Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traces in evaluation result are only saving the last prompt per trace name #1871

Open
VladTabakovCortea opened this issue Jan 22, 2025 · 2 comments
Assignees
Labels
bug Something isn't working module-metrics this is part of metrics module

Comments

@VladTabakovCortea
Copy link

[x] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
Evaluation result's traces property only keeps one trace per trace.name, inconsistent with ragas_traces. So in FactualCorrectness there are multiple calls to decompose_claims for example

Ragas version: 0.2.11
Python version: 3.11

Code to Reproduce

 from ragas.dataset_schema import SingleTurnSample
 from ragas.metrics._factual_correctness import FactualCorrectness
 from ragas import evaluate
from datasets import Dataset

dataset = Dataset.from_dict({"response": ["Eifel tower is in Paris"], "reference": ["Paris, France is a city where Eifel tower is located"]})
fc = FactualCorrectness()
score = evaluate(
    dataset, metrics=[fc]
)
print(score.scores)
# [{'factual_correctness': 0.67}]
print(score.traces[0]['factual_correctness'])
{'claim_decomposition_prompt': {'input': ClaimDecompositionInput(response='Paris, France is a city where Eifel tower is located'),
                                'output': ClaimDecompositionOutput(claims=['Paris is a city in France.', 'The Eiffel Tower is located in Paris.'])},
 'n_l_i_statement_prompt': {'input': NLIStatementInput(context='Eifel tower is in Paris', statements=['Paris is a city in France.', 'The Eiffel Tower is located in Paris.']),
                            'output': NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Paris is a city in France.', reason='The statement about Paris being a city in France is a well-known fact and can be inferred from general knowledge, but it is not directly stated in the context provided.', verdict=0), StatementFaithfulnessAnswer(statement='The Eiffel Tower is located in Paris.', reason='The context explicitly states that the Eiffel Tower is in Paris, making this statement directly inferable.', verdict=1)])}}

As you can see in the example traces for that metric include only one prompt per prompt name, instead of 2 like it should, since in the code we call these metrics twice, this can be proven by checking score.ragas_traces

In [34]: [score.metadata for score in score.ragas_traces.values()]
Out[34]: 
[{'type': <ChainType.EVALUATION: 'evaluation'>},
 {'type': <ChainType.ROW: 'row'>, 'row_index': 0},
 {'type': <ChainType.METRIC: 'metric'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>}]

Error trace

Expected behavior
result.traces has ALL traces per prompt, including the metrics with the same name

Additional context
I think the problem is in ragas/callbacks.py::parse_run_traces() line 158, it assigns by metric name so if there are multiple metrics in the same call it will only save the last trace.

I couldnt find anythin in the open issues to correlate with my issue so I thought it was a new one, plus that does seem like unexpected behaviour, please let me know otherwise

@VladTabakovCortea VladTabakovCortea added the bug Something isn't working label Jan 22, 2025
@dosubot dosubot bot added the module-metrics this is part of metrics module label Jan 22, 2025
@VladTabakovCortea VladTabakovCortea changed the title Traces in evaluation result are only saving the last prompt per metric name Traces in evaluation result are only saving the last prompt per trace name Jan 22, 2025
@shahules786
Copy link
Member

Hey @VladTabakovCortea thanks for reporting this, this seems to be a bug to me. @jjmachan can you check where this is coming from?

@Vidit-Ostwal
Copy link
Contributor

Vidit-Ostwal commented Jan 23, 2025

@shahules786, @jjmachan

This is because of the "prompt_traces" dictionary using the "prompt_trace.name" as the key which will not be unique when called more than once,

This is inside ragas.metrics._factual_correctness Line 281,282,287,288

Image

This is inside ragas.callbacks.py Line 170,171,172

def parse_run_traces(
    traces: t.Dict[str, ChainRun],
    parent_run_id: t.Optional[str] = None,
) -> t.List[t.Dict[str, t.Any]]:
    
    print(traces)
    print(parent_run_id)

    root_traces = [
        chain_trace
        for chain_trace in traces.values()
        if chain_trace.parent_run_id == parent_run_id
    ]

    if len(root_traces) > 1:
        raise ValueError(
            "Multiple root traces found! This is a bug on our end, please file an issue and we will fix it ASAP :)"
        )
    root_trace = root_traces[0]

    # get all the row traces
    parased_traces = []
    for row_uuid in root_trace.children:
        row_trace = traces[row_uuid]
        metric_traces = MetricTrace()
        for metric_uuid in row_trace.children:
            metric_trace = traces[metric_uuid]
            metric_traces.scores[metric_trace.name] = metric_trace.outputs.get(
                "output", {}
            )
            # get all the prompt IO from the metric trace
            prompt_traces = {}
            for i, prompt_uuid in enumerate(metric_trace.children):
                prompt_trace = traces[prompt_uuid]
                output = prompt_trace.outputs.get("output", {})
                output = output[0] if isinstance(output, list) else output
                prompt_traces[f"{prompt_trace.name}"] = {
                    "input": prompt_trace.inputs.get("data", {}),
                    "output": output,
                }
            metric_traces[f"{metric_trace.name}"] = prompt_traces
        parased_traces.append(metric_traces)

    return parased_traces

A quick solution, I would like to suggest is to append last 4 letter of the run_id, to make it unique.

            for i, prompt_uuid in enumerate(metric_trace.children):
                prompt_trace = traces[prompt_uuid]
                output = prompt_trace.outputs.get("output", {})
                output = output[0] if isinstance(output, list) else output
                prompt_traces[f"{prompt_trace.name}_{prompt_trace.run_id[:4]}"] = {
                    "input": prompt_trace.inputs.get("data", {}),
                    "output": output,
                }
            metric_traces[f"{metric_trace.name}"] = prompt_traces

which results to an output like this

{'claim_decomposition_prompt_8e37': {'input': ClaimDecompositionInput(response='Eifel tower is in Paris', sentences=['Eifel tower is in Paris']), 'output': ClaimDecompositionOutput(decomposed_claims=[['Eiffel
tower is in Paris']])}, 'n_l_i_statement_prompt_e622': {'input': NLIStatementInput(context='Paris, France is a city where Eifel tower is located', statements=['Eiffel tower is in Paris']), 'output':
NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Eiffel tower is in Paris', reason='The context explicitly states that Eiffel tower is located in Paris, France.', verdict=1)])},
'claim_decomposition_prompt_b5de': {'input': ClaimDecompositionInput(response='Paris, France is a city where Eifel tower is located', sentences=['Paris, France is a city where Eifel tower is located']),
'output': ClaimDecompositionOutput(decomposed_claims=[['Paris is a city in France.'], ['Eiffel Tower is located in Paris.']])}, 'n_l_i_statement_prompt_c96e': {'input': NLIStatementInput(context='Eifel tower is
in Paris', statements=['Paris is a city in France.', 'Eiffel Tower is located in Paris.']), 'output': NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Paris is a city in France.',
reason='The context does not provide any information about the location of Paris.', verdict=0), StatementFaithfulnessAnswer(statement='Eiffel Tower is located in Paris.', reason='The context explicitly states
that the Eiffel Tower is in Paris.', verdict=1)])}}

Please provide your feedback on this solution.
Do you have any alternative approaches that might be more effective?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module-metrics this is part of metrics module
Projects
None yet
Development

No branches or pull requests

4 participants