Traces in evaluation result are only saving the last prompt per trace name #1871

VladTabakovCortea · 2025-01-22T14:53:00Z

[x] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
Evaluation result's traces property only keeps one trace per trace.name, inconsistent with ragas_traces. So in FactualCorrectness there are multiple calls to decompose_claims for example

Ragas version: 0.2.11
Python version: 3.11

Code to Reproduce

 from ragas.dataset_schema import SingleTurnSample
 from ragas.metrics._factual_correctness import FactualCorrectness
 from ragas import evaluate
from datasets import Dataset

dataset = Dataset.from_dict({"response": ["Eifel tower is in Paris"], "reference": ["Paris, France is a city where Eifel tower is located"]})
fc = FactualCorrectness()
score = evaluate(
    dataset, metrics=[fc]
)
print(score.scores)
# [{'factual_correctness': 0.67}]
print(score.traces[0]['factual_correctness'])

{'claim_decomposition_prompt': {'input': ClaimDecompositionInput(response='Paris, France is a city where Eifel tower is located'),
                                'output': ClaimDecompositionOutput(claims=['Paris is a city in France.', 'The Eiffel Tower is located in Paris.'])},
 'n_l_i_statement_prompt': {'input': NLIStatementInput(context='Eifel tower is in Paris', statements=['Paris is a city in France.', 'The Eiffel Tower is located in Paris.']),
                            'output': NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Paris is a city in France.', reason='The statement about Paris being a city in France is a well-known fact and can be inferred from general knowledge, but it is not directly stated in the context provided.', verdict=0), StatementFaithfulnessAnswer(statement='The Eiffel Tower is located in Paris.', reason='The context explicitly states that the Eiffel Tower is in Paris, making this statement directly inferable.', verdict=1)])}}

As you can see in the example traces for that metric include only one prompt per prompt name, instead of 2 like it should, since in the code we call these metrics twice, this can be proven by checking score.ragas_traces

In [34]: [score.metadata for score in score.ragas_traces.values()]
Out[34]: 
[{'type': <ChainType.EVALUATION: 'evaluation'>},
 {'type': <ChainType.ROW: 'row'>, 'row_index': 0},
 {'type': <ChainType.METRIC: 'metric'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>}]

Error trace

Expected behavior
result.traces has ALL traces per prompt, including the metrics with the same name

Additional context
I think the problem is in ragas/callbacks.py::parse_run_traces() line 158, it assigns by metric name so if there are multiple metrics in the same call it will only save the last trace.

I couldnt find anythin in the open issues to correlate with my issue so I thought it was a new one, plus that does seem like unexpected behaviour, please let me know otherwise

The text was updated successfully, but these errors were encountered:

shahules786 · 2025-01-23T04:32:52Z

Hey @VladTabakovCortea thanks for reporting this, this seems to be a bug to me. @jjmachan can you check where this is coming from?

Vidit-Ostwal · 2025-01-23T17:59:30Z

@shahules786, @jjmachan

This is because of the "prompt_traces" dictionary using the "prompt_trace.name" as the key which will not be unique when called more than once,

This is inside ragas.metrics._factual_correctness Line 281,282,287,288

This is inside ragas.callbacks.py Line 170,171,172

def parse_run_traces(
    traces: t.Dict[str, ChainRun],
    parent_run_id: t.Optional[str] = None,
) -> t.List[t.Dict[str, t.Any]]:
    
    print(traces)
    print(parent_run_id)

    root_traces = [
        chain_trace
        for chain_trace in traces.values()
        if chain_trace.parent_run_id == parent_run_id
    ]

    if len(root_traces) > 1:
        raise ValueError(
            "Multiple root traces found! This is a bug on our end, please file an issue and we will fix it ASAP :)"
        )
    root_trace = root_traces[0]

    # get all the row traces
    parased_traces = []
    for row_uuid in root_trace.children:
        row_trace = traces[row_uuid]
        metric_traces = MetricTrace()
        for metric_uuid in row_trace.children:
            metric_trace = traces[metric_uuid]
            metric_traces.scores[metric_trace.name] = metric_trace.outputs.get(
                "output", {}
            )
            # get all the prompt IO from the metric trace
            prompt_traces = {}
            for i, prompt_uuid in enumerate(metric_trace.children):
                prompt_trace = traces[prompt_uuid]
                output = prompt_trace.outputs.get("output", {})
                output = output[0] if isinstance(output, list) else output
                prompt_traces[f"{prompt_trace.name}"] = {
                    "input": prompt_trace.inputs.get("data", {}),
                    "output": output,
                }
            metric_traces[f"{metric_trace.name}"] = prompt_traces
        parased_traces.append(metric_traces)

    return parased_traces

A quick solution, I would like to suggest is to append last 4 letter of the run_id, to make it unique.

            for i, prompt_uuid in enumerate(metric_trace.children):
                prompt_trace = traces[prompt_uuid]
                output = prompt_trace.outputs.get("output", {})
                output = output[0] if isinstance(output, list) else output
                prompt_traces[f"{prompt_trace.name}_{prompt_trace.run_id[:4]}"] = {
                    "input": prompt_trace.inputs.get("data", {}),
                    "output": output,
                }
            metric_traces[f"{metric_trace.name}"] = prompt_traces

which results to an output like this

{'claim_decomposition_prompt_8e37': {'input': ClaimDecompositionInput(response='Eifel tower is in Paris', sentences=['Eifel tower is in Paris']), 'output': ClaimDecompositionOutput(decomposed_claims=[['Eiffel
tower is in Paris']])}, 'n_l_i_statement_prompt_e622': {'input': NLIStatementInput(context='Paris, France is a city where Eifel tower is located', statements=['Eiffel tower is in Paris']), 'output':
NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Eiffel tower is in Paris', reason='The context explicitly states that Eiffel tower is located in Paris, France.', verdict=1)])},
'claim_decomposition_prompt_b5de': {'input': ClaimDecompositionInput(response='Paris, France is a city where Eifel tower is located', sentences=['Paris, France is a city where Eifel tower is located']),
'output': ClaimDecompositionOutput(decomposed_claims=[['Paris is a city in France.'], ['Eiffel Tower is located in Paris.']])}, 'n_l_i_statement_prompt_c96e': {'input': NLIStatementInput(context='Eifel tower is
in Paris', statements=['Paris is a city in France.', 'Eiffel Tower is located in Paris.']), 'output': NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Paris is a city in France.',
reason='The context does not provide any information about the location of Paris.', verdict=0), StatementFaithfulnessAnswer(statement='Eiffel Tower is located in Paris.', reason='The context explicitly states
that the Eiffel Tower is in Paris.', verdict=1)])}}

Please provide your feedback on this solution.
Do you have any alternative approaches that might be more effective?

VladTabakovCortea added the bug Something isn't working label Jan 22, 2025

dosubot bot added the module-metrics this is part of metrics module label Jan 22, 2025

VladTabakovCortea changed the title ~~Traces in evaluation result are only saving the last prompt per metric name~~ Traces in evaluation result are only saving the last prompt per trace name Jan 22, 2025

shahules786 assigned jjmachan Jan 23, 2025

Vidit-Ostwal mentioned this issue Jan 25, 2025

Changed the parse_run_traces to include last 4 letters of run_id #1880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traces in evaluation result are only saving the last prompt per trace name #1871

Traces in evaluation result are only saving the last prompt per trace name #1871

VladTabakovCortea commented Jan 22, 2025

shahules786 commented Jan 23, 2025

Vidit-Ostwal commented Jan 23, 2025 •

edited

Loading

Traces in evaluation result are only saving the last prompt per trace name #1871

Traces in evaluation result are only saving the last prompt per trace name #1871

Comments

VladTabakovCortea commented Jan 22, 2025

shahules786 commented Jan 23, 2025

Vidit-Ostwal commented Jan 23, 2025 • edited Loading

Vidit-Ostwal commented Jan 23, 2025 •

edited

Loading