-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traces in evaluation result are only saving the last prompt per trace name #1871
Comments
Hey @VladTabakovCortea thanks for reporting this, this seems to be a bug to me. @jjmachan can you check where this is coming from? |
This is because of the "prompt_traces" dictionary using the "prompt_trace.name" as the key which will not be unique when called more than once,
A quick solution, I would like to suggest is to append last 4 letter of the run_id, to make it unique.
which results to an output like this
Please provide your feedback on this solution. |
[x] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug
Evaluation result's
traces
property only keeps one trace per trace.name, inconsistent withragas_traces
. So in FactualCorrectness there are multiple calls to decompose_claims for exampleRagas version: 0.2.11
Python version: 3.11
Code to Reproduce
As you can see in the example traces for that metric include only one prompt per prompt name, instead of 2 like it should, since in the code we call these metrics twice, this can be proven by checking score.ragas_traces
Error trace
Expected behavior
result.traces has ALL traces per prompt, including the metrics with the same name
Additional context
I think the problem is in ragas/callbacks.py::parse_run_traces() line 158, it assigns by metric name so if there are multiple metrics in the same call it will only save the last trace.
I couldnt find anythin in the open issues to correlate with my issue so I thought it was a new one, plus that does seem like unexpected behaviour, please let me know otherwise
The text was updated successfully, but these errors were encountered: