You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, I obtained slightly different results. For example, I got:
Evaluating Max Incorrect - 1...
Max Incorrect - 1: 0.23908368645487507
Evaluating Max Incorrect - 3...
Max Incorrect - 3: 0.4145232659361979
Evaluating Max Incorrect - 5...
Max Incorrect - 5: 0.4740800451445922
However, the reported results in Table 3 for Max Incorrect value 1, 3, 5 are 26.1, 41.7, 48.2, respectively.
I also tried another similarity function exact_match and another file dev.predictions.human.jsonl and still could not reproduce the corresponding results in the paper. I am not sure whether my inaccuracies result from certain randomness in each run or missing some important settings. Could you please provide some clarification in reproducing the results correctly? Thank you very much.
The text was updated successfully, but these errors were encountered:
Hi, Thanks for your great work and neat code! I tried to reproduce the results in Table 3 in the ProtoQA paper with the command:
protoqa_evaluator evaluate --similarity_function wordnet data/dev/dev.crowdsourced.jsonl data/dev/dev.predictions.gpt2finetuned.json
However, I obtained slightly different results. For example, I got:
However, the reported results in Table 3 for Max Incorrect value 1, 3, 5 are 26.1, 41.7, 48.2, respectively.
I also tried another similarity function exact_match and another file
dev.predictions.human.jsonl
and still could not reproduce the corresponding results in the paper. I am not sure whether my inaccuracies result from certain randomness in each run or missing some important settings. Could you please provide some clarification in reproducing the results correctly? Thank you very much.The text was updated successfully, but these errors were encountered: