Inaccuracies in reproducing the reported results in Table 3 #5

araloak · 2024-07-01T03:21:15Z

Hi, Thanks for your great work and neat code! I tried to reproduce the results in Table 3 in the ProtoQA paper with the command:

protoqa_evaluator evaluate --similarity_function wordnet data/dev/dev.crowdsourced.jsonl data/dev/dev.predictions.gpt2finetuned.json

However, I obtained slightly different results. For example, I got:

Evaluating Max Incorrect - 1...
Max Incorrect - 1: 0.23908368645487507
Evaluating Max Incorrect - 3...
Max Incorrect - 3: 0.4145232659361979
Evaluating Max Incorrect - 5...
Max Incorrect - 5: 0.4740800451445922

However, the reported results in Table 3 for Max Incorrect value 1, 3, 5 are 26.1, 41.7, 48.2, respectively.

I also tried another similarity function exact_match and another file dev.predictions.human.jsonl and still could not reproduce the corresponding results in the paper. I am not sure whether my inaccuracies result from certain randomness in each run or missing some important settings. Could you please provide some clarification in reproducing the results correctly? Thank you very much.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccuracies in reproducing the reported results in Table 3 #5

Inaccuracies in reproducing the reported results in Table 3 #5

araloak commented Jul 1, 2024

Inaccuracies in reproducing the reported results in Table 3 #5

Inaccuracies in reproducing the reported results in Table 3 #5

Comments

araloak commented Jul 1, 2024