Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccuracies in reproducing the reported results in Table 3 #5

Open
araloak opened this issue Jul 1, 2024 · 0 comments
Open

Inaccuracies in reproducing the reported results in Table 3 #5

araloak opened this issue Jul 1, 2024 · 0 comments

Comments

@araloak
Copy link

araloak commented Jul 1, 2024

Hi, Thanks for your great work and neat code! I tried to reproduce the results in Table 3 in the ProtoQA paper with the command:

protoqa_evaluator evaluate --similarity_function wordnet data/dev/dev.crowdsourced.jsonl data/dev/dev.predictions.gpt2finetuned.json

However, I obtained slightly different results. For example, I got:

Evaluating Max Incorrect - 1...
Max Incorrect - 1: 0.23908368645487507
Evaluating Max Incorrect - 3...
Max Incorrect - 3: 0.4145232659361979
Evaluating Max Incorrect - 5...
Max Incorrect - 5: 0.4740800451445922

However, the reported results in Table 3 for Max Incorrect value 1, 3, 5 are 26.1, 41.7, 48.2, respectively.

I also tried another similarity function exact_match and another file dev.predictions.human.jsonl and still could not reproduce the corresponding results in the paper. I am not sure whether my inaccuracies result from certain randomness in each run or missing some important settings. Could you please provide some clarification in reproducing the results correctly? Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant