-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors in huggingface dataset #1
Comments
Thank you for your feedback! In this case, "a month with four weeks" has been removed from the original GSM8K question. We are aware of potential ambiguities or errors in certain questions and are actively working on a new version of GSM8K-Plus, which will be released soon. We greatly appreciate your assistance in identifying any issues you come across. Additionally, we will release a concise test set for efficient evaluation. |
Okay. Do you have any date in mind? I have a model that can have strong results in your benchmark but I would like to have results asap. Thanks!!! |
We're delighted to learn that you've built a robust model. Feel free to share the developed model with us, and we can conduct a prompt evaluation using our dataset. The release of the finalized dataset may take some time, with our current aim being late March. |
Nice! I strongly encourage you to review all the dataset. Found a lot of errors in calculations. Looking forward to seeing your final version of the benchmark. |
Interesting question: |
@qtli any updates on the new dataset version? |
Hey SantiDianaClibrain, thanks a lot for your feedback! We've been working tirelessly to fix the unrealistic numbers you brought up and some perturbation failures. We've just released an updated version of GSM-Plus and a smaller subset called testmini. We invite you to try out the latest version of GSM-Plus at Huggingface Datasets and we welcome any feedback you may have. |
Hi ekmb, thanks for reaching out! We just released an updated version of GSM-Plus (including a smaller subset, called testmini), available on Huggingface Datasets. It resolves issues with unrealistic numbers and non-compliant question contexts. Check it out and share your thoughts! |
Hi @qtli! I am really interested in this dataset, but I still find some errors, even in the gsm-plus mini subset. Here are two examples of the "adding operation" category with incorrect answer fields.
Do you happen to have an estimation of the number of errors in each category so I can know which gap in accuracy to ignore? Or maybe you plan to release a new version—it would be great to have a clean version. |
Hi @dgtm777l, thank you very much for your valuable feedback! Please allow us a few days to thoroughly review the question-solution pairs in the minitest set. Once the checking process is complete, I will promptly notify you with the results. |
Hi! I would like to point that I believe I found at least 1 error in the dataset. Can this be possible? For the next question, the ground truth answer is "None", while I would say that the question can be answered.
Britany records 18 4-minute TikTok videos each week. She spends 2 hours a week writing amateur songs to sing on TikTok, and 15 minutes six days a week doing her makeup before filming herself for TikTok. How much time does Britany spend on TikTok in a month?
The text was updated successfully, but these errors were encountered: