Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in huggingface dataset #1

Open
SantiDianaClibrain opened this issue Mar 8, 2024 · 10 comments
Open

Errors in huggingface dataset #1

SantiDianaClibrain opened this issue Mar 8, 2024 · 10 comments

Comments

@SantiDianaClibrain
Copy link

Hi! I would like to point that I believe I found at least 1 error in the dataset. Can this be possible? For the next question, the ground truth answer is "None", while I would say that the question can be answered.

Britany records 18 4-minute TikTok videos each week. She spends 2 hours a week writing amateur songs to sing on TikTok, and 15 minutes six days a week doing her makeup before filming herself for TikTok. How much time does Britany spend on TikTok in a month?

@qtli
Copy link
Owner

qtli commented Mar 8, 2024

Thank you for your feedback!

In this case, "a month with four weeks" has been removed from the original GSM8K question. We are aware of potential ambiguities or errors in certain questions and are actively working on a new version of GSM8K-Plus, which will be released soon.

We greatly appreciate your assistance in identifying any issues you come across.

Additionally, we will release a concise test set for efficient evaluation.

@SantiDianaClibrain
Copy link
Author

Okay. Do you have any date in mind? I have a model that can have strong results in your benchmark but I would like to have results asap. Thanks!!!

@qtli
Copy link
Owner

qtli commented Mar 12, 2024

We're delighted to learn that you've built a robust model. Feel free to share the developed model with us, and we can conduct a prompt evaluation using our dataset. The release of the finalized dataset may take some time, with our current aim being late March.

@SantiDianaClibrain
Copy link
Author

Nice! I strongly encourage you to review all the dataset. Found a lot of errors in calculations. Looking forward to seeing your final version of the benchmark.

@SantiDianaClibrain
Copy link
Author

Interesting question: Raymond and Samantha are cousins. Raymond was born 60 years before Samantha. Raymond had a son at the age of 230. If Samantha is now 310, how many years ago was Raymond's son born?

@ekmb
Copy link

ekmb commented Jul 3, 2024

@qtli any updates on the new dataset version?

@qtli
Copy link
Owner

qtli commented Jul 8, 2024

Interesting question: Raymond and Samantha are cousins. Raymond was born 60 years before Samantha. Raymond had a son at the age of 230. If Samantha is now 310, how many years ago was Raymond's son born?

Hey SantiDianaClibrain, thanks a lot for your feedback! We've been working tirelessly to fix the unrealistic numbers you brought up and some perturbation failures. We've just released an updated version of GSM-Plus and a smaller subset called testmini. We invite you to try out the latest version of GSM-Plus at Huggingface Datasets and we welcome any feedback you may have.

@qtli
Copy link
Owner

qtli commented Jul 8, 2024

@qtli any updates on the new dataset version?

Hi ekmb, thanks for reaching out! We just released an updated version of GSM-Plus (including a smaller subset, called testmini), available on Huggingface Datasets. It resolves issues with unrealistic numbers and non-compliant question contexts. Check it out and share your thoughts!

@dgtm777
Copy link

dgtm777 commented Jul 9, 2024

Hi @qtli! I am really interested in this dataset, but I still find some errors, even in the gsm-plus mini subset. Here are two examples of the "adding operation" category with incorrect answer fields.

  1. index in the dataset - 1220
    Question: "A trader buys some bags of wheat from a farmer at a rate of $20 per bag. If it costs $2 to transport each bag from the farm to the warehouse, and after selling all the bags at a rate of $30 each and paid $50 for the storage fee, the trader made a total profit of $400 how many bags did he sell?"
    Answer: 55
    If we look at the solution, it will state that 450/8 = 55, which is wrong

  2. index in the dataset - 2380
    Question: "Ruby is 6 times older than Sam. In 21 years, Ruby will be 3 times as old as Sam. How old is Sam now?"
    Answer: 30
    I believe the correct answer here should be 14

Do you happen to have an estimation of the number of errors in each category so I can know which gap in accuracy to ignore? Or maybe you plan to release a new version—it would be great to have a clean version.

@qtli
Copy link
Owner

qtli commented Jul 10, 2024

Hi @qtli! I am really interested in this dataset, but I still find some errors, even in the gsm-plus mini subset. Here are two examples of the "adding operation" category with incorrect answer fields.

  1. index in the dataset - 1220
    Question: "A trader buys some bags of wheat from a farmer at a rate of $20 per bag. If it costs $2 to transport each bag from the farm to the warehouse, and after selling all the bags at a rate of $30 each and paid $50 for the storage fee, the trader made a total profit of $400 how many bags did he sell?"
    Answer: 55
    If we look at the solution, it will state that 450/8 = 55, which is wrong
  2. index in the dataset - 2380
    Question: "Ruby is 6 times older than Sam. In 21 years, Ruby will be 3 times as old as Sam. How old is Sam now?"
    Answer: 30
    I believe the correct answer here should be 14

Do you happen to have an estimation of the number of errors in each category so I can know which gap in accuracy to ignore? Or maybe you plan to release a new version—it would be great to have a clean version.

Hi @dgtm777l, thank you very much for your valuable feedback! Please allow us a few days to thoroughly review the question-solution pairs in the minitest set. Once the checking process is complete, I will promptly notify you with the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants