-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Checksums #708
Dataset Checksums #708
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Hello! Thanks so much for publicly sharing the checksums. It's incredibly helpful to be able to check that datasets were correctly downloaded. I've run the |
Hi! Thanks for giving this a try! Yes, there is certainly something else going on -- but, i don't know what could be causing this! Did you check the output of the tree commands? Could you share your checksum of a couple simple ones like ogbg/criteo1tb? I don't know how to debug this but perhaps having your checksum values might help in some way :) |
Hi @chandramouli-sastry,
@tfaod did you check the number of files (e.g. via |
Thanks for the quick reply! EDIT: re: @fsschneider - I've the same checksum on criteo1tb, but not ogbg or wmt (with generated model file). I'll generate the rest of the checksums to compare I've checked all our 1/ file counts (find -type f | wc -l) and 2/ dir size du -sch --apparent-size wmt/), and they are all consistent with the mlcommons provided values. I've included my outputs for the tree and checksumdir commands for ogbg. wmt, and criteo1tb.
ogbg
criteo1tb
wmt:
|
Thanks @fsschneider for generating the checksums! I think all of this suggests that the data downloaded on kasimbeg-8 in /home/kasimbeg/data is incomplete/corrupted -- i had to re-download and extract fastmri and that seems to match with the one generated by Frank, so thats good! Its also good that the checksum obtained by @tfaod on criteo matches with that of Frank's! I wanted to avoid considering the vocab files in the checksum because the serialized data might not be consistent across runs -- but not sure! I mainly only wrote this script and copy-pasted the outputs into the readme:
I think that we could then append the output of this code on the data directory we believe is correctly downloaded at the end of the README? |
stale, differences left to be resolved.
@chandramouli-sastry did you detect any differences on kasimbeg-8 with the directory structure and file sizes that @fsschneider reported in the README before running your script? If so can you please document them in this thread. To resolve this I think we should use @fsschneider's data setup as the source of truth. @fsschneider could you start a new PR that contains just the hash commands and results? |
I just downloaded As a result, I don't think that the checksums by |
@priyakasimbeg I closed this PR. Feel free to reopen if you disagree. |
Addresses #647