Dataset Checksums #708

chandramouli-sastry · 2024-03-18T00:58:50Z

Addresses #647

For librispeech and wmt, the generated hashes do not contain the tokenizer vocabulary.
For Imagenet-pytorch, I have generated the checksum by keeping the imagenet-v2 together with the train and val files similar to imagenet-jax.
I have included outputs of the tree command to show exactly what folders were considered for the checksum.

github-actions · 2024-03-18T00:59:03Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

tfaod · 2024-03-19T00:26:22Z

Hello! Thanks so much for publicly sharing the checksums. It's incredibly helpful to be able to check that datasets were correctly downloaded.

I've run the checksumdir command on my directories, and get completely different values? (incl ogbg, etc)
Assuming not all my datasets were downloaded incorrectly, I am curious if there could be another reason for the difference in checksum values. Do you have any idea what could be causing this? Thanks!!

chandramouli-sastry · 2024-03-19T03:44:20Z

Hi! Thanks for giving this a try! Yes, there is certainly something else going on -- but, i don't know what could be causing this! Did you check the output of the tree commands? Could you share your checksum of a couple simple ones like ogbg/criteo1tb? I don't know how to debug this but perhaps having your checksum values might help in some way :)

fsschneider · 2024-03-19T16:31:55Z

Hi @chandramouli-sastry,
thanks a lot for adding this. I think this could be very beneficial.
However, I am also getting different checksums, while the tree output (and file sizes and numbers) are identical. Below, I am posting my checksums, perhaps they are identical to @tfaod ?

OGBG: 4808a6652bc4d129c1638ac55b219bfe
WMT: 3764a73cdc19d7572c042ce19e59c74b (but this is after generating the wmt_sentencepiece_model)
FastMRI: cd8c6452d9fa5fe89d050df969e98f70
ImageNet: Both the JAX and the PyTorch versions look slightly different in our Cluster setup (e.g. additional files that other groups need) so I didn't compare the checksums here. So I only provide the checksum for the separate ImageNet v2 directory, i.e. checksumdir imagenet_v2: a7f24a2250469706827eb2dff360590d.
Criteo1TB: aeb5217d11610ab6c679df572faadc7e. But here I also get a different output when running the tree --filelimit 30. It reports that 885 (not 347) entries exceeds the filelimit, which is also aligned with the total of 885 files I find via find -type f | wc -l.
LibriSpeech: 071e7582d63c92e51797f3f11967fb74 (but this is after generating spm_model.vocab. The tree output is identical to what you reported, minus the additional spm_model.vocab file.

@tfaod did you check the number of files (e.g. via find -type f | wc -l) and total file size (e.g. via du -sch --apparent-size librispeech/)vs. what we report? This was intended as a first check whether all the data downloading worked.

tfaod · 2024-03-19T16:51:22Z

Thanks for the quick reply!

EDIT: re: @fsschneider - I've the same checksum on criteo1tb, but not ogbg or wmt (with generated model file). I'll generate the rest of the checksums to compare

I've checked all our 1/ file counts (find -type f | wc -l) and 2/ dir size du -sch --apparent-size wmt/), and they are all consistent with the mlcommons provided values.

I've included my outputs for the tree and checksumdir commands for ogbg. wmt, and criteo1tb.
There are 2 differences in the tree results:

criteo1tb displays 885 entries exceeds filelimit rather than 347 from the README
wmt has an additional file wmt_sentencepiece_model that the README has in the final directory but not the tree output

ogbg

tree $DATA_DIR/ogbg --filelimit 30

└── ogbg_molpcba
    └── 0.1.3
        ├── dataset_info.json
        ├── features.json
        ├── metadata.json
        ├── ogbg_molpcba-test.tfrecord-00000-of-00001
        ├── ogbg_molpcba-train.tfrecord-00000-of-00008
        ├── ogbg_molpcba-train.tfrecord-00001-of-00008
        ├── ogbg_molpcba-train.tfrecord-00002-of-00008
        ├── ogbg_molpcba-train.tfrecord-00003-of-00008
        ├── ogbg_molpcba-train.tfrecord-00004-of-00008
        ├── ogbg_molpcba-train.tfrecord-00005-of-00008
        ├── ogbg_molpcba-train.tfrecord-00006-of-00008
        ├── ogbg_molpcba-train.tfrecord-00007-of-00008
        └── ogbg_molpcba-validation.tfrecord-00000-of-00001

2 directories, 13 files

checksumdir $DATA_DIR/ogbg: 88420b94329a574d9308360dacf0778f

criteo1tb

tree $DATA_DIR/criteo1tb --filelimit 30

criteo1tb  [885 entries exceeds filelimit, not opening dir]


0 directories, 0 files

checksumdir $DATA_DIR/criteo1tb: aeb5217d11610ab6c679df572faadc7e

wmt:

tree wmt --filelimit 30

├── wmt14_translate
│   └── de-en
│       └── 1.0.0
│           ├── dataset_info.json
│           ├── features.json
│           ├── wmt14_translate-test.tfrecord-00000-of-00001
│           ├── wmt14_translate-train.tfrecord-00000-of-00016
│           ├── wmt14_translate-train.tfrecord-00001-of-00016
│           ├── wmt14_translate-train.tfrecord-00002-of-00016
│           ├── wmt14_translate-train.tfrecord-00003-of-00016
│           ├── wmt14_translate-train.tfrecord-00004-of-00016
│           ├── wmt14_translate-train.tfrecord-00005-of-00016
│           ├── wmt14_translate-train.tfrecord-00006-of-00016
│           ├── wmt14_translate-train.tfrecord-00007-of-00016
│           ├── wmt14_translate-train.tfrecord-00008-of-00016
│           ├── wmt14_translate-train.tfrecord-00009-of-00016
│           ├── wmt14_translate-train.tfrecord-00010-of-00016
│           ├── wmt14_translate-train.tfrecord-00011-of-00016
│           ├── wmt14_translate-train.tfrecord-00012-of-00016
│           ├── wmt14_translate-train.tfrecord-00013-of-00016
│           ├── wmt14_translate-train.tfrecord-00014-of-00016
│           ├── wmt14_translate-train.tfrecord-00015-of-00016
│           └── wmt14_translate-validation.tfrecord-00000-of-00001
├── wmt17_translate
│   └── de-en
│           ├── wmt14_translate-train.tfrecord-00000-of-00016
│           ├── wmt14_translate-train.tfrecord-00001-of-00016
│           ├── wmt14_translate-train.tfrecord-00002-of-00016
│           ├── wmt14_translate-train.tfrecord-00003-of-00016
│           ├── wmt14_translate-train.tfrecord-00004-of-00016
│           ├── wmt14_translate-train.tfrecord-00005-of-00016
│           ├── wmt14_translate-train.tfrecord-00006-of-00016
│           ├── wmt14_translate-train.tfrecord-00007-of-00016
│           ├── wmt14_translate-train.tfrecord-00008-of-00016
│           ├── wmt14_translate-train.tfrecord-00009-of-00016
│           ├── wmt14_translate-train.tfrecord-00010-of-00016
│           ├── wmt14_translate-train.tfrecord-00011-of-00016
│           ├── wmt14_translate-train.tfrecord-00012-of-00016
│           ├── wmt14_translate-train.tfrecord-00013-of-00016
│           ├── wmt14_translate-train.tfrecord-00014-of-00016
│           ├── wmt14_translate-train.tfrecord-00015-of-00016
│           └── wmt14_translate-validation.tfrecord-00000-of-00001
├── wmt17_translate
│   └── de-en
│       └── 1.0.0
│           ├── dataset_info.json
│           ├── features.json
│           ├── wmt17_translate-test.tfrecord-00000-of-00001
│           ├── wmt17_translate-train.tfrecord-00000-of-00016
│           ├── wmt17_translate-train.tfrecord-00001-of-00016
│           ├── wmt17_translate-train.tfrecord-00002-of-00016
│           ├── wmt17_translate-train.tfrecord-00003-of-00016
│           ├── wmt17_translate-train.tfrecord-00004-of-00016
│           ├── wmt17_translate-train.tfrecord-00005-of-00016
│           ├── wmt17_translate-train.tfrecord-00006-of-00016
│           ├── wmt17_translate-train.tfrecord-00007-of-00016
│           ├── wmt17_translate-train.tfrecord-00008-of-00016
│           ├── wmt17_translate-train.tfrecord-00009-of-00016
│           ├── wmt17_translate-train.tfrecord-00010-of-00016
│           ├── wmt17_translate-train.tfrecord-00011-of-00016
│           ├── wmt17_translate-train.tfrecord-00012-of-00016
│           ├── wmt17_translate-train.tfrecord-00013-of-00016
│           ├── wmt17_translate-train.tfrecord-00014-of-00016
│           ├── wmt17_translate-train.tfrecord-00015-of-00016
│           └── wmt17_translate-validation.tfrecord-00000-of-00001
└── wmt_sentencepiece_model

6 directories, 41 files

checksumdir $DATA_DIR/wmt: 5921e54f13a9968d31dc2e3eec4f9f34

chandramouli-sastry · 2024-03-19T17:14:36Z

Thanks @fsschneider for generating the checksums! I think all of this suggests that the data downloaded on kasimbeg-8 in /home/kasimbeg/data is incomplete/corrupted -- i had to re-download and extract fastmri and that seems to match with the one generated by Frank, so thats good! Its also good that the checksum obtained by @tfaod on criteo matches with that of Frank's! I wanted to avoid considering the vocab files in the checksum because the serialized data might not be consistent across runs -- but not sure!

I mainly only wrote this script and copy-pasted the outputs into the readme:

import os
for dirname in glob.glob("data/*"):
    os.system(f"tree {dirname} --filelimit 30")
    os.system(f"checksumdir {dirname}")

I think that we could then append the output of this code on the data directory we believe is correctly downloaded at the end of the README?

stale, differences left to be resolved.

priyakasimbeg · 2024-03-20T19:56:32Z

@chandramouli-sastry did you detect any differences on kasimbeg-8 with the directory structure and file sizes that @fsschneider reported in the README before running your script? If so can you please document them in this thread.

To resolve this I think we should use @fsschneider's data setup as the source of truth. @fsschneider could you start a new PR that contains just the hash commands and results?

fsschneider · 2024-04-02T11:46:22Z

I just downloaded ogbg twice on the same computer (within seconds). Even without any apparent differences, the checksum provided by checksumdir doesn't match. So I am assuming (judging by the fact that we (mainly) see differences for TFDS datasets) that TFDS uses a timestamp (from downloading) or something similarly non-deterministic within the .tfrecord files.

As a result, I don't think that the checksums by checksumdir provide meaningful information. I would suggest closing this PR and not providing checksums.

fsschneider · 2024-04-02T11:50:44Z

@priyakasimbeg I closed this PR. Feel free to reopen if you disagree.

add dataset hashes

7e3dfaf

chandramouli-sastry requested a review from a team as a code owner March 18, 2024 00:58

priyakasimbeg previously approved these changes Mar 20, 2024

View reviewed changes

priyakasimbeg self-requested a review March 20, 2024 19:24

fsschneider closed this Apr 2, 2024

github-actions bot locked and limited conversation to collaborators Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Checksums #708

Dataset Checksums #708

chandramouli-sastry commented Mar 18, 2024

github-actions bot commented Mar 18, 2024

tfaod commented Mar 19, 2024

chandramouli-sastry commented Mar 19, 2024

fsschneider commented Mar 19, 2024

tfaod commented Mar 19, 2024 •

edited

Loading

chandramouli-sastry commented Mar 19, 2024 •

edited

Loading

priyakasimbeg commented Mar 20, 2024

fsschneider commented Apr 2, 2024

fsschneider commented Apr 2, 2024

Dataset Checksums #708

Dataset Checksums #708

Conversation

chandramouli-sastry commented Mar 18, 2024

github-actions bot commented Mar 18, 2024

tfaod commented Mar 19, 2024

chandramouli-sastry commented Mar 19, 2024

fsschneider commented Mar 19, 2024

tfaod commented Mar 19, 2024 • edited Loading

ogbg

criteo1tb

chandramouli-sastry commented Mar 19, 2024 • edited Loading

priyakasimbeg commented Mar 20, 2024

fsschneider commented Apr 2, 2024

fsschneider commented Apr 2, 2024

tfaod commented Mar 19, 2024 •

edited

Loading

chandramouli-sastry commented Mar 19, 2024 •

edited

Loading