Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data file module inference #7132

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

HennerM
Copy link

@HennerM HennerM commented Aug 29, 2024

I saved a dataset with two splits to disk with DatasetDict.save_to_disk. The train is bigger and ended up in 10 shards, whereas the test split only resulted in 1 split.

Now when trying to load the dataset, an error is raised that not all splits have the same data format:

ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}

This is not expected because both splits are saved as arrow files.

I did some debugging and found that this is the case because the list of data_files includes a state.json file.

Now this means for train split I get 10 ".arrow" and 1 ".json" file. Since datasets picks based on the most common extension this is correctly inferred as "arrow". In the test split, there is 1 .arrow and 1 .json file. Given the function description:

It picks the module based on the most common file extension.
In case of a draw ".parquet" is the favorite, and then alphabetical order.

This is not quite true though, because in a tie the extensions are actually based on reverse-alphabetical order:

for (ext, _), _ in sorted(extensions_counter.items(), key=sort_key, *reverse=True*):

Which thus leads to the module wrongly inferred as "json", whereas it should be "arrow", matching the train split.

I first thought about adding "state.json" in the list of excluded files for the inference: https://github.com/huggingface/datasets/blob/main/src/datasets/load.py#L513. However, I think from digging into the code it looks like the right thing to do is to exclude it in the list of data_files to start with, because it is more of a metadata than a data file.

state.json is not a data_file. Not excluding it can lead to an error when trying to load a dataset with a split that has only one shard.

Excluding state.json should be fine because it is excluded later on anyway when the format is inferred as e.g. "arrow"
@lhoestq
Copy link
Member

lhoestq commented Sep 2, 2024

Hi ! datasets saved using save_to_disk should be loaded with load_from_disk ;)

@HennerM
Copy link
Author

HennerM commented Sep 2, 2024

It is convienient to just pass in a path to a local dataset or one from the hub and use the same function to load it. Is it not possible to get this fix merged in to allow this?

@lhoestq
Copy link
Member

lhoestq commented Sep 2, 2024

We can modify save_to_disk to write the dataset in a structure supported by the Hub in this case, it's kind of a legacy function anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants