Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: IterableDataset.push_to_hub #5665

Open
NielsRogge opened this issue Mar 23, 2023 · 5 comments
Open

Feature request: IterableDataset.push_to_hub #5665

NielsRogge opened this issue Mar 23, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@NielsRogge
Copy link
Contributor

NielsRogge commented Mar 23, 2023

Feature request

It'd be great to have a lazy push to hub, similar to the lazy loading we have with IterableDataset.

Suppose you'd like to filter LAION based on certain conditions, but as LAION doesn't fit into your disk, you'd like to leverage streaming:

from datasets import load_dataset

dataset = load_dataset("laion/laion400m", streaming=True, split="train")

Then you could filter the dataset based on certain conditions:

filtered_dataset = dataset.filter(lambda example: example['HEIGHT'] > 400)

In order to persist this dataset and push it back to the hub, one currently needs to first load the entire filtered dataset on disk and then push:

from datasets import Dataset

Dataset.from_generator(filtered_dataset.__iter__).push_to_hub(...)

It would be great if we can instead lazy push to the data to the hub (basically stream the data to the hub), not being limited by our disk size:

filtered_dataset.push_to_hub("my-filtered-dataset")

Motivation

This feature would be very useful for people that want to filter huge datasets without having to load the entire dataset or a filtered version thereof on their local disk.

Your contribution

Happy to test out a PR :)

@ducha-aiki
Copy link

+1

1 similar comment
@phineas-pta
Copy link

+1

@Jourdelune
Copy link

+1, should be possible now? :) https://huggingface.co/blog/xethub-joins-hf

@lhoestq
Copy link
Member

lhoestq commented Aug 21, 2024

Haha we're working hard to integrate Xet in the HF back-end, it will enable coo use cases :)

Anyway about IterableDataset.push_to_hub, I'd be happy to to provide guidance and answer questions if anyone wants to start a first simple implementation of this

@meg-huggingface
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants