You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This feature would be very useful for people that want to filter huge datasets without having to load the entire dataset or a filtered version thereof on their local disk.
Your contribution
Happy to test out a PR :)
The text was updated successfully, but these errors were encountered:
Haha we're working hard to integrate Xet in the HF back-end, it will enable coo use cases :)
Anyway about IterableDataset.push_to_hub, I'd be happy to to provide guidance and answer questions if anyone wants to start a first simple implementation of this
Feature request
It'd be great to have a lazy push to hub, similar to the lazy loading we have with
IterableDataset
.Suppose you'd like to filter LAION based on certain conditions, but as LAION doesn't fit into your disk, you'd like to leverage streaming:
Then you could filter the dataset based on certain conditions:
In order to persist this dataset and push it back to the hub, one currently needs to first load the entire filtered dataset on disk and then push:
It would be great if we can instead lazy push to the data to the hub (basically stream the data to the hub), not being limited by our disk size:
Motivation
This feature would be very useful for people that want to filter huge datasets without having to load the entire dataset or a filtered version thereof on their local disk.
Your contribution
Happy to test out a PR :)
The text was updated successfully, but these errors were encountered: