Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify data split to use HF api #65

Merged
merged 2 commits into from
Feb 21, 2024
Merged

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Feb 17, 2024

Stack from ghstack (oldest at bottom):

Just found out that HF dataset has its own API to do data split (across DP ranks). Verified that it has the expected data behavior (same on SP ranks, different on DP ranks).

Note: This is still a map-style dataset, that has to be loaded in memory. Setting streaming=True for load_dataset returns an IterableDataset whose data doesn't have to fit in memory, but the data loading speed is significantly slower.

tianyu-l added a commit that referenced this pull request Feb 17, 2024
ghstack-source-id: 489d666dd77ddcae80b139147ad82f4b1e6888da
Pull Request resolved: #65
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 17, 2024
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Just found out that HF dataset has its own [API](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/main_classes#datasets.distributed.split_dataset_by_node) to do data split (across DP ranks). Verified that it has the expected data behavior (same on SP ranks, different on DP ranks).

Note: This is still a map-style dataset, that has to be loaded in memory. Setting `streaming=True` for [load_dataset](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/loading_methods#datasets.load_dataset) returns an IterableDataset whose data doesn't have to fit in memory, but the data loading speed is significantly slower.


[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Feb 21, 2024
ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939
Pull Request resolved: #65
@tianyu-l tianyu-l merged commit 5ebe2e7 into gh/tianyu-l/1/base Feb 21, 2024
3 checks passed
tianyu-l added a commit that referenced this pull request Feb 21, 2024
ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939
Pull Request resolved: #65
@tianyu-l tianyu-l deleted the gh/tianyu-l/1/head branch February 21, 2024 20:08
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939
Pull Request resolved: #65
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939
Pull Request resolved: pytorch#65
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants