modify data split to use HF api #65

tianyu-l · 2024-02-17T04:10:39Z

Stack from ghstack (oldest at bottom):

-> modify data split to use HF api #65

Just found out that HF dataset has its own API to do data split (across DP ranks). Verified that it has the expected data behavior (same on SP ranks, different on DP ranks).

Note: This is still a map-style dataset, that has to be loaded in memory. Setting streaming=True for load_dataset returns an IterableDataset whose data doesn't have to fit in memory, but the data loading speed is significantly slower.

[ghstack-poisoned]

ghstack-source-id: 489d666dd77ddcae80b139147ad82f4b1e6888da Pull Request resolved: #65

wanchaol

lgtm

Just found out that HF dataset has its own [API](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/main_classes#datasets.distributed.split_dataset_by_node) to do data split (across DP ranks). Verified that it has the expected data behavior (same on SP ranks, different on DP ranks). Note: This is still a map-style dataset, that has to be loaded in memory. Setting `streaming=True` for [load_dataset](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/loading_methods#datasets.load_dataset) returns an IterableDataset whose data doesn't have to fit in memory, but the data loading speed is significantly slower. [ghstack-poisoned]

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: #65

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: pytorch#65

modify data split to use HF api

869c684

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Feb 17, 2024

modify data split to use HF api

1c2bb3a

ghstack-source-id: 489d666dd77ddcae80b139147ad82f4b1e6888da Pull Request resolved: #65

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 17, 2024

wanchaol approved these changes Feb 21, 2024

View reviewed changes

tianyu-l added a commit that referenced this pull request Feb 21, 2024

modify data split to use HF api

a5597ad

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: #65

tianyu-l merged commit 5ebe2e7 into gh/tianyu-l/1/base Feb 21, 2024
3 checks passed

tianyu-l added a commit that referenced this pull request Feb 21, 2024

modify data split to use HF api

55a6b0b

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: #65

tianyu-l deleted the gh/tianyu-l/1/head branch February 21, 2024 20:08

lessw2020 pushed a commit that referenced this pull request Apr 18, 2024

modify data split to use HF api

2daf53f

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: #65

philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024

modify data split to use HF api

8a74077

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: pytorch#65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modify data split to use HF api #65

modify data split to use HF api #65

tianyu-l commented Feb 17, 2024 •

edited

Loading

wanchaol left a comment

modify data split to use HF api #65

modify data split to use HF api #65

Conversation

tianyu-l commented Feb 17, 2024 • edited Loading

wanchaol left a comment

Choose a reason for hiding this comment

tianyu-l commented Feb 17, 2024 •

edited

Loading