streaming datasets doesn't work properly with multi-node #6623

rohitgr7 · 2024-01-27T23:46:13Z

Feature request

Let’s say I have a dataset with 5 samples with values [1, 2, 3, 4, 5], with 2 GPUs (for DDP) and batch size of 2. This dataset is an IterableDataset since I am streaming it.

Now I split the dataset using split_dataset_by_node to ensure it doesn’t get repeated. And since it’s already splitted, I don’t have to use DistributedSampler (also they don't work with iterable datasets anyway)?

But in this case I noticed that the:

First iteraton:
first GPU will get → [1, 2]
first GPU will get → [3, 4]

Second iteraton:
first GPU will get → [5]
first GPU will get → Nothing

which actually creates an issue since in case of DistributedSampler, the samples are repeated internally to ensure non of the GPUs at any iteration is missing any data for gradient sync.

So my questions are:

Here since splitting is happening before hand, how to make sure each GPU get’s a batch at each iteration to avoid gradient sync issues?
Do we need to use DistributedSampler? If yes, how?
in the docstrings of split_dataset_by_node, this is mentioned: "If the dataset has a number of shards that is a factor of world_size (i.e. if dataset.n_shards % world_size == 0), then the shards are evenly assigned across the nodes, which is the most optimized. Otherwise, each node keeps 1 example out of world_size, skipping the other examples." Can you explain the last part here?
If dataset.n_shards % world_size != 0, is it possible to shard the streaming dataset on the fly to avoid the case where data is missing?

Motivation

Somehow streaming datasets should work with DDP since for big LLMs a lot of data is required and DDP/multi-node is mostly used to train such models and streaming can actually help solve the data part of it.

Your contribution

Yes, I can help in submitting the PR once we get mutual understanding on how it should behave.

The text was updated successfully, but these errors were encountered:

rohitgr7 · 2024-01-30T19:55:37Z

@mariosasko, @lhoestq, @albertvillanova
hey guys! can anyone help? or can you guys suggest who can help with this?

lhoestq · 2024-01-31T13:49:15Z

Hi !

When the dataset is running of of examples, the last batches received by the GPU can be incomplete or empty/missing. We haven't implemented yet a way to ignore the last batch. It might require the datasets to provide the number of examples per shard though, so that we can know when to stop.
Samplers are not compatible with IterableDatasets in pytorch
if dataset.n_shards % world_size != 0 then all the nodes will read/stream the full dataset in order (possibly reading/streaming the same data multiple times), BUT will only yield one example out of world_size so that each example goes to one exactly one GPU.
no, sharding should be down up-front and can take some time depending on the dataset size and format

rohitgr7 · 2024-01-31T19:22:31Z

if dataset.n_shards % world_size != 0 then all the nodes will read/stream the full dataset in order (possibly reading/streaming the same data multiple times), BUT will only yield one example out of world_size so that each example goes to one exactly one GPU.

considering there's just 1 shard and 2 worker nodes, do you mean each worker node will load the whole dataset but still receive half of that shard while streaming?

lhoestq · 2024-02-01T10:25:47Z

Yes both nodes will stream from the 1 shard, but each node will skip half of the examples. This way in total each example is seen once and exactly once during you distributed training.

Though it terms of I/O, the dataset is effectively read/streamed twice.

rohitgr7 · 2024-02-01T13:06:24Z

what if the number of samples in that shard % num_nodes != 0? it will break/get stuck? or is the data repeated in that case for gradient sync?

lhoestq · 2024-02-02T09:42:09Z

In the case one at least one of the nodes will get an empty/incomplete batch. The data is not repeated in that case. If the training loop doesn't take this into account it can lead to unexpected behaviors indeed.

In the future we'd like to add a feature that would allow the nodes to ignore the last batch, this way all the nodes would only have full batches.

kkkjyu · 2024-03-08T11:44:44Z

In the case one at least one of the noes will get an empty/incomplete batch. The data is not repeated in that case. If the training loop doesn't take this into account it can lead to unexpected behaviors indeed.

In the future we'd like to add a feature that would allow the nodes to ignore the last batch, this way all the nodes would only have full batches.

Is there any method to modify one dataset's n_shard? modify the number of files is ok? one file == one shard?

lhoestq · 2024-03-08T14:27:07Z

modify the number of files is ok? one file == one shard?

Yep, one file == one shard :)

alex-hh · 2024-09-22T14:19:04Z

Hi @lhoestq, do you have any advice on how to implement a fix for the case dataset.n_shards % world_size != 0 while such a fix is not supported in the library?

It seems essential for performing validation in a ddp setting

Simply limiting the number of files is a bit brittle as it relies on world size being consistent to ensure different runs see the same data

How should a user either ignore the last batch or handle the empty batch?

Is the issue of overhanging batches also relevant for map-style datasets?

lhoestq · 2024-09-23T10:09:04Z

How should a user either ignore the last batch or handle the empty batch?

Check the batch size in the training loop and use all_reduce (or any communication method) to make sure all the nodes got their data before passing them to the model. If some data are missing you can decide to stop the training loop or repeat examples until all the nodes have exhausted their data.

Cc @andrewkho in case you know a way to make the DataLoader stop or add extra samples automatically in case of distributed + unevenly divisible iterable dataset

Is the issue of overhanging batches also relevant for map-style datasets?

The DistributedSampler drops the last data by default to make the dataset evenly divisible.

andrewkho · 2024-09-26T19:15:04Z

@lhoestq Unfortunately for IterableDataset there isn't a way to do this in general without introducing communciation between ranks, or having all the ranks read all the data before starting to figure out when to stop (which is pretty impractical). My recommendation for these situations where you don't know the total number of samples apriori is to, configure the iterable dataset to yield a fixed number of samples before raising StopIteration, and if necessary, repeat/reshuffle samples to hit that number

andrewkho · 2024-09-26T19:18:02Z

A heads up that we're planning to land something new in torchdata by end-of-year to help with these scenarios, we'll update this thread when we hvae some code landed

lhoestq · 2024-09-27T14:49:38Z

I made a quick example with communication between ranks to stop once all the data from all the ranks are exhausted (and repeating data if necessary to end up with a number of samples evenly divisible)

import torch
import torch.distributed as dist
from datasets import Dataset
from datasets.distributed import split_dataset_by_node
from torch.utils.data import DataLoader


# simulate a streaming dataset
num_shards = 1  # change here if you want to simulate a dataset made of many files/shards
ds = Dataset.from_dict({"x": [1, 2, 3, 4, 5]}).to_iterable_dataset(num_shards=num_shards)

# split the dataset for distributed training
dist.init_process_group()
rank, world_size = dist.get_rank(), dist.get_world_size()
ds = split_dataset_by_node(ds, rank=rank,world_size=world_size)
dl = DataLoader(ds)

exhausted = torch.zeros(world_size, dtype=torch.bool)

# IMPORTANT: Loop over the local dataset until the data from each rank has been exhausted

def loop():
    while True:
        yield from dl
        yield "end"

for x in loop():
    if x == "end":
        exhausted[rank] = True
        continue
    # stop once the data from all the ranks are exhausted
    dist.all_reduce(exhausted)
    if torch.all(exhausted):
        break
    # do your forward pass + loss here
    # model.forward(...)
    print(x)

on my laptop I run torchrun --nnodes=1 --nproc-per-node=2 main.py and I get

{'x': tensor([2])}
{'x': tensor([1])}
{'x': tensor([3])}
{'x': tensor([4])}
{'x': tensor([5])}
{'x': tensor([2])}

we indeed end up with 6 samples, {'x': tensor([2])} was repeated to get 6 examples in total which is divisible by the world size 2.

I also tried with more ranks and with num_workers in DataLoader and it works as expected (don't forget to add if __name__ == '__main__': if necessary for DataLoader multiprocessing)

EDIT: replaced cycle(chain(dl, ["end"])) by loop() after comment #6623 (comment) by @ragavsachdeva

alex-hh · 2024-09-27T15:26:10Z

great thanks for the example, will give it a try!

alex-hh · 2024-10-02T16:14:33Z

@lhoestq in the case where dataset.n_shards is divisible by world_size, is it important that each shard contains exactly the same number of samples? what happens if this isn't the case (in what circumstances will this cause a timeout)?

lhoestq · 2024-10-02T17:55:33Z

If your data are not evenly divisible (dataset.n_shards divisibility by world_size just changes the logic to distribute the data) you'll need some logic to make the GPUs happy at the end of training. E.g. with my example above to stop once all the data from all the ranks are exhausted (and repeating data if necessary to end up with a number of samples evenly divisible)

Though if dataset.n_shards is divisible by world_size and each shard contains the same amount of data then your data IS evenly divisible so you are all good

alex-hh · 2024-10-02T18:16:14Z

Ok makes sense, thanks for the explanation. I guess even if the shards all contain the same amount of data you still have an issue if you do any filtering (#6719)

What do you think of dataset.repeat(n).take(samples_per_epoch) as a simple way of handling this kind of situation? (c.f. issue I just opened #7192 ).

lhoestq · 2024-10-03T12:51:27Z

yes it makes sense indeed

ragavsachdeva · 2024-10-09T00:54:31Z

I made a quick example with communication between ranks to stop once all the data from all the ranks are exhausted (and repeating data if necessary to end up with a number of samples evenly divisible)

from itertools import cycle, chain
...
# IMPORTANT: Loop over the local dataset until the data from each rank has been exhausted
for x in cycle(chain(dl, ["end"])):
    if x == "end":
        exhausted[rank] = True
        continue
    # stop once the data from all the ranks are exhausted
    dist.all_reduce(exhausted)
    if torch.all(exhausted):
        break
    # do your forward pass + loss here
    # model.forward(...)
    print(x)

Just incase someone copy pastes this in their code (like I did), please be aware of pytorch/pytorch#23900 and use pytorch/pytorch#23900 (comment).

lhoestq · 2024-10-09T09:25:02Z

Thanks for noticing @ragavsachdeva ! I edited my code to fix the issue

JohnHerry · 2024-10-15T03:33:19Z

I have a node with 8 cards and training files splited into 56 sub files, so my n_shards= 56 / 8 = 7; my initial num_workers = 32, and it report that n_shards = 7 < num_workers, so 25 wokers are stoped, as a result, my training can use only 7 cpu cores at all. should I set my num_wokers less then 7 to get more cpu cores worked?

lhoestq · 2024-10-15T14:56:25Z

In your case each rank has a DataLoader with 7 running workers (and 25 stopped workers) so actually in total there are 8*7=56 DataLoader workers running (one per shard).

If you want to use more CPU for the DataLoader you can shard your dataset in more files than 56. E.g. if you want each rank to run 32 DataLoader workers you need 8*32=256 files.

JohnHerry · 2024-10-16T00:55:18Z

Thank you for the help, I will have a try

rohitgr7 added the enhancement New feature or request label Jan 27, 2024

alex-hh mentioned this issue Oct 2, 2024

Add repeat() for iterable datasets #7192

Open

This was referenced Oct 3, 2024

concatenate_datasets does not preserve shuffling state #7196

Open

Add repeat method to datasets #7198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streaming datasets doesn't work properly with multi-node #6623

streaming datasets doesn't work properly with multi-node #6623

rohitgr7 commented Jan 27, 2024 •

edited

Loading

rohitgr7 commented Jan 30, 2024

lhoestq commented Jan 31, 2024

rohitgr7 commented Jan 31, 2024

lhoestq commented Feb 1, 2024

rohitgr7 commented Feb 1, 2024

lhoestq commented Feb 2, 2024 •

edited

Loading

kkkjyu commented Mar 8, 2024

lhoestq commented Mar 8, 2024

alex-hh commented Sep 22, 2024 •

edited

Loading

lhoestq commented Sep 23, 2024

andrewkho commented Sep 26, 2024

andrewkho commented Sep 26, 2024

lhoestq commented Sep 27, 2024 •

edited

Loading

alex-hh commented Sep 27, 2024

alex-hh commented Oct 2, 2024 •

edited

Loading

lhoestq commented Oct 2, 2024

alex-hh commented Oct 2, 2024 •

edited

Loading

lhoestq commented Oct 3, 2024

ragavsachdeva commented Oct 9, 2024

lhoestq commented Oct 9, 2024

JohnHerry commented Oct 15, 2024

lhoestq commented Oct 15, 2024

JohnHerry commented Oct 16, 2024

streaming datasets doesn't work properly with multi-node #6623

streaming datasets doesn't work properly with multi-node #6623

Comments

rohitgr7 commented Jan 27, 2024 • edited Loading

Feature request

Motivation

Your contribution

rohitgr7 commented Jan 30, 2024

lhoestq commented Jan 31, 2024

rohitgr7 commented Jan 31, 2024

lhoestq commented Feb 1, 2024

rohitgr7 commented Feb 1, 2024

lhoestq commented Feb 2, 2024 • edited Loading

kkkjyu commented Mar 8, 2024

lhoestq commented Mar 8, 2024

alex-hh commented Sep 22, 2024 • edited Loading

lhoestq commented Sep 23, 2024

andrewkho commented Sep 26, 2024

andrewkho commented Sep 26, 2024

lhoestq commented Sep 27, 2024 • edited Loading

alex-hh commented Sep 27, 2024

alex-hh commented Oct 2, 2024 • edited Loading

lhoestq commented Oct 2, 2024

alex-hh commented Oct 2, 2024 • edited Loading

lhoestq commented Oct 3, 2024

ragavsachdeva commented Oct 9, 2024

lhoestq commented Oct 9, 2024

JohnHerry commented Oct 15, 2024

lhoestq commented Oct 15, 2024

JohnHerry commented Oct 16, 2024

rohitgr7 commented Jan 27, 2024 •

edited

Loading

lhoestq commented Feb 2, 2024 •

edited

Loading

alex-hh commented Sep 22, 2024 •

edited

Loading

lhoestq commented Sep 27, 2024 •

edited

Loading

alex-hh commented Oct 2, 2024 •

edited

Loading

alex-hh commented Oct 2, 2024 •

edited

Loading