Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assigning Different Microbatches to Each Rank #425

Open
purefall opened this issue Sep 11, 2024 · 3 comments
Open

Assigning Different Microbatches to Each Rank #425

purefall opened this issue Sep 11, 2024 · 3 comments
Assignees
Labels

Comments

@purefall
Copy link

purefall commented Sep 11, 2024

Context:

We are following the FSDP example and trying to understand the mechanism behind how different microbatches are assigned to each rank during training, and specifically the role of the global_rank variable in this process.

In the code, it appears that global_rank is used as a seed for dataset shuffling, as shown below:

data = load_dataset(dataset, name=name, streaming=True, split=split, trust_remote_code=True).shuffle(42 + global_rank)

However, we encountered a few uncertainties regarding the initialization of global_rank and how it ensures non-overlapping data across ranks.

Questions:

  1. Initialization of global_rank:

    • Is global_rank meant to be passed as an argument, or is it inferred from the environment (e.g., the rank in distributed training)?
  2. Shuffling and Data Partitioning:

    • How does shuffling with global_rank ensure that different ranks receive different, non-overlapping samples? While the shuffling function modifies the random seed using global_rank, it's unclear how this alone guarantees distinct data across ranks without overlap.
  3. Use of DistributedSampler:
    In the current example, the DataLoader does not use a DistributedSampler, which is typically utilized to partition datasets across ranks. The DataLoader setup looks like this:

    train_dataloader = DataLoader(train_concat_dataset,
                                  batch_size=batch_size,
                                  num_workers=workers,
                                  pin_memory=True,
                                  prefetch_factor=4,
                                  timeout=600)
    • Is there any additional mechanism beyond shuffling (e.g., use of a DistributedSampler) that ensures non-overlapping data across ranks? Should we consider adding a DistributedSampler in this case?

Request:

Could you provide clarification on:

  • The intended role and correct initialization of global_rank.
  • How microbatches are distributed across ranks, especially in the absence of a DistributedSampler.

Any guidance on how to avoid potential overlap in samples across different ranks would be greatly appreciated.

@pbelevich
Copy link
Collaborator

@purefall thanks for reporting the issue, we are working on improving this example. Currently dataloading code in def create_streaming_dataloader is a mock that is not designed for production use. Answering your questions:
In general FSDP dataloading setup should look like this:

local_rank = int(os.environ['LOCAL_RANK'])
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])

sampler = DistributedSampler(your_dataset, rank=rank, num_replicas=world_size, shuffle=True)

train_dataloader = DataLoader(your_dataset,
                              sampler=sampler,
                              batch_size=batch_size,
                              num_workers=workers,
                              pin_memory=True,
                              prefetch_factor=4,
                              timeout=600)

Please refer to the PyTorch FSDP example while we are working on improving our FSDP example. Thank you!

@maxschmitt
Copy link

Will

rank = int(os.environ['RANK'])

lead to the same result as

import torch.distributed as dist
rank = dist.get_rank()

?

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants