Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch parallelism #5916

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andersooi
Copy link
Contributor

Description

  • Added new concept entry for PyTorch Distributed Data Parallelism

Issue Solved

Closes #5871

Type of Change

  • Adding a new entry

Checklist

  • All writings are my own.
  • My entry follows the Codecademy Docs style guide.
  • My changes generate no new warnings.
  • I have performed a self-review of my own writing and code.
  • I have checked my entry and corrected any misspellings.
  • I have made corresponding changes to the documentation if needed.
  • I have confirmed my changes are not being pushed from my forked main branch.
  • I have confirmed that I'm pushing from a new branch named after the changes I'm making.
  • I have linked any issues that are relevant to this PR in the Issues Solved section.

@Radhika-okhade Radhika-okhade self-assigned this Jan 7, 2025
@Radhika-okhade Radhika-okhade added status: under review Issue or PR is currently being reviewed pytorch PyTorch labels Jan 7, 2025
@Radhika-okhade
Copy link
Collaborator

Hey! @andersooi Please correct the file path. This is the correct path docs/content/pytorch/concepts/distributed-data-parallelism/distributed-data-parallelism.md

@@ -0,0 +1,104 @@
---
Title: 'Distributed Data Parallelism'
Description: 'An overview of distributed data parallelism in PyTorch.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is too generic. Explain what distributed data parallelism is within 1-2 lines.
https://github.com/Codecademy/docs/blob/main/documentation/style-guide.md

Comment on lines +8 to +10
- 'PyTorch'
- 'Data'
- 'Data Parallelism'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 'PyTorch'
- 'Data'
- 'Data Parallelism'
- 'Data'
- 'Data Parallelism'
- 'PyTorch'


## Introduction to Distributed Data Parallelism

Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets. It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimising device communication overhead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets. It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimising device communication overhead.
Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets.
It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimizing device communication overhead.


To use DDP, a distributed process group needs to be initialised and wrapped to a model with `torch.nn.parallel.DistributedDataParallel`.

```py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```py
```pseudo

def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '8000'
dist.init_process("nccl", rank=rank, world_size=world_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dist.init_process("nccl", rank=rank, world_size=world_size)
dist.init_process_group("nccl", rank=rank, world_size=world_size)

@Radhika-okhade
Copy link
Collaborator

Hey! @andersooi, Thank you for contributing to Codecademy docs. I have made a few suggestions; please go through them and make the necessary changes.

@Radhika-okhade Radhika-okhade added status: waiting for author and removed status: under review Issue or PR is currently being reviewed labels Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Concept Entry] PyTorch: Distributed Data Parallelism 
2 participants