Pytorch parallelism #5916

andersooi · 2025-01-04T07:31:46Z

Description

Added new concept entry for PyTorch Distributed Data Parallelism

Issue Solved

Closes #5871

Type of Change

Adding a new entry

Checklist

All writings are my own.
My entry follows the Codecademy Docs style guide.
My changes generate no new warnings.
I have performed a self-review of my own writing and code.
I have checked my entry and corrected any misspellings.
I have made corresponding changes to the documentation if needed.
I have confirmed my changes are not being pushed from my forked main branch.
I have confirmed that I'm pushing from a new branch named after the changes I'm making.
I have linked any issues that are relevant to this PR in the Issues Solved section.

Radhika-okhade · 2025-01-15T09:54:53Z

Hey! @andersooi Please correct the file path. This is the correct path docs/content/pytorch/concepts/distributed-data-parallelism/distributed-data-parallelism.md

Radhika-okhade · 2025-01-15T10:01:17Z

content/pytorch/concepts/distributed-data-parallelism/distributed-data-parallelism.md

@@ -0,0 +1,104 @@
+---
+Title: 'Distributed Data Parallelism'
+Description: 'An overview of distributed data parallelism in PyTorch.'


The description is too generic. Explain what distributed data parallelism is within 1-2 lines.
https://github.com/Codecademy/docs/blob/main/documentation/style-guide.md

Radhika-okhade · 2025-01-15T10:02:40Z

content/pytorch/concepts/distributed-data-parallelism/distributed-data-parallelism.md

+  - 'PyTorch'
+  - 'Data'
+  - 'Data Parallelism'


Suggested change

- 'PyTorch'

- 'Data'

- 'Data Parallelism'

- 'Data'

- 'Data Parallelism'

- 'PyTorch'

Radhika-okhade · 2025-01-15T10:04:13Z

content/pytorch/concepts/distributed-data-parallelism/distributed-data-parallelism.md

+
+## Introduction to Distributed Data Parallelism
+
+Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets. It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimising device communication overhead.


Suggested change

Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets. It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimising device communication overhead.

Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets.

It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimizing device communication overhead.

Radhika-okhade · 2025-01-15T10:06:33Z

content/pytorch/concepts/distributed-data-parallelism/distributed-data-parallelism.md

+
+To use DDP, a distributed process group needs to be initialised and wrapped to a model with `torch.nn.parallel.DistributedDataParallel`.
+
+```py


Suggested change

```py

```pseudo

Radhika-okhade · 2025-01-15T10:09:38Z

content/pytorch/concepts/distributed-data-parallelism/distributed-data-parallelism.md

+def setup(rank, world_size):
+    os.environ['MASTER_ADDR'] = 'localhost'
+    os.environ['MASTER_PORT'] = '8000'
+    dist.init_process("nccl", rank=rank, world_size=world_size)


Suggested change

dist.init_process("nccl", rank=rank, world_size=world_size)

dist.init_process_group("nccl", rank=rank, world_size=world_size)

Radhika-okhade · 2025-01-15T10:11:34Z

Hey! @andersooi, Thank you for contributing to Codecademy docs. I have made a few suggestions; please go through them and make the necessary changes.

andersooi added 2 commits January 4, 2025 01:44

feat: add new tag to tags file

cf4405c

feat: add concept entry for distributed data parallelism

f0e43fc

Radhika-okhade self-assigned this Jan 7, 2025

Radhika-okhade added status: under review Issue or PR is currently being reviewed pytorch PyTorch labels Jan 7, 2025

Radhika-okhade reviewed Jan 15, 2025

View reviewed changes

Radhika-okhade added status: waiting for author and removed status: under review Issue or PR is currently being reviewed labels Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch parallelism #5916

Pytorch parallelism #5916

andersooi commented Jan 4, 2025

Radhika-okhade commented Jan 15, 2025

Radhika-okhade Jan 15, 2025

Radhika-okhade Jan 15, 2025

Radhika-okhade Jan 15, 2025

Radhika-okhade Jan 15, 2025

Radhika-okhade Jan 15, 2025

Radhika-okhade commented Jan 15, 2025


		## Introduction to Distributed Data Parallelism

		Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets. It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimising device communication overhead.

	Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets. It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimising device communication overhead.
	Distributed Data Parallelism (DDP) in PyTorch is a module that enables users to efficiently train models across multiple GPUs and machines. By splitting the training process across multiple machines, DDP helps reduce training time and facilitates scaling to larger models and datasets.
	It achieves parallelism by splitting the input data into smaller chunks, processing them on different GPUs, and aggregating results for updates. Compared to `DataParallel`, DDP offers better performance and scalability by minimizing device communication overhead.


		To use DDP, a distributed process group needs to be initialised and wrapped to a model with `torch.nn.parallel.DistributedDataParallel`.

		```py

	dist.init_process("nccl", rank=rank, world_size=world_size)
	dist.init_process_group("nccl", rank=rank, world_size=world_size)

Pytorch parallelism #5916

Are you sure you want to change the base?

Pytorch parallelism #5916

Conversation

andersooi commented Jan 4, 2025

Description

Issue Solved

Type of Change

Checklist

Radhika-okhade commented Jan 15, 2025

Radhika-okhade Jan 15, 2025

Choose a reason for hiding this comment

Radhika-okhade Jan 15, 2025

Choose a reason for hiding this comment

Radhika-okhade Jan 15, 2025

Choose a reason for hiding this comment

Radhika-okhade Jan 15, 2025

Choose a reason for hiding this comment

Radhika-okhade Jan 15, 2025

Choose a reason for hiding this comment

Radhika-okhade commented Jan 15, 2025