Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor Parallelism Integration #3269

Merged
merged 89 commits into from
May 24, 2024

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented May 8, 2024

What does this PR do?

This PR adds Tensor Parallelism (TP) integration. As part of this, we simplify the Trainer interface for parallelism.

Current limitations:

  • TP requires FSDP
  • TP means no load_monolith_rank0_only
  • TP does not support monolith checkpointing
  • TP must be all layers with same TP world size

Task List

  • Add TensorParallelism integration
  • Switch to parallelism_config interface
  • Code cleanup with respect to where device_mesh is tracked
  • Make device_mesh/DTensor on by default with FSDP
  • Docs update
  • Move all distributed work to a new folder

Test List

  • Add TP train test
  • Add TP sharded resumption test
  • Add TP full state dict resumption test

Follow-on PR:

  • Switch to Dataclass
  • Load monolith rank0 only support
  • Train TP only
  • Add TP vs. FSDP determinism test (blocked by no monolith checkpointing)

Note:

  • simulate_tp.py will be deleted before merge. It is currently used for testing.

@dakinggg dakinggg marked this pull request as draft May 8, 2024 21:01
composer/core/state.py Show resolved Hide resolved
composer/core/state.py Show resolved Hide resolved
composer/core/state.py Show resolved Hide resolved
composer/core/state.py Show resolved Hide resolved
composer/core/state.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@b-chu b-chu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but looks good! Thanks for the PR!

composer/core/state.py Show resolved Hide resolved
composer/core/state.py Show resolved Hide resolved
composer/distributed/mosaic_fsdp.py Outdated Show resolved Hide resolved
@mvpatel2000 mvpatel2000 merged commit 09f14f9 into mosaicml:dev May 24, 2024
15 checks passed
dakinggg added a commit to dakinggg/composer that referenced this pull request May 25, 2024
dakinggg added a commit that referenced this pull request May 25, 2024
* Revert "Bugfixes to FSDP + TP (#3323)"

This reverts commit 79e79eb.

* Revert "Tensor Parallelism Integration (#3269)"

This reverts commit 09f14f9.
@mvpatel2000 mvpatel2000 mentioned this pull request May 28, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants