Add Sequence Parallelism to llama #32

wanchaol · 2024-02-01T22:15:49Z

Stack from ghstack (oldest at bottom):

-> Add Sequence Parallelism to llama #32

Somehow the torch.compile not working although eager sequence
parallelism working, so currently don't turn it on by default

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 27c7a076c1549707c9d759e11aae51a245021940 Pull Request resolved: #32

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 73c094b08f6dda91bf79d19297bb88a2050c7286 Pull Request resolved: #32

wconstab · 2024-02-02T05:44:25Z

torchtrain/parallelisms/parallelize_llama.py

+                parallelize_plan=layer_plan,
+            )
+
+        rank0_log(f"Applied Sequence Parallelism to the model...")


I wonder if its useful to log more info about the SP plan. I was thinking about it for PP, what info do we want to print. Should each parallelism print its own summary, or should we have one overall function that prints overall parallel info in a unified way?

🤔 That's a good point. I think yeah we should probably log the parallelize plan for SP. This would require some changes in PyTorch to add __str__ to our ParallelStyles, I can add the log once the PyTorch PR is merged.

Should each parallelism print its own summary, or should we have one overall function that prints overall parallel info in a unified way

My two cents: It's a bit tricky to give overall summary. I think we can figure out how to even print the intended summary for each parallelism first, i.e. when transformerblock stacked too many, we can't log/print every layer parallel plan, so I think maybe we print pp degree of transformerblock, and we might not want to print the SP plan for each PP transformerblock.

tianyu-l

looks great to me! one inline question

tianyu-l · 2024-02-02T20:29:56Z

torchtrain/parallelisms/parallelize_llama.py

+            distribute_rmsnorm(transformer_block.attention_norm, tp_mesh)
+            distribute_rmsnorm(transformer_block.ffn_norm, tp_mesh)


shall we also apply it on the final norm after all transformer blocks?

not sth currently enabled, but I think we can explore this in real training and see if shard the final norm would give additional memory/perf benefits :)

torchtrain/parallelisms/parallelize_llama.py

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 16fe643eb2ee10f45cf67d42fe20e063a1ad4669 Pull Request resolved: #32

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: c1b8f3a645bdd1568b81210f1775b700fd8c2336 Pull Request resolved: #32

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 0d251f2efe36e71eae71549d863cb3e128e92634 Pull Request resolved: #32

Add Sequence Parallelism to llama

7569c1e

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

wanchaol mentioned this pull request Feb 1, 2024

Update dataloading to use packing with fixed seq_length #31

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 1, 2024

Update on "Add Sequence Parallelism to llama"

09e7447

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

wanchaol requested review from fegin, wconstab, lessw2020 and tianyu-l February 1, 2024 22:24

wconstab reviewed Feb 2, 2024

View reviewed changes

wconstab approved these changes Feb 2, 2024

View reviewed changes

tianyu-l approved these changes Feb 2, 2024

View reviewed changes

awgu reviewed Feb 2, 2024

View reviewed changes

torchtrain/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

Update on "Add Sequence Parallelism to llama"

52c9091

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Update on "Add Sequence Parallelism to llama"

ff2d82b

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Update on "Add Sequence Parallelism to llama"

d0511d3

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

wanchaol merged commit d0511d3 into gh/wanchaol/2/base Feb 7, 2024
3 checks passed

wanchaol deleted the gh/wanchaol/2/head branch February 7, 2024 07:20

wanchaol mentioned this pull request Feb 7, 2024

Add sequence parallelism to llama #44

Closed

0781532 mentioned this pull request Oct 15, 2024

Is there way to offload training memory to DRAM (using FSDP2?) for training Llama3-8B with torchtitan? #620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sequence Parallelism to llama #32

Add Sequence Parallelism to llama #32

wanchaol commented Feb 1, 2024 •

edited

Loading

wconstab Feb 2, 2024

wanchaol Feb 2, 2024

tianyu-l left a comment

tianyu-l Feb 2, 2024

wanchaol Feb 7, 2024

		distribute_rmsnorm(transformer_block.attention_norm, tp_mesh)
		distribute_rmsnorm(transformer_block.ffn_norm, tp_mesh)

Add Sequence Parallelism to llama #32

Add Sequence Parallelism to llama #32

Conversation

wanchaol commented Feb 1, 2024 • edited Loading

wconstab Feb 2, 2024

Choose a reason for hiding this comment

wanchaol Feb 2, 2024

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

tianyu-l Feb 2, 2024

Choose a reason for hiding this comment

wanchaol Feb 7, 2024

Choose a reason for hiding this comment

wanchaol commented Feb 1, 2024 •

edited

Loading