Add Pipeline Parallel (and 2D PP+FSDP) support #161

wconstab · 2024-03-23T00:12:55Z

Stack from ghstack (oldest at bottom):

uses pipeline tracer frontend to extract a graph and partition it into
chunks per stage
hardcodes one schedule (1F1B) for now (need to expose option to switch
schedule and test other schedules)
supports 2D parallelism currently, 3D (TP) is work in progress

[ghstack-poisoned]

ghstack-source-id: 14902407f0c573a4b4e9f615495b805af0ed8afc Pull Request resolved: #161

[ghstack-poisoned]

torchtrain/parallelisms/parallelize_llama.py

kwen2501 · 2024-03-29T04:54:44Z

train.py

+        logger.info(
+            f"{Color.blue}Extracting pipeline module for stage {pp_mesh.get_local_rank()}{Color.reset}"
+        )
+        model = pmod.get_stage_module(pp_mesh.get_local_rank())


nit: watch out for rank-stage inequality in case of Interleaved 1F1B.

yea, i need to switch to an interleaved schedule and clean this up

torchtrain/parallelisms/parallelize_llama.py

kwen2501

Thanks a lot for the demo! LGTM!

[ghstack-poisoned]

traced module is burning in a 'meta' device arg for one 'ones' op which breaks runtime after moving model to 'cuda'. Haven't worked on loss fn yet. ghstack-source-id: 47735f666b6086e179699b1bbfb06168b488d4d4 Pull Request resolved: #161

[ghstack-poisoned]

Haven't worked on loss fn yet. ghstack-source-id: 4c438ddd2989e427489c4e2d5a9ddd35711bdb78 Pull Request resolved: #161

[ghstack-poisoned]

(fake) Loss now runs and propagates to logger ghstack-source-id: b5a290878909ebc67bbcfda25809be439e222523 Pull Request resolved: #161

[ghstack-poisoned]

Loss now runs and propagates to logger, but optimizer isn't working ghstack-source-id: 56b0ef0ed92d181126e6866a153316f00431c7e7 Pull Request resolved: #161

[ghstack-poisoned]

Loss now runs and propagates to logger, but optimizer isn't working ghstack-source-id: 4ede08f5a9d1bc994448cb057bb491d24866d078 Pull Request resolved: #161

[ghstack-poisoned]

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now - supports 1D parallelism currently. WIP: support 2D/3D parallel and clean up seed-checkpoint ux ghstack-source-id: 7055ffe515b79fa6edad58a72543d9bc8e866f80 Pull Request resolved: #161

[ghstack-poisoned]

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now (need to expose option to switch schedule and test other schedules) - supports 2D parallelism currently, 3D (TP) is work in progress ghstack-source-id: 6bd801399be3f77a45d1dda11bc87e9a90b92df4 Pull Request resolved: #161

kwen2501

LGTM!
Thanks for pulling PP in!

torchtitan/parallelisms/parallelize_llama.py

kwen2501 · 2024-05-02T16:48:27Z

torchtitan/parallelisms/parallelize_llama.py

+        print("labels: ", labels.shape, labels.dtype)
+
+        # Create a pipeline representation from the model
+        pipe = pipeline(model, parallel_dims.pp, example_args=(input_ids,))


nit: strictly speaking, the second arg is the number of microbatches -- it is okay if you using PP dim to represent it for now. Longer term I think it should be exposed as a field in the config file.

torchtitan/parallelisms/parallelize_llama.py

train.py

[ghstack-poisoned]

requirements.txt

torchtitan/models/llama/model.py

torchtitan/parallelisms/parallelize_llama.py

wanchaol · 2024-05-02T17:02:07Z

torchtitan/parallelisms/parallelize_llama.py

+            )
+
+        # Get example input
+        label_shape = input_shape = (8, 2048)  # TODO


hmmm would PP be working for all cases that are not this shape, or it requires the shape to be the exact input shape of the runtime?

need to double check how this works and fix.

wanchaol · 2024-05-02T17:03:43Z

torchtitan/parallelisms/parallelize_llama.py

+            # TODO(whc) need to fix PP + FSDP-mixed-precision
+            # tracer for PP assumes f32 and is caught off guard when runtime FSDP interacts using bf16 inputs
+            # param_dtype=torch.bfloat16, reduce_dtype=torch.float32
+            param_dtype=torch.float32,


we shouldn't by default change this, this would make the cases where FSDP or FSDP + TP use fp32 instead of bf16

I wonder if supporting bf16 should be a criteria for landing. I would imagine that training with FSDP + PP in fp32 is not really viable efficiency-wise (at least for larger jobs).

I think we should fix this before landing the PP change. I think there was a possible way to fix this in the tracer, but lost track of it, will dig it up

torchtitan/parallelisms/parallelize_llama.py

wanchaol · 2024-05-02T17:07:09Z

train.py

+    # there are virtual stages
+    if parallel_dims.pp_enabled:
+        stage = PipelineStage(
+            pipe=pipe_meta,


should this be pipe_meta or model?

its correct. Ke proposed an alternative, but we'd still have to pass the pipe_info and the model into _PipelineStage in that case. I could make this change.

wanchaol · 2024-05-02T17:09:14Z

train.py

+            pipe=pipe_meta,
+            stage_index=pp_rank,
+            device=device,
+            group=pp_mesh.get_group(),


wondering if we should put the stage creation into parallelize_llama, IMO we only need pp_schedule in train.py

yea, I think this question and Ke's suggestion about returning a PipelineStage from parallelize_llama are better taken in context of a next PR that also adds support for looped schedules.

Looped schedules further complicate things bc the PP logic first needs to chunk up the model, then apply the DP/TP portion of parallelize_llama on each chunk, and finally pass all the chunks into the schedule.

I think in the end, I might prefer to separate out PP from parallelize_llama, and have a flow where we can take the return from PP apply function and iteratively call parallelize_llama on those chunks.

wanchaol · 2024-05-02T17:12:42Z

train.py

+                loss = (
+                    torch.mean(torch.stack(losses))
+                    if is_last_stage
+                    else torch.Tensor([-1.0])


Why we need the default -1 value? because of logging purpose?

oh, yea i could make it a 'None' but then i have to update logger to not log at all. maybe that's actually a better way to do it. let me try that.

ok- so what I could do is try to alter the metrics code so that on non-last-stage ranks, we omit printing loss, or, we print "loss: None" instead of -1.

The change will add more lines of code, since I need to deal with several places that expect loss and global_[avg/mean]_loss to be valid numbers

avoid writing them into metrics dict

replace their format string with a string value instead of a float value in the logger.info

avoid calling loss.item() in the first place

I agree in principle that's the "right" fix, but i'm not sure if its worth the LOC / complexity. I don't totally hate the -1 thing.

Another option I considered is to skip the whole codeblock of '# log metrics' on non-last-stage ranks. I ruled this out, since it is still useful to log mfu, memory for other ranks.

So let me know what you want to do here @wanchaol

[ghstack-poisoned]

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now (need to expose option to switch schedule and test other schedules) - supports 2D parallelism currently, 3D (TP) is work in progress ghstack-source-id: 205f8b08eac15bb7bee66ecdec439b9828b0949c Pull Request resolved: #161

[ghstack-poisoned]

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now (need to expose option to switch schedule and test other schedules) - supports 2D parallelism currently, 3D (TP) is work in progress ghstack-source-id: cbbb628fd823d579064a8038e6511ec77457ef19 Pull Request resolved: #161

[ghstack-poisoned]

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now (need to expose option to switch schedule and test other schedules) - supports 2D parallelism currently, 3D (TP) is work in progress ghstack-source-id: 94f89f90787cca27310cb966a7edf7ea9bbc0098 Pull Request resolved: #161

[ghstack-poisoned]

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now (need to expose option to switch schedule and test other schedules) - supports 2D parallelism currently, 3D (TP) is work in progress ghstack-source-id: ac8c37124f79f8246155e14da23c2f5cfd75c0de Pull Request resolved: #161

[ghstack-poisoned]

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now (need to expose option to switch schedule and test other schedules) - supports 2D parallelism currently, 3D (TP) is work in progress ghstack-source-id: feb45e115f7bbee37179887bb196c12d21d93b43 Pull Request resolved: #161

wconstab · 2024-05-03T19:00:53Z

torchtitan/parallelisms/parallelize_llama.py

+        for i in range(1, parallel_dims.pp)
+    }
+    # Get example input
+    label_shape = input_shape = (8, 2048)  # TODO


@kwen2501 any ideas for a clean way to do this in torchtrain? do we expect people to get a batch out of their dataloader and then reset it? or do we expect people to hardcode it?

i think what i might do is directly pass input_shape from train.py,

and in train.py i can set input_shape = (job_config.batch_size, job_config.seq_len) or something. is that clean enough?

ok pushed a variation on this.

not sure if its better to hide this inside parallelize since we already have job config, or make it explicit from train.py that we are passing input_shape in for some reason

Either way sounds okay to me -- eventually, the shape comes the config.

[ghstack-poisoned]

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now (need to expose option to switch schedule and test other schedules) - supports 2D parallelism currently, 3D (TP) is work in progress ghstack-source-id: 0616a1c0d40f8e51ddfc1b2d330dbddc491e00e2 Pull Request resolved: #161

tianyu-l · 2024-05-07T05:12:10Z

torchtitan/parallelisms/parallelize_llama.py

+    layers_per_rank = len(model.layers) // parallel_dims.pp
+    split_spec = {
+        f"layers.{i * layers_per_rank}": SplitPoint.BEGINNING
+        for i in range(1, parallel_dims.pp)


I'm new to PP api and have a question:
If layers_per_rank = 5, parallel_dims.pp = 2, what should be the split_spec. My straightforward thought is SplitPoint.BEGINNING should contain i = 1, 3, 5, but according to the code it's just i = 1.

parallel_dims.pp refers to the number of pipeline stages we split the model into.
For example, if model.layers = 10, 10 // 2 = 5, then we put 5 layers per stage (i.e. layers_per_rank = 5).
Hence we make a cut at model.layers.5 -- (nRanks - 1) split points.

[ghstack-poisoned]

wconstab · 2024-05-10T21:59:27Z

squashed

- dcp load seems to work now - need to pull in schedule object ghstack-source-id: cbbb8c9cd3b343952003b6314f1f2cc4a7a9e0cf Pull Request resolved: #161

- uses pipeline tracer frontend to extract a graph and partition it into chunks per stage - hardcodes one schedule (1F1B) for now (need to expose option to switch schedule and test other schedules) - supports 2D parallelism currently, 3D (TP) is work in progress ghstack-source-id: 0616a1c0d40f8e51ddfc1b2d330dbddc491e00e2 Pull Request resolved: #161

Update

36d2293

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Mar 23, 2024

WIP integrate pippy's tracer frontend

d60f2b3

ghstack-source-id: 14902407f0c573a4b4e9f615495b805af0ed8afc Pull Request resolved: #161

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 23, 2024

Update

8208b36

[ghstack-poisoned]

This was referenced Mar 29, 2024

Add support for seed checkpoint creation for meta-init flow #172

Merged

wip pipelinestage #174

Closed

kwen2501 reviewed Mar 29, 2024

View reviewed changes

torchtrain/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

kwen2501 reviewed Mar 29, 2024

View reviewed changes

torchtrain/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

kwen2501 approved these changes Mar 29, 2024

View reviewed changes

Update

4bb6409

[ghstack-poisoned]

kwen2501 mentioned this pull request Apr 1, 2024

Unflatten traced module pytorch/PiPPy#954

Merged

Update

5fb7d12

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Apr 2, 2024

WIP integrate pippy's tracer frontend

914df69

Haven't worked on loss fn yet. ghstack-source-id: 4c438ddd2989e427489c4e2d5a9ddd35711bdb78 Pull Request resolved: #161

Update

d437461

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Apr 3, 2024

WIP integrate pippy's tracer frontend

e77d677

(fake) Loss now runs and propagates to logger ghstack-source-id: b5a290878909ebc67bbcfda25809be439e222523 Pull Request resolved: #161

Update

522f93b

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Apr 3, 2024

WIP integrate pippy's tracer frontend

4d167fa

Loss now runs and propagates to logger, but optimizer isn't working ghstack-source-id: 56b0ef0ed92d181126e6866a153316f00431c7e7 Pull Request resolved: #161

wconstab added a commit that referenced this pull request Apr 3, 2024

WIP integrate pippy's tracer frontend

296c875

Loss now runs and propagates to logger, but optimizer isn't working ghstack-source-id: 56b0ef0ed92d181126e6866a153316f00431c7e7 Pull Request resolved: #161

Update

093bb94

[ghstack-poisoned]

This was referenced Apr 5, 2024

Make freqs_cis a persistent buffer for pp init #201

Merged

Delete grad scaler, which is unsupported/unused #202

Merged

Factor out loss_fn to share code with pipeline par #203

Merged

wconstab added a commit that referenced this pull request Apr 5, 2024

WIP integrate pippy's tracer frontend

4134d08

Loss now runs and propagates to logger, but optimizer isn't working ghstack-source-id: 4ede08f5a9d1bc994448cb057bb491d24866d078 Pull Request resolved: #161

Update

9628813

[ghstack-poisoned]

wconstab changed the title ~~WIP integrate pippy's tracer frontend~~ Add Pipeline Parallel support Apr 5, 2024

Update

57ecb37

[ghstack-poisoned]

kwen2501 approved these changes May 2, 2024

View reviewed changes

Update

f597ef2

[ghstack-poisoned]

wanchaol reviewed May 2, 2024

View reviewed changes

Update

a01b2a3

[ghstack-poisoned]

wconstab mentioned this pull request May 2, 2024

Remove unnecessary .to() inside model forward #298

Merged

Update

4c336cc

[ghstack-poisoned]

Update

7887b25

[ghstack-poisoned]

Update

2bc6a9a

[ghstack-poisoned]

Update

a43fa7f

[ghstack-poisoned]

wconstab commented May 3, 2024

View reviewed changes

Update

6558428

[ghstack-poisoned]

This was referenced May 3, 2024

[fused_rmsnorm] Avoid conditional on dynamic stride #300

Closed

[fused_rmsnorm] Avoid querying device inside forward #301

Closed

[fused_rmsnorm] Register as a custom operator for tracing #303

Closed

WIP apply PP manually #308

Closed

tianyu-l reviewed May 7, 2024

View reviewed changes

Update

2def4ff

[ghstack-poisoned]

wconstab mentioned this pull request May 9, 2024

Add Pipeline Parallel (and 2D PP+FSDP) support #318

Merged

Update

41b9928

[ghstack-poisoned]

wconstab closed this May 10, 2024

tianyu-l pushed a commit that referenced this pull request Aug 16, 2024

WIP integrate pippy's tracer frontend

6a430d8

- dcp load seems to work now - need to pull in schedule object ghstack-source-id: cbbb8c9cd3b343952003b6314f1f2cc4a7a9e0cf Pull Request resolved: #161

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pipeline Parallel (and 2D PP+FSDP) support #161

Add Pipeline Parallel (and 2D PP+FSDP) support #161

wconstab commented Mar 23, 2024 •

edited

Loading

kwen2501 Mar 29, 2024

wconstab Apr 5, 2024

kwen2501 left a comment

kwen2501 left a comment

kwen2501 May 2, 2024

wanchaol May 2, 2024

wconstab May 2, 2024

wanchaol May 2, 2024

awgu May 2, 2024

wconstab May 2, 2024

wanchaol May 2, 2024

wconstab May 2, 2024

wanchaol May 2, 2024

wconstab May 2, 2024

wanchaol May 2, 2024

wconstab May 2, 2024

wconstab May 2, 2024 •

edited

Loading

wconstab May 3, 2024

wconstab May 3, 2024

wconstab May 3, 2024

kwen2501 May 3, 2024

tianyu-l May 7, 2024

kwen2501 May 7, 2024 •

edited

Loading

wconstab commented May 10, 2024

Add Pipeline Parallel (and 2D PP+FSDP) support #161

Add Pipeline Parallel (and 2D PP+FSDP) support #161

Conversation

wconstab commented Mar 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 left a comment

Choose a reason for hiding this comment

kwen2501 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wconstab May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 May 7, 2024 • edited Loading

Choose a reason for hiding this comment

wconstab commented May 10, 2024

wconstab commented Mar 23, 2024 •

edited

Loading

wconstab May 2, 2024 •

edited

Loading

kwen2501 May 7, 2024 •

edited

Loading