Enable FSDP2 cpu offloading #624

mori360 · 2024-10-17T23:23:09Z

resolve #620
Add config: --training.enable_cpu_offload

Command: CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh

For non-pp case:

For pp case:

tianyu-l · 2024-10-24T21:00:02Z

train.py

@@ -169,13 +166,15 @@ def loss_fn(pred, labels):
            else "cuda"
        )
        model.to_empty(device=init_device)
-        model.init_weights()
+        if job_config.training.enable_cpu_offload:
+            with torch.device("cuda"):


I somehow think the model.init_weights(buffer_device="cuda") change sounds better. It is straightforward on what we want to achieve, while making minimum change to code.

test_runner.py

tianyu-l

Looks good in general. Had some inline comments. In particular, let's figure out if CPU offloading and PP should coexist; if so we should add support for that as well.

test_runner.py

torchtitan/config_manager.py

torchtitan/models/llama/model.py

train.py

tianyu-l · 2024-10-25T00:44:24Z

train.py

+        init_device = (
+            "cpu"
+            if job_config.checkpoint.create_seed_checkpoint
+            or job_config.training.enable_cpu_offload


Don't we need to do the same for the PP case (several lines above)? Or are we assuming if PP is used, CPU offloading is not an option?

I test it with pp.
Based on llama2_7b.toml, cpu_offload also works but has a much obvious latency in training.

mori360 · 2024-10-28T20:02:39Z

test_runner.py

+            "Enable CPU Offload with PP",
+            "enable_cpu_offload+PP",
+            ngpu=4,
+        ),


test with pp, could remove pp later if not necessary in the CI test

train.py

tianyu-l

lgtm! thanks!

resolve pytorch#620 Add config: `--training.enable_cpu_offload` Command: `CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh` For non-pp case: <img width="611" alt="Screenshot 2024-10-23 at 1 45 56 PM" src="https://github.com/user-attachments/assets/8692f8a6-c0f3-460e-8eb6-7f7195bed370"> For pp case: <img width="587" alt="cpu offload+pp" src="https://github.com/user-attachments/assets/73e40861-47e2-4845-a41c-4bfea2860109">

mori360 added 2 commits October 16, 2024 17:09

enable FSDP2 cpuoffload

67407e6

rename config

9a8c90d

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 17, 2024

lint fix

530683d

awgu changed the title ~~Enable FDSP2 cpu offloading~~ Enable FSDP2 cpu offloading Oct 18, 2024

mori360 added 9 commits October 21, 2024 15:41

update cpu offload config

01b0662

manage freqs_cis as nn.Parameter

9569e57

init freqs_cis at meta device

30c204f

remove memory snapshot

7d36b87

Merge branch 'main' into cpu_offload

82f4fec

change default device, not set nn.Parameters

c691a6d

typo

3dfca1b

use a context manager for model.init_weights

6fef39a

correct condition for non_cpu_offloading case

edfe0a6

tianyu-l reviewed Oct 24, 2024

View reviewed changes

mori360 added 2 commits October 24, 2024 16:26

add buffer_device as optional input

f6f9393

add ngpu for 1D test

d6840dd

mori360 commented Oct 24, 2024

View reviewed changes

test_runner.py Outdated Show resolved Hide resolved

mori360 requested a review from tianyu-l October 24, 2024 23:52

mori360 marked this pull request as ready for review October 24, 2024 23:52

tianyu-l reviewed Oct 25, 2024

View reviewed changes

mori360 marked this pull request as draft October 25, 2024 02:46

mori360 added 2 commits October 24, 2024 19:51

modify config help, update condition logic

ab1e258

test cpu offload with pp

2ca9882

mori360 commented Oct 28, 2024

View reviewed changes

tianyu-l reviewed Oct 28, 2024

View reviewed changes

train.py Outdated Show resolved Hide resolved

move init_device and buffer_device outside pp condition

7af331c

mori360 force-pushed the cpu_offload branch from 7173ee7 to 7af331c Compare October 28, 2024 21:29

mori360 marked this pull request as ready for review October 28, 2024 21:55

mori360 requested a review from tianyu-l October 28, 2024 21:55

mori360 requested a review from awgu October 28, 2024 21:57

tianyu-l approved these changes Oct 28, 2024

View reviewed changes

mori360 merged commit 193ce98 into pytorch:main Oct 28, 2024
5 checks passed

mori360 mentioned this pull request Dec 2, 2024

Gradient clipping doesn't work with FSDP CPU offloading pytorch/torchtune#1977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable FSDP2 cpu offloading #624

Enable FSDP2 cpu offloading #624

mori360 commented Oct 17, 2024 •

edited

Loading

tianyu-l Oct 24, 2024

tianyu-l left a comment

tianyu-l Oct 25, 2024

mori360 Oct 25, 2024

mori360 Oct 28, 2024 •

edited

Loading

tianyu-l left a comment

Enable FSDP2 cpu offloading #624

Enable FSDP2 cpu offloading #624

Conversation

mori360 commented Oct 17, 2024 • edited Loading

tianyu-l Oct 24, 2024

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

tianyu-l Oct 25, 2024

Choose a reason for hiding this comment

mori360 Oct 25, 2024

Choose a reason for hiding this comment

mori360 Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

mori360 commented Oct 17, 2024 •

edited

Loading

mori360 Oct 28, 2024 •

edited

Loading