Add truncated llama style model init via reset parameters() #54

lessw2020 · 2024-02-12T17:44:42Z

This PR adds the following:
1 - via reset parameters, a full layerwise init for the llama models under /llama. This uses the total model depth as part of the init via:
self.weight_init_std = 0.02 / (2 * self.num_layers) ** 0.5

2 - The final output ffn (head) is init with sqrt of the dim of the model itself and a slightly wider cutoff factor of 3.

3 - tangential change - updates run_llama_train.sh with updated MODEL and MODEL_CONF params to allow for direct model control via the sh script. (there was a MODEL already but it was incorrectly using that in place of MODEL_CONF...though we should update this as it's not intuitive).

4 - made the debugmodel default to 2 layers as an improved debug check.

5 - added a 1B and 40B for additional testing configs. I can't currently run 70B on my H100 due to OOM, but can run 40B.

Testing:
Verified proper init and training with 7B, 13B and ~40B:

lessw2020 · 2024-02-12T17:56:47Z

additional:
1 - PR links to issue #20
2 - I split out the sh file to relevant subsections to make modifying it easier, and
3 - also commented out checkpointing and turned off profiling so it's off by default. (don't want to profile or checkpoint ~40B etc. during dev work). Figure we can add this back to on by default once it's ready to go public.

lessw2020 · 2024-02-12T18:04:53Z

also, I commented out --compile b/c I cannot compile with latest nightlies - opened issue: #55

wanchaol

Looks great! have a few comments inlined then we are good to go

run_llama_train.sh

torchtrain/train_configs/train_config.toml

torchtrain/models/llama/model.py

run_llama_train.sh

torchtrain/models/llama/model.py

wanchaol · 2024-02-13T20:59:47Z

Also, I just built the latest main branch pytorch, and tried compile, it seems still working on my side

lessw2020 · 2024-02-14T19:25:14Z

Also, I just built the latest main branch pytorch, and tried compile, it seems still working on my side

thanks for letting me know! I turned back on compile as in further investigation from your update, it seems the error is related specifically to doing cuda kernel work on same machine - inductor is confused about which cuda backend files to use.

wconstab · 2024-02-15T14:45:06Z

looking forward to trying this out! thanks @lessw2020

This PR adds the following: 1 - via reset parameters, a full layerwise init for the llama models under /llama. This uses the total model depth as part of the init via: self.weight_init_std = 0.02 / (2 * self.num_layers) ** 0.5 2 - The final output ffn (head) is init with sqrt of the dim of the model itself and a slightly wider cutoff factor of 3. 3 - tangential change - updates run_llama_train.sh with updated MODEL and MODEL_CONF params to allow for direct model control via the sh script. (there was a MODEL already but it was incorrectly using that in place of MODEL_CONF...though we should update this as it's not intuitive). 4 - made the debugmodel default to 2 layers as an improved debug check. 5 - added a 1B and 40B for additional testing configs. I can't currently run 70B on my H100 due to OOM, but can run 40B. Testing: Verified proper init and training with 7B, 13B and ~40B: <img width="1085" alt="Screenshot 2024-02-11 at 10 39 12 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/049037ed-63a4-4ab0-bebc-f297857aab72">

This PR adds the following: 1 - via reset parameters, a full layerwise init for the llama models under /llama. This uses the total model depth as part of the init via: self.weight_init_std = 0.02 / (2 * self.num_layers) ** 0.5 2 - The final output ffn (head) is init with sqrt of the dim of the model itself and a slightly wider cutoff factor of 3. 3 - tangential change - updates run_llama_train.sh with updated MODEL and MODEL_CONF params to allow for direct model control via the sh script. (there was a MODEL already but it was incorrectly using that in place of MODEL_CONF...though we should update this as it's not intuitive). 4 - made the debugmodel default to 2 layers as an improved debug check. 5 - added a 1B and 40B for additional testing configs. I can't currently run 70B on my H100 due to OOM, but can run 40B. Testing: Verified proper init and training with 7B, 13B and ~40B: <img width="1085" alt="Screenshot 2024-02-11 at 10 39 12 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/049037ed-63a4-4ab0-bebc-f297857aab72"> [ghstack-poisoned]

lessw2020 added 2 commits February 11, 2024 22:48

implement layersize trunc init, verify 7B, 13B, 40B

e14a350

make model_conf for model control via sh, add rdvz point

c5da96d

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2024

ruff format __init__.py

8b940be

lessw2020 requested review from wanchaol, wconstab and fegin February 12, 2024 17:56

uncomment trainer_dir env var

c7f8ec7

wanchaol reviewed Feb 13, 2024

View reviewed changes

run_llama_train.sh Outdated Show resolved Hide resolved

torchtrain/train_configs/train_config.toml Outdated Show resolved Hide resolved

torchtrain/models/llama/model.py Show resolved Hide resolved

run_llama_train.sh Outdated Show resolved Hide resolved

wanchaol reviewed Feb 13, 2024

View reviewed changes

torchtrain/models/llama/model.py Show resolved Hide resolved

lessw2020 added 4 commits February 14, 2024 11:27

turn back on profiling, checkpoint, compiler

58c691d

resolve merge conflicts, revert compile, profile, checkpointing to on

41efc02

Merge branch 'main' into llama-init

26ac429

ufmt whitespace fix

dc29cb5

lessw2020 merged commit 076edda into pytorch:main Feb 14, 2024
3 checks passed

lessw2020 deleted the llama-init branch February 14, 2024 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add truncated llama style model init via reset parameters() #54

Add truncated llama style model init via reset parameters() #54

lessw2020 commented Feb 12, 2024

lessw2020 commented Feb 12, 2024

lessw2020 commented Feb 12, 2024

wanchaol left a comment

wanchaol commented Feb 13, 2024

lessw2020 commented Feb 14, 2024

wconstab commented Feb 15, 2024

Add truncated llama style model init via reset parameters() #54

Add truncated llama style model init via reset parameters() #54

Conversation

lessw2020 commented Feb 12, 2024

lessw2020 commented Feb 12, 2024

lessw2020 commented Feb 12, 2024

wanchaol left a comment

Choose a reason for hiding this comment

wanchaol commented Feb 13, 2024

lessw2020 commented Feb 14, 2024

wconstab commented Feb 15, 2024