[BE] Normalized to use `model_args: ModelArgs` #58

awgu · 2024-02-13T17:10:35Z

Some modules used args: ModelArgs, others params: ModelArgs, and others model_args: ModelArgs. This PR normalizes everything to use model_args: ModelArgs for consistency. (params might be confused with nn.Parameters, and model_args was more explicit than args.)

Test Plan

./run_llama_train.sh

Output

+ TRAINER_DIR=/home/andgu/local/torchtrain
+ MODEL=debugmodel
+ NGPU=8
+ PP=1
+ SP=1
+ DP=-1
+ LOG_RANK=0
+ CHECKPOINT_FOLDER=
+ CHECKPOINT_INTERVAL=5
+ torchrun --nproc_per_node=8 --local-ranks-filter 0 --role rank --tee 3 train.py --steps 10 --compile --pp_degree 1 --sp_degree 1 --dp_degree -1
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] 
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] *****************************************
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] *****************************************
[rank0]:2024-02-13 09:53:33,644 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [8]
[rank0]:2024-02-13 09:53:36,955 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-13 09:53:36,955 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:/home/andgu/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
[rank0]:  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
[rank0]:2024-02-13 09:53:41,571 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-13 09:53:41,572 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-13 09:53:41,572 - root - INFO - Compiling model llama with torch.compile...
[rank0]:2024-02-13 09:53:43,892 - root - INFO - Profiling active.  Traces will be saved at ./torchtrain/outputs/profiling/traces
[rank0]:NCCL version 2.19.3+cuda12.0
[rank0]:[rank0]:[2024-02-13 09:53:43,995] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:/data/users/andgu/pytorch/torch/_inductor/lowering.py:1697: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank0]:2024-02-13 09:54:06,085 - root - INFO - step: 1, current loss: 10.54707145690918, lr: [0.0002666666666666667]
[rank0]:2024-02-13 09:54:06,153 - root - INFO - step: 2, current loss: 10.481386184692383, lr: [0.0005333333333333334]
[rank0]:2024-02-13 09:54:06,222 - root - INFO - step: 3, current loss: 10.334623336791992, lr: [0.0008]
[rank0]:2024-02-13 09:54:06,288 - root - INFO - step: 4, current loss: 10.121940612792969, lr: [0.0007]
[rank0]:2024-02-13 09:54:06,355 - root - INFO - step: 5, current loss: 9.922933578491211, lr: [0.0006000000000000001]
[rank0]:2024-02-13 09:54:06,422 - root - INFO - step: 6, current loss: 9.710294723510742, lr: [0.0005]
[rank0]:2024-02-13 09:54:06,487 - root - INFO - step: 7, current loss: 9.587849617004395, lr: [0.0004]
[rank0]:2024-02-13 09:54:06,773 - root - INFO - step: 8, current loss: 9.474313735961914, lr: [0.00030000000000000003]
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-13 09:54:06,845 - root - INFO - step: 9, current loss: 9.282522201538086, lr: [0.0002]
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-13 09:54:06,999 - root - INFO - exporting profile traces to ./torchtrain/outputs/profiling/traces/iteration_10
[rank0]:2024-02-13 09:54:07,002 - root - INFO - step: 10, current loss: 9.34823989868164, lr: [0.0001]

lessw2020

nice, thanks for improving this!

Some modules used `args: ModelArgs`, others `params: ModelArgs`, and others `model_args: ModelArgs`. This PR normalizes everything to use `model_args: ModelArgs` for consistency. (`params` might be confused with `nn.Parameter`s, and `model_args` was more explicit than `args`.) **Test Plan** ``` ./run_llama_train.sh ``` <details> <summary> Output </summary> ``` + TRAINER_DIR=/home/andgu/local/torchtrain + MODEL=debugmodel + NGPU=8 + PP=1 + SP=1 + DP=-1 + LOG_RANK=0 + CHECKPOINT_FOLDER= + CHECKPOINT_INTERVAL=5 + torchrun --nproc_per_node=8 --local-ranks-filter 0 --role rank --tee 3 train.py --steps 10 --compile --pp_degree 1 --sp_degree 1 --dp_degree -1 [2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] [2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] ***************************************** [2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] ***************************************** [rank0]:2024-02-13 09:53:33,644 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [8] [rank0]:2024-02-13 09:53:36,955 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-13 09:53:36,955 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:/home/andgu/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 [rank0]: warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" [rank0]:2024-02-13 09:53:41,571 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-13 09:53:41,572 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-13 09:53:41,572 - root - INFO - Compiling model llama with torch.compile... [rank0]:2024-02-13 09:53:43,892 - root - INFO - Profiling active. Traces will be saved at ./torchtrain/outputs/profiling/traces [rank0]:NCCL version 2.19.3+cuda12.0 [rank0]:[rank0]:[2024-02-13 09:53:43,995] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:/data/users/andgu/pytorch/torch/_inductor/lowering.py:1697: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. [rank0]: warnings.warn( [rank0]:2024-02-13 09:54:06,085 - root - INFO - step: 1, current loss: 10.54707145690918, lr: [0.0002666666666666667] [rank0]:2024-02-13 09:54:06,153 - root - INFO - step: 2, current loss: 10.481386184692383, lr: [0.0005333333333333334] [rank0]:2024-02-13 09:54:06,222 - root - INFO - step: 3, current loss: 10.334623336791992, lr: [0.0008] [rank0]:2024-02-13 09:54:06,288 - root - INFO - step: 4, current loss: 10.121940612792969, lr: [0.0007] [rank0]:2024-02-13 09:54:06,355 - root - INFO - step: 5, current loss: 9.922933578491211, lr: [0.0006000000000000001] [rank0]:2024-02-13 09:54:06,422 - root - INFO - step: 6, current loss: 9.710294723510742, lr: [0.0005] [rank0]:2024-02-13 09:54:06,487 - root - INFO - step: 7, current loss: 9.587849617004395, lr: [0.0004] [rank0]:2024-02-13 09:54:06,773 - root - INFO - step: 8, current loss: 9.474313735961914, lr: [0.00030000000000000003] [rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-13 09:54:06,845 - root - INFO - step: 9, current loss: 9.282522201538086, lr: [0.0002] [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-13 09:54:06,999 - root - INFO - exporting profile traces to ./torchtrain/outputs/profiling/traces/iteration_10 [rank0]:2024-02-13 09:54:07,002 - root - INFO - step: 10, current loss: 9.34823989868164, lr: [0.0001] ``` </details>

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 13, 2024

awgu changed the title ~~Normalized to use model_args: ModelArgs~~ [BE] Normalized to use model_args: ModelArgs Feb 13, 2024

lessw2020 approved these changes Feb 13, 2024

View reviewed changes

awgu force-pushed the be branch from 5633abe to fb5f4fc Compare February 13, 2024 17:56

[BE] Normalized to use model_args: ModelArgs

d0094b5

awgu force-pushed the be branch from fb5f4fc to d0094b5 Compare February 13, 2024 17:59

awgu marked this pull request as ready for review February 13, 2024 17:59

awgu merged commit 58b706d into pytorch:main Feb 13, 2024
3 checks passed

awgu deleted the be branch February 13, 2024 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] Normalized to use `model_args: ModelArgs` #58

[BE] Normalized to use `model_args: ModelArgs` #58

awgu commented Feb 13, 2024 •

edited

Loading

lessw2020 left a comment

[BE] Normalized to use model_args: ModelArgs #58

[BE] Normalized to use model_args: ModelArgs #58

Conversation

awgu commented Feb 13, 2024 • edited Loading

lessw2020 left a comment

Choose a reason for hiding this comment

[BE] Normalized to use `model_args: ModelArgs` #58

[BE] Normalized to use `model_args: ModelArgs` #58

awgu commented Feb 13, 2024 •

edited

Loading