Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE] Normalized to use model_args: ModelArgs #58

Merged
merged 1 commit into from
Feb 13, 2024
Merged

[BE] Normalized to use model_args: ModelArgs #58

merged 1 commit into from
Feb 13, 2024

Conversation

awgu
Copy link
Contributor

@awgu awgu commented Feb 13, 2024

Some modules used args: ModelArgs, others params: ModelArgs, and others model_args: ModelArgs. This PR normalizes everything to use model_args: ModelArgs for consistency. (params might be confused with nn.Parameters, and model_args was more explicit than args.)

Test Plan

./run_llama_train.sh
Output
+ TRAINER_DIR=/home/andgu/local/torchtrain
+ MODEL=debugmodel
+ NGPU=8
+ PP=1
+ SP=1
+ DP=-1
+ LOG_RANK=0
+ CHECKPOINT_FOLDER=
+ CHECKPOINT_INTERVAL=5
+ torchrun --nproc_per_node=8 --local-ranks-filter 0 --role rank --tee 3 train.py --steps 10 --compile --pp_degree 1 --sp_degree 1 --dp_degree -1
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] 
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] *****************************************
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] *****************************************
[rank0]:2024-02-13 09:53:33,644 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [8]
[rank0]:2024-02-13 09:53:36,955 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-13 09:53:36,955 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:/home/andgu/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
[rank0]:  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
[rank0]:2024-02-13 09:53:41,571 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-13 09:53:41,572 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-13 09:53:41,572 - root - INFO - Compiling model llama with torch.compile...
[rank0]:2024-02-13 09:53:43,892 - root - INFO - Profiling active.  Traces will be saved at ./torchtrain/outputs/profiling/traces
[rank0]:NCCL version 2.19.3+cuda12.0
[rank0]:[rank0]:[2024-02-13 09:53:43,995] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:/data/users/andgu/pytorch/torch/_inductor/lowering.py:1697: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank0]:2024-02-13 09:54:06,085 - root - INFO - step: 1, current loss: 10.54707145690918, lr: [0.0002666666666666667]
[rank0]:2024-02-13 09:54:06,153 - root - INFO - step: 2, current loss: 10.481386184692383, lr: [0.0005333333333333334]
[rank0]:2024-02-13 09:54:06,222 - root - INFO - step: 3, current loss: 10.334623336791992, lr: [0.0008]
[rank0]:2024-02-13 09:54:06,288 - root - INFO - step: 4, current loss: 10.121940612792969, lr: [0.0007]
[rank0]:2024-02-13 09:54:06,355 - root - INFO - step: 5, current loss: 9.922933578491211, lr: [0.0006000000000000001]
[rank0]:2024-02-13 09:54:06,422 - root - INFO - step: 6, current loss: 9.710294723510742, lr: [0.0005]
[rank0]:2024-02-13 09:54:06,487 - root - INFO - step: 7, current loss: 9.587849617004395, lr: [0.0004]
[rank0]:2024-02-13 09:54:06,773 - root - INFO - step: 8, current loss: 9.474313735961914, lr: [0.00030000000000000003]
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-13 09:54:06,845 - root - INFO - step: 9, current loss: 9.282522201538086, lr: [0.0002]
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-13 09:54:06,999 - root - INFO - exporting profile traces to ./torchtrain/outputs/profiling/traces/iteration_10
[rank0]:2024-02-13 09:54:07,002 - root - INFO - step: 10, current loss: 9.34823989868164, lr: [0.0001]

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 13, 2024
@awgu awgu changed the title Normalized to use model_args: ModelArgs [BE] Normalized to use model_args: ModelArgs Feb 13, 2024
Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks for improving this!

@awgu awgu marked this pull request as ready for review February 13, 2024 17:59
@awgu awgu merged commit 58b706d into pytorch:main Feb 13, 2024
3 checks passed
@awgu awgu deleted the be branch February 13, 2024 18:41
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
Some modules used `args: ModelArgs`, others `params: ModelArgs`, and
others `model_args: ModelArgs`. This PR normalizes everything to use
`model_args: ModelArgs` for consistency. (`params` might be confused
with `nn.Parameter`s, and `model_args` was more explicit than `args`.)

**Test Plan**
```
./run_llama_train.sh
```


<details>
<summary> Output </summary>

```
+ TRAINER_DIR=/home/andgu/local/torchtrain
+ MODEL=debugmodel
+ NGPU=8
+ PP=1
+ SP=1
+ DP=-1
+ LOG_RANK=0
+ CHECKPOINT_FOLDER=
+ CHECKPOINT_INTERVAL=5
+ torchrun --nproc_per_node=8 --local-ranks-filter 0 --role rank --tee 3 train.py --steps 10 --compile --pp_degree 1 --sp_degree 1 --dp_degree -1
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] 
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] *****************************************
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] *****************************************
[rank0]:2024-02-13 09:53:33,644 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [8]
[rank0]:2024-02-13 09:53:36,955 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-13 09:53:36,955 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:/home/andgu/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
[rank0]:  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
[rank0]:2024-02-13 09:53:41,571 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-13 09:53:41,572 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-13 09:53:41,572 - root - INFO - Compiling model llama with torch.compile...
[rank0]:2024-02-13 09:53:43,892 - root - INFO - Profiling active.  Traces will be saved at ./torchtrain/outputs/profiling/traces
[rank0]:NCCL version 2.19.3+cuda12.0
[rank0]:[rank0]:[2024-02-13 09:53:43,995] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:/data/users/andgu/pytorch/torch/_inductor/lowering.py:1697: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank0]:2024-02-13 09:54:06,085 - root - INFO - step: 1, current loss: 10.54707145690918, lr: [0.0002666666666666667]
[rank0]:2024-02-13 09:54:06,153 - root - INFO - step: 2, current loss: 10.481386184692383, lr: [0.0005333333333333334]
[rank0]:2024-02-13 09:54:06,222 - root - INFO - step: 3, current loss: 10.334623336791992, lr: [0.0008]
[rank0]:2024-02-13 09:54:06,288 - root - INFO - step: 4, current loss: 10.121940612792969, lr: [0.0007]
[rank0]:2024-02-13 09:54:06,355 - root - INFO - step: 5, current loss: 9.922933578491211, lr: [0.0006000000000000001]
[rank0]:2024-02-13 09:54:06,422 - root - INFO - step: 6, current loss: 9.710294723510742, lr: [0.0005]
[rank0]:2024-02-13 09:54:06,487 - root - INFO - step: 7, current loss: 9.587849617004395, lr: [0.0004]
[rank0]:2024-02-13 09:54:06,773 - root - INFO - step: 8, current loss: 9.474313735961914, lr: [0.00030000000000000003]
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-13 09:54:06,845 - root - INFO - step: 9, current loss: 9.282522201538086, lr: [0.0002]
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-13 09:54:06,999 - root - INFO - exporting profile traces to ./torchtrain/outputs/profiling/traces/iteration_10
[rank0]:2024-02-13 09:54:07,002 - root - INFO - step: 10, current loss: 9.34823989868164, lr: [0.0001]
```
</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants