Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BE] Normalized to use
model_args: ModelArgs
(#58)
Some modules used `args: ModelArgs`, others `params: ModelArgs`, and others `model_args: ModelArgs`. This PR normalizes everything to use `model_args: ModelArgs` for consistency. (`params` might be confused with `nn.Parameter`s, and `model_args` was more explicit than `args`.) **Test Plan** ``` ./run_llama_train.sh ``` <details> <summary> Output </summary> ``` + TRAINER_DIR=/home/andgu/local/torchtrain + MODEL=debugmodel + NGPU=8 + PP=1 + SP=1 + DP=-1 + LOG_RANK=0 + CHECKPOINT_FOLDER= + CHECKPOINT_INTERVAL=5 + torchrun --nproc_per_node=8 --local-ranks-filter 0 --role rank --tee 3 train.py --steps 10 --compile --pp_degree 1 --sp_degree 1 --dp_degree -1 [2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] [2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] ***************************************** [2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-02-13 09:53:31,345] torch.distributed.run: [WARNING] ***************************************** [rank0]:2024-02-13 09:53:33,644 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [8] [rank0]:2024-02-13 09:53:36,955 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-13 09:53:36,955 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:/home/andgu/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 [rank0]: warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" [rank0]:2024-02-13 09:53:41,571 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-13 09:53:41,572 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-13 09:53:41,572 - root - INFO - Compiling model llama with torch.compile... [rank0]:2024-02-13 09:53:43,892 - root - INFO - Profiling active. Traces will be saved at ./torchtrain/outputs/profiling/traces [rank0]:NCCL version 2.19.3+cuda12.0 [rank0]:[rank0]:[2024-02-13 09:53:43,995] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:/data/users/andgu/pytorch/torch/_inductor/lowering.py:1697: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. [rank0]: warnings.warn( [rank0]:2024-02-13 09:54:06,085 - root - INFO - step: 1, current loss: 10.54707145690918, lr: [0.0002666666666666667] [rank0]:2024-02-13 09:54:06,153 - root - INFO - step: 2, current loss: 10.481386184692383, lr: [0.0005333333333333334] [rank0]:2024-02-13 09:54:06,222 - root - INFO - step: 3, current loss: 10.334623336791992, lr: [0.0008] [rank0]:2024-02-13 09:54:06,288 - root - INFO - step: 4, current loss: 10.121940612792969, lr: [0.0007] [rank0]:2024-02-13 09:54:06,355 - root - INFO - step: 5, current loss: 9.922933578491211, lr: [0.0006000000000000001] [rank0]:2024-02-13 09:54:06,422 - root - INFO - step: 6, current loss: 9.710294723510742, lr: [0.0005] [rank0]:2024-02-13 09:54:06,487 - root - INFO - step: 7, current loss: 9.587849617004395, lr: [0.0004] [rank0]:2024-02-13 09:54:06,773 - root - INFO - step: 8, current loss: 9.474313735961914, lr: [0.00030000000000000003] [rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-13 09:54:06,845 - root - INFO - step: 9, current loss: 9.282522201538086, lr: [0.0002] [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-13 09:54:06 3243810:3243810 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-13 09:54:06,999 - root - INFO - exporting profile traces to ./torchtrain/outputs/profiling/traces/iteration_10 [rank0]:2024-02-13 09:54:07,002 - root - INFO - step: 10, current loss: 9.34823989868164, lr: [0.0001] ``` </details>
- Loading branch information