Add basic integration test template #104

gnadathur · 2024-03-02T00:11:54Z

Summary:
Create a test config and runner for integration test. Run it as part of CI.

Test Plan:

+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ search_dir=./integration_test/test_configs
+ for config_file in "$search_dir"/*
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./integration_test/test_configs/basic_integration_test.toml
W0301 16:07:13.452000 140363802604544 torch/distributed/run.py:717]
W0301 16:07:13.452000 140363802604544 torch/distributed/run.py:717] *****************************************
W0301 16:07:13.452000 140363802604544 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0301 16:07:13.452000 140363802604544 torch/distributed/run.py:717] *****************************************
[rank0]:2024-03-01 16:07:15,292 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-03-01 16:07:16,375 - root - INFO - Starting job: 2DParallel with debug model
[rank0]:2024-03-01 16:07:16,375 - root - INFO - Building llama
[rank0]:2024-03-01 16:07:16,383 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-01 16:07:16,383 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-03-01 16:07:17,864 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-03-01 16:07:17,866 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-01 16:07:17,867 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-01 16:07:17,867 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-03-01 16:07:18,911 - root - INFO - Applied FSDP to the model...
[rank0]:2024-03-01 16:07:18,913 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-03-01 16:07:18,913 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240301-1607.
[rank0]:2024-03-01 16:07:19,529 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-03-01 16:07:21,462 - root - INFO - �[36mstep:  1  �[32mloss: 10.8086  �[39miter: �[34m 1.8985�[39m  data: �[34m0.0307  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-03-01 16:07:21,545 - root - INFO - �[36mstep:  2  �[32mloss: 10.7591  �[39miter: �[34m 0.0502�[39m  data: �[34m0.0317  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-03-01 16:07:21,626 - root - INFO - �[36mstep:  3  �[32mloss: 10.6249  �[39miter: �[34m 0.0482�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-03-01 16:07:21,706 - root - INFO - �[36mstep:  4  �[32mloss: 10.4011  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0316  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-03-01 16:07:21,787 - root - INFO - �[36mstep:  5  �[32mloss: 10.1362  �[39miter: �[34m 0.0484�[39m  data: �[34m0.0313  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-03-01 16:07:21,866 - root - INFO - �[36mstep:  6  �[32mloss:  9.8866  �[39miter: �[34m 0.0485�[39m  data: �[34m0.0305  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-03-01 16:07:21,945 - root - INFO - �[36mstep:  7  �[32mloss:  9.7258  �[39miter: �[34m 0.0471�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-03-01 16:07:22,088 - root - INFO - �[36mstep:  8  �[32mloss:  9.3857  �[39miter: �[34m 0.0481�[39m  data: �[34m0.0315  �[39mlr: �[33m0.0003�[39m
[rank0]:STAGE:2024-03-01 16:07:22 1275772:1275772 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-03-01 16:07:22,213 - root - INFO - �[36mstep:  9  �[32mloss:  9.2071  �[39miter: �[34m 0.0897�[39m  data: �[34m0.0307  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-03-01 16:07:22 1275772:1275772 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-03-01 16:07:22 1275772:1275772 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-03-01 16:07:22,590 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-03-01 16:07:22,602 - root - INFO - �[36mstep: 10  �[32mloss:  9.1143  �[39miter: �[34m 0.0904�[39m  data: �[34m0.0304  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-03-01 16:07:22,603 - root - INFO - Average iter time: 0.0600 seconds
[rank0]:2024-03-01 16:07:22,603 - root - INFO - Average data load time: 0.0310 seconds
[rank0]:2024-03-01 16:07:22,603 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0

Reviewers:

Subscribers:

Tasks:

Tags:

tianyu-l · 2024-03-02T00:27:38Z

integration_test/test_configs/basic_integration_test.toml

+# TorchTrain Config.toml
+[job]
+dump_folder = "./outputs"
+description = "2DParallel with debug model"


since we call it 2D test here, shall we enable sp/pp below besides FSDP?

I tried SP but its crashing so renamed the test to 1D. Once we fix SP, we can add another test for 2D.

wconstab · 2024-03-02T00:39:19Z

.github/workflows/unit_test_4gpu.yaml

@@ -37,7 +37,7 @@ jobs:
          python -m pip install -r requirements.txt
          python -m pip install -r dev-requirements.txt
          python -m pip install -e .
-      - name: Run NGPU=4 ./run_llama_train.sh
-        run: NGPU=4 ./run_llama_train.sh
+      - name: Run NGPU=4 ./integration_test/run.sh


if there's a good reason to make a new .sh for testing purposes, then disregard this. but it has some value to test the actual run_llama_train.sh script that users will see, so we might as well just call that script from the integration tests.

Note that you can override the .toml while calling that script- e.g.
CONFIG_FILE=blah ./run_llama_train.sh

and in that way achieve various different test combos

actually, same comment about the .toml files themselves.

If we are offering .toml files for various model configs, we might as well just test those directly (except for the ones that are too big, etc).

I am ok with having separate files in integration_test/test_configs/ folder when they are not duplicating a config we have elsewhere, but if we end up with very similar files in 2 places we should probably delete the one in integration_test/test_configs/ and just directly test the one in the other config folder

This is a good question.
IMO, there can be some key differences between config files in the test directory and the ones in train_config.

Train configs can be tailored to the optimal config we recommend for a particular model. These are the "public" configs. For example, the number of steps can be what is needed to show good numerics.

Test configs on the other hand can be short but also granular, (for ex w/ and w/o SAC) to test the various combinations.

Agreed on the comment about de-duping configs.
Here's a scheme that can maybe work better.

Create train_configs/test for all the test configs.

train_configs directory can be used for public configs.

There is no duplication bet. config files in root directory and the test sub directory

IMO if we want to have a separate test related scripts, we can possibly create scripts/test/ and put all things under that folder

if we want to test different configs, similar we can have train_configs/test/

@wanchaol

How about this scheme. We add a job level config job.use_for_integration_test which is False by default. For those configs which we want to add to integration test, we can set this option to True. This way nothing changes w.r.t location of configs etc. The runner can check for this flag and run the config.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 2, 2024

gnadathur requested review from wconstab and wanchaol March 2, 2024 00:19

tianyu-l reviewed Mar 2, 2024

View reviewed changes

wconstab reviewed Mar 2, 2024

View reviewed changes

gnadathur force-pushed the intg_tests branch from f8d7f05 to 29d5f75 Compare March 2, 2024 02:06

gnadathur closed this Mar 27, 2024

gnadathur force-pushed the intg_tests branch from 29d5f75 to 8dd5798 Compare March 27, 2024 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic integration test template #104

Add basic integration test template #104

gnadathur commented Mar 2, 2024

tianyu-l Mar 2, 2024

gnadathur Mar 2, 2024

wconstab Mar 2, 2024

wconstab Mar 2, 2024

gnadathur Mar 2, 2024 •

edited

Loading

wanchaol Mar 2, 2024

gnadathur Mar 5, 2024

Add basic integration test template #104

Add basic integration test template #104

Conversation

gnadathur commented Mar 2, 2024

tianyu-l Mar 2, 2024

Choose a reason for hiding this comment

gnadathur Mar 2, 2024

Choose a reason for hiding this comment

wconstab Mar 2, 2024

Choose a reason for hiding this comment

wconstab Mar 2, 2024

Choose a reason for hiding this comment

gnadathur Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

wanchaol Mar 2, 2024

Choose a reason for hiding this comment

gnadathur Mar 5, 2024

Choose a reason for hiding this comment

gnadathur Mar 2, 2024 •

edited

Loading