Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic integration test template #104

Closed
wants to merge 0 commits into from
Closed

Add basic integration test template #104

wants to merge 0 commits into from

Conversation

gnadathur
Copy link
Contributor

Summary:
Create a test config and runner for integration test. Run it as part of CI.

Test Plan:

+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ search_dir=./integration_test/test_configs
+ for config_file in "$search_dir"/*
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./integration_test/test_configs/basic_integration_test.toml
W0301 16:07:13.452000 140363802604544 torch/distributed/run.py:717]
W0301 16:07:13.452000 140363802604544 torch/distributed/run.py:717] *****************************************
W0301 16:07:13.452000 140363802604544 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0301 16:07:13.452000 140363802604544 torch/distributed/run.py:717] *****************************************
[rank0]:2024-03-01 16:07:15,292 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-03-01 16:07:16,375 - root - INFO - Starting job: 2DParallel with debug model
[rank0]:2024-03-01 16:07:16,375 - root - INFO - Building llama
[rank0]:2024-03-01 16:07:16,383 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-01 16:07:16,383 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-03-01 16:07:17,864 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-03-01 16:07:17,866 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-01 16:07:17,867 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-01 16:07:17,867 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-03-01 16:07:18,911 - root - INFO - Applied FSDP to the model...
[rank0]:2024-03-01 16:07:18,913 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-03-01 16:07:18,913 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240301-1607.
[rank0]:2024-03-01 16:07:19,529 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-03-01 16:07:21,462 - root - INFO - �[36mstep:  1  �[32mloss: 10.8086  �[39miter: �[34m 1.8985�[39m  data: �[34m0.0307  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-03-01 16:07:21,545 - root - INFO - �[36mstep:  2  �[32mloss: 10.7591  �[39miter: �[34m 0.0502�[39m  data: �[34m0.0317  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-03-01 16:07:21,626 - root - INFO - �[36mstep:  3  �[32mloss: 10.6249  �[39miter: �[34m 0.0482�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-03-01 16:07:21,706 - root - INFO - �[36mstep:  4  �[32mloss: 10.4011  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0316  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-03-01 16:07:21,787 - root - INFO - �[36mstep:  5  �[32mloss: 10.1362  �[39miter: �[34m 0.0484�[39m  data: �[34m0.0313  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-03-01 16:07:21,866 - root - INFO - �[36mstep:  6  �[32mloss:  9.8866  �[39miter: �[34m 0.0485�[39m  data: �[34m0.0305  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-03-01 16:07:21,945 - root - INFO - �[36mstep:  7  �[32mloss:  9.7258  �[39miter: �[34m 0.0471�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-03-01 16:07:22,088 - root - INFO - �[36mstep:  8  �[32mloss:  9.3857  �[39miter: �[34m 0.0481�[39m  data: �[34m0.0315  �[39mlr: �[33m0.0003�[39m
[rank0]:STAGE:2024-03-01 16:07:22 1275772:1275772 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-03-01 16:07:22,213 - root - INFO - �[36mstep:  9  �[32mloss:  9.2071  �[39miter: �[34m 0.0897�[39m  data: �[34m0.0307  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-03-01 16:07:22 1275772:1275772 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-03-01 16:07:22 1275772:1275772 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-03-01 16:07:22,590 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-03-01 16:07:22,602 - root - INFO - �[36mstep: 10  �[32mloss:  9.1143  �[39miter: �[34m 0.0904�[39m  data: �[34m0.0304  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-03-01 16:07:22,603 - root - INFO - Average iter time: 0.0600 seconds
[rank0]:2024-03-01 16:07:22,603 - root - INFO - Average data load time: 0.0310 seconds
[rank0]:2024-03-01 16:07:22,603 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0

Reviewers:

Subscribers:

Tasks:

Tags:

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 2, 2024
@gnadathur gnadathur requested review from wconstab and wanchaol March 2, 2024 00:19
# TorchTrain Config.toml
[job]
dump_folder = "./outputs"
description = "2DParallel with debug model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we call it 2D test here, shall we enable sp/pp below besides FSDP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried SP but its crashing so renamed the test to 1D. Once we fix SP, we can add another test for 2D.

@@ -37,7 +37,7 @@ jobs:
python -m pip install -r requirements.txt
python -m pip install -r dev-requirements.txt
python -m pip install -e .
- name: Run NGPU=4 ./run_llama_train.sh
run: NGPU=4 ./run_llama_train.sh
- name: Run NGPU=4 ./integration_test/run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a good reason to make a new .sh for testing purposes, then disregard this. but it has some value to test the actual run_llama_train.sh script that users will see, so we might as well just call that script from the integration tests.

Note that you can override the .toml while calling that script- e.g.
CONFIG_FILE=blah ./run_llama_train.sh

and in that way achieve various different test combos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, same comment about the .toml files themselves.

If we are offering .toml files for various model configs, we might as well just test those directly (except for the ones that are too big, etc).

I am ok with having separate files in integration_test/test_configs/ folder when they are not duplicating a config we have elsewhere, but if we end up with very similar files in 2 places we should probably delete the one in integration_test/test_configs/ and just directly test the one in the other config folder

Copy link
Contributor Author

@gnadathur gnadathur Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question.
IMO, there can be some key differences between config files in the test directory and the ones in train_config.

  1. Train configs can be tailored to the optimal config we recommend for a particular model. These are the "public" configs. For example, the number of steps can be what is needed to show good numerics.
  2. Test configs on the other hand can be short but also granular, (for ex w/ and w/o SAC) to test the various combinations.

Agreed on the comment about de-duping configs.
Here's a scheme that can maybe work better.

  1. Create train_configs/test for all the test configs.
  2. train_configs directory can be used for public configs.
  3. There is no duplication bet. config files in root directory and the test sub directory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if we want to have a separate test related scripts, we can possibly create scripts/test/ and put all things under that folder

if we want to test different configs, similar we can have train_configs/test/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanchaol

How about this scheme. We add a job level config job.use_for_integration_test which is False by default. For those configs which we want to add to integration test, we can set this option to True. This way nothing changes w.r.t location of configs etc. The runner can check for this flag and run the config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants