Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Memory Benchmarking Feature to Benchmarking Code #14

Merged
merged 19 commits into from
May 27, 2024

Conversation

achew010
Copy link
Contributor

@achew010 achew010 commented May 17, 2024

Description

This PR adds GPU memory logging features to the benchmark script according to #8 and an updated benchmark README for usage instructions.

There are 2 approaches to logging memory,

  • Using Nvidia's SMI CLI tool
  • Using Huggingface's HFTrainer's API

Note: Issue #19 is created to address the grouping of memory values using a common prefix and will be addressed in future

Usage

1. Nvidia's SMI CLI tool

Set environment variable MEMORY_LOGGING=nvidia to use run_benchmarks.sh with nvidia logging

For each experiment,

  • Before the experiment calls subprocess.run, it will open an async nvidia-smi process to monitor only gpu indices in $CUDA_VISIBLE_DEVICES and log to FILE_MEM inside Experiment.save_dir
  • After the experiment subprocess call is completed. It terminates the async process.
  • At the end of each experiment, reads and aggregates an average memory over time (MiB / per sec) for each device and finally saves the average memory across all devices into gpu_mem in the main result logging function Experiment.write_results
  • Since it is an independent process called to measure the device, no expected slowdown in training speed

Each experiment directory will have a gpu log that contains <Timestamp>, <GPU Name>, <GPU ID>, <GPU Memory Used>

The memory readings will be reflected in the results raw_summary.csv under the column 'nvidia_mem_reserved' where the raw values are reported in MiB

2. Torch CUDA through Huggingface's HFTrainer's API

Set environment variable MEMORY_LOGGING=huggingface to use run_benchmarks.sh with huggingface logging (default)

HFTrainer has a feature to log memory through the skip_memory_metrics=False training argument. In their documentation, it is mentioned that setting this argument to False will affect training speed. In our tests so far (below), we do not see significant difference in throughput (tokens/sec) when using this argument.

A set of finegrain GPU readings will show as additional columns in the results raw_summary.csv where the raw values are reported in bytes

image

3. Log Both

Set environment variable MEMORY_LOGGING=all to use run_benchmarks.sh with both logging methods

4. Difference between Nvidia-SMI Utility and Torch CUDA through HFTrainer API

1. The Nvidia-SMI Utility is a coarse measurement tool that captures anything takes up GPU memory. It is simple and non-intrusive as it doesn't involve probing the trainer. It uses the NVML library to fetch reserved memory for each device ID -

  • All running computer processes using that device
  • All gpu memory used inside the training script
  • Any other caching of gpu memory by the training script

Note: To get accurate measurements, no other processes should be running on the device apart from the target process itself.

2. The HFTrainer API is a more precise tool that logs memory usage for a couple of operations inside HFTrainer

It uses torch.cuda.memory_allocated to probe the trainer by taking snapshots of allocated memory and storing the differences between the before and after of each stage. The following stages are probed -

  • Before Trainer init
  • Trainer.__init__,
  • Trainer.train,
  • Trainer.evaluate,
  • Trainer.predict.

Note: Any gpu memory accessed and used outside any of these stages or not part of HFTrainer will not be tracked. If the train script does not use the Huggingface trainer then this API will not work as well.

Note: Details on Memory Calculations from HFTrainer for GPTQ-LoRA + FSDP

This is an example of the memory values that HFTrainer will produce in the outputs of train()

output_metrics = {
    'train_runtime': 191.2491, 
    'train_samples_per_second': 0.209, 
    'train_steps_per_second': 0.052, 
    'train_tokens_per_second': 428.342, 
    'train_loss': 1.0627506256103516, 
    'init_mem_cpu_alloc_delta': 4096, 
    'init_mem_gpu_alloc_delta': 0, 
    'init_mem_cpu_peaked_delta': 0, 
    'init_mem_gpu_peaked_delta': 0, 
    'train_mem_cpu_alloc_delta': 839086080, 
    'train_mem_gpu_alloc_delta': -17491768832, 
    'train_mem_cpu_peaked_delta': 0, 
    'train_mem_gpu_peaked_delta': 26747825664, 
    'before_init_mem_cpu': 5513297920, 
    'before_init_mem_gpu': 36141687296, 
    'epoch': 0.01
}

We refer to the keys of the memory metrics in this order

  • before_init_mem_X as stage0
  • init_mem_X as stage1
  • train_mem_X as stage2
  • ...

We currently compute the memory values in the report by taking the largest of sums. For example:

For allocated memory value

max([
  stage0_mem + stage1_allocated_delta, 
  stage0_mem + stage1_allocated_delta + stage2_allocated_delta,
  ...
])

For peak memory value

max([
  stage0_mem + stage1_allocated_delta + stage1_peaked_delta, 
  stage0_mem + stage1_allocated_delta + stage2_allocated_delta + stage2_peaked_delta,
  ...
])

Notice that we do not include stage0_mem alone when computing the max value. This is to avoid misleading comparisons between GPTQ-LoRA and other approaches that support low-memory mode. GPTQ-LoRA + FSDP currently does not support low-memory mode as mentioned #18.

The stage0_mem value of GPTQ-LoRA + FSDP will reflect a larger value as it is loaded fully before the trainer is initialized and then subsequently will be sharded internally in trainer.prepare.

This might cause some misleading comparisons when other variants are loaded in low-memory mode and have smaller stage0_mem memory consumption than GPTQ-LoRA + FSDP before its sharding. Once low-memory mode is supported for GPTQ-LoRA, we will include stage0_mem back inside the max computation

Tests

Memory Measurement Accuracy and Potential Side Effects

1. No Significant Slowdown From Using HFTrainer Memory Probes API on QLoRA Training

In both the Mistral7B model and Mixtral model, introducing the memory probes do not show a significant impact to the throughput of the training run (50 steps). Generally with larger batch sizes and models, the overhead of memory logging becomes insignificant.

A. <100 toks/sec slowdown after introducing the memory probes for Mistral,

model_name_or_path num
gpus
per
device
batch
size
throughput
with
no mem probe
(toks/sec)
throughput
with
mem probe
(toks/sec)
mistralai/Mistral-7B-v0.1 1 4 3465 3432
mistralai/Mistral-7B-v0.1 2 2 2973 2931
mistralai/Mistral-7B-v0.1 1 8 3489 3508
mistralai/Mistral-7B-v0.1 2 4 3383 3298

B. <100 toks/sec slowdown after introducing the memory probes for Mixtral

model_name_or_path num
gpus
per
device
batch
size
throughput
with
no_mem_probe
(toks/sec)
throughput
with
mem_probe
(toks/sec)
mistralai/Mixtral-8x7B-Instruct-v0.1 1 4 1785 1776
mistralai/Mixtral-8x7B-Instruct-v0.1 2 2 1518 1442
mistralai/Mixtral-8x7B-Instruct-v0.1 1 8 1938 1933
mistralai/Mixtral-8x7B-Instruct-v0.1 2 4 1757 1724

2. Torch/HF shows more granularity of memory usage with peak memory and actual allocated memory than Nvidia's reserved memory. This is more helpful when analyzing the actual memory allocated for each model.

We compare the 2 memory tracking methods (Nvidia vs Torch/HF) on single devices for both GPTQ-LoRA and QLoRA. Nvidia's peak mem reserved reports larger values than Torch/HF peak mem alloc, torch mem alloc shows the actual memory usage is lesser.

model_name_or_path num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
mistralai/Mistral-7B-v0.1 1 4 19.46 15.86 4.84
TheBloke/Mistral-7B-v0.1-GPTQ 1 4 19.97 15.89 4.87
mistralai/Mixtral-8x7B-Instruct-v0.1 1 4 37.49 36.22 25.2
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ 1 4 36.59 35.53 24.51
NousResearch/Llama-2-70b-hf 1 4 71.12 68.16 37.35
TheBloke/Llama-2-70B-GPTQ 1 4 70.51 65.9 36.29

3. Memory Usage Decreases on Distributed Finetuning

When running large models on multiple devices, torch mem alloc shows the memory usage decreases as the models are sharded (Comparing to table above).

model_name_or_path num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
mistralai/Mistral-7B-v0.1 2 4 20.97 16.59 2.73
TheBloke/Mistral-7B-v0.1-GPTQ 2 4 23.75 16.26 3.01
mistralai/Mixtral-8x7B-Instruct-v0.1 2 4 32.59 29.33 13.22
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ 2 4 51.1 27.74 12.79

Verified that torch mem alloc for GPTQ-LoRA on Llama2-70B hovers at 19GiB when sharded after trainer.prepare and during training. The values are similar to the manual probed values from this #15.

model_name_or_path num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
NousResearch/Llama-2-70b-hf 2 2 51.49 46.52 19.17
TheBloke/Llama-2-70B-GPTQ 2 2 78.69 45.4 18.65

Benchmarks

Run tox -e run_benches to produce benchmarks. Full benchmark details can be referenced here

4. For small models, LoRA runs faster than the Quantized PEFT methods. One likely reason could be that it doesnt require an additional dequantization operation before the base layer+LoRA matmuls. While this is so, we also observe the significantly larger memory it consumed compared to the Quantized PEFT methods.

model_name_or_path Training
Type
Accel.
Config
Type
num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
throughput
(toks/sec)
mistralai/Mistral-7B-v0.1 lora none 1 4 29.03 26.11 15.12 3597
mistralai/Mistral-7B-v0.1 lora accelerated-peft-bnb 1 4 19.46 15.86 4.84 3428
TheBloke/Mistral-7B-v0.1-GPTQ lora accelerated-peft-autogptq 1 4 19.97 15.89 4.87 3254

5. We observe that on single device finetuning for larger models (e.g. 49B Mixtral), PEFT begins to run out of memory while the Quantized PEFT methods continue to maintain low memory consumption.

model_name_or_path Training
Type
Accel.
Config
Type
num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
throughput
(toks/sec)
mistralai/Mixtral-8x7B-Instruct-v0.1 none none 1 4 79.14 OOM OOM OOM
mistralai/Mixtral-8x7B-Instruct-v0.1 lora none 1 4 79.06 OOM OOM OOM
mistralai/Mixtral-8x7B-Instruct-v0.1 lora baseline-peft-bnb 1 4 47.18 46.42 25.73 1396
mistralai/Mixtral-8x7B-Instruct-v0.1 lora accelerated-peft-bnb 1 4 37.5 36.22 25.2 1764
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ lora accelerated-peft-autogptq 1 4 36.58 35.53 24.51 1864

6. In distributed finetuning for large models like Llama2-70B, GPTQ-LoRA shows the lowest memory consumption with the same throughput.

model_name_or_path Training
Type
Accel.
Config
Type
num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
throughput
(toks/sec)
NousResearch/Llama-2-70b-hf lora none 2 2 79 OOM OOM OOM
NousResearch/Llama-2-70b-hf lora accelerated-peft-bnb 2 2 51.4 46.52 19.17 418
TheBloke/Llama-2-70B-GPTQ lora accelerated-peft-autogptq 2 2 78.5 45.4 18.65 426

7. Increasing the batch size, GPTQ-LoRA is the only experiment that doesnt run out of memory.

model_name_or_path Training
Type
Accel.
Config
Type
num
gpus
per
device
batch
size
peak nvidia mem reserved
(GiB)
peak torch mem alloc
(GiB)
torch mem alloc
(GiB)
throughput
(toks/sec)
NousResearch/Llama-2-70b-hf lora none 2 4 OOM OOM OOM OOM
NousResearch/Llama-2-70b-hf lora accelerated-peft-bnb 2 4 OOM OOM OOM OOM
TheBloke/Llama-2-70B-GPTQ lora accelerated-peft-autogptq 2 4 78.48 70.67 18.65 451

@achew010 achew010 changed the base branch from main to dev May 17, 2024 07:02
@achew010 achew010 marked this pull request as ready for review May 17, 2024 08:45
@achew010 achew010 requested a review from fabianlim as a code owner May 17, 2024 08:45
Copy link
Contributor

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be good to put a link to this issue soemwhere in the code so that we can view the gpu log schema.

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved
scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved
@fabianlim
Copy link
Contributor

fabianlim commented May 20, 2024

can you run tox -e lint from the top level directory, the linting is not automated yet #7 . Also do we plan to activate the memory by default in run_benchmarks.sh?

@achew010
Copy link
Contributor Author

achew010 commented May 21, 2024

can you run tox -e lint from the top level directory, the linting is not automated yet #7 . Also do we plan to activate the memory by default in run_benchmarks.sh?

@fabianlim
okay linted. Do we want to set nvidia-smi to be the default mem logging approach or once we establish that the speed degradation is insignificant with using the HF memory logging API to use this instead?

@fabianlim
Copy link
Contributor

fabianlim commented May 21, 2024

okay linted. Do we want to set nvidia-smi to be the default mem logging approach or once we establish that the speed degradation is insignificant with using the HF memory logging API to use this instead?

@achew010 lets merge that commit on top of this one and let me review.

scripts/benchmarks/README.md Outdated Show resolved Hide resolved
scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved
scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved
scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved
scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved
scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved
scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved
scripts/benchmarks/README.md Outdated Show resolved Hide resolved
scripts/benchmarks/README.md Outdated Show resolved Hide resolved
@fabianlim fabianlim linked an issue May 23, 2024 that may be closed by this pull request
achew010 added a commit to achew010/fms-acceleration that referenced this pull request May 24, 2024
@fabianlim
Copy link
Contributor

fabianlim commented May 24, 2024

@achew010 i approived. after we update the csv we can merge. Also can you run a tox -e lint

@fabianlim fabianlim merged commit f1895b7 into foundation-model-stack:dev May 27, 2024
2 checks passed
@fabianlim
Copy link
Contributor

We have noted the memory keys should be renamed; to be addressed later in #19

fabianlim added a commit that referenced this pull request May 27, 2024
…or GPTQ-LoRA (#20)

* Add GitHub Workflow for Linting , Formatting and Test. Activate Workflow for Framework (#7)

* add lint workflow

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* add pylintrc, update .tox fix files

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* activate test and minor fix

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* lint benchmarks.py and add workflow to dev

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

---------

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* Improvements to Benchmark Scripts and Config Generation Workflow (#13)

* fix benches and add verify configs

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* update readme and add workflow

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* add packaging dep

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* update torch dep in framework and run-benches

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* take host env in run-benches

* add display bench results script

* rename summary.csv to raw_summary.csv and update run_benchmarks.sh

* export environment variables in shell command

* dump out pip requirements for repro, and add default FHT_branch

---------

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* Added support for running official HF baseline FSDP-QLoRA benchmark (#16)

* new baseline scenario

* rename variables

* added warning when plugin allows SFTTrainer to handle PEFT on single device

* Fix FSDP when performing GPTQ-LoRA with Triton V2  (#15)

* wrap in parameters and torch view to correct dtype

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* refactor to apply patch only on FSDP and simplify

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

---------

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* Provide Memory Benchmarking Feature to Benchmarking Code (#14)

* add gpu memory logging support

* made improvements to GPU reference and result collation

* Renamed memory logging argument to reflect its readings as reserved me
mory using nvidia-smi and changed aggregation function in result collation

* variable renames

* manual linting

* added memory logging functionality via HFTrainer

* added support to benchmark memory using HFTrainer and updated READMEwith explanation of the 2 memory benchmarking options

* addressed changes requested in PR #14

* fix bug and smplify gpu logs aggregation logic

* fixes to calculation of HFTrainer Mem Logging values

* fix calculations

* more fixes

* fix to ignore including  stage inside max calculation of alloc memory

* more comments and README updates

* added fix to keyerror due to empty output dict from OOM

* manual linting

* added benchmark results to refs

* remove unnecessary columns in results gathering

* made changes to results gathering

---------

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Co-authored-by: achew010 <[email protected]>
@achew010 achew010 deleted the memory-benchmarks branch May 29, 2024 02:14
@fabianlim
Copy link
Contributor

@achew010 can we move all the memory computation logic out of write_result into gather_report. That way the results.json only holds the raw data. Then gather_report can hold all the logic to preprocess the data for human consumption.

@fabianlim
Copy link
Contributor

fabianlim commented Jun 1, 2024

@achew010 also one more consideration is that memory we should only have huggingface mem probes in the benchmark.csv. This is because the command.sh cannot easily replay the nvidia-smi measurements. Actually maybe there are more issues, because results.json is not even properly populated by command.sh.

Or unless we have the tool do a proper replay and start the nvidia-smi properly. Update addressed in below commit.

fabianlim added a commit to fabianlim/fms-acceleration that referenced this pull request Jun 1, 2024
fabianlim added a commit that referenced this pull request Jun 2, 2024
* refactor

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* fixes

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* refactor mistral

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* add mixtral

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* some refactoring after introducing mlp

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* remove extranous files

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* add bnb

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* lint + fmt and improvements to readme

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* bench fixes

* need to handle lora adapters device due to #26

* allow replay of failed benches, addressing comment in #14

* update benches (remove l40)

---------

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add GPU measurements to Benchmark Script
2 participants