-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide Memory Benchmarking Feature to Benchmarking Code #14
Provide Memory Benchmarking Feature to Benchmarking Code #14
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be good to put a link to this issue soemwhere in the code so that we can view the gpu log schema.
ed4ae05
to
d4dae06
Compare
mory using nvidia-smi and changed aggregation function in result collation
d4dae06
to
a0aa069
Compare
can you run |
@fabianlim |
…ith explanation of the 2 memory benchmarking options
@achew010 i approived. after we update the csv we can merge. Also can you run a |
We have noted the memory keys should be renamed; to be addressed later in #19 |
…or GPTQ-LoRA (#20) * Add GitHub Workflow for Linting , Formatting and Test. Activate Workflow for Framework (#7) * add lint workflow Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add pylintrc, update .tox fix files Signed-off-by: Yu Chin Fabian Lim <[email protected]> * activate test and minor fix Signed-off-by: Yu Chin Fabian Lim <[email protected]> * lint benchmarks.py and add workflow to dev Signed-off-by: Yu Chin Fabian Lim <[email protected]> --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Improvements to Benchmark Scripts and Config Generation Workflow (#13) * fix benches and add verify configs Signed-off-by: Yu Chin Fabian Lim <[email protected]> * update readme and add workflow Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add packaging dep Signed-off-by: Yu Chin Fabian Lim <[email protected]> * update torch dep in framework and run-benches Signed-off-by: Yu Chin Fabian Lim <[email protected]> * take host env in run-benches * add display bench results script * rename summary.csv to raw_summary.csv and update run_benchmarks.sh * export environment variables in shell command * dump out pip requirements for repro, and add default FHT_branch --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Added support for running official HF baseline FSDP-QLoRA benchmark (#16) * new baseline scenario * rename variables * added warning when plugin allows SFTTrainer to handle PEFT on single device * Fix FSDP when performing GPTQ-LoRA with Triton V2 (#15) * wrap in parameters and torch view to correct dtype Signed-off-by: Yu Chin Fabian Lim <[email protected]> * refactor to apply patch only on FSDP and simplify Signed-off-by: Yu Chin Fabian Lim <[email protected]> --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Provide Memory Benchmarking Feature to Benchmarking Code (#14) * add gpu memory logging support * made improvements to GPU reference and result collation * Renamed memory logging argument to reflect its readings as reserved me mory using nvidia-smi and changed aggregation function in result collation * variable renames * manual linting * added memory logging functionality via HFTrainer * added support to benchmark memory using HFTrainer and updated READMEwith explanation of the 2 memory benchmarking options * addressed changes requested in PR #14 * fix bug and smplify gpu logs aggregation logic * fixes to calculation of HFTrainer Mem Logging values * fix calculations * more fixes * fix to ignore including stage inside max calculation of alloc memory * more comments and README updates * added fix to keyerror due to empty output dict from OOM * manual linting * added benchmark results to refs * remove unnecessary columns in results gathering * made changes to results gathering --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: achew010 <[email protected]>
@achew010 can we move all the memory computation logic out of |
@achew010 also one more consideration is that memory we should only have Or unless we have the tool do a proper replay and start the |
* refactor Signed-off-by: Yu Chin Fabian Lim <[email protected]> * fixes Signed-off-by: Yu Chin Fabian Lim <[email protected]> * refactor mistral Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add mixtral Signed-off-by: Yu Chin Fabian Lim <[email protected]> * some refactoring after introducing mlp Signed-off-by: Yu Chin Fabian Lim <[email protected]> * remove extranous files Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add bnb Signed-off-by: Yu Chin Fabian Lim <[email protected]> * lint + fmt and improvements to readme Signed-off-by: Yu Chin Fabian Lim <[email protected]> * bench fixes * need to handle lora adapters device due to #26 * allow replay of failed benches, addressing comment in #14 * update benches (remove l40) --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Description
This PR adds GPU memory logging features to the benchmark script according to #8 and an updated benchmark README for usage instructions.
There are 2 approaches to logging memory,
Note: Issue #19 is created to address the grouping of memory values using a common prefix and will be addressed in future
Usage
1. Nvidia's SMI CLI tool
Set environment variable
MEMORY_LOGGING=nvidia
to userun_benchmarks.sh
with nvidia loggingFor each experiment,
subprocess.run
, it will open an asyncnvidia-smi
process to monitor only gpu indices in$CUDA_VISIBLE_DEVICES
and log toFILE_MEM
insideExperiment.save_dir
subprocess
call is completed. It terminates the async process.gpu_mem
in the main result logging functionExperiment.write_results
Each experiment directory will have a gpu log that contains
<Timestamp>, <GPU Name>, <GPU ID>, <GPU Memory Used>
The memory readings will be reflected in the results
raw_summary.csv
under the column'nvidia_mem_reserved'
where the raw values are reported inMiB
2. Torch CUDA through Huggingface's HFTrainer's API
Set environment variable
MEMORY_LOGGING=huggingface
to userun_benchmarks.sh
with huggingface logging (default)HFTrainer has a feature to log memory through the
skip_memory_metrics=False
training argument. In their documentation, it is mentioned that setting this argument toFalse
will affect training speed. In our tests so far (below), we do not see significant difference in throughput (tokens/sec) when using this argument.A set of finegrain GPU readings will show as additional columns in the results
raw_summary.csv
where the raw values are reported inbytes
3. Log Both
Set environment variable
MEMORY_LOGGING=all
to userun_benchmarks.sh
with both logging methods4. Difference between Nvidia-SMI Utility and Torch CUDA through HFTrainer API
1. The Nvidia-SMI Utility is a coarse measurement tool that captures anything takes up GPU memory. It is simple and non-intrusive as it doesn't involve probing the trainer. It uses the NVML library to fetch reserved memory for each device ID -
Note: To get accurate measurements, no other processes should be running on the device apart from the target process itself.
2. The HFTrainer API is a more precise tool that logs memory usage for a couple of operations inside HFTrainer
It uses
torch.cuda.memory_allocated
to probe the trainer by taking snapshots of allocated memory and storing the differences between the before and after of each stage. The following stages are probed -Trainer.__init__
,Trainer.train
,Trainer.evaluate
,Trainer.predict
.Note: Any gpu memory accessed and used outside any of these stages or not part of HFTrainer will not be tracked. If the train script does not use the Huggingface trainer then this API will not work as well.
Note: Details on Memory Calculations from HFTrainer for GPTQ-LoRA + FSDP
This is an example of the memory values that HFTrainer will produce in the outputs of
train()
We refer to the keys of the memory metrics in this order
before_init_mem_X
as stage0init_mem_X
as stage1train_mem_X
as stage2We currently compute the memory values in the report by taking the largest of sums. For example:
For allocated memory value
For peak memory value
Notice that we do not include
stage0_mem
alone when computing the max value. This is to avoid misleading comparisons between GPTQ-LoRA and other approaches that support low-memory mode. GPTQ-LoRA + FSDP currently does not support low-memory mode as mentioned #18.The
stage0_mem
value of GPTQ-LoRA + FSDP will reflect a larger value as it is loaded fully before the trainer is initialized and then subsequently will be sharded internally intrainer.prepare
.This might cause some misleading comparisons when other variants are loaded in low-memory mode and have smaller
stage0_mem
memory consumption than GPTQ-LoRA + FSDP before its sharding. Once low-memory mode is supported for GPTQ-LoRA, we will includestage0_mem
back inside the max computationTests
Memory Measurement Accuracy and Potential Side Effects
1. No Significant Slowdown From Using HFTrainer Memory Probes API on QLoRA Training
In both the Mistral7B model and Mixtral model, introducing the memory probes do not show a significant impact to the throughput of the training run (50 steps). Generally with larger batch sizes and models, the overhead of memory logging becomes insignificant.
A. <100 toks/sec slowdown after introducing the memory probes for Mistral,
gpus
device
batch
size
with
no mem probe
(toks/sec)
with
mem probe
(toks/sec)
B. <100 toks/sec slowdown after introducing the memory probes for Mixtral
gpus
device
batch
size
with
no_mem_probe
(toks/sec)
with
mem_probe
(toks/sec)
2. Torch/HF shows more granularity of memory usage with peak memory and actual allocated memory than Nvidia's reserved memory. This is more helpful when analyzing the actual memory allocated for each model.
We compare the 2 memory tracking methods (Nvidia vs Torch/HF) on single devices for both GPTQ-LoRA and QLoRA. Nvidia's
peak mem reserved
reports larger values than Torch/HFpeak mem alloc
,torch mem alloc
shows the actual memory usage is lesser.gpus
device
batch
size
(GiB)
(GiB)
(GiB)
3. Memory Usage Decreases on Distributed Finetuning
When running large models on multiple devices,
torch mem alloc
shows the memory usage decreases as the models are sharded (Comparing to table above).gpus
device
batch
size
(GiB)
(GiB)
(GiB)
Verified that
torch mem alloc
for GPTQ-LoRA on Llama2-70B hovers at 19GiB when sharded aftertrainer.prepare
and during training. The values are similar to the manual probed values from this #15.gpus
device
batch
size
(GiB)
(GiB)
(GiB)
Benchmarks
Run
tox -e run_benches
to produce benchmarks. Full benchmark details can be referenced here4. For small models, LoRA runs faster than the Quantized PEFT methods. One likely reason could be that it doesnt require an additional dequantization operation before the base layer+LoRA matmuls. While this is so, we also observe the significantly larger memory it consumed compared to the Quantized PEFT methods.
Type
Config
Type
gpus
device
batch
size
(GiB)
(GiB)
(GiB)
(toks/sec)
5. We observe that on single device finetuning for larger models (e.g. 49B Mixtral), PEFT begins to run out of memory while the Quantized PEFT methods continue to maintain low memory consumption.
Type
Config
Type
gpus
device
batch
size
(GiB)
(GiB)
(GiB)
(toks/sec)
6. In distributed finetuning for large models like Llama2-70B, GPTQ-LoRA shows the lowest memory consumption with the same throughput.
Type
Config
Type
gpus
device
batch
size
(GiB)
(GiB)
(GiB)
(toks/sec)
7. Increasing the batch size, GPTQ-LoRA is the only experiment that doesnt run out of memory.
Type
Config
Type
gpus
device
batch
size
(GiB)
(GiB)
(GiB)
(toks/sec)