Extracted Subset of AutoGPTQ library into Accelerated-Peft Plugin #48

achew010 · 2024-07-03T08:12:56Z

Description

This PR addresses #38 and extracts a subset of GPTQModel, a refactored fork of AutoGPTQ into fms_acceleration_peft/src/gptqmodel to do away with the problematic installation of AutoGPTQ.

This is because

AutoGPTQ hasn't had a release since Mar'24 and newer additions such as triton quantization kernels can only be used with custom installation from the main branch.
Installation of the main branch is unnecessarily slow given that this plugin only uses the triton kernel and a bunch of model code while the bulk of the time is spent compiling the cuda kernels. This also allows for easier testing and building of the plugin without the need for dependencies like cudatoolkit.

Additions

new folder in src/gptqmodel containing extracted code
modified autogptq plugin with compatibility between using external AutoGPTQ library (if installed and available) or the local refactored subset gptqmodel
unit tests in tests/test_gptq_model.py to ensure the extracted subset maintains the same behaviour as the original
fixes to FOAK plugin for compatibility to local extracted gptq package
comparison tool against reference benchmarks

Issues:

Comparing new benchmarks against our current reference scripts/benchmarks/ref, we noticed a non-zero lora dropout will incur some memory overhead that make experiments for large models run out of memory (elaborated in Quantized Peft Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout #50). The comparison tool will pick this difference in experiment result as an outlier but will also flag out the parameter change in the report.
Temporary fix to FOAK dequantization triton kernel to only offset if using official AutoGPTQ package.
- Without the fix, the dequantization produces wrong base outputs that affect the loss badly when FOAK plugin is used with the local AutoGPTQ package
- The reason is FOAK runs a dequant function not compatible with the local package in acceleration_peft. FOAK Plugin currently maintains its own triton kernels e.g. dequantization (similar with official AutoGPTQ) for GPTQ fused ops. But offset was removed in the dequantization function for our local package (see here).
- A more permanent fix would be for FOAK plugin to rely on the accelerated_peft plugin to manage the dequantization function to use (local autogptq package or official autogptq) rather than maintaining a similar set of functions itself. Introduce a Better Dequantization Fix on Triton Function for FOAK Plugin's GPTQ Fused Operations #52 has been created to follow up on this .

Benchmarks

There seems to be an improvement to throughput with the new library on FOAK. Comparing the throughput from our reference against the updated benches on Mistral-7B-GPTQ and Llama70B-GPTQ in the table below.

We see similar throughput to previous reference throughput for accelerated-peft-autogptq plugin
We see higher throughput on the FOAK rows

Mistral-7B-GPTQ
- 3954 -> 4101 tokens (1 gpu, bs=4)
- 3911 -> 4040 tokens (2 gpu, bs=4)
Llama2-70B-GPTQ
- 500 -> 554 tokens (1 gpu, bs=4)
- 496 -> 551 (2 gpu, bs=4)

Mistral-7B-GPTQ

Reference

model_name_or_path	framework_config	num_gpus	batch_size	tokens_per_second	mem_alloc_in_GIB	peak_mem_alloc_in_GIB
TheBloke/Mistral-7B-v0.1-GPTQ	accelerated-peft-autogptq	1	4	3332	4.87	15.8
TheBloke/Mistral-7B-v0.1-GPTQ	accelerated-peft-autogptq	2	4	3236	2.74	16.0
TheBloke/Mistral-7B-v0.1-GPTQ	accelerated-peft-autogptq-foak	1	4	3954	4.87	13.6
TheBloke/Mistral-7B-v0.1-GPTQ	accelerated-peft-autogptq-foak	2	4	3911	2.74	15.6

Updated

model_name_or_path	framework_config	num_gpus	batch_size	tokens_per_second	mem_alloc_in_bytes	peak_mem_alloc_in_GIB
TheBloke/Mistral-7B-v0.1-GPTQ	accelerated-peft-autogptq	1	4	3404	4.87	15.9
TheBloke/Mistral-7B-v0.1-GPTQ	accelerated-peft-autogptq	2	4	3293	2.79	16.7
TheBloke/Mistral-7B-v0.1-GPTQ	accelerated-peft-autogptq-foak	1	4	3965	4.87	13.6
TheBloke/Mistral-7B-v0.1-GPTQ	accelerated-peft-autogptq-foak	2	4	3944	2.79	16.1

Llama70B-GPTQ

Reference

model_name_or_path	framework_config	num_gpus	batch_size	tokens_per_second	mem_alloc_in_GIB	peak_mem_alloc_in_GIB
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq	1	4	450	36.2	65.8
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq	2	4	444	18.1	70.1
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq-foak	1	4	500	36.2	65.0
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq-foak	2	4	496	18.1	69.4

Updated

model_name_or_path	framework_config	num_gpus	batch_size	tokens_per_second	mem_alloc_in_bytes	peak_mem_alloc_in_GIB
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq	1	4	455	36.2	67.2
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq	2	4	446	18.3	71.7
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq-foak	1	4	499	36.2	66.1
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq-foak	2	4	495	18.1	70.5

Unit Tests

=================================================================================================================== test session starts ===================================================================================================================
platform linux -- Python 3.10.12, pytest-8.2.2, pluggy-1.5.0
rootdir: /data/aaron/experimental/fms-acceleration/plugins/accelerated-peft
configfile: pyproject.toml
collected 7 items                                                                                                                                                                                                                                         

tests/test_gptqmodel.py ..                                                                                                                                                                                                                          [ 28%]
tests/test_peft_plugins.py ..                                                                                                                                                                                                                       [ 57%]
tests/test_q4_triton.py ..                                                                                                                                                                                                                          [ 85%]
tests/test_triton.py .                                                                                                                                                                                                                              [100%]

==================================================================================================================== warnings summary =====================================================================================================================
.tox/py/lib/python3.10/site-packages/transformers/utils/hub.py:124
  /data/aaron/experimental/fms-acceleration/plugins/accelerated-peft/.tox/py/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
    warnings.warn(

tests/test_gptqmodel.py::test_pre_quantized_model_outputs_match
tests/test_gptqmodel.py::test_quantizing_pretrained_model_outputs_match
tests/test_gptqmodel.py::test_quantizing_pretrained_model_outputs_match
tests/test_q4_triton.py::TestsQ4Triton::test_generation_desc_act_false
tests/test_q4_triton.py::TestsQ4Triton::test_generation_desc_act_true
tests/test_triton.py::TestTriton::test_triton_qlinear
  /data/aaron/experimental/fms-acceleration/plugins/accelerated-peft/.tox/py/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

tests/test_gptqmodel.py::test_pre_quantized_model_outputs_match
  /data/aaron/experimental/fms-acceleration/plugins/accelerated-peft/.tox/py/lib/python3.10/site-packages/auto_gptq/utils/peft_utils.py:360: UserWarning: You can just ignore this warning if the peft type you use isn't in ['LORA', 'ADALORA'].
  LlamaGPTQForCausalLM supports injecting fused attention but not enables this time. If you are training adapters, you must also disable fused attention injection when loading quantized base model at inference time, otherwise adapters may not be added to base model properly. If you are loading adapters to do inference, you can reference to adapter's config file to check whether the adapters are trained using base model that not enable fused attention injection.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================== 7 passed, 8 warnings in 130.42s (0:02:10) ========================================================================================================

Comparison Tool

The tool compares the set of benchmark results against a previous reference. It generates a chart for every metric compared (e.g. train_loss, train_tokens_per_second, mem_alloc...) as well as a csv file of outliers that are significantly different from the reference.

Usage

python scripts/benchmarks/compare_with_reference.py \
--result_dir $BENCHMARK_RESULTS_DIR \
--reference_benchmark_filepath $REFERENCE_BENCHMARK_CSV_FILEPATH

Chart:

Generally we see the new benchmark results from the extracted gptq package (New axis) match closely with that of the previous benchmark using the official autogptq package (Ref axis).

Table:

In the table below, the values from the reference column refer to values seen in previous benchmarks and values from the new column refer to values seen in the current benchmark. Outliers will have significant difference between the 2 columns. The outliers seen below are reported outliers due to the OOM issue in #50.

Note: Any hyperparameter difference between the new bench results and the reference will be the rightmost columns appended at the back following reference and new.

outlier.csv

fabianlim

it looks quite good overall, but requesting first round of changes

plugins/accelerated-peft/pyproject.toml

plugins/accelerated-peft/requirements.txt

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py

plugins/accelerated-peft/src/gptqmodel/__init__.py

plugins/accelerated-peft/tests/test_gptqmodel.py

plugins/accelerated-peft/src/gptqmodel/utils/peft.py

plugins/framework/tox.ini

plugins/accelerated-peft/tests/test_gptqmodel.py

fabianlim · 2024-07-04T14:16:40Z

@achew010 this needs a formatting, and some bench results

scripts/benchmarks/compare_with_reference.py

scripts/run_benchmarks.sh

scripts/benchmarks/compare_with_reference.py

...used-ops-and-kernels/src/fms_acceleration_foak/fused_ops/unsloth_lora/gptq/triton/kernels.py

fabianlim

approved

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* remove skip on test now #48 is complete Signed-off-by: Yu Chin Fabian Lim <[email protected]> * fix fusedops test Signed-off-by: Yu Chin Fabian Lim <[email protected]> * fix model patching in test Signed-off-by: Yu Chin Fabian Lim <[email protected]> * fix test to tail on input grads Signed-off-by: Yu Chin Fabian Lim <[email protected]> * fix dropout in fused_lora Signed-off-by: Yu Chin Fabian Lim <[email protected]> * fmt + lint Signed-off-by: Yu Chin Fabian Lim <[email protected]> --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim requested changes Jul 3, 2024

View reviewed changes

achew010 force-pushed the extracted_autogptq branch from 61e0b56 to b42d401 Compare July 4, 2024 06:45

achew010 marked this pull request as ready for review July 4, 2024 09:29

fabianlim reviewed Jul 4, 2024

View reviewed changes

plugins/accelerated-peft/tests/test_gptqmodel.py Outdated Show resolved Hide resolved

achew010 force-pushed the extracted_autogptq branch 6 times, most recently from dab9a8d to 0858912 Compare July 8, 2024 08:02

fabianlim mentioned this pull request Jul 9, 2024

ReIntroduce Package for FMS Accel foundation-model-stack/fms-hf-tuning#223

Merged

2 tasks

achew010 force-pushed the extracted_autogptq branch from 1cf8811 to 1f35ea4 Compare July 11, 2024 06:16