LLMEasyQuant is a package developed for Easy Quantization Deployment for LLM applications. Nowadays, packages like TensorRT and Quanto have many underlying structures and self-invoking internal functions, which are not conducive to developers' personalized development and learning for deployment. LLMEasyQuant is developed to tackle this problem.
Author: Dong Liu, Meng Jiang, Kaiser Pister
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
# Set device to CPU for now
device = 'cpu'
# device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load model and tokenizer
model_id = 'gpt2' # 137m F32 params
# model_id = 'facebook/opt-1.3b' # 1.3b f16 params
# model_id = 'mistralai/Mistral-7B-v0.1' # 7.24b bf16 params, auth required
# model_id = 'meta-llama/Llama-2-7b-hf' # auth required
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_int8 = AutoModelForCausalLM.from_pretrained(model_id,
device_map='auto',
quantization_config=BitsAndBytesConfig(
load_in_8bit=True)
)
model_int8.name_or_path += "_int8"
absmax
absq = Quantizer(model, tokenizer, absmax_quantize)
quantizers.append(absq)
zeropoint
zpq = Quantizer(model, tokenizer, zeropoint_quantize)
quantizers.append(zpq)
smoothquant
smooth_quant = SmoothQuantMatrix(alpha=0.5)
smoothq = Quantizer (model, tokenizer, smooth_quant.smooth_quant_apply)
quantizers.append(smoothq)
simquant
simq = Quantizer(model, tokenizer, sim_quantize )
quantizers.append(simq)
simquant, zeroquant and knowledge distllation of both each
symq = Quantizer(model, tokenizer, sym_quantize_8bit)
zeroq = Quantizer(model, tokenizer, sym_quantize_8bit, zeroquant_func)
quantizers.extend([symq, zeroq])
AWQ
awq = Quantizer(model, tokenizer, awq_quantize )
quantizers.append(simq)
BiLLM
billmq = Quantizer(model, tokenizer, billm_quantize )
quantizers.append(simq)
QLora
qloraq = Quantizer(model, tokenizer, qlora_quantize )
quantizers.append(simq)
[q.quantize() for q in quantizers]
dist_plot([model, model_int8] + [q.quant for q in quantizers])
generated = compare_generation([model, model_int8] + [q.quant for q in quantizers], tokenizer, max_length=200, temperature=0.8)
ppls = compare_ppl([model, model_int8] + [q.quant for q in quantizers], tokenizer, list(generated.values()))
In the research, we develop LLMEasyQuant, it is a package aiming to for easy quantization deployment which is user-friendly and easy to be deployed when computational resouces is limited.
Feature/Package | AWQ | BiLLM | QLora | TensorRT | Quanto | LLMEasyQuant |
---|---|---|---|---|---|---|
Hardware Requirements | GPU required | GPU required | GPU required | GPU required | GPU required | Supports CPU and GPU |
Deployment Steps | Multiple complex steps | Detailed setup and tuning required | Intricate steps and parameter adjustments | Complex setup with CUDA dependencies | Complex setup with multiple dependencies | Streamlined, minimal setup, includes AWQ, BiLLM, QLora |
Quantization Methods | Manual adjustments and configurations | Detailed configurations needed | Specific configurations for each method | Limited to specific optimizations | Limited to specific optimizations | Variety of methods with simple interface, includes AWQ, BiLLM, QLora |
Supported Methods | AWQ | BiLLM | QLora | TensorRT-specific methods | Quanto-specific methods | Absmax, Zeropoint, SmoothQuant, SimQuant, SymQuant, ZeroQuant, AWQ, BiLLM, QLora |
Integration Process | Complex library installation and setup | Extensive documentation and dependencies | Intricate library setup | Requires integration with NVIDIA stack | Requires integration with specific frameworks | Simple integration with transformers |
Visualization Tools | Additional setup required | Additional setup required | Additional setup required | External tools needed | External tools needed | Built-in visualization functions |
Performance Analysis | External tools needed | External tools needed | External tools needed | External tools needed | External tools needed | Built-in performance analysis functions |
- Hardware Flexibility: Supports both CPU and GPU, providing flexibility for developers with different hardware resources.
- Simplified Deployment: Requires minimal setup steps, making it user-friendly and accessible.
- Comprehensive Quantization Methods: Offers a wide range of quantization methods, including AWQ, BiLLM, and QLora, with easy-to-use interfaces.
- Built-in Visualization and Analysis: Includes tools for visualizing and comparing model performance, simplifying the evaluation process.
If you find LLMEasyQuant useful or relevant to your project and research, please kindly cite our paper:
@misc{liu2024llmeasyquanteasyuse,
title={LLMEasyQuant -- An Easy to Use Toolkit for LLM Quantization},
author={Dong Liu and Meng Jiang and Kaiser Pister},
year={2024},
eprint={2406.19657},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2406.19657},
}