This directory demonstrates how to chain multiple optimization techniques like Pruning, Distillation, and Quantization together to achieve the best performance on a given model.
This example shows how to compress a Hugging Face Bert large model for Question Answering
using the combination of modelopt.torch.prune
, modelopt.torch.distill
and modelopt.torch.quantize
. More specifically, we will:
- Prune the Bert large model to 50% FLOPs with GradNAS algorithm and fine-tune with distillation
- Quantize the fine-tuned model to INT8 precision with Post-Training Quantization (PTQ) and Quantize Aware Training (QAT) with distillation
- Export the quantized model to ONNX format for deployment with TensorRT
The main python file is bert_prune_distill_quantize.py and scripts for running it for all 3 steps are available in the scripts directory.
NOTE: This example has been tested on 8 x 24GB A5000 GPUs with PyTorch 2.4 and CUDA 12.4. It takes about 2 hours to complete all the stages of the optimization. Most of the time is spent on fine-tuning and QAT.
Install Model Optimizer with optional torch and huggingface dependencies:
pip install "nvidia-modelopt[torch,hf]" --extra-index-url https://pypi.nvidia.com
To run the example, execute the following scripts in order:
-
First we prune the Bert large model to 50% FLOPs with GradNAS algorithm. Then, we fine-tune the pruned model with distillation from unpruned teacher model to recover 99+% of the initial F1 score (93.15). We recommend using multiple GPUs for fine-tuning. Note that we use more epochs for fine-tuning, which is different from the 2 epochs used originally in fine-tuning Bert without distillation since distillation requires more epochs to converge but achieves much better results.
bash scripts/1_prune.sh
-
Quantize the fine-tuned model to INT8 precision and run calibration (PTQ). Note that PTQ will result in a slight drop in F1 score but we will be able to recover the F1 score with QAT. We run QAT with distillation as well from unpruned teacher model.
bash scripts/2_int8_quantize.sh
-
Export the quantized model to ONNX format for deployment with TensorRT.
bash scripts/3_onnx_export.sh