- PTQ for LLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular pre-trained models from frameworks like
- PTQ for Diffusers walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with TensorRT. The Diffusers example in this repo is complementary to the demoDiffusion example in TensorRT repo and includes FP8 plugins as well as the latest updates on INT8 quantization.
- QAT for LLMs demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or 4-bit in NVIDIA Blackwell platform).
- Sparsity for LLMs shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.
- Pruning demonstrates how to optimally prune Linear and Conv layers, and Transformer attention heads, MLP, and depth using the Model Optimizer for following frameworks:
- NVIDIA NeMo / NVIDIA Megatron-LM GPT-style models (e.g. Llama 3, Mistral NeMo, etc.)
- Hugging Face language models like BERT and GPT-J
- Computer Vision models like NVIDIA Tao framework detection models.
- ONNX PTQ shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
- Distillation for LLMs demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.
- Chained Optimizations shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).
- Model Hub provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.