Skip to content

friendliai/friendli-model-optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Friendli Model Optimizer (FMO) for supercharging generative AI serving 🚀

CI Status Python Version PyPi Package Version Documentation License

Overview

Friendli Model Optimizer (FMO) is a tool that provides model optimizations for efficient generative AI serving with Friendli Engine. The optimizations improve generative AI serving performance without compromising task accuracy.

FMO is designed to work with transformers library. You can optimize the model in Hugging Face Model Hub using FMO.

FMO currently supports PTQ(Post Training Quantization) algorithms, FP8, INT8 and AWQ.

[!NOTE] FMO currently utilizes a single GPU for running optimizations. But, it can generate optimized model checkpoints for large models like LLaMA-3.1-70B and LLaMA-3.1-405B! Additionally, even for FP8 precision, you are not restricted to using GPUs that support FP8.

What's NEW? (latest: v0.8.0)

  • Further optimization for running FP8, and INT8 quantization.
  • Support searching automatic calibration dataset batch size for running FMO.
  • Support [AWQ(Activation-aware Weight Quantization)].
  • Support ExaoneForCausalLM.

Table of Contents

Quick Installation

pip install friendli-model-optimizer

Supported Features & Model Architecture

FMO currently supports the following PTQ (Post-Training Quantization) techniques:

FP8

FP8 is an 8-bit floating-point format that offers a higher dynamic range than INT8, making it better suited for quantizing both weights and activations. This leads to increased throughput and reduced latency while maintaining high output quality with minimal degradation.

FMO offers a pedantic level setting, which controls the trade-off between accuracy and processing time for FP8. Higher pedantic levels provide more accurate model but can increase the time required to generate quantized models, and may sometimes slow down inference. Lower pedantic levels allow for faster quantization, though they may reduce model accuracy. Each quantization mode supports different ranges of pedantic levels.

FP8 support 1-2 pedantic level. Defaults to 1.

Important

FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.

Note

For now, we only support the E4M3 (4-bit exponent and 3-bit mantissa) encoding format.

Supported Model Architectures for FP8 Quantization

  • CohereForCausalLM
  • ExaoneForCausalLM
  • Gemma2ForCausalLM
  • LlamaForCausalLM
  • MistralForcausalLM
  • MixtralForCausalLM
  • MptForCausalLM
  • Phi3ForCausalLM
  • Qwen2ForCausalLM

INT8

INT8 Quantization represents weights and activations using the INT8 format with acceptable accuracy drops. Friendli Engine enables dynamic activation scaling, where scales are computed on the fly during runtime.

Supported Model Architectures for INT8 Quantization

  • CohereForCausalLM
  • ExaoneForCausalLM
  • Gemma2ForCausalLM
  • LlamaForCausalLM
  • MistralForcausalLM
  • MixtralForCausalLM
  • MptForCausalLM
  • Phi3ForCausalLM
  • Qwen2ForCausalLM

AWQ

Activation-Aware Weight Quantization (AWQ) is a technique that optimizes neural networks for efficiency without compromising accuracy. Unlike traditional weight quantization methods, AWQ leverages a deep understanding of the data distribution within neural networks during inference.

To learn more about AWQ, refer to this article.

Supported Model Architectures for INT8 Quantization

  • CohereForCausalLM
  • ExaoneForCausalLM
  • Gemma2ForCausalLM
  • LlamaForCausalLM
  • MistralForcausalLM
  • MixtralForCausalLM
  • MptForCausalLM
  • Phi3ForCausalLM
  • Qwen2ForCausalLM

User Guides

You can run the quantization processes with the command below:

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL
--device $DEVICE \
--offload

The command line arguments means :

  • model-name-or-path: Hugging Face pretrained model name or directory path of the saved model checkpoint.
  • output-dir: Directory path to save the quantized checkpoint and related configurations.
  • mode: Quantization techniques to apply. You can use fp8, int8.
  • pedantic-level: Represent to accuracy-latency trade-off. Higher pedantic level ensure a more accurate representaition of the model, but increase the quantization processing time. Defaults to 1.
  • device: Device to run the quantization process. Defaults to "cuda:0".
  • offload: When enabled, this option significantly reduces GPU memory usage by offloading model layers onto CPU RAM. Defaults to False.

Example: Run FP8 quantization with Meta-Llama-3-8B-Instruct

export MODEL_NAME_OR_PATH="meta-llama/Meta-Llama-3-8B-Instruct"
export OUTPUT_DIR="./"

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode "fp8" \
--device "cuda:0" \

Once your optimized model is ready, you can serve the model with Friendli Engine. Please check out our official documentation to learn more!

How to improve an optimized model quality with calibration dataset?

Using a calibration dataset that closely resembles the data to be generated during deployment can improve the quality of the quantized model when deployed.

Currently, we use the default calibration dataset with the following specifications, which serve as a great starting point for calibration:

  • Dataset: cnn_dailymail (version 3.0.0)
  • Split Name of Dataset: test
  • Column Name of Dataset: article
  • Number of samples: 512
  • Sequence length: 1024

These settings offer a solid foundation. However, further tuning may be necessary based on your specific needs.
For instance, consider using a custom dataset in the following scenarios:

  • If the generated text is primarily in a language other than English, while optimization results may still be acceptable, including texts in the primary language is a good practice.

  • If the generated texts are highly structured (e.g., JSON, XML) rather than plain text, using a custom dataset that better matches this structure can lead to improved performance.

[!TIP]

If the optimized model continues to experience significant accuracy drops, you may try increasing the sample size or extending the sequence length to enhance performance.

Support & Issues

If you have any questions or issues, please feel free to open an issue in this repository.