Overview

Friendli Model Optimizer (FMO) for supercharging generative AI serving 🚀

Overview

Friendli Model Optimizer (FMO) is a tool that provides model optimizations for efficient generative AI serving with Friendli Engine. The optimizations improve generative AI serving performance without compromising task accuracy.

FMO is designed to work with transformers library. You can optimize the model in Hugging Face Model Hub using FMO.

FMO currently supports PTQ(Post Training Quantization) algorithms, FP8, INT8 and AWQ.

[!NOTE] FMO currently utilizes a single GPU for running optimizations. But, it can generate optimized model checkpoints for large models like LLaMA-3.1-70B and LLaMA-3.1-405B! Additionally, even for FP8 precision, you are not restricted to using GPUs that support FP8.

What's NEW? (latest: v0.8.0)

Further optimization for running FP8, and INT8 quantization.
Support searching automatic calibration dataset batch size for running FMO.
Support [AWQ(Activation-aware Weight Quantization)].
Support ExaoneForCausalLM.

Quick Installation

pip install friendli-model-optimizer

Supported Features & Model Architecture

FMO currently supports the following PTQ (Post-Training Quantization) techniques:

FP8

FP8 is an 8-bit floating-point format that offers a higher dynamic range than INT8, making it better suited for quantizing both weights and activations. This leads to increased throughput and reduced latency while maintaining high output quality with minimal degradation.

FMO offers a pedantic level setting, which controls the trade-off between accuracy and processing time for FP8. Higher pedantic levels provide more accurate model but can increase the time required to generate quantized models, and may sometimes slow down inference. Lower pedantic levels allow for faster quantization, though they may reduce model accuracy. Each quantization mode supports different ranges of pedantic levels.

FP8 support 1-2 pedantic level. Defaults to 1.

Important

FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.

Note

For now, we only support the E4M3 (4-bit exponent and 3-bit mantissa) encoding format.

Supported Model Architectures for FP8 Quantization

CohereForCausalLM
ExaoneForCausalLM
Gemma2ForCausalLM
LlamaForCausalLM
MistralForcausalLM
MixtralForCausalLM
MptForCausalLM
Phi3ForCausalLM
Qwen2ForCausalLM

INT8

INT8 Quantization represents weights and activations using the INT8 format with acceptable accuracy drops. Friendli Engine enables dynamic activation scaling, where scales are computed on the fly during runtime.

Supported Model Architectures for INT8 Quantization

CohereForCausalLM
ExaoneForCausalLM
Gemma2ForCausalLM
LlamaForCausalLM
MistralForcausalLM
MixtralForCausalLM
MptForCausalLM
Phi3ForCausalLM
Qwen2ForCausalLM

AWQ

Activation-Aware Weight Quantization (AWQ) is a technique that optimizes neural networks for efficiency without compromising accuracy. Unlike traditional weight quantization methods, AWQ leverages a deep understanding of the data distribution within neural networks during inference.

To learn more about AWQ, refer to this article.

Supported Model Architectures for INT8 Quantization

CohereForCausalLM
ExaoneForCausalLM
Gemma2ForCausalLM
LlamaForCausalLM
MistralForcausalLM
MixtralForCausalLM
MptForCausalLM
Phi3ForCausalLM
Qwen2ForCausalLM

User Guides

You can run the quantization processes with the command below:

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL
--device $DEVICE \
--offload

The command line arguments means :

model-name-or-path: Hugging Face pretrained model name or directory path of the saved model checkpoint.
output-dir: Directory path to save the quantized checkpoint and related configurations.
mode: Quantization techniques to apply. You can use fp8, int8.
pedantic-level: Represent to accuracy-latency trade-off. Higher pedantic level ensure a more accurate representaition of the model, but increase the quantization processing time. Defaults to 1.
device: Device to run the quantization process. Defaults to "cuda:0".
offload: When enabled, this option significantly reduces GPU memory usage by offloading model layers onto CPU RAM. Defaults to False.

Example: Run FP8 quantization with Meta-Llama-3-8B-Instruct

export MODEL_NAME_OR_PATH="meta-llama/Meta-Llama-3-8B-Instruct"
export OUTPUT_DIR="./"

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode "fp8" \
--device "cuda:0" \

Once your optimized model is ready, you can serve the model with Friendli Engine. Please check out our official documentation to learn more!

How to improve an optimized model quality with calibration dataset?

Using a calibration dataset that closely resembles the data to be generated during deployment can improve the quality of the quantized model when deployed.

Currently, we use the default calibration dataset with the following specifications, which serve as a great starting point for calibration:

Dataset: cnn_dailymail (version 3.0.0)
Split Name of Dataset: test
Column Name of Dataset: article
Number of samples: 512
Sequence length: 1024

These settings offer a solid foundation. However, further tuning may be necessary based on your specific needs.
For instance, consider using a custom dataset in the following scenarios:

If the generated text is primarily in a language other than English, while optimization results may still be acceptable, including texts in the primary language is a good practice.
If the generated texts are highly structured (e.g., JSON, XML) rather than plain text, using a custom dataset that better matches this structure can lead to improved performance.

[!TIP]

If the optimized model continues to experience significant accuracy drops, you may try increasing the sample size or extending the sequence length to enhance performance.

Support & Issues

If you have any questions or issues, please feel free to open an issue in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
scripts		scripts
src/fmo		src/fmo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Friendli Model Optimizer (FMO) for supercharging generative AI serving 🚀

Overview

What's NEW? (latest: v0.8.0)

Table of Contents

Quick Installation

Supported Features & Model Architecture

FP8

Supported Model Architectures for FP8 Quantization

INT8

Supported Model Architectures for INT8 Quantization

AWQ

Supported Model Architectures for INT8 Quantization

User Guides

Example: Run FP8 quantization with Meta-Llama-3-8B-Instruct

How to improve an optimized model quality with calibration dataset?

Support & Issues

About

Releases 8

Packages

Contributors 2

Languages

License

friendliai/friendli-model-optimizer

Folders and files

Latest commit

History

Repository files navigation

Friendli Model Optimizer (FMO) for supercharging generative AI serving 🚀

Overview

What's NEW? (latest: v0.8.0)

Table of Contents

Quick Installation

Supported Features & Model Architecture

FP8

Supported Model Architectures for FP8 Quantization

INT8

Supported Model Architectures for INT8 Quantization

AWQ

Supported Model Architectures for INT8 Quantization

User Guides

Example: Run FP8 quantization with Meta-Llama-3-8B-Instruct

How to improve an optimized model quality with calibration dataset?

Support & Issues

About

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 2

Languages

Packages