diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst index bbc09ccd4b5fbb..046dde9661c3bb 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst @@ -6,38 +6,36 @@ LLM Weight Compression :hidden: weight-compression/microscaling-quantization + weight-compression/4-bit-weight-quantization -Weight compression is a technique for enhancing the efficiency of models, -especially those with large memory requirements. This method reduces the model's -memory footprint, a crucial factor for Large Language Models (LLMs). +Weight compression enhances the efficiency of models by reducing their memory footprint, +a crucial factor for Large Language Models (LLMs). It is especially effective for networks with high memory requirements. -Unlike full model quantization, where weights and activations are quantized, -weight compression in `Neural Network Compression Framework (NNCF) `__ -only targets the model's weights. This approach allows the activations to remain as -floating-point numbers, preserving most of the model's accuracy while improving its -speed and reducing its size. +Unlike full model quantization, where both weights and activations are quantized, it +only targets weights, keeping activations as floating-point numbers. This means preserving most +of the model's accuracy while improving its +speed and reducing its size. The reduction in size is especially noticeable with larger models. +For instance the 7 billion parameter Llama 2 model can be reduced +from about 25GB to 4GB using 4-bit weight compression. -The reduction in size is especially noticeable with larger models, -for instance the 7 billion parameter Llama 2 model can be reduced -from about 25GB to 4GB using 4-bit weight compression. With smaller models (i.e. less -than 1B parameters), weight compression may result in more accuracy reduction than -with larger models. +.. note:: + + With smaller language models (i.e. less than 1B parameters), weight + compression may result in more accuracy reduction than with larger models. + Therefore, weight compression is recommended for use with LLMs only. -LLMs and other models that require +LLMs and other GenAI models that require extensive memory to store the weights during inference can benefit from weight compression as it: * enables inference of exceptionally large models that cannot be accommodated in the device memory; - * reduces storage and memory overhead, making models more lightweight and less resource intensive for deployment; - * improves inference speed by reducing the latency of memory access when computing the operations with weights, for example, Linear layers. The weights are smaller and thus faster to load from memory; - * unlike quantization, does not require sample data to calibrate the range of activation values. @@ -46,197 +44,228 @@ provides weight quantization to 8 and 4-bit integer data types as a compression method primarily designed to optimize LLMs. +Compression Methods (8-bit vs. 4-bit) +##################################### + +For models that come from `Hugging Face `__ and are supported +by Optimum, it is recommended to use the **Optimum Intel API**, which employs NNCF weight +compression capabilities to optimize various large Transformer models. + +The NNCF ``nncf.compress_weights()`` API, with most of its options, is exposed in the +``.from_pretrained()`` method of Optimum Intel classes. Optimum also has several datasets +for data-aware quantization available out-of-the-box. -Compress Model Weights -###################### +You can use the examples below to perform data-free 8-bit or 4-bit weight quantization. +Before you start, make sure Optimum Intel is installed in your environment +by running the following command: -**8-bit weight quantization** method offers a balance between model size reduction and -maintaining accuracy, which usually leads to significant performance improvements for -Transformer-based models. Models with 8-bit compressed weights are performant on the -vast majority of supported CPU and GPU platforms. By default, weights are compressed -asymmetrically to "INT8_ASYM" mode. +.. code-block:: python + pip install optimum[openvino] -The code snippet below shows how to do asymmetrical 8-bit quantization of the model weights -represented in OpenVINO IR using NNCF: +**8-bit weight quantization** offers a good balance between reducing the size and lowering the +accuracy of a model. It usually results in significant improvements for transformer-based models +and guarantees good model performance for a vast majority of supported CPU and GPU platforms. +By default, weights are compressed asymmetrically to "INT8_ASYM" mode. .. tab-set:: - .. tab-item:: OpenVINO - :sync: openvino + .. tab-item:: Compression with Optimum-Intel + :sync: optimum - .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py - :language: python - :fragment: [compression_8bit] + Load a pre-trained Hugging Face model, compress it to INT8_ASYM, using the + Optimum Intel API, and then execute inference with a text phrase: + Simply use the optimum-cli command line tool: -Now, the model is ready for compilation and inference. -It can be also saved into a compressed format, resulting in a smaller binary file. + .. code-block:: console -**4-bit weight quantization** method stands for an INT4-INT8 mixed-precision weight quantization, -where INT4 is considered as the primary precision and asymmetric INT8 is the backup one. -It usually results in a smaller model size and lower inference latency, although the accuracy -degradation could be higher, depending on the model. + optimum-cli export openvino --model microsoft/Phi-3.5-mini-instruct --weight-format int8 ov_phi-3.5-mini-instruct -The code snippet below shows how to do 4-bit quantization of the model weights represented -in OpenVINO IR using NNCF: + You can also use the code sample to the same effect: -.. tab-set:: + .. code-block:: python - .. tab-item:: OpenVINO - :sync: openvino + from optimum.intel.openvino import OVModelForCausalLM, OVWeightQuantizationConfig + from transformers import AutoTokenizer, pipeline - .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py - :language: python - :fragment: [compression_4bit] + # Load and compress a model from Hugging Face. + model_id = "microsoft/Phi-3.5-mini-instruct" + model = OVModelForCausalLM.from_pretrained( + model_id, + export=True, + quantization_config=OVWeightQuantizationConfig(bits=8) + ) + # Inference + tokenizer = AutoTokenizer.from_pretrained(model_id) + pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + phrase = "The weather is" + results = pipe(phrase) + print(results) -The table below summarizes the benefits and trade-offs for each compression type in terms of -memory reduction, speed gain, and accuracy loss. + For more details, refer to the article on how to + :doc:`infer LLMs using Optimum Intel <../../learn-openvino/llm_inference_guide/llm-inference-hf>`. -.. list-table:: - :widths: 25 20 20 20 - :header-rows: 1 + .. tab-item:: Compression with NNCF + :sync: nncf - * - - - Memory Reduction - - Latency Improvement - - Accuracy Loss - * - INT8 Asymmetric - - Low - - Medium - - Low - * - INT4 Symmetric - - High - - High - - High - * - INT4 Asymmetric - - High - - Medium - - Medium + Load a pre-trained Hugging Face model, using the Optimum Intel API, + compress it to INT8_ASYM, using NNCF, and then execute inference with a text phrase: + .. code-block:: python + from nncf import compress_weights, CompressWeightsMode + from optimum.intel.openvino import OVModelForCausalLM + from transformers import AutoTokenizer, pipeline -The INT4 method has several parameters that can provide different performance-accuracy -trade-offs after optimization: + # Load a model and compress it with NNCF. + model_id = "microsoft/Phi-3.5-mini-instruct" + model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=False, compile=False) + model.model = compress_weights(model.model, mode=CompressWeightsMode.INT8_ASYM) -* ``mode`` - there are two optimization modes: symmetric and asymmetric. + # Inference + model.compile() + tokenizer = AutoTokenizer.from_pretrained(model_id) + pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + phrase = "The weather is" + results = pipe(phrase) + print(results) - **Symmetric Compression** - ``INT4_SYM`` - INT4 Symmetric mode involves quantizing weights to a signed 4-bit integer - symmetrically without zero point. This mode is faster than the INT8_ASYM, making - it ideal for situations where **speed and size reduction are prioritized over accuracy**. +Here is an example of code using NNCF to perform asymmetrical 8-bit weight quantization of +a model in the OpenVINO IR format: - .. code-block:: python +.. tab-set:: - from nncf import compress_weights - from nncf import CompressWeightsMode + .. tab-item:: OpenVINO + :sync: openvino - compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM) + .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py + :language: python + :fragment: [compression_8bit] - **Asymmetric Compression** - ``INT4_ASYM`` - INT4 Asymmetric mode also uses an unsigned 4-bit integer but quantizes weights - asymmetrically with a non-fixed zero point. This mode slightly compromises speed in - favor of better accuracy compared to the symmetric mode. This mode is useful when - **minimal accuracy loss is crucial**, but a faster performance than INT8 is still desired. +**4-bit weight quantization** is actually a mixed-precision compression, +primarily INT4 and a backup asymmetric INT8 precisions. It produces a smaller model, +offering lower inference latency but potentially noticeable accuracy degradation, +depending on the model. - .. code-block:: python +.. tab-set:: - from nncf import compress_weights - from nncf import CompressWeightsMode + .. tab-item:: Compression with Optimum-Intel + :sync: optimum - compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM) + Load a pre-trained Hugging Face model, compress it to INT4, using the + Optimum Intel API, and then execute inference with a text phrase: -* ``group_size`` controls the size of the group of weights that share the same - quantization parameters. Shared quantization parameters help to speed up the - calculation of activation values as they are dequantized and quantized between - layers. However, they can reduce accuracy. The following group sizes are - recommended: ``128``, ``64``, ``32`` (``128`` is default value). + Simply use the optimum-cli command line tool: - `Smaller Group Size`: Leads to a more accurate model but increases the model's - footprint and reduces inference speed. + .. code-block:: console - `Larger Group Size`: Results in faster inference and a smaller model, but might - compromise accuracy. + optimum-cli export openvino --model microsoft/Phi-3.5-mini-instruct --weight-format int4 --awq --scale-estimation --dataset wikitext2 --group-size 64 --ratio 1.0 ov_phi-3.5-mini-instruct -* ``ratio`` controls the ratio between the layers compressed to the precision defined - by ``mode`` and the rest of the layers that will be kept in the ``backup_mode`` in the optimized model. - Ratio is a decimal between 0 and 1. For example, 0.8 means that 80% of layers will be - compressed to the precision defined by ``mode``, while the rest will be compressed to - ``backup_mode`` precision. The default value for ratio is 1. + You can also use the code sample to the same effect: - `Higher Ratio (more layers set to mode precision)`: Reduces the model size and increase inference speed but - might lead to higher accuracy degradation. + .. code-block:: python - `Lower Ratio (more layers set to backup_mode precision)`: Maintains better accuracy but results in a larger model size - and potentially slower inference. + from optimum.intel.openvino import OVModelForCausalLM, OVWeightQuantizationConfig + from transformers import AutoTokenizer, pipeline - In this example, 90% of the model's layers are quantized to INT4 asymmetrically with - a group size of 64: + # Load and compress a model from Hugging Face. + model_id = "microsoft/Phi-3.5-mini-instruct" + model = OVModelForCausalLM.from_pretrained( + model_id, + export=True, + quantization_config=OVWeightQuantizationConfig( + bits=4, + quant_method="awq", + scale_estimation=True, + dataset="wikitext2", + group_size=64, + ratio=1.0 + ) + ) - .. code-block:: python + # Inference + tokenizer = AutoTokenizer.from_pretrained(model_id) + pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + phrase = "The weather is" + results = pipe(phrase) + print(results) - from nncf import compress_weights, CompressWeightsMode + .. tab-item:: Compression with NNCF + :sync: nncf - # Example: Compressing weights with INT4_ASYM mode, group size of 64, and 90% INT4 ratio - compressed_model = compress_weights( - model, - mode=CompressWeightsMode.INT4_ASYM, - group_size=64, - ratio=0.9, - ) + Load a pre-trained Hugging Face model, using the Optimum Intel API, + compress it to INT4 using NNCF, and then execute inference with a text phrase: -* ``scale_estimation`` - boolean parameter that enables more accurate estimation of - quantization scales. Especially helpful when the weights of all layers are quantized to - 4 bits. Requires dataset. + .. code-block:: python -* ``awq`` - boolean parameter that enables the AWQ method for more accurate INT4 weight - quantization. Especially helpful when the weights of all the layers are quantized to - 4 bits. The method can sometimes result in reduced accuracy when used with - Dynamic Quantization of activations. Requires dataset. + from nncf import compress_weights, CompressWeightsMode + from optimum.intel.openvino import OVModelForCausalLM + from transformers import AutoTokenizer, pipeline -* ``gptq`` - boolean parameter that enables the GPTQ method for more accurate INT4 weight - quantization. Requires dataset. + # Load a model and compress it with NNCF. + model_id = "microsoft/Phi-3.5-mini-instruct" + model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=False, compile=False) + model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM) -* ``dataset`` - calibration dataset for data-aware weight compression. It is required - for some compression options, for example, ``scale_estimation``, ``gptq`` or ``awq``. Some types - of ``sensitivity_metric`` can use data for precision selection. + # Inference + model.compile() + tokenizer = AutoTokenizer.from_pretrained(model_id) + pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + phrase = "The weather is" + results = pipe(phrase) + print(results) -* ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing - layers in the bit-width selection algorithm. Some of the metrics require dataset to be - provided. The following types are supported: - * ``nncf.SensitivityMetric.WEIGHT_QUANTIZATION_ERROR`` - data-free metric computed as - the inverted 8-bit quantization noise. Weights with highest value of this metric can - be accurately quantized channel-wise to 8-bit. The idea is to leave these weights in - 8 bit, and quantize the rest of layers to 4-bit group-wise. Since group-wise is more - accurate than per-channel, accuracy should not degrade. + For more details, refer to the article on how to + :doc:`infer LLMs using Optimum Intel <../../../learn-openvino/llm_inference_guide/llm-inference-hf>`. - * ``nncf.SensitivityMetric.HESSIAN_INPUT_ACTIVATION`` - requires dataset. The average - Hessian trace of weights with respect to the layer-wise quantization error multiplied - by L2 norm of 8-bit quantization noise. +The code snippet below shows how to do 4-bit quantization of the model weights represented +in OpenVINO IR using NNCF: - * ``nncf.SensitivityMetric.MEAN_ACTIVATION_VARIANCE`` - requires dataset. The mean - variance of the layers' inputs multiplied by inverted 8-bit quantization noise. +.. tab-set:: - * ``nncf.SensitivityMetric.MAX_ACTIVATION_VARIANCE`` - requires dataset. The maximum - variance of the layers' inputs multiplied by inverted 8-bit quantization noise. + .. tab-item:: OpenVINO + :sync: openvino - * ``nncf.SensitivityMetric.MEAN_ACTIVATION_MAGNITUDE`` - requires dataset. The mean - magnitude of the layers' inputs multiplied by inverted 8-bit quantization noise. + .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py + :language: python + :fragment: [compression_4bit] + +Refer to the article about +:doc:`4-bit weight quantization <./weight-compression/4-bit-weight-quantization>` +for more details. -* ``all_layers`` - boolean parameter that enables INT4 weight quantization of all - Fully-Connected and Embedding layers, including the first and last layers in the model. +Once the model has been optimized, it is ready for compilation and inference. The model can +also be :ref:`saved into a compressed format `, resulting in a +smaller binary file. + +The table below summarizes the benefits and trade-offs for each compression type in terms of +memory reduction, speed gain, and accuracy loss. -* ``lora_correction`` - boolean parameter that enables the LoRA Correction Algorithm - to further improve the accuracy of INT4 compressed models on top of other - algorithms - AWQ and Scale Estimation. +.. list-table:: + :widths: 25 20 20 20 + :header-rows: 1 -* ``backup_mode`` - defines a backup precision for mixed-precision weight compression. - There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains - the original floating-point precision of the model weights (``INT8_ASYM`` is default value). + * - + - Memory Reduction + - Latency Improvement + - Accuracy Loss + * - INT8 Asymmetric + - Low + - Medium + - Low + * - INT4 Symmetric + - High + - High + - High + * - INT4 Asymmetric + - High + - Medium + - Medium **Use synthetic data for LLM weight compression** @@ -268,8 +297,8 @@ for details of the usage. # Synthetic-based compression synthetic_dataset = nncf.data.generate_text_data(hf_model, tokenizer, dataset_size=100) quantization_dataset = nncf.Dataset( - synthetic_dataset, - transform_fn # see example in NNCF repo how to make transform_fn + synthetic_dataset, + transform_fn # See the example in NNCF repo to learn how to make transform_fn. ) model = compress_weights( @@ -280,58 +309,16 @@ for details of the usage. dataset=quantization_dataset, awq=True, scale_estimation=True - ) # model is openvino.Model + ) # The model is openvino.Model. For data-aware weight compression refer to the following `example `__. .. note:: - Some methods can be stacked on top of one another to achieve a better - accuracy-performance trade-off after weight quantization. For example, the **Scale Estimation** - method can be applied along with **AWQ** and mixed-precision quantization (the ``ratio`` parameter). - - -**Hugging Face Optimum-Intel API** - -Hugging Face Optimum-Intel provides an easy way to use NNCF Weight Compression capabilities to optimize -various large Transformer models. Most of the options of the NNCF ``nncf.compress_weights()`` API are -exposed in the ``.from_pretrained()`` method of Optimum-Intel classes. Optimum also has several datasets -for data-aware quantization available out-of-the-box. -The example below shows data-free 4-bit weight quantization -applied on top of OpenVINO IR. Before trying the example, make sure Optimum Intel -is installed in your environment by running the following command: - -.. code-block:: python - - pip install optimum[openvino] - -.. code-block:: python - - from optimum.intel.openvino import OVModelForCausalLM, OVWeightQuantizationConfig - from transformers import AutoTokenizer, pipeline - - # Load and compress model from Hugging Face - model_id = "microsoft/Phi-3.5-mini-instruct" - model = OVModelForCausalLM.from_pretrained( - model_id, - export=True, - quantization_config=OVWeightQuantizationConfig( - bits=4, - quant_method="awq", - scale_estimation=True, - dataset="wikitext2", - group_size=64, - ratio=1.0 - ) - ) - - # Inference - tokenizer = AutoTokenizer.from_pretrained(model_id) - pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) - phrase = "The weather is" - results = pipe(phrase) - print(results) + Some methods can be stacked on top of one another to achieve a better + accuracy-performance trade-off after weight quantization. For example, the **Scale Estimation** + method can be applied along with **AWQ** and mixed-precision quantization (the ``ratio`` parameter). Exporting and Loading Compressed Models @@ -344,179 +331,157 @@ so it is preferable to compress the model once, save it, and then load the compressed model later for faster time to first inference. .. code-block:: python + :name: save_pretrained - # Save compressed model for faster loading later - model.save_pretrained("Phi-3.5-mini-instruct-int4-sym-ov") - tokenizer.save_pretrained("Phi-3.5-mini-instruct-int4-sym-ov") - - # Load a saved model - model = OVModelForCausalLM.from_pretrained("Phi-3.5-mini-instruct-int4-sym-ov") - tokenizer = AutoTokenizer.from_pretrained("Phi-3.5-mini-instruct-int4-sym-ov") - -GPTQ Models -############ + # Save compressed model for faster loading later + model.save_pretrained("Phi-3.5-mini-instruct-int4-sym-ov") + tokenizer.save_pretrained("Phi-3.5-mini-instruct-int4-sym-ov") -OpenVINO also supports 4-bit models from Hugging Face -`Transformers `__ library optimized -with `GPTQ `__. In this case, there is no -need for an additional model optimization step because model conversion will -automatically preserve the INT4 optimization results, allowing model inference to benefit from it. + # Load a saved model + model = OVModelForCausalLM.from_pretrained("Phi-3.5-mini-instruct-int4-sym-ov") + tokenizer = AutoTokenizer.from_pretrained("Phi-3.5-mini-instruct-int4-sym-ov") -A compression example using a GPTQ model is shown below. -Make sure to install GPTQ dependencies by running the following command: +.. tip:: -.. code-block:: python - - pip install optimum[openvino] auto-gptq - -.. code-block:: python + Models optimized with with NNCF or Optimum Intel can be used with + :doc:`OpenVINO GenAI <../../learn-openvino/llm_inference_guide/genai-guide>`. - from optimum.intel.openvino import OVModelForCausalLM - from transformers import AutoTokenizer, pipeline - # Load model from Hugging Face already optimized with GPTQ - model_id = "TheBloke/Llama-2-7B-Chat-GPTQ" - model = OVModelForCausalLM.from_pretrained(model_id, export=True) +Auto-tuning of Weight Compression Parameters +############################################ - # Inference - tokenizer = AutoTokenizer.from_pretrained(model_id) - pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) - phrase = "The weather is" - results = pipe(phrase) - print(results) +To find the optimal weight compression parameters for a particular model, refer to the +`example `__ , +where weight compression parameters are being searched from the subset of values. +To speed up the search, a self-designed validation pipeline called +`WhoWhatBench `__ +is used. The pipeline can quickly evaluate the changes in the accuracy of the optimized +model compared to the baseline. -An `example of a model `__ -that has been optimized using GPTQ. Compression Metrics Examples -######################################## +############################ -The table below shows examples of text-generation Language Models with different +Below you will find examples of text-generation Language Models with different optimization settings in a data-free setup, where no dataset is used at the optimization step. The Perplexity metric is a measurement of response accuracy, where a higher complexity score indicates a lower accuracy. It is measured on the `Lambada OpenAI dataset `__. -.. list-table:: - :widths: 40 55 25 25 - :header-rows: 1 - - * - Model - - Optimization - - Perplexity\* - - Model Size (Gb) - * - databricks/dolly-v2-3b - - FP32 - - 5.01 - - 10.3 - * - databricks/dolly-v2-3b - - INT8_ASYM - - 5.07 - - 2.6 - * - databricks/dolly-v2-3b - - INT4_ASYM,group_size=32,ratio=0.5 - - 5.28 - - 2.2 - * - facebook/opt-6.7b - - FP32 - - 4.25 - - 24.8 - * - facebook/opt-6.7b - - INT8_ASYM - - 4.27 - - 6.2 - * - facebook/opt-6.7b - - INT4_ASYM,group_size=64,ratio=0.8 - - 4.32 - - 4.1 - * - meta-llama/Llama-2-7b-chat-hf - - FP32 - - 3.28 - - 25.1 - * - meta-llama/Llama-2-7b-chat-hf - - INT8_ASYM - - 3.29 - - 6.3 - * - meta-llama/Llama-2-7b-chat-hf - - INT4_ASYM,group_size=128,ratio=0.8 - - 3.41 - - 4.0 - * - togethercomputer/RedPajama-INCITE-7B-Instruct - - FP32 - - 4.15 - - 25.6 - * - togethercomputer/RedPajama-INCITE-7B-Instruct - - INT8_ASYM - - 4.17 - - 6.4 - * - togethercomputer/RedPajama-INCITE-7B-Instruct - - INT4_ASYM,group_size=128,ratio=1.0 - - 4.17 - - 3.6 - * - meta-llama/Llama-2-13b-chat-hf - - FP32 - - 2.92 - - 48.5 - * - meta-llama/Llama-2-13b-chat-hf - - INT8_ASYM - - 2.91 - - 12.1 - * - meta-llama/Llama-2-13b-chat-hf - - INT4_SYM,group_size=64,ratio=0.8 - - 2.98 - - 8.0 - - -The following table shows accuracy metric in a data-aware 4-bit weight quantization -setup measured on the `Wikitext dataset `__. - -.. list-table:: - :widths: 40 55 25 25 - :header-rows: 1 - - * - Model - - Optimization - - Word perplexity\* - - Model Size (Gb) - * - meta-llama/llama-7b-chat-hf - - FP32 - - 11.57 - - 12.61 - * - meta-llama/llama-7b-chat-hf - - INT4_SYM,group_size=128,ratio=1.0,awq=True - - 12.34 - - 2.6 - * - stabilityai_stablelm-3b-4e1t - - FP32 - - 10.17 - - 10.41 - * - stabilityai_stablelm-3b-4e1t - - INT4_SYM,group_size=64,ratio=1.0,awq=True - - 10.89 - - 2.6 - * - HuggingFaceH4/zephyr-7b-beta - - FP32 - - 9.82 - - 13.99 - * - HuggingFaceH4/zephyr-7b-beta - - INT4_SYM,group_size=128,ratio=1.0 - - 10.32 - - 2.6 +.. dropdown:: Perplexity\* in data-free optimization + + .. list-table:: + :widths: 40 55 25 25 + :header-rows: 1 + + * - Model + - Optimization + - Perplexity\* + - Model Size (Gb) + * - databricks/dolly-v2-3b + - FP32 + - 5.01 + - 10.3 + * - databricks/dolly-v2-3b + - INT8_ASYM + - 5.07 + - 2.6 + * - databricks/dolly-v2-3b + - INT4_ASYM,group_size=32,ratio=0.5 + - 5.28 + - 2.2 + * - facebook/opt-6.7b + - FP32 + - 4.25 + - 24.8 + * - facebook/opt-6.7b + - INT8_ASYM + - 4.27 + - 6.2 + * - facebook/opt-6.7b + - INT4_ASYM,group_size=64,ratio=0.8 + - 4.32 + - 4.1 + * - meta-llama/Llama-2-7b-chat-hf + - FP32 + - 3.28 + - 25.1 + * - meta-llama/Llama-2-7b-chat-hf + - INT8_ASYM + - 3.29 + - 6.3 + * - meta-llama/Llama-2-7b-chat-hf + - INT4_ASYM,group_size=128,ratio=0.8 + - 3.41 + - 4.0 + * - togethercomputer/RedPajama-INCITE-7B-Instruct + - FP32 + - 4.15 + - 25.6 + * - togethercomputer/RedPajama-INCITE-7B-Instruct + - INT8_ASYM + - 4.17 + - 6.4 + * - togethercomputer/RedPajama-INCITE-7B-Instruct + - INT4_ASYM,group_size=128,ratio=1.0 + - 4.17 + - 3.6 + * - meta-llama/Llama-2-13b-chat-hf + - FP32 + - 2.92 + - 48.5 + * - meta-llama/Llama-2-13b-chat-hf + - INT8_ASYM + - 2.91 + - 12.1 + * - meta-llama/Llama-2-13b-chat-hf + - INT4_SYM,group_size=64,ratio=0.8 + - 2.98 + - 8.0 + + +.. dropdown:: Perplexity\* in data-aware optimization + + The following table shows accuracy metric in a data-aware 4-bit weight quantization + setup measured on the `Wikitext dataset `__. + + .. list-table:: + :widths: 40 55 25 25 + :header-rows: 1 + + * - Model + - Optimization + - Word perplexity\* + - Model Size (Gb) + * - meta-llama/llama-7b-chat-hf + - FP32 + - 11.57 + - 12.61 + * - meta-llama/llama-7b-chat-hf + - INT4_SYM,group_size=128,ratio=1.0,awq=True + - 12.34 + - 2.6 + * - stabilityai_stablelm-3b-4e1t + - FP32 + - 10.17 + - 10.41 + * - stabilityai_stablelm-3b-4e1t + - INT4_SYM,group_size=64,ratio=1.0,awq=True + - 10.89 + - 2.6 + * - HuggingFaceH4/zephyr-7b-beta + - FP32 + - 9.82 + - 13.99 + * - HuggingFaceH4/zephyr-7b-beta + - INT4_SYM,group_size=128,ratio=1.0 + - 10.32 + - 2.6 \*Perplexity metric in both tables was measured without the Dynamic Quantization feature enabled in the OpenVINO runtime. -Auto-tuning of Weight Compression Parameters -############################################ - -To find the optimal weight compression parameters for a particular model, refer to the -`example `__ , -where weight compression parameters are being searched from the subset of values. -To speed up the search, a self-designed validation pipeline called -`WhoWhatBench `__ -is used. The pipeline can quickly evaluate the changes in the accuracy of the optimized -model compared to the baseline. Additional Resources #################### diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst new file mode 100644 index 00000000000000..ae9bc7d7b8b4a3 --- /dev/null +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst @@ -0,0 +1,175 @@ +4-bit Weight Quantization +========================= + +The 4-bit weight quantization method results in significant reduction in model size and +memory usage, making LLMs more accessible to less performant devices. +It also usually offers lower inference latency, however, depending on specific models, +it may potentially impact the accuracy. + +Nevertheless, the INT4 method has several parameters that can provide different performance-accuracy +trade-offs after optimization: + +* ``mode`` - there are two optimization modes: symmetric and asymmetric. + + .. tab-set:: + + .. tab-item:: Symmetric Compression + :sync: int4-sym + + INT4 Symmetric mode (``INT4_SYM``) involves quantizing weights to a signed 4-bit integer + symmetrically without zero point. This mode is faster than the INT8_ASYM, making + it ideal for situations where **speed and size reduction are prioritized over accuracy**. + + .. code-block:: python + + from nncf import compress_weights + from nncf import CompressWeightsMode + + compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM) + + .. tab-item:: Asymmetric Compression + :sync: int4-asym + + INT4 Asymmetric mode (``INT4_ASYM``) also uses an unsigned 4-bit integer but quantizes weights + asymmetrically with a non-fixed zero point. This mode slightly compromises speed in + favor of better accuracy compared to the symmetric mode. This mode is useful when + **minimal accuracy loss is crucial**, but a faster performance than INT8 is still desired. + + .. code-block:: python + + from nncf import compress_weights + from nncf import CompressWeightsMode + + compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM) + +* ``group_size`` controls the size of the group of weights that share the same + quantization parameters. Shared quantization parameters help to speed up the + calculation of activation values as they are dequantized and quantized between + layers. However, they can reduce accuracy. The following group sizes are + recommended: ``128``, ``64``, ``32`` (``128`` is default value). + + `Smaller Group Size`: Leads to a more accurate model but increases the model's + footprint and reduces inference speed. + + `Larger Group Size`: Results in faster inference and a smaller model, but might + compromise accuracy. + +* ``ratio`` controls the ratio between the layers compressed to the precision defined + by ``mode`` and the rest of the layers that will be kept in the ``backup_mode`` in the optimized model. + Ratio is a decimal between 0 and 1. For example, 0.8 means that 80% of layers will be + compressed to the precision defined by ``mode``, while the rest will be compressed to + ``backup_mode`` precision. The default value for ratio is 1. + + | **Higher Ratio (more layers set to mode precision)**: + | Reduces the model size and increase inference speed but + might lead to higher accuracy degradation. + + | **Lower Ratio (more layers set to backup_mode precision)**: + | Maintains better accuracy but results in a larger model size + and potentially slower inference. + + In the example below, 90% of the model's layers are quantized to INT4 asymmetrically with + a group size of 64: + + .. code-block:: python + + from nncf import compress_weights, CompressWeightsMode + + # Example: Compressing weights with INT4_ASYM mode, group size of 64, and 90% INT4 ratio + compressed_model = compress_weights( + model, + mode=CompressWeightsMode.INT4_ASYM, + group_size=64, + ratio=0.9, + ) + +* ``scale_estimation`` - a boolean parameter that enables more accurate estimation of + quantization scales. Especially helpful when the weights of all layers are quantized to + 4 bits. Requires dataset. + +* ``awq`` - a boolean parameter that enables the AWQ method for more accurate INT4 weight + quantization. Especially helpful when the weights of all the layers are quantized to + 4 bits. The method can sometimes result in reduced accuracy when used with + Dynamic Quantization of activations. Requires dataset. + +* ``gptq`` - a boolean parameter that enables the GPTQ method for more accurate INT4 weight + quantization. Requires dataset. + +* ``dataset`` - a calibration dataset for data-aware weight compression. It is required + for some compression options, for example, ``scale_estimation``, ``gptq`` or ``awq``. Some types + of ``sensitivity_metric`` can use data for precision selection. + +* ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing + layers in the bit-width selection algorithm. Some of the metrics require dataset to be + provided. The following types are supported: + + * ``nncf.SensitivityMetric.WEIGHT_QUANTIZATION_ERROR`` - a data-free metric computed as + the inverted 8-bit quantization noise. Weights with highest value of this metric can + be accurately quantized channel-wise to 8-bit. The idea is to leave these weights in + 8 bit, and quantize the rest of layers to 4-bit group-wise. Since group-wise is more + accurate than per-channel, accuracy should not degrade. + + * ``nncf.SensitivityMetric.HESSIAN_INPUT_ACTIVATION`` - requires a dataset. The average + Hessian trace of weights with respect to the layer-wise quantization error multiplied + by L2 norm of 8-bit quantization noise. + + * ``nncf.SensitivityMetric.MEAN_ACTIVATION_VARIANCE`` - requires a dataset. The mean + variance of the layers' inputs multiplied by inverted 8-bit quantization noise. + + * ``nncf.SensitivityMetric.MAX_ACTIVATION_VARIANCE`` - requires a dataset. The maximum + variance of the layers' inputs multiplied by inverted 8-bit quantization noise. + + * ``nncf.SensitivityMetric.MEAN_ACTIVATION_MAGNITUDE`` - requires a dataset. The mean + magnitude of the layers' inputs multiplied by inverted 8-bit quantization noise. + +* ``all_layers`` - a boolean parameter that enables INT4 weight quantization of all + Fully-Connected and Embedding layers, including the first and last layers in the model. + +* ``lora_correction`` - a boolean parameter that enables the LoRA Correction Algorithm + to further improve the accuracy of INT4 compressed models on top of other + algorithms - AWQ and Scale Estimation. + +* ``backup_mode`` - defines a backup precision for mixed-precision weight compression. + There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains + the original floating-point precision of the model weights (``INT8_ASYM`` is default value). + +| + +4-bit Weight Quantization with GPTQ +################################### + +You can use models from Hugging Face +`Transformers `__ library, which are quantized +with `GPTQ `__ algorithm. Such models do not require +additional optimization step because the conversion will automatically preserve +the INT4 optimization results, and model inference will eventually benefit from it. + +See the `example of a model `__ +that has been optimized with GPTQ. + +You can also refer to the code sample below which shows how to load a 4-bit +GPTQ model and run inference. + +.. dropdown:: Using a GPTQ model. + + Make sure to install GPTQ dependencies by running the following command: + + .. code-block:: python + + pip install optimum[openvino] auto-gptq + + .. code-block:: python + + from optimum.intel.openvino import OVModelForCausalLM + from transformers import AutoTokenizer, pipeline + + # Load model from Hugging Face already optimized with GPTQ + model_id = "TheBloke/Llama-2-7B-Chat-GPTQ" + model = OVModelForCausalLM.from_pretrained(model_id, export=True) + + # Inference + tokenizer = AutoTokenizer.from_pretrained(model_id) + pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + phrase = "The weather is" + results = pipe(phrase) + print(results)