Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support backup precision option for WC #2978

Merged
merged 13 commits into from
Oct 7, 2024
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ The Weights Compression algorithm is aimed at compressing the weights of the mod
By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
OpenVINO backend also supports 4 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM, NF4, E2M1. The primary precision in case of INT4_SYM mode is signed 4-bit integer and weights are quantized to it [symmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) without zero point. In case of INT4_ASYM mode - unsigned 4-bit integer and weight are quantized to it [asymmetrically](/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point. In case of E2M1 mode - [e2m1](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data type without zero point and has 8bit [E8M0](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) scale.
All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
All embeddings, convolutions and last linear layers are always compressed to 8-bit integer data type. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit asymmetric integer data type.
All embeddings, convolutions and last linear layers are always compressed to a backup mode, which is "INT8_ASYM", by default. To quantize embeddings and last linear layers to 4-bit, use `all_layers=True`.
Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to a backup mode. OpenVINO backend supports 3 backup modes: INT8_SYM, INT8_ASYM, and NONE, which retains the original floating-point precision of the model weights. Backup mode is supported only for mixed-precision weight quantization.

### User guide

Expand All @@ -37,6 +37,13 @@ from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM) # model is openvino.Model object
```

- Compress weights to NF4 with group size = 128, except embeddings, convolutions and last linear layers - they are remain in original floating-point precision.

```python
from nncf import compress_weights, BackupMode, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4, backup_mode=BackupMode.NONE) # model is openvino.Model object
```

- Generally, `INT4_SYM` mode is the fastest mixed-precision mode, but it may lead to a significant accuracy degradation or perplexity increase.
Compressing weights asymmetrically (`INT4_ASYM` mode) is the way to increase accuracy, however in turns it slows down inference a bit.
If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`. Please refer to the [example](https://github.com/openvinotoolkit/nncf/blob/develop/examples/llm_compression/openvino/tiny_llama_find_hyperparams) how to automatically tune these parameters.
Expand Down
1 change: 1 addition & 0 deletions nncf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
from nncf.errors import UnsupportedModelError as UnsupportedModelError
from nncf.errors import UnsupportedVersionError as UnsupportedVersionError
from nncf.errors import ValidationError as ValidationError
from nncf.parameters import BackupMode as BackupMode
from nncf.parameters import CompressWeightsMode as CompressWeightsMode
from nncf.parameters import DropType as DropType
from nncf.parameters import ModelType as ModelType
Expand Down
3 changes: 3 additions & 0 deletions nncf/experimental/torch/fx/quantization/quantize_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
from nncf.experimental.torch.fx.transformations import apply_quantization_transformations
from nncf.experimental.torch.fx.transformations import revert_quantization_transformations
from nncf.experimental.torch.fx.transformations import shared_constants_unification_transformation
from nncf.parameters import BackupMode
from nncf.parameters import CompressWeightsMode
from nncf.parameters import ModelType
from nncf.parameters import QuantizationMode
Expand Down Expand Up @@ -124,6 +125,7 @@ def compress_weights_impl(
scale_estimation: bool,
gptq: bool,
lora_correction: bool,
backup_mode: BackupMode,
advanced_parameters: Optional[AdvancedCompressionParameters] = None,
) -> torch.fx.GraphModule:
"""
Expand All @@ -142,6 +144,7 @@ def compress_weights_impl(
scale_estimation,
gptq,
lora_correction,
backup_mode,
advanced_parameters,
)
shared_constants_unification_transformation(model)
Expand Down
3 changes: 3 additions & 0 deletions nncf/openvino/quantization/quantize_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
from nncf.openvino.quantization.backend_parameters import is_weight_compression_needed
from nncf.openvino.quantization.quantize_ifmodel import apply_algorithm_if_bodies
from nncf.openvino.rt_info import dump_parameters
from nncf.parameters import BackupMode
from nncf.parameters import CompressWeightsMode
from nncf.parameters import DropType
from nncf.parameters import ModelType
Expand Down Expand Up @@ -379,6 +380,7 @@ def compress_weights_impl(
scale_estimation: bool,
gptq: bool,
lora_correction: bool,
backup_mode: BackupMode,
advanced_parameters: Optional[AdvancedCompressionParameters] = None,
) -> ov.Model:
"""
Expand All @@ -398,6 +400,7 @@ def compress_weights_impl(
scale_estimation,
gptq,
lora_correction,
backup_mode,
advanced_parameters,
)
graph = NNCFGraphFactory.create(model)
Expand Down
17 changes: 17 additions & 0 deletions nncf/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,23 @@ class CompressWeightsMode(StrEnum):
E2M1 = "e2m1"


@api(canonical_alias="nncf.BackupMode")
class BackupMode(StrEnum):
"""
Defines a backup mode for weight compression.
:param NONE: Stands for original floating-point precision of the model weights.
In this mode, weights are retained in their original precision without any quantization.
:param INT8_SYM: Stands for 8-bit integer symmetric quantization without zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization
:param INT8_ASYM: Stands for 8-bit integer asymmetric quantization with a typical non-fixed zero point.
https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization
"""

NONE = "none"
INT8_SYM = "int8_sym"
INT8_ASYM = "int8_asym"


@api(canonical_alias="nncf.SensitivityMetric")
class SensitivityMetric(StrEnum):
"""
Expand Down
Loading
Loading