From 9fa0b6d2dfc2fdaa77fa10d72c844d7cc614f8f3 Mon Sep 17 00:00:00 2001 From: Alexander Dokuchaev Date: Wed, 4 Jun 2025 10:03:45 +0300 Subject: [PATCH 01/22] release notes template --- ReleaseNotes.md | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 382e2072d99..384d8873918 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -1,5 +1,53 @@ # Release Notes +## New in Release 2.17.0 + +Post-training Quantization: + +- Breaking changes: + - ... +- General: + - ... +- Features: + - ... +- Fixes: + - ... +- Improvements: + - ... +- Deprecations/Removals: + - ... +- Tutorials: + - ... +- Known issues: + - ... + +Compression-aware training: + +- Breaking changes: + - ... +- General: + - ... +- Features: + - ... +- Fixes: + - ... +- Improvements: + - ... +- Deprecations/Removals: + - ... +- Tutorials: + - ... +- Known issues: + - ... + +Deprecations/Removals: + +- ... + +Requirements: + +- ... + ## New in Release 2.16.0 Post-training Quantization: From 5796111d85aea88b4381cb3cf343f2e9ac86ad4f Mon Sep 17 00:00:00 2001 From: Andrey Churkin Date: Wed, 4 Jun 2025 10:13:38 +0100 Subject: [PATCH 02/22] Update ReleaseNotes.md --- ReleaseNotes.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 384d8873918..a6e1277d42c 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -7,9 +7,11 @@ Post-training Quantization: - Breaking changes: - ... - General: - - ... + - (ONNX) Upgraded ONNX Runtime to version 1.21.1. + - (ONNX) Added an example for LLM weight compression in the ONNX backend. This example showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. - Features: - - ... + - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. + - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - Fixes: - ... - Improvements: From 3316cd0ddc695447e30cbc99c98a327b252e5d9e Mon Sep 17 00:00:00 2001 From: Andrey Churkin Date: Wed, 4 Jun 2025 10:17:47 +0100 Subject: [PATCH 03/22] Update ReleaseNotes.md --- ReleaseNotes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index a6e1277d42c..8e6dea77bfb 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -11,7 +11,7 @@ Post-training Quantization: - (ONNX) Added an example for LLM weight compression in the ONNX backend. This example showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. - Features: - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. - - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. + - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - Fixes: - ... - Improvements: From fcd41f7166bc784a46ee686eb4f1da552c6d43fe Mon Sep 17 00:00:00 2001 From: Daniil Lyakhov Date: Wed, 4 Jun 2025 13:31:27 +0200 Subject: [PATCH 04/22] Update ReleaseNotes.md --- ReleaseNotes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 8e6dea77bfb..e39923c1320 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -13,7 +13,7 @@ Post-training Quantization: - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - Fixes: - - ... + - Fixed BiasCorrection failures with models without a batch dimention. - Improvements: - ... - Deprecations/Removals: From 4a381617a2ad499bd206b8b7bcf834774fe0e4e4 Mon Sep 17 00:00:00 2001 From: Aamir Nazir Date: Wed, 4 Jun 2025 15:42:15 +0400 Subject: [PATCH 05/22] Update ReleaseNotes.md --- ReleaseNotes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index e39923c1320..fa979b3aa20 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -12,6 +12,7 @@ Post-training Quantization: - Features: - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. + - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss. - Fixes: - Fixed BiasCorrection failures with models without a batch dimention. - Improvements: From b122e98136e27caee2d19cde0d66d0ab98a89dbc Mon Sep 17 00:00:00 2001 From: Lyalyushkin Nikolay Date: Wed, 4 Jun 2025 22:17:28 +0200 Subject: [PATCH 06/22] Update ReleaseNotes.md Added NLS to the release notes --- ReleaseNotes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index fa979b3aa20..08d336d5b92 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -31,7 +31,7 @@ Compression-aware training: - General: - ... - Features: - - ... + - (PyTorch) For downstream tasks, we introduce Quantization-Aware Training (QAT) with absorbable elastic LoRA adapters and neural low-rank search (NLS). This novel weight compression method enhances the accuracy of Large Language Models (LLMs) with int4 weights on downstream tasks, achieving a reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weights` API now includes a new `compression_format` option, `nncf.CompressionFormat.FQ_LORA_NLS`. A sample QAT compression pipeline with preview support is available [here](examples/llm_compression/torch/downstream_qat_with_nls). - Fixes: - ... - Improvements: From 727d7d45debbc56bb77db796f150c0cd3f6c14b5 Mon Sep 17 00:00:00 2001 From: Liubov Talamanova Date: Thu, 5 Jun 2025 15:11:47 +0100 Subject: [PATCH 07/22] Add list of OV notebooks with NNCF to release notes --- ReleaseNotes.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 08d336d5b92..963f67014f2 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -20,7 +20,14 @@ Post-training Quantization: - Deprecations/Removals: - ... - Tutorials: - - ... + - [Post-Training Optimization of MiniCPM-o 2.6 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/minicpm-o-omnimodal-chatbot/minicpm-o-omnimodal-chatbot.ipynb) + - [Post-Training Optimization of Qwen2.5-Omni Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2.5-omni-chatbot/qwen2.5-omni-chatbot.ipynb) + - [Post-Training Optimization of InternVideo2 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/intern-video2-classiciation/intern-video2-classification.ipynb) + - [Post-Training Optimization of OpenVoice2 and MeloTTS Models](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/openvoice2-and-melotts/openvoice2-and-melotts.ipynb) + - [Post-Training Optimization of Flex.2 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/flex.2-image-generation/flex.2-image-generation.ipynb) + - [Post-Training Optimization of Wan2.1 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/wan2.1-text-to-video/wan2.1-text-to-video.ipynb) + - [Post-Training Optimization of Phi-4-mini Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/supplementary_materials/phi4-agent/phi4_agent.py) + - [Post-Training Optimization of Torch.FX Stable Diffusion v3 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/stable-diffusion-v3-torch-fx/stable-diffusion-v3-torch-fx.ipynb) - Known issues: - ... From 619f5317740a6f258ab447168b3efaf24f493334 Mon Sep 17 00:00:00 2001 From: andreyanufr Date: Tue, 10 Jun 2025 09:17:50 +0200 Subject: [PATCH 08/22] Update ReleaseNotes.md https://github.com/openvinotoolkit/nncf/pull/3485 https://github.com/openvinotoolkit/nncf/pull/3315 --- ReleaseNotes.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 963f67014f2..c4dd0955ccc 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -13,6 +13,8 @@ Post-training Quantization: - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss. + - (OpenVINO, PyTorch) Added data-free AWQ based on the per-column magnitudes of the weights. + - (OpenVINO) Added support for quantizing of the V/V_proj input for ScaledDotProductAttention for FP8. - Fixes: - Fixed BiasCorrection failures with models without a batch dimention. - Improvements: From c095bd34df91c3b36816409b007b144747ba2553 Mon Sep 17 00:00:00 2001 From: Andrey Churkin Date: Tue, 10 Jun 2025 09:29:29 +0100 Subject: [PATCH 09/22] Update ReleaseNotes.md --- ReleaseNotes.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index c4dd0955ccc..b4de3886d41 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -7,10 +7,9 @@ Post-training Quantization: - Breaking changes: - ... - General: - - (ONNX) Upgraded ONNX Runtime to version 1.21.1. - - (ONNX) Added an example for LLM weight compression in the ONNX backend. This example showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. + - ... - Features: - - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. + - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. Added an example for LLM weight compression in the ONNX backend. This example showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss. - (OpenVINO, PyTorch) Added data-free AWQ based on the per-column magnitudes of the weights. @@ -57,6 +56,7 @@ Deprecations/Removals: - ... Requirements: + - (ONNX) Upgraded ONNX Runtime to version 1.21.1. - ... From 4ce2fc0565d68c6810073b3fb733ad4caeb8c24d Mon Sep 17 00:00:00 2001 From: Andrey Churkin Date: Tue, 10 Jun 2025 09:33:29 +0100 Subject: [PATCH 10/22] Update ReleaseNotes.md --- ReleaseNotes.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index b4de3886d41..1afe78345e8 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -52,14 +52,11 @@ Compression-aware training: - ... Deprecations/Removals: - - ... Requirements: - (ONNX) Upgraded ONNX Runtime to version 1.21.1. -- ... - ## New in Release 2.16.0 Post-training Quantization: From 19f81e75a739fcc3ffc7b37576b0b004cb541568 Mon Sep 17 00:00:00 2001 From: Andrey Churkin Date: Tue, 10 Jun 2025 09:37:05 +0100 Subject: [PATCH 11/22] Update ReleaseNotes.md --- ReleaseNotes.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 1afe78345e8..20998586efa 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -52,10 +52,12 @@ Compression-aware training: - ... Deprecations/Removals: + - ... Requirements: - - (ONNX) Upgraded ONNX Runtime to version 1.21.1. + +- (ONNX) Upgraded ONNX Runtime to version 1.21.1. ## New in Release 2.16.0 From fb61a8a96f753f14881794a379ad24e642e36872 Mon Sep 17 00:00:00 2001 From: Andrey Churkin Date: Tue, 10 Jun 2025 10:00:03 +0100 Subject: [PATCH 12/22] Update ReleaseNotes.md --- ReleaseNotes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 20998586efa..dbe96188cb2 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -9,7 +9,7 @@ Post-training Quantization: - General: - ... - Features: - - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. Added an example for LLM weight compression in the ONNX backend. This example showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. + - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. Added an example for LLM weight compression in the ONNX backend. [This example](examples/llm_compression/onnx/tiny_llama) showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss. - (OpenVINO, PyTorch) Added data-free AWQ based on the per-column magnitudes of the weights. From b9a2428281ee8b802dd14ce9dc62b33c8ed63e19 Mon Sep 17 00:00:00 2001 From: Alexander Dokuchaev Date: Tue, 10 Jun 2025 22:15:38 +0300 Subject: [PATCH 13/22] Update ReleaseNotes.md --- ReleaseNotes.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index dbe96188cb2..e9b6c503067 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -5,19 +5,24 @@ Post-training Quantization: - Breaking changes: - - ... + - (PyTorch) Updated the model tracing mechanism to use TorchFunctionHook instead of patching torch namespace. As a result, layer names have changed, which may require updating the ignored scopes if they rely on specific layer names. - General: - - ... + - (PyTorch) Moved function_hook module from experimental to nncf.torch namespace. The function_hook module is now the default mechanism for model tracing in NNCF. - Features: - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. Added an example for LLM weight compression in the ONNX backend. [This example](examples/llm_compression/onnx/tiny_llama) showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. + - (ONNX) Speed up weight compression for opset<21. - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss. - (OpenVINO, PyTorch) Added data-free AWQ based on the per-column magnitudes of the weights. - (OpenVINO) Added support for quantizing of the V/V_proj input for ScaledDotProductAttention for FP8. - Fixes: - - Fixed BiasCorrection failures with models without a batch dimention. + - Fixed BiasCorrection failures with models without a batch dimension. + - Aligned quantile centers for NF4 with OpenVINO implementation. - Improvements: - - ... + - (TorchFX) The `nncf.torch.disable_patching()` context manager is no longer required. + - (OpenVINO) Add version of nncf to rt_info. + - Optimized weight compression for NF4. + - Support `transformer>4.52` by `nncf.data.generate_text_data`. - Deprecations/Removals: - ... - Tutorials: @@ -57,7 +62,9 @@ Deprecations/Removals: Requirements: -- (ONNX) Upgraded ONNX Runtime to version 1.21.1. +- Upgraded ONNX Runtime to version 1.21.1. +- Updated PyTorch (2.7.1) and Torchvision (0.22.1) versions. +- Removed jstyleson from requirements. ## New in Release 2.16.0 From f0377fa24cfac7236b60560115b71faab04cfa09 Mon Sep 17 00:00:00 2001 From: Lyalyushkin Nikolay Date: Wed, 11 Jun 2025 17:35:41 +0200 Subject: [PATCH 14/22] clarified positioning two QAT approaches with absorbable LoRA --- ReleaseNotes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index e9b6c503067..4364e1bc42f 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -44,7 +44,7 @@ Compression-aware training: - General: - ... - Features: - - (PyTorch) For downstream tasks, we introduce Quantization-Aware Training (QAT) with absorbable elastic LoRA adapters and neural low-rank search (NLS). This novel weight compression method enhances the accuracy of Large Language Models (LLMs) with int4 weights on downstream tasks, achieving a reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weights` API now includes a new `compression_format` option, `nncf.CompressionFormat.FQ_LORA_NLS`. A sample QAT compression pipeline with preview support is available [here](examples/llm_compression/torch/downstream_qat_with_nls). + - (PyTorch) For downstream tasks, we introduce Quantization-Aware Training (QAT) with absorbable elastic LoRA adapters and neural low-rank search (NLS). This novel weight compression method enhances the accuracy of Large Language Models (LLMs) with int4 weights on downstream tasks, achieving a reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weights` API now includes a new `compression_format` option, `nncf.CompressionFormat.FQ_LORA_NLS`. A sample QAT compression pipeline with preview support is available [here](examples/llm_compression/torch/downstream_qat_with_nls). Building on our previous work with absorbable LoRA adapters, this new pipeline is specifically designed for downstream tasks. In contrast, the pipeline from the previous release was tailored to enhance general accuracy through knowledge distillation using static rank settings. For a more comprehensive understanding of both approaches, please refer to ["Weight-Only Quantization Aware Training with LoRA and NLS"](/docs/usage/training_time_compression/quantization_aware_training_lora/Usage.md) in the ["Training-Time Compression Algorithms"](/README.md#Training-Time-Compression-Algorithms) section of the main README in the repository. - Fixes: - ... - Improvements: From bb59beed834be2501245226ececbe2808e02a898 Mon Sep 17 00:00:00 2001 From: Lyalyushkin Nikolay Date: Wed, 11 Jun 2025 18:01:52 +0200 Subject: [PATCH 15/22] fixes and improvements in QAT + absorbable LoRA --- ReleaseNotes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 4364e1bc42f..71fa2ed3b96 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -46,9 +46,9 @@ Compression-aware training: - Features: - (PyTorch) For downstream tasks, we introduce Quantization-Aware Training (QAT) with absorbable elastic LoRA adapters and neural low-rank search (NLS). This novel weight compression method enhances the accuracy of Large Language Models (LLMs) with int4 weights on downstream tasks, achieving a reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weights` API now includes a new `compression_format` option, `nncf.CompressionFormat.FQ_LORA_NLS`. A sample QAT compression pipeline with preview support is available [here](examples/llm_compression/torch/downstream_qat_with_nls). Building on our previous work with absorbable LoRA adapters, this new pipeline is specifically designed for downstream tasks. In contrast, the pipeline from the previous release was tailored to enhance general accuracy through knowledge distillation using static rank settings. For a more comprehensive understanding of both approaches, please refer to ["Weight-Only Quantization Aware Training with LoRA and NLS"](/docs/usage/training_time_compression/quantization_aware_training_lora/Usage.md) in the ["Training-Time Compression Algorithms"](/README.md#Training-Time-Compression-Algorithms) section of the main README in the repository. - Fixes: - - ... + - (PyTorch) Minimized the disparity in accuracy between the Torch model and its exported OpenVINO equivalent for ["Weight-Only Quantization Aware Training with LoRA and NLS"](/docs/usage/training_time_compression/quantization_aware_training_lora/Usage.md). - Improvements: - - ... + - (Pytorch) The evaluation and selection process for the best checkpoint in "QAT + absorbable LoRA" with knowledge distillation has been revised. The tuned Torch model is now evaluated using the validation split of Wikitext, while the final results are measured on the test split with the OpenVINO model. The [results table for Wikitext](/examples/llm_compression/torch/distillation_qat_with_lora/README.md#results-on-wikitext) has been updated accordingly and now includes three additional models. - Deprecations/Removals: - ... - Tutorials: From 8481229c3167377866f1c21a37599669398fab3c Mon Sep 17 00:00:00 2001 From: Alexander Suslov Date: Thu, 12 Jun 2025 10:05:36 +0400 Subject: [PATCH 16/22] Update ReleaseNotes.md --- ReleaseNotes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 71fa2ed3b96..42795fb0c3c 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -18,6 +18,7 @@ Post-training Quantization: - Fixes: - Fixed BiasCorrection failures with models without a batch dimension. - Aligned quantile centers for NF4 with OpenVINO implementation. + - Weights compression statistics collection have been fixed to show the data types of ignored weights. - Improvements: - (TorchFX) The `nncf.torch.disable_patching()` context manager is no longer required. - (OpenVINO) Add version of nncf to rt_info. From bcf868e2b37451506dc7390ca3965598a9e538b4 Mon Sep 17 00:00:00 2001 From: Alexander Dokuchaev Date: Thu, 12 Jun 2025 15:39:48 +0300 Subject: [PATCH 17/22] Update ReleaseNotes.md --- ReleaseNotes.md | 34 +++++++--------------------------- 1 file changed, 7 insertions(+), 27 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 42795fb0c3c..519883fcde5 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -4,28 +4,24 @@ Post-training Quantization: -- Breaking changes: - - (PyTorch) Updated the model tracing mechanism to use TorchFunctionHook instead of patching torch namespace. As a result, layer names have changed, which may require updating the ignored scopes if they rely on specific layer names. - General: - (PyTorch) Moved function_hook module from experimental to nncf.torch namespace. The function_hook module is now the default mechanism for model tracing in NNCF. + - (PyTorch) The function_hook module is now the default mechanism for model tracing. It has moved out from experimental status and has been moved to the core nncf.torch namespace. - Features: + - (OpenVINO, PyTorch) Added 4-bit data-free AWQ based on the per-column magnitudes of the weights.. + - (OpenVINO) Added support for quantizing of the V/V_proj input for ScaledDotProductAttention for FP8. - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. Added an example for LLM weight compression in the ONNX backend. [This example](examples/llm_compression/onnx/tiny_llama) showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. - - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model’s external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - - (ONNX) Speed up weight compression for opset<21. + - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model's external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss. - - (OpenVINO, PyTorch) Added data-free AWQ based on the per-column magnitudes of the weights. - - (OpenVINO) Added support for quantizing of the V/V_proj input for ScaledDotProductAttention for FP8. - Fixes: + - (TorchFX) The `nncf.torch.disable_patching()` context manager is no longer required. - Fixed BiasCorrection failures with models without a batch dimension. - Aligned quantile centers for NF4 with OpenVINO implementation. - Weights compression statistics collection have been fixed to show the data types of ignored weights. - Improvements: - - (TorchFX) The `nncf.torch.disable_patching()` context manager is no longer required. - (OpenVINO) Add version of nncf to rt_info. - - Optimized weight compression for NF4. + - Optimized weight compression for NF4 (up to 10x speed up). - Support `transformer>4.52` by `nncf.data.generate_text_data`. -- Deprecations/Removals: - - ... - Tutorials: - [Post-Training Optimization of MiniCPM-o 2.6 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/minicpm-o-omnimodal-chatbot/minicpm-o-omnimodal-chatbot.ipynb) - [Post-Training Optimization of Qwen2.5-Omni Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2.5-omni-chatbot/qwen2.5-omni-chatbot.ipynb) @@ -35,35 +31,19 @@ Post-training Quantization: - [Post-Training Optimization of Wan2.1 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/wan2.1-text-to-video/wan2.1-text-to-video.ipynb) - [Post-Training Optimization of Phi-4-mini Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/supplementary_materials/phi4-agent/phi4_agent.py) - [Post-Training Optimization of Torch.FX Stable Diffusion v3 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/stable-diffusion-v3-torch-fx/stable-diffusion-v3-torch-fx.ipynb) -- Known issues: - - ... Compression-aware training: -- Breaking changes: - - ... -- General: - - ... - Features: - (PyTorch) For downstream tasks, we introduce Quantization-Aware Training (QAT) with absorbable elastic LoRA adapters and neural low-rank search (NLS). This novel weight compression method enhances the accuracy of Large Language Models (LLMs) with int4 weights on downstream tasks, achieving a reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The `nncf.compress_weights` API now includes a new `compression_format` option, `nncf.CompressionFormat.FQ_LORA_NLS`. A sample QAT compression pipeline with preview support is available [here](examples/llm_compression/torch/downstream_qat_with_nls). Building on our previous work with absorbable LoRA adapters, this new pipeline is specifically designed for downstream tasks. In contrast, the pipeline from the previous release was tailored to enhance general accuracy through knowledge distillation using static rank settings. For a more comprehensive understanding of both approaches, please refer to ["Weight-Only Quantization Aware Training with LoRA and NLS"](/docs/usage/training_time_compression/quantization_aware_training_lora/Usage.md) in the ["Training-Time Compression Algorithms"](/README.md#Training-Time-Compression-Algorithms) section of the main README in the repository. - Fixes: - (PyTorch) Minimized the disparity in accuracy between the Torch model and its exported OpenVINO equivalent for ["Weight-Only Quantization Aware Training with LoRA and NLS"](/docs/usage/training_time_compression/quantization_aware_training_lora/Usage.md). - Improvements: - (Pytorch) The evaluation and selection process for the best checkpoint in "QAT + absorbable LoRA" with knowledge distillation has been revised. The tuned Torch model is now evaluated using the validation split of Wikitext, while the final results are measured on the test split with the OpenVINO model. The [results table for Wikitext](/examples/llm_compression/torch/distillation_qat_with_lora/README.md#results-on-wikitext) has been updated accordingly and now includes three additional models. -- Deprecations/Removals: - - ... -- Tutorials: - - ... -- Known issues: - - ... - -Deprecations/Removals: - -- ... Requirements: -- Upgraded ONNX Runtime to version 1.21.1. +- Updated ONNX Runtime to version 1.21.1. - Updated PyTorch (2.7.1) and Torchvision (0.22.1) versions. - Removed jstyleson from requirements. From d96a1fd1311e2df0243c652a78de2e2ea4159713 Mon Sep 17 00:00:00 2001 From: Maksim Proshin Date: Sat, 14 Jun 2025 13:43:32 +0400 Subject: [PATCH 18/22] Update ReleaseNotes.md --- ReleaseNotes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 519883fcde5..e9eb9d534ae 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -8,7 +8,7 @@ Post-training Quantization: - (PyTorch) Moved function_hook module from experimental to nncf.torch namespace. The function_hook module is now the default mechanism for model tracing in NNCF. - (PyTorch) The function_hook module is now the default mechanism for model tracing. It has moved out from experimental status and has been moved to the core nncf.torch namespace. - Features: - - (OpenVINO, PyTorch) Added 4-bit data-free AWQ based on the per-column magnitudes of the weights.. + - (OpenVINO, PyTorch, TorchFX) Added 4-bit data-free AWQ (Activation-aware Weight Quantization) based on the per-column magnitudes of the weights making it possible to apply AWQ without a dataset for more accurate compression. - (OpenVINO) Added support for quantizing of the V/V_proj input for ScaledDotProductAttention for FP8. - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. Added an example for LLM weight compression in the ONNX backend. [This example](examples/llm_compression/onnx/tiny_llama) showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model's external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. From 3e75ab3015795b4d16583ec0ce2ecbc501b3c6d6 Mon Sep 17 00:00:00 2001 From: Maksim Proshin Date: Sat, 14 Jun 2025 13:55:07 +0400 Subject: [PATCH 19/22] Update ReleaseNotes.md --- ReleaseNotes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index e9eb9d534ae..241fe627441 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -19,9 +19,9 @@ Post-training Quantization: - Aligned quantile centers for NF4 with OpenVINO implementation. - Weights compression statistics collection have been fixed to show the data types of ignored weights. - Improvements: - - (OpenVINO) Add version of nncf to rt_info. + - (OpenVINO) Added the version of NNCF to rt_info. - Optimized weight compression for NF4 (up to 10x speed up). - - Support `transformer>4.52` by `nncf.data.generate_text_data`. + - Support for `transformer>4.52` by `nncf.data.generate_text_data`. - Tutorials: - [Post-Training Optimization of MiniCPM-o 2.6 Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/minicpm-o-omnimodal-chatbot/minicpm-o-omnimodal-chatbot.ipynb) - [Post-Training Optimization of Qwen2.5-Omni Model](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2.5-omni-chatbot/qwen2.5-omni-chatbot.ipynb) From a02d6b2243f2881fd8178193f46aba675186b08b Mon Sep 17 00:00:00 2001 From: Maksim Proshin Date: Sat, 14 Jun 2025 14:14:48 +0400 Subject: [PATCH 20/22] Update ReleaseNotes.md --- ReleaseNotes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 241fe627441..379fabe2012 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -14,7 +14,7 @@ Post-training Quantization: - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model's external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss. - Fixes: - - (TorchFX) The `nncf.torch.disable_patching()` context manager is no longer required. + - (TorchFX, Experimental) The `nncf.torch.disable_patching()` context manager is no longer required. - Fixed BiasCorrection failures with models without a batch dimension. - Aligned quantile centers for NF4 with OpenVINO implementation. - Weights compression statistics collection have been fixed to show the data types of ignored weights. @@ -43,7 +43,7 @@ Compression-aware training: Requirements: -- Updated ONNX Runtime to version 1.21.1. +- Updated ONNX Runtime (1.21.1). - Updated PyTorch (2.7.1) and Torchvision (0.22.1) versions. - Removed jstyleson from requirements. From c58ddcec598d0a3fbe31f154abf6a3b2a4efea47 Mon Sep 17 00:00:00 2001 From: Alexander Suslov Date: Mon, 16 Jun 2025 11:39:02 +0400 Subject: [PATCH 21/22] Update ReleaseNotes.md --- ReleaseNotes.md | 1 - 1 file changed, 1 deletion(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 379fabe2012..9f2cd9ccd71 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -5,7 +5,6 @@ Post-training Quantization: - General: - - (PyTorch) Moved function_hook module from experimental to nncf.torch namespace. The function_hook module is now the default mechanism for model tracing in NNCF. - (PyTorch) The function_hook module is now the default mechanism for model tracing. It has moved out from experimental status and has been moved to the core nncf.torch namespace. - Features: - (OpenVINO, PyTorch, TorchFX) Added 4-bit data-free AWQ (Activation-aware Weight Quantization) based on the per-column magnitudes of the weights making it possible to apply AWQ without a dataset for more accurate compression. From bfd8465fba2c73c97df2796c7dc904ef97310ceb Mon Sep 17 00:00:00 2001 From: Alexander Suslov Date: Mon, 16 Jun 2025 11:49:01 +0400 Subject: [PATCH 22/22] Update ReleaseNotes.md --- ReleaseNotes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ReleaseNotes.md b/ReleaseNotes.md index 9f2cd9ccd71..1316d854263 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -8,12 +8,12 @@ Post-training Quantization: - (PyTorch) The function_hook module is now the default mechanism for model tracing. It has moved out from experimental status and has been moved to the core nncf.torch namespace. - Features: - (OpenVINO, PyTorch, TorchFX) Added 4-bit data-free AWQ (Activation-aware Weight Quantization) based on the per-column magnitudes of the weights making it possible to apply AWQ without a dataset for more accurate compression. - - (OpenVINO) Added support for quantizing of the V/V_proj input for ScaledDotProductAttention for FP8. + - (OpenVINO) Added support for quantizing of the value input for ScaledDotProductAttention for FP8. - (ONNX) Added support for data-free weight compression using INT4 (INT8) in the ONNX backend. Added an example for LLM weight compression in the ONNX backend. [This example](examples/llm_compression/onnx/tiny_llama) showcases the optimization of the `TinyLlama-1.1B-Chat-v0.3` model in ONNX format using the NNCF weight compression API. - (ONNX) Added the `BackendParameters.EXTERNAL_DATA_DIR` parameter for the ONNX backend. This parameter specifies the absolute path to the directory where the model's external data files are stored. All external data files must be located in the same directory. It should be used when the model is loaded without external data using `onnx.load("model.onnx", load_external_data=False)`, and the external data files are not in the current working directory of the process. This parameter can be omitted if the external data files are located in the current working directory of the process. - (TorchFX, Experimental) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce accuracy loss. - Fixes: - - (TorchFX, Experimental) The `nncf.torch.disable_patching()` context manager is no longer required. + - (TorchFX, Experimental) To simplify usage, the nncf.torch.disable_patching() context manager has been made redundant and is no longer required ([example](/examples/post_training_quantization/torch_fx/resnet18/README.md)). - Fixed BiasCorrection failures with models without a batch dimension. - Aligned quantile centers for NF4 with OpenVINO implementation. - Weights compression statistics collection have been fixed to show the data types of ignored weights.