do_sample=False for NPU in chat_sample, add NPU to README (#1637)

- make chat_sample work out of the box on NPU by forcing do_sample=False for NPU - add NPU info to text_generation samples README and a small unrelated change: - change `pip install` command for exporting models that are already on huggingface-hub. No need to install all of PyTorch and transformers if you only need to download a model.
openvinotoolkit · Jan 28, 2025 · 4521bb6 · 4521bb6
1 parent 3b016df
commit 4521bb6
Show file tree

Hide file tree

Showing 2 changed files with 24 additions and 2 deletions.
diff --git a/samples/cpp/text_generation/README.md b/samples/cpp/text_generation/README.md
@@ -19,7 +19,7 @@ optimim-cli export openvino --model <model> <output_folder>
 ```
 If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli.
 ```sh
-pip install --upgrade-strategy eager -r ../../export-requirements.txt
+pip install huggingface-hub
 huggingface-cli download <model> --local-dir <output_folder>
 ```
 
@@ -54,6 +54,17 @@ The following template can be used as a default, but it may not work properly wi
 "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
 ```
 
+#### NPU support
+
+NPU device is supported with some limitations. See [NPU inference of
+LLMs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) documentation. In particular:
+
+- Models must be exported with symmetric INT4 quantization (`optimum-cli export openvino --weight-format int4 --sym --model <model> <output_folder>`).
+  For models with more than 4B parameters, channel wise quantization should be used (`--group-size -1`).
+- Beam search and parallel sampling are not supported.
+- Use OpenVINO 2025.0 or later (installed by deployment-requirements.txt, see "Common information" section), and the latest NPU driver.
+
+
 ### 2. Greedy Causal LM (`greedy_causal_lm`)
 - **Description:**
 Basic text generation using a causal language model.

diff --git a/samples/python/text_generation/README.md b/samples/python/text_generation/README.md
@@ -19,7 +19,7 @@ optimim-cli export openvino --model <model> <output_folder>
 ```
 If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli.
 ```sh
-pip install --upgrade-strategy eager -r ../../export-requirements.txt
+pip install huggingface-hub
 huggingface-cli download <model> --local-dir <output_folder>
 ```
 
@@ -54,6 +54,17 @@ The following template can be used as a default, but it may not work properly wi
 "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
 ```
 
+#### NPU support
+
+NPU device is supported with some limitations. See [NPU inference of
+LLMs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) documentation. In particular:
+
+- Models must be exported with symmetric INT4 quantization (`optimum-cli export openvino --weight-format int4 --sym --model <model> <output_folder>`).
+  For models with more than 4B parameters, channel wise quantization should be used (`--group-size -1`).
+- Beam search and parallel sampling are not supported.
+- Use OpenVINO 2025.0 or later (installed by deployment-requirements.txt, see "Common information" section), and the latest NPU driver.
+
+
 ### 2. Greedy Causal LM (`greedy_causal_lm`)
 - **Description:**
 Basic text generation using a causal language model.