Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into mitruska/glu_shap…
Browse files Browse the repository at this point in the history
…e_infer
  • Loading branch information
mitruska committed Dec 2, 2024
2 parents 6891c09 + 0d9d14d commit 2e4971d
Show file tree
Hide file tree
Showing 358 changed files with 12,927 additions and 7,058 deletions.
6 changes: 3 additions & 3 deletions .github/actions/cache/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions docs/articles_en/about-openvino/performance-benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ implemented in your solutions. Click the buttons below to see the chosen benchma
:outline:
:expand:

:material-regular:`bar_chart;1.4em` OVMS for GenAI (coming soon)
:material-regular:`bar_chart;1.4em` OVMS for GenAI



Expand Down Expand Up @@ -163,7 +163,7 @@ For a listing of all platforms and configurations used for testing, refer to the
2024.5, as of November 20, 2024.

* OpenVINO Model Server performance results are based on release
2024.4, as of Sept. 30, 2024.
2024.5, as of November 20, 2024.

The results may not reflect all publicly available updates. Intel technologies' features and
benefits depend on system configuration and may require enabled hardware, software, or service
Expand Down
6 changes: 3 additions & 3 deletions docs/articles_en/about-openvino/release-notes-openvino.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ What's new

* New models supported: Llama 3.2 (1B & 3B), Gemma 2 (2B & 9B), and YOLO11.
* LLM support on NPU: Llama 3 8B, Llama 2 7B, Mistral-v0.2-7B, Qwen2-7B-Instruct and Phi-3
Mini-Instruct.
Mini-Instruct.
* Noteworthy notebooks added: Sam2, Llama3.2, Llama3.2 - Vision, Wav2Lip, Whisper, and Llava.
* Preview: support for Flax, a high-performance Python neural network library based on JAX.
Its modular design allows for easy customization and accelerated inference on GPUs.
Expand Down Expand Up @@ -87,8 +87,8 @@ Common
* A new constant constructor has been added, enabling constants to be created from data pointer
as shared memory. Additionally, it can take ownership of a shared, or other, object, avoiding
a two-step process to wrap memory into ``ov::Tensor``.
* Files are now read via the async ReadFile API, reducing the bottleneck for LLM model load
times on GPU.
* Asynchronous file reading with mmap library has been implemented, reducing loading times for
model files, especially for LLMs.
* CPU implementation of SliceScatter operator is now available, used for models such as Gemma,
supporting increased LLM performance.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ CPU
* Ubuntu 20.04 long-term support (LTS), 64-bit (Kernel 5.15+)
* macOS 12.6 and above, 64-bit and ARM64
* CentOS 7
* Red Hat Enterprise Linux 9.3-9.4, 64-bit
* Red Hat Enterprise Linux (RHEL) 8 and 9, 64-bit
* openSUSE Tumbleweed, 64-bit and ARM64
* Ubuntu 20.04 ARM64

Expand Down Expand Up @@ -65,7 +65,7 @@ GPU
* Ubuntu 22.04 long-term support (LTS), 64-bit
* Ubuntu 20.04 long-term support (LTS), 64-bit
* CentOS 7
* Red Hat Enterprise Linux 9.3-9.4, 64-bit
* Red Hat Enterprise Linux (RHEL) 8 and 9, 64-bit

.. tab-item:: Additional considerations

Expand Down
2 changes: 1 addition & 1 deletion docs/articles_en/get-started/install-openvino.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Install OpenVINO™ 2024.5

<script type="module" crossorigin src="../_static/selector-tool/assets/index-Codcw3jz.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<iframe id="selector" src="../_static/selector-tool/selector-451bede.html" style="width: 100%; border: none" title="Download Intel® Distribution of OpenVINO™ Toolkit"></iframe>
<iframe id="selector" src="../_static/selector-tool/selector-2a63478.html" style="width: 100%; border: none" title="Download Intel® Distribution of OpenVINO™ Toolkit"></iframe>

OpenVINO 2024.5, described here, is not a Long-Term-Support version!
All currently supported versions are:
Expand Down
4 changes: 2 additions & 2 deletions docs/articles_en/learn-openvino.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Learn OpenVINO

Interactive Tutorials (Python) <learn-openvino/interactive-tutorials-python>
Sample Applications (Python & C++) <learn-openvino/openvino-samples>
Large Language Model Inference Guide <learn-openvino/llm_inference_guide>
Generative AI workflow <learn-openvino/llm_inference_guide>



Expand All @@ -29,5 +29,5 @@ as well as an experienced user.
| :doc:`OpenVINO Samples <learn-openvino/openvino-samples>`
| The OpenVINO samples (Python and C++) are simple console applications that show how to use specific OpenVINO API features. They can assist you in executing tasks such as loading a model, running inference, querying particular device capabilities, etc.
| :doc:`Large Language Models in OpenVINO <learn-openvino/llm_inference_guide>`
| :doc:`Generative AI workflow <learn-openvino/llm_inference_guide>`
| Detailed information on how OpenVINO accelerates Generative AI use cases and what models it supports. This tutorial provides instructions for running Generative AI models using Hugging Face Optimum Intel and Native OpenVINO APIs.
214 changes: 90 additions & 124 deletions docs/articles_en/learn-openvino/llm_inference_guide.rst
Original file line number Diff line number Diff line change
@@ -1,140 +1,106 @@
Large Language Model Inference Guide
Generative AI workflow
========================================

.. meta::
:description: Explore learning materials, including interactive
Python tutorials and sample console applications that explain
how to use OpenVINO features.
:description: learn how to use OpenVINO to run generative AI models.


.. toctree::
:maxdepth: 1
:hidden:

Run LLMs with Optimum Intel <llm_inference_guide/llm-inference-hf>
Run LLMs on OpenVINO GenAI Flavor <llm_inference_guide/genai-guide>
Run LLMs on Base OpenVINO <llm_inference_guide/llm-inference-native-ov>
Inference with OpenVINO GenAI <llm_inference_guide/genai-guide>
Inference with Optimum Intel <llm_inference_guide/llm-inference-hf>
Generative AI with Base OpenVINO (not recommended) <llm_inference_guide/llm-inference-native-ov>
OpenVINO Tokenizers <llm_inference_guide/ov-tokenizers>

Large Language Models (LLMs) like GPT are transformative deep learning networks capable of a
broad range of natural language tasks, from text generation to language translation. OpenVINO
optimizes the deployment of these models, enhancing their performance and integration into
various applications. This guide shows how to use LLMs with OpenVINO, from model loading and
conversion to advanced use cases.


Generative AI is a specific area of Deep Learning models used for producing new and “original”
data, based on input in the form of image, sound, or natural language text. Due to their
complexity and size, generative AI pipelines are more difficult to deploy and run efficiently.
OpenVINO simplifies the process and ensures high-performance integrations, with the following
options:

.. tab-set::

.. tab-item:: OpenVINO GenAI

| - Suggested for production deployment for the supported use cases.
| - Smaller footprint and fewer dependencies.
| - More optimization and customization options.
| - Available in both Python and C++.
| - A limited set of supported use cases.
:doc:`Install the OpenVINO GenAI package <../get-started/install-openvino/install-openvino-genai>`
and run generative models out of the box. With custom
API and tokenizers, among other components, it manages the essential tasks such as the
text generation loop, tokenization, and scheduling, offering ease of use and high
performance.

.. tab-item:: Hugging Face integration

| - Suggested for prototyping and, if the use case is not covered by OpenVINO GenAI, production.
| - Bigger footprint and more dependencies.
| - Limited customization due to Hugging Face dependency.
| - Not usable for C++ applications.
| - A very wide range of supported models.
Using Optimum Intel is a great way to experiment with different models and scenarios,
thanks to a simple interface for the popular API and infrastructure offered by Hugging Face.
It also enables weight compression with
`Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__,
as well as conversion on the fly. For integration with the final product it may offer
lower performance, though.

`Check out the GenAI Quick-start Guide [PDF] <https://docs.openvino.ai/2024/_static/download/GenAI_Quick_Start_Guide.pdf>`__

The advantages of using OpenVINO for LLM deployment:

* **OpenVINO offers optimized LLM inference**:
provides a full C/C++ API, leading to faster operation than Python-based runtimes; includes a
Python API for rapid development, with the option for further optimization in C++.
* **Compatible with diverse hardware**:
supports CPUs, GPUs, and neural accelerators across ARM and x86/x64 architectures, integrated
Intel® Processor Graphics, discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data
Center GPU Flex Series; features automated optimization to maximize performance on target
hardware.
* **Requires fewer dependencies**:
than frameworks like Hugging Face and PyTorch, resulting in a smaller binary size and reduced
memory footprint, making deployments easier and updates more manageable.
* **Provides compression and precision management techniques**:
such as 8-bit and 4-bit weight compression, including embedding layers, and storage format
reduction. This includes fp16 precision for non-compressed models and int8/int4 for compressed
models, like GPTQ models from `Hugging Face <https://huggingface.co/models>`__.
* **Supports a wide range of deep learning models and architectures**:
including text, image, and audio generative models like Llama 2, MPT, OPT, Stable Diffusion,
Stable Diffusion XL. This enables the development of multimodal applications, allowing for
write-once, deploy-anywhere capabilities.
* **Enhances inference capabilities**:
fused inference primitives such as Scaled Dot Product Attention, Rotary Positional Embedding,
Group Query Attention, and Mixture of Experts. It also offers advanced features like in-place
KV-cache, dynamic quantization, KV-cache quantization and encapsulation, dynamic beam size
configuration, and speculative sampling.
* **Provides stateful model optimization**:
models from the Hugging Face Transformers are converted into a stateful form, optimizing
inference performance and memory usage in long-running text generation tasks by managing past
KV-cache tensors more efficiently internally. This feature is automatically activated for many
supported models, while unsupported ones remain stateless. Learn more about the
:doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`.

OpenVINO offers three main paths for Generative AI use cases:

* **Hugging Face**: use OpenVINO as a backend for Hugging Face frameworks (transformers,
diffusers) through the `Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__
extension.
* **OpenVINO GenAI Flavor**: use OpenVINO GenAI APIs (Python and C++).
* **Base OpenVINO**: use OpenVINO native APIs (Python and C++) with
`custom pipeline code <https://github.com/openvinotoolkit/openvino.genai>`__.

In both cases, the OpenVINO runtime is used for inference, and OpenVINO tools are used for
optimization. The main differences are in footprint size, ease of use, and customizability.

The Hugging Face API is easy to learn, provides a simple interface and hides the complexity of
model initialization and text generation for a better developer experience. However, it has more
dependencies, less customization, and cannot be ported to C/C++.

The OpenVINO GenAI Flavor reduces the complexity of LLMs implementation by
automatically managing essential tasks like the text generation loop, tokenization,
and scheduling. The Native OpenVINO API provides a more hands-on experience,
requiring manual setup of these functions. Both methods are designed to minimize dependencies
and the overall application footprint and enable the use of generative models in C++ applications.

It is recommended to start with Hugging Face frameworks to experiment with different models and
scenarios. Then the model can be used with OpenVINO APIs if it needs to be optimized
further. Optimum Intel provides interfaces that enable model optimization (weight compression)
using `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__,
and export models to the OpenVINO model format for use in native API applications.

Proceed to run LLMs with:
.. dropdown:: Fewer dependencies and smaller footprint
:animate: fade-in-slide-down
:color: secondary

Less bloated than frameworks such as Hugging Face and PyTorch, with a smaller binary size and reduced
memory footprint, makes deployments easier and updates more manageable.

.. dropdown:: Compression and precision management
:animate: fade-in-slide-down
:color: secondary

Techniques such as 8-bit and 4-bit weight compression, including embedding layers, and storage
format reduction. This includes fp16 precision for non-compressed models and int8/int4 for
compressed models, like GPTQ models from `Hugging Face <https://huggingface.co/models>`__.

.. dropdown:: Enhanced inference capabilities
:animate: fade-in-slide-down
:color: secondary

Advanced features like in-place KV-cache, dynamic quantization, KV-cache quantization and
encapsulation, dynamic beam size configuration, and speculative sampling, and more are
available.

.. dropdown:: Stateful model optimization
:animate: fade-in-slide-down
:color: secondary

Models from the Hugging Face Transformers are converted into a stateful form, optimizing
inference performance and memory usage in long-running text generation tasks by managing past
KV-cache tensors more efficiently internally. This feature is automatically activated for
many supported models, while unsupported ones remain stateless. Learn more about the
:doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`.

.. dropdown:: Optimized LLM inference
:animate: fade-in-slide-down
:color: secondary

Includes a Python API for rapid development and C++ for further optimization, offering
better performance than Python-based runtimes.


Proceed to guides on:

* :doc:`Hugging Face and Optimum Intel <./llm_inference_guide/llm-inference-hf>`
* :doc:`OpenVINO GenAI Flavor <./llm_inference_guide/genai-guide>`
* :doc:`Native OpenVINO API <./llm_inference_guide/llm-inference-native-ov>`

The table below summarizes the differences between Hugging Face and the native OpenVINO API
approaches.

.. dropdown:: Differences between Hugging Face and the native OpenVINO API

.. list-table::
:widths: 20 25 55
:header-rows: 1

* -
- Hugging Face through OpenVINO
- OpenVINO Native API
* - Model support
- Supports transformer-based models such as LLMs
- Supports all model architectures from most frameworks
* - APIs
- Python (Hugging Face API)
- Python, C++ (OpenVINO API)
* - Model Format
- Source Framework / OpenVINO
- Source Framework / OpenVINO
* - Inference code
- Hugging Face based
- Custom inference pipelines
* - Additional dependencies
- Many Hugging Face dependencies
- Lightweight (e.g. numpy, etc.)
* - Application footprint
- Large
- Small
* - Pre/post-processing and glue code
- Provided through high-level Hugging Face APIs
- Must be custom implemented (see OpenVINO samples and notebooks)
* - Performance
- Good, but less efficient compared to native APIs
- Inherent speed advantage with C++, but requires hands-on optimization
* - Flexibility
- Constrained to Hugging Face API
- High flexibility with Python and C++; allows custom coding
* - Learning Curve and Effort
- Lower learning curve; quick to integrate
- Higher learning curve; requires more effort in integration
* - Ideal Use Case
- Ideal for quick prototyping and Python-centric projects
- Best suited for high-performance, resource-optimized production environments
* - Model Serving
- Paid service, based on CPU/GPU usage with Hugging Face
- Free code solution, run script for own server; costs may incur for cloud services
like AWS but generally cheaper than Hugging Face rates
* :doc:`Hugging Face and Optimum Intel <./llm_inference_guide/llm-inference-hf>`


Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Run LLMs with OpenVINO GenAI Flavor on NPU
Inference with OpenVINO GenAI
==========================================

.. meta::
Expand All @@ -20,21 +20,22 @@ Install required dependencies:
pip install nncf==2.12 onnx==1.16.1 optimum-intel==1.19.0
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
NOTE that for systems based on Intel® Core Ultra Processors Series 2 and 16 GB of RAM,
prompts longer then 1024 characters will not work with a model of 7B or more parameters,
Note that for systems based on Intel® Core Ultra Processors Series 2, more than 16GB of RAM
may be required to run prompts over 1024 tokens on models exceeding 7B parameters,
such as Llama-2-7B, Mistral-0.2-7B, and Qwen-2-7B.

Export an LLM model via Hugging Face Optimum-Intel
##################################################

Since **symmetrically-quantized 4-bit (INT4) models are preffered for inference on NPU**, make sure to export
the model with the proper conversion and optimization settings.
Since **symmetrically-quantized 4-bit (INT4) models are preffered for inference on NPU**, make
sure to export the model with the proper conversion and optimization settings.

| You may export LLMs via Optimum-Intel, using one of two compression methods:
| **group quantization** - for both smaller and larger models,
| **channel-wise quantization** - remarkably effective but for models exceeding 1 billion parameters.
You select one of the methods by setting the ``--group-size`` parameter to either ``128`` or ``-1``, respectively. See the following examples:
You select one of the methods by setting the ``--group-size`` parameter to either ``128`` or
``-1``, respectively. See the following examples:

.. tab-set::

Expand Down
Loading

0 comments on commit 2e4971d

Please sign in to comment.