Skip to content

Commit

Permalink
[DOCS] GenAI tweaks (#28054)
Browse files Browse the repository at this point in the history
  • Loading branch information
kblaszczak-intel authored Dec 16, 2024
1 parent 97dcb16 commit ab3dcfb
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 44 deletions.
63 changes: 25 additions & 38 deletions docs/articles_en/learn-openvino/llm_inference_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ Generative AI workflow
Generative AI is a specific area of Deep Learning models used for producing new and “original”
data, based on input in the form of image, sound, or natural language text. Due to their
complexity and size, generative AI pipelines are more difficult to deploy and run efficiently.
OpenVINO simplifies the process and ensures high-performance integrations, with the following
OpenVINO simplifies the process and ensures high-performance integrations, with the following
options:

.. tab-set::

.. tab-item:: OpenVINO GenAI
.. tab-item:: OpenVINO GenAI

| - Suggested for production deployment for the supported use cases.
| - Smaller footprint and fewer dependencies.
Expand All @@ -39,6 +39,8 @@ options:
text generation loop, tokenization, and scheduling, offering ease of use and high
performance.

`Check out the OpenVINO GenAI Quick-start Guide [PDF] <https://docs.openvino.ai/2024/_static/download/GenAI_Quick_Start_Guide.pdf>`__

.. tab-item:: Hugging Face integration

| - Suggested for prototyping and, if the use case is not covered by OpenVINO GenAI, production.
Expand All @@ -54,49 +56,34 @@ options:
as well as conversion on the fly. For integration with the final product it may offer
lower performance, though.

`Check out the GenAI Quick-start Guide [PDF] <https://docs.openvino.ai/2024/_static/download/GenAI_Quick_Start_Guide.pdf>`__

The advantages of using OpenVINO for LLM deployment:

.. dropdown:: Fewer dependencies and smaller footprint
:animate: fade-in-slide-down
:color: secondary

Less bloated than frameworks such as Hugging Face and PyTorch, with a smaller binary size and reduced
memory footprint, makes deployments easier and updates more manageable.

.. dropdown:: Compression and precision management
:animate: fade-in-slide-down
:color: secondary

Techniques such as 8-bit and 4-bit weight compression, including embedding layers, and storage
format reduction. This includes fp16 precision for non-compressed models and int8/int4 for
compressed models, like GPTQ models from `Hugging Face <https://huggingface.co/models>`__.

.. dropdown:: Enhanced inference capabilities
:animate: fade-in-slide-down
:color: secondary
The advantages of using OpenVINO for generative model deployment:

Advanced features like in-place KV-cache, dynamic quantization, KV-cache quantization and
encapsulation, dynamic beam size configuration, and speculative sampling, and more are
available.
| **Fewer dependencies and smaller footprint**
| Less bloated than frameworks such as Hugging Face and PyTorch, with a smaller binary size and reduced
memory footprint, makes deployments easier and updates more manageable.
.. dropdown:: Stateful model optimization
:animate: fade-in-slide-down
:color: secondary
| **Compression and precision management**
| Techniques such as 8-bit and 4-bit weight compression, including embedding layers, and storage
format reduction. This includes fp16 precision for non-compressed models and int8/int4 for
compressed models, like GPTQ models from `Hugging Face <https://huggingface.co/models>`__.
Models from the Hugging Face Transformers are converted into a stateful form, optimizing
inference performance and memory usage in long-running text generation tasks by managing past
KV-cache tensors more efficiently internally. This feature is automatically activated for
many supported models, while unsupported ones remain stateless. Learn more about the
:doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`.
| **Enhanced inference capabilities**
| Advanced features like in-place KV-cache, dynamic quantization, KV-cache quantization and
encapsulation, dynamic beam size configuration, and speculative sampling, and more are
available.
.. dropdown:: Optimized LLM inference
:animate: fade-in-slide-down
:color: secondary
| **Stateful model optimization**
| Models from the Hugging Face Transformers are converted into a stateful form, optimizing
inference performance and memory usage in long-running text generation tasks by managing past
KV-cache tensors more efficiently internally. This feature is automatically activated for
many supported models, while unsupported ones remain stateless. Learn more about the
:doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`.
Includes a Python API for rapid development and C++ for further optimization, offering
better performance than Python-based runtimes.
| **Optimized LLM inference**
| Includes a Python API for rapid development and C++ for further optimization, offering
better performance than Python-based runtimes.

Proceed to guides on:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ make sure to :doc:`install OpenVINO with GenAI <../../get-started/install-openvi
.. dropdown:: Text-to-Image Generation

OpenVINO GenAI introduces the openvino_genai.Text2ImagePipeline for inference of text-to-image
models such as: as Stable Diffusion 1.5, 2.1, XL, LCM, Flex, and more.
See the following usage example for reference.

.. tab-set::

.. tab-item:: Python
Expand Down Expand Up @@ -579,8 +583,9 @@ compression is done by NNCF at the model export stage. The exported model contai
information necessary for execution, including the tokenizer/detokenizer and the generation
config, ensuring that its results match those generated by Hugging Face.

The `LLMPipeline` is the main object used for decoding and handles all the necessary steps.
You can construct it directly from the folder with the converted model.
The `LLMPipeline` is the main object to setup the model for text generation. You can provide the
converted model to this object, specify the device for inference, and provide additional
parameters.


.. tab-set::
Expand Down Expand Up @@ -911,7 +916,7 @@ running the following code:
GenAI API
#######################################

The use case described here uses the following OpenVINO GenAI API methods:
The use case described here uses the following OpenVINO GenAI API classes:

* generation_config - defines a configuration class for text generation,
enabling customization of the generation process such as the maximum length of
Expand All @@ -921,7 +926,6 @@ The use case described here uses the following OpenVINO GenAI API methods:
text generation, and managing outputs with configurable options.
* streamer_base - an abstract base class for creating streamers.
* tokenizer - the tokenizer class for text encoding and decoding.
* visibility - controls the visibility of the GenAI library.

Learn more from the `GenAI API reference <https://docs.openvino.ai/2024/api/genai_api/api.html>`__.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ Generative Model Preparation



Since generative AI models tend to be big and resource-heavy, it is advisable to store them
locally and optimize for efficient inference. This article will show how to prepare
Since generative AI models tend to be big and resource-heavy, it is advisable to
optimize them for efficient inference. This article will show how to prepare
LLM models for inference with OpenVINO by:

* `Downloading Models from Hugging Face <#download-generative-models-from-hugging-face-hub>`__
Expand Down
Binary file modified docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf
Binary file not shown.

0 comments on commit ab3dcfb

Please sign in to comment.