Skip to content

Commit

Permalink
String Tensors Basic Documentation (openvinotoolkit#22097) (openvinot…
Browse files Browse the repository at this point in the history
…oolkit#22240)

port: openvinotoolkit#22097

Basic documentation of how to use string tensors
authored-by: Sergey Lyalin
  • Loading branch information
kblaszczak-intel authored Jan 18, 2024
1 parent b0fe37f commit 45f6285
Show file tree
Hide file tree
Showing 3 changed files with 218 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Running Inference with OpenVINO™
openvino_docs_OV_UG_ShapeInference
openvino_docs_OV_UG_DynamicShapes
openvino_docs_OV_UG_model_state_intro
openvino_docs_OV_UG_string_tensors
Optimize Inference <openvino_docs_deployment_optimization_guide_dldt_optimization_guide>

.. meta::
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ Integrate OpenVINO™ with Your Application


.. meta::
:description: Learn how to implement a typical inference pipeline of OpenVINO™
:description: Learn how to implement a typical inference pipeline of OpenVINO™
Runtime in an application.


Following these steps, you can implement a typical OpenVINO™ Runtime inference
pipeline in your application. Before proceeding, make sure you have
Following these steps, you can implement a typical OpenVINO™ Runtime inference
pipeline in your application. Before proceeding, make sure you have
:doc:`installed OpenVINO Runtime <openvino_docs_install_guides_overview>` and set environment variables (run ``<INSTALL_DIR>/setupvars.sh`` for Linux or ``setupvars.bat`` for Windows, otherwise, the ``OpenVINO_DIR`` variable won't be configured properly to pass ``find_package`` calls).


Expand Down Expand Up @@ -243,8 +243,8 @@ To learn how to change the device configuration, read the :doc:`Query device pro
Step 3. Create an Inference Request
###################################

``ov::InferRequest`` class provides methods for model inference in OpenVINO™ Runtime.
Create an infer request using the following code (see
``ov::InferRequest`` class provides methods for model inference in OpenVINO™ Runtime.
Create an infer request using the following code (see
:doc:`InferRequest detailed documentation <openvino_docs_OV_UG_Infer_request>` for more details):

.. tab-set::
Expand Down Expand Up @@ -299,6 +299,7 @@ You can use external memory to create ``ov::Tensor`` and use the ``ov::InferRequ
:language: cpp
:fragment: [part4]

See :doc:`additional materials <openvino_docs_OV_UG_string_tensors>` to learn how to handle textual data as a model input.

Step 5. Start Inference
#######################
Expand Down Expand Up @@ -329,7 +330,7 @@ OpenVINO™ Runtime supports inference in either synchronous or asynchronous mod
:fragment: [part5]


This section demonstrates a simple pipeline. To get more information about other ways to perform inference, read the dedicated
This section demonstrates a simple pipeline. To get more information about other ways to perform inference, read the dedicated
:doc:`"Run inference" section <openvino_docs_OV_UG_Infer_request>`.

Step 6. Process the Inference Results
Expand Down Expand Up @@ -360,6 +361,7 @@ Go over the output tensors and process the inference results.
:language: cpp
:fragment: [part6]

See :doc:`additional materials <openvino_docs_OV_UG_string_tensors>` to learn how to handle textual data as a model output.

Step 7. Release the allocated objects (only for C)
##################################################
Expand Down Expand Up @@ -440,5 +442,6 @@ Additional Resources

* See the :doc:`OpenVINO Samples <openvino_docs_OV_UG_Samples_Overview>` page or the `Open Model Zoo Demos <https://docs.openvino.ai/2023.3/omz_demos.html>`__ page for specific examples of how OpenVINO pipelines are implemented for applications like image classification, text prediction, and many others.
* :doc:`OpenVINO™ Runtime Preprocessing <openvino_docs_OV_UG_Preprocessing_Overview>`
* :doc:`String Tensors <openvino_docs_OV_UG_string_tensors>`
* :doc:`Using Encrypted Models with OpenVINO <openvino_docs_OV_UG_protecting_model_guide>`

Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
.. {#openvino_docs_OV_UG_string_tensors}
String Tensors
==============


.. meta::
:description: Learn how to pass and retrieve text to and from OpenVINO model.

OpenVINO tensors can hold not only numerical data, like floating-point or integer numbers,
but also textual information, represented as one or multiple strings.
Such a tensor is called a string tensor and can be passed as input or retrieved as output of a text-processing model, such as
`tokenizers and detokenizers <https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/custom_operations/user_ie_extensions/tokenizer/python>`__.

While this section describes basic API to handle string tensors, more practical examples that leverage both
string tensors and OpenVINO tokenizer can be found in
`GenAI Samples <https://github.com/openvinotoolkit/openvino.genai/tree/master/text_generation/causal_lm/cpp>`__.


Representation
##############

String tensors are supported in C++ and Python APIs, represented as instances of the `ov::Tensor`
class with the `element_type` parameter equal to `ov::element::string`. Each element of a string tensor is a string
of arbitrary length, including an empty string, and can be set independently of other elements in the same tensor.

Depending on the API used (C++ or Python), the underlying data type that represents the string when accessing the tensor elements is
different:

- in C++, std::string is used
- in Python, `numpy.str_`/`numpy.bytes_` populated Numpy arrays are used, as a read-only copy of the underlying C++ content

String tensor implementation doesn't imply any limitations on string encoding, as underlying `std::string` doesn't have such limitations.
It is capable of representing all valid UTF-8 characters but also any other byte sequence outside of the UTF-8 encoding standard.
Users should pay extra attention when handling arbitrary byte sequences when accessing tensor content as encoded UTF-8 symbols.

As the string representation is more sophisticated in contrast to for example `float` or `int` data type,
the underlying memory that is used for string tensor representation cannot be handled without properly constructing and destroying string objects.
Also, in contrast to numerical data, C++ and Python do not share the same memory layout, so there is no immediate
sharing of tensor content between the two APIs. Python provides only a numpy-compatible view of the data
allocated and held in C++ core as an array of the `std::string` objects.

A developer must consider these restrictions when writing code using string tensors and
avoid treating the content as raw bytes or as a view of data in Python.

Create a String Tensor
######################

The following is an example of how to create a small 1D tensor pre-populated with three elements:

.. tab-set::

.. tab-item:: Python
:sync: py

.. code-block:: py
:force:
import openvino as ov
tensor = ov.Tensor(['text', 'more text', 'even more text'])
.. tab-item:: C++
:sync: cpp

.. code-block:: cpp
#include <vector>
#include <string>
#include <openvino/openvino.hpp>
std::vector<std::string> strings = {"text", "more text", "even more text"};
ov::Tensor tensor(ov::element::string, ov::Shape{strings.size()}, &strings[0]);
The example demonstrates that similarly to tensors with numerical information,
a tensor object can be created on top of existing memory in C++ by providing a pointer to a pre-allocated array of elements.
Here, an instance of std::vector is used to hold the memory and consists of three std::string objects.
So, the `tensor` object in the C++ example will share the same memory as the `strings` vector.

Note that `ov::Tensor`, when initialized with a pointer, requires pre-initialized memory with valid `std::string` objects
created by calling one of the available `std::string` constructors even for empty string. It is undefined behaviour if
not initialized memory is passed to this `ov::Tensor` constructor.

In the Python version of the example above, a regular list of strings is used as an initializer.
No memory sharing is available this time, in contrast to C++,
and the strings from the initialization list are copied to a separately allocated storage underneath the `tensor` object.

Besides a plain Python list of strings, an initializer can be one of the supported `numpy` arrays initialized
with Unicode or byte strings:

.. tab-set::

.. tab-item:: Python
:sync: py

.. code-block:: python
:force:
import numpy as np
tensor = ov.Tensor(np.array(['text', 'more text', 'even more text']))
tensor = ov.Tensor(np.array([b'text', b'more text', b'even more text']))
If `ov::Tensor` is created without providing initialization strings,
a tensor of a specified shape and empty strings as elements is created:

.. tab-set::

.. tab-item:: Python
:sync: py

.. code-block:: python
:force:
tensor = ov.Tensor(dtype=str, shape=[3])
.. tab-item:: C++
:sync: cpp

.. code-block:: cpp
ov::Tensor tensor(ov::element::string, ov::Shape{3});
`ov::Tensor` allocates and initializes the required number of `std::string` objects under the hood.


Accessing Elements
##################

The following code prints all elements in the 1D string tensor constructed above.
In C++ code the same `.data` template method is used for other data types,
and to access string data it should be called with the `std::string` type.
In Python, dedicated `std_data` and `byte_data` fields are used instead of `data` field for numerical data.

.. tab-set::

.. tab-item:: Python
:sync: py

.. code-block:: python
:force:
data = tensor.str_data # use tensor.byte_data instead to access encoded strings as `bytes`
for i in range(tensor.get_size()):
print(data[i])
.. tab-item:: C++
:sync: cpp

.. code-block:: cpp
#include <iostream>
std::string* data = tensor.data<std::string>();
for(size_t i = 0; i < tensor.get_size(); ++i)
std::cout << data[i] << '\n';
In the case of Python, an object retrieved with `tensor.str_data` (or `tensor.bytes_data`) is a numpy array
with `numpy.str_` elements (or `numpy.bytes_` correspondingly). It is a copy of underlying data from
the `tensor` object and cannot be used for tensor content modification.
To set new values, the entire tensor content should be set as a list or as a `numpy` array, as demonstrated
below.

In contrast to Python, when using `tensor.data<std::string>()` in C++, a pointer to the underlying data
storage is returned and it can be used for tensor element modification:

.. tab-set::

.. tab-item:: Python
:sync: py

.. code-block:: python
# Unicode strings:
tensor.str_data = ['one', 'two', 'three']
# Do NOT use tensor.str_data[i] to set a new value, it won't update the tensor content
# Encoded strings:
tensor.bytes_data = [b'one', b'two', b'three']
# Do NOT use tensor.bytes_data[i] to set a new value, it won't update the tensor content
.. tab-item:: C++
:sync: cpp

.. code-block:: cpp
std::string new_content[] = {"one", "two", "three"};
std::string* data = tensor.data<std::string>();
for(size_t i = 0; i < tensor.get_size(); ++i)
data[i] = new_content[i];
When reading or setting string tensor elements in Python, it is recommended to use `str` objects (or `numpy.str_` if used in numpy array)
when it is known that the underlying byte sequence forms a valid UTF-8 encoded string.
Otherwise, if arbitrary byte sequences are allowed,
not necessarily within the UTF-8 standard, use `bytes` strings (or `numpy.bytes_` correspondingly) instead.

Accessing tensor content through `str_data` implicitly applies UTF-8 decoding.
If parts of the byte stream cannot be represented as valid Unicode symbols,
the � replacement symbol is used to signal errors in such invalid Unicode streams.

Additional Resources
####################

* Learn about the :doc:`basic steps to integrate inference in your application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`.

* Use `OpenVINO tokenizers <https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/custom_operations/user_ie_extensions/tokenizer/python>`__ to produce models that use string tensors to work with textual information as pre- and post-processing for the large language models.

* Check out `GenAI Samples <https://github.com/openvinotoolkit/openvino.genai/tree/master/text_generation/causal_lm/cpp>`__ to see how string tensors are used in real-life applications.

0 comments on commit 45f6285

Please sign in to comment.