forked from openvinotoolkit/openvino
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
String Tensors Basic Documentation (openvinotoolkit#22097) (openvinot…
…oolkit#22240) port: openvinotoolkit#22097 Basic documentation of how to use string tensors authored-by: Sergey Lyalin
- Loading branch information
1 parent
b0fe37f
commit 45f6285
Showing
3 changed files
with
218 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
208 changes: 208 additions & 0 deletions
208
...rticles_en/openvino_workflow/running_inference_with_openvino/string_tensors.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,208 @@ | ||
.. {#openvino_docs_OV_UG_string_tensors} | ||
String Tensors | ||
============== | ||
|
||
|
||
.. meta:: | ||
:description: Learn how to pass and retrieve text to and from OpenVINO model. | ||
|
||
OpenVINO tensors can hold not only numerical data, like floating-point or integer numbers, | ||
but also textual information, represented as one or multiple strings. | ||
Such a tensor is called a string tensor and can be passed as input or retrieved as output of a text-processing model, such as | ||
`tokenizers and detokenizers <https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/custom_operations/user_ie_extensions/tokenizer/python>`__. | ||
|
||
While this section describes basic API to handle string tensors, more practical examples that leverage both | ||
string tensors and OpenVINO tokenizer can be found in | ||
`GenAI Samples <https://github.com/openvinotoolkit/openvino.genai/tree/master/text_generation/causal_lm/cpp>`__. | ||
|
||
|
||
Representation | ||
############## | ||
|
||
String tensors are supported in C++ and Python APIs, represented as instances of the `ov::Tensor` | ||
class with the `element_type` parameter equal to `ov::element::string`. Each element of a string tensor is a string | ||
of arbitrary length, including an empty string, and can be set independently of other elements in the same tensor. | ||
|
||
Depending on the API used (C++ or Python), the underlying data type that represents the string when accessing the tensor elements is | ||
different: | ||
|
||
- in C++, std::string is used | ||
- in Python, `numpy.str_`/`numpy.bytes_` populated Numpy arrays are used, as a read-only copy of the underlying C++ content | ||
|
||
String tensor implementation doesn't imply any limitations on string encoding, as underlying `std::string` doesn't have such limitations. | ||
It is capable of representing all valid UTF-8 characters but also any other byte sequence outside of the UTF-8 encoding standard. | ||
Users should pay extra attention when handling arbitrary byte sequences when accessing tensor content as encoded UTF-8 symbols. | ||
|
||
As the string representation is more sophisticated in contrast to for example `float` or `int` data type, | ||
the underlying memory that is used for string tensor representation cannot be handled without properly constructing and destroying string objects. | ||
Also, in contrast to numerical data, C++ and Python do not share the same memory layout, so there is no immediate | ||
sharing of tensor content between the two APIs. Python provides only a numpy-compatible view of the data | ||
allocated and held in C++ core as an array of the `std::string` objects. | ||
|
||
A developer must consider these restrictions when writing code using string tensors and | ||
avoid treating the content as raw bytes or as a view of data in Python. | ||
|
||
Create a String Tensor | ||
###################### | ||
|
||
The following is an example of how to create a small 1D tensor pre-populated with three elements: | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: Python | ||
:sync: py | ||
|
||
.. code-block:: py | ||
:force: | ||
import openvino as ov | ||
tensor = ov.Tensor(['text', 'more text', 'even more text']) | ||
.. tab-item:: C++ | ||
:sync: cpp | ||
|
||
.. code-block:: cpp | ||
#include <vector> | ||
#include <string> | ||
#include <openvino/openvino.hpp> | ||
std::vector<std::string> strings = {"text", "more text", "even more text"}; | ||
ov::Tensor tensor(ov::element::string, ov::Shape{strings.size()}, &strings[0]); | ||
The example demonstrates that similarly to tensors with numerical information, | ||
a tensor object can be created on top of existing memory in C++ by providing a pointer to a pre-allocated array of elements. | ||
Here, an instance of std::vector is used to hold the memory and consists of three std::string objects. | ||
So, the `tensor` object in the C++ example will share the same memory as the `strings` vector. | ||
|
||
Note that `ov::Tensor`, when initialized with a pointer, requires pre-initialized memory with valid `std::string` objects | ||
created by calling one of the available `std::string` constructors even for empty string. It is undefined behaviour if | ||
not initialized memory is passed to this `ov::Tensor` constructor. | ||
|
||
In the Python version of the example above, a regular list of strings is used as an initializer. | ||
No memory sharing is available this time, in contrast to C++, | ||
and the strings from the initialization list are copied to a separately allocated storage underneath the `tensor` object. | ||
|
||
Besides a plain Python list of strings, an initializer can be one of the supported `numpy` arrays initialized | ||
with Unicode or byte strings: | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: Python | ||
:sync: py | ||
|
||
.. code-block:: python | ||
:force: | ||
import numpy as np | ||
tensor = ov.Tensor(np.array(['text', 'more text', 'even more text'])) | ||
tensor = ov.Tensor(np.array([b'text', b'more text', b'even more text'])) | ||
If `ov::Tensor` is created without providing initialization strings, | ||
a tensor of a specified shape and empty strings as elements is created: | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: Python | ||
:sync: py | ||
|
||
.. code-block:: python | ||
:force: | ||
tensor = ov.Tensor(dtype=str, shape=[3]) | ||
.. tab-item:: C++ | ||
:sync: cpp | ||
|
||
.. code-block:: cpp | ||
ov::Tensor tensor(ov::element::string, ov::Shape{3}); | ||
`ov::Tensor` allocates and initializes the required number of `std::string` objects under the hood. | ||
|
||
|
||
Accessing Elements | ||
################## | ||
|
||
The following code prints all elements in the 1D string tensor constructed above. | ||
In C++ code the same `.data` template method is used for other data types, | ||
and to access string data it should be called with the `std::string` type. | ||
In Python, dedicated `std_data` and `byte_data` fields are used instead of `data` field for numerical data. | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: Python | ||
:sync: py | ||
|
||
.. code-block:: python | ||
:force: | ||
data = tensor.str_data # use tensor.byte_data instead to access encoded strings as `bytes` | ||
for i in range(tensor.get_size()): | ||
print(data[i]) | ||
.. tab-item:: C++ | ||
:sync: cpp | ||
|
||
.. code-block:: cpp | ||
#include <iostream> | ||
std::string* data = tensor.data<std::string>(); | ||
for(size_t i = 0; i < tensor.get_size(); ++i) | ||
std::cout << data[i] << '\n'; | ||
In the case of Python, an object retrieved with `tensor.str_data` (or `tensor.bytes_data`) is a numpy array | ||
with `numpy.str_` elements (or `numpy.bytes_` correspondingly). It is a copy of underlying data from | ||
the `tensor` object and cannot be used for tensor content modification. | ||
To set new values, the entire tensor content should be set as a list or as a `numpy` array, as demonstrated | ||
below. | ||
|
||
In contrast to Python, when using `tensor.data<std::string>()` in C++, a pointer to the underlying data | ||
storage is returned and it can be used for tensor element modification: | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: Python | ||
:sync: py | ||
|
||
.. code-block:: python | ||
# Unicode strings: | ||
tensor.str_data = ['one', 'two', 'three'] | ||
# Do NOT use tensor.str_data[i] to set a new value, it won't update the tensor content | ||
# Encoded strings: | ||
tensor.bytes_data = [b'one', b'two', b'three'] | ||
# Do NOT use tensor.bytes_data[i] to set a new value, it won't update the tensor content | ||
.. tab-item:: C++ | ||
:sync: cpp | ||
|
||
.. code-block:: cpp | ||
std::string new_content[] = {"one", "two", "three"}; | ||
std::string* data = tensor.data<std::string>(); | ||
for(size_t i = 0; i < tensor.get_size(); ++i) | ||
data[i] = new_content[i]; | ||
When reading or setting string tensor elements in Python, it is recommended to use `str` objects (or `numpy.str_` if used in numpy array) | ||
when it is known that the underlying byte sequence forms a valid UTF-8 encoded string. | ||
Otherwise, if arbitrary byte sequences are allowed, | ||
not necessarily within the UTF-8 standard, use `bytes` strings (or `numpy.bytes_` correspondingly) instead. | ||
|
||
Accessing tensor content through `str_data` implicitly applies UTF-8 decoding. | ||
If parts of the byte stream cannot be represented as valid Unicode symbols, | ||
the � replacement symbol is used to signal errors in such invalid Unicode streams. | ||
|
||
Additional Resources | ||
#################### | ||
|
||
* Learn about the :doc:`basic steps to integrate inference in your application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`. | ||
|
||
* Use `OpenVINO tokenizers <https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/custom_operations/user_ie_extensions/tokenizer/python>`__ to produce models that use string tensors to work with textual information as pre- and post-processing for the large language models. | ||
|
||
* Check out `GenAI Samples <https://github.com/openvinotoolkit/openvino.genai/tree/master/text_generation/causal_lm/cpp>`__ to see how string tensors are used in real-life applications. |