convert Qwen/Qwen2.5-3B-Instruct tokens failed #301

luhuaei · 2024-10-23T17:26:44Z

Context

Hello, Here is my venv package version.

➭ pip list | grep -e 'openvino'     
openvino                  2024.4.0
openvino-telemetry        2024.1.0
openvino-tokenizers       2024.4.0.0
transformers              4.45.2
optimum                   1.23.2
optimum-intel             1.20.0.dev0+227defe

Here is my history.

huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir Qwen2.5-3B-Instruct
➭ convert_tokenizer -o onnx Qwen2.5-3B-Instruct
Loading Huggingface Tokenizer...
Converting Huggingface Tokenizer to OpenVINO...
Traceback (most recent call last):
  File "/qwen-agent-app/venv/bin/convert_tokenizer", line 8, in <module>
    sys.exit(convert_hf_tokenizer())
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/cli.py", line 257, in convert_hf_tokenizer
    converted = convert_tokenizer(
                ^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/convert_tokenizer.py", line 75, in convert_tokenizer
    ov_tokenizers = convert_fast_tokenizer(
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/hf_parser.py", line 473, in convert_fast_tokenizer
    ov_tokenizer = pipeline.get_tokenizer_ov_subgraph()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/tokenizer_pipeline.py", line 1129, in get_tokenizer_ov_subgraph
    input_node = step.get_ov_subgraph(input_node)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/tokenizer_pipeline.py", line 612, in get_ov_subgraph
    input_nodes.extend(self.create_string_constant_node(self.merges).outputs())
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/tokenizer_pipeline.py", line 78, in create_string_constant_node
    ps = pack_strings(value)
         ^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/str_pack.py", line 39, in pack_strings
    symbols.write(byte_string)
TypeError: a bytes-like object is required, not 'list'

What needs to be done?

Try to support Qwen/Qwen2.5-3B-Instuct mode

Example Pull Requests

I change /qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/tokenizer_pipeline.py 75 line

- ps = pack_strings(value)
+ rrr = []
+ for r in value:
+     rrr += r
+ ps = pack_strings(rrr)

With this patch, can success conver to openvino tokenizer. But I load in openvino ovms, is report error as follow.

Exception from src/inference/src/dev/plugin.cpp:58:
Check 'this->get_input_partial_shape(added_token_input).is_dynamic() || this->get_input_partial_shape(added_token_input) == this->get_input_partial_shape(added_token_input + 3)' failed at /openvino_tokenizers/src/bpe_tokenizer.cpp:38:
Expected equal number of added tokens and added token indices.

Resources

Contribution guide - start here!
Intel DevHub Discord channel - engage in discussions, ask questions and talk to OpenVINO developers

Contact points

Thanks.

Ticket

No response

The text was updated successfully, but these errors were encountered:

Misaka-sister · 2024-10-24T05:59:39Z

I have the same problem:
> convert_tokenizer Qwen/Qwen2.5-3B-Instruct --with-detokenizer -o llm\Qwen2.5-3B-openvino
Loading Huggingface Tokenizer...
Converting Huggingface Tokenizer to OpenVINO...
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "intel-llm\Scripts\convert_tokenizer.exe\__main__.py", line 7, in <module>
File "intel-llm\Lib\site-packages\openvino_tokenizers\cli.py", line 257, in convert_hf_tokenizer
converted = convert_tokenizer(
^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\convert_tokenizer.py", line 75, in convert_tokenizer
ov_tokenizers = convert_fast_tokenizer(
^^^^^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\hf_parser.py", line 473, in convert_fast_tokenizer
ov_tokenizer = pipeline.get_tokenizer_ov_subgraph()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\tokenizer_pipeline.py", line 1126, in get_tokenizer_ov_subgraph
input_node = step.get_ov_subgraph(input_node)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\tokenizer_pipeline.py", line 609, in get_ov_subgraph
input_nodes.extend(self.create_string_constant_node(self.merges).outputs())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\tokenizer_pipeline.py", line 75, in create_string_constant_node
ps = pack_strings(value)
^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\str_pack.py", line 39, in pack_strings
symbols.write(byte_string)
TypeError: a bytes-like object is required, not 'list'

luhuaei · 2024-10-26T04:21:37Z

After analyzing this issue in depth, I found that the root cause is that in transformers > 4.45.0, BPE token merges are parsed into a 2D array.

"merges": [
  [
    "Ġ",
    "Ġ"
  ]
]

However, in openvino_tokenizers, the pack_string related functions currently only support 1D arrays. Therefore, we need to pip install transformers==4.43.0 or a lower version.

In transformers <= 4.43.0, merges are parsed into a 1D string array, separated by spaces.

"merges": [
    "Ġ Ġ"
]

@Misaka-sister

luhuaei added the good first issue Good for newcomers label Oct 23, 2024

luhuaei changed the title ~~[Good First Issue]: convert Qwen/Qwen2.5-3B-Instruct tokens failed~~ convert Qwen/Qwen2.5-3B-Instruct tokens failed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert Qwen/Qwen2.5-3B-Instruct tokens failed #301

convert Qwen/Qwen2.5-3B-Instruct tokens failed #301

luhuaei commented Oct 23, 2024 •

edited

Loading

Misaka-sister commented Oct 24, 2024

luhuaei commented Oct 26, 2024 •

edited

Loading

convert Qwen/Qwen2.5-3B-Instruct tokens failed #301

convert Qwen/Qwen2.5-3B-Instruct tokens failed #301

Comments

luhuaei commented Oct 23, 2024 • edited Loading

Context

What needs to be done?

Example Pull Requests

Resources

Contact points

Ticket

Misaka-sister commented Oct 24, 2024

luhuaei commented Oct 26, 2024 • edited Loading

luhuaei commented Oct 23, 2024 •

edited

Loading

luhuaei commented Oct 26, 2024 •

edited

Loading