Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert Qwen/Qwen2.5-3B-Instruct tokens failed #301

Open
luhuaei opened this issue Oct 23, 2024 · 2 comments
Open

convert Qwen/Qwen2.5-3B-Instruct tokens failed #301

luhuaei opened this issue Oct 23, 2024 · 2 comments
Labels
good first issue Good for newcomers

Comments

@luhuaei
Copy link

luhuaei commented Oct 23, 2024

Context

Hello, Here is my venv package version.

➭ pip list | grep -e 'openvino'     
openvino                  2024.4.0
openvino-telemetry        2024.1.0
openvino-tokenizers       2024.4.0.0
transformers              4.45.2
optimum                   1.23.2
optimum-intel             1.20.0.dev0+227defe

Here is my history.

huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir Qwen2.5-3B-Instruct
➭ convert_tokenizer -o onnx Qwen2.5-3B-Instruct
Loading Huggingface Tokenizer...
Converting Huggingface Tokenizer to OpenVINO...
Traceback (most recent call last):
  File "/qwen-agent-app/venv/bin/convert_tokenizer", line 8, in <module>
    sys.exit(convert_hf_tokenizer())
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/cli.py", line 257, in convert_hf_tokenizer
    converted = convert_tokenizer(
                ^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/convert_tokenizer.py", line 75, in convert_tokenizer
    ov_tokenizers = convert_fast_tokenizer(
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/hf_parser.py", line 473, in convert_fast_tokenizer
    ov_tokenizer = pipeline.get_tokenizer_ov_subgraph()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/tokenizer_pipeline.py", line 1129, in get_tokenizer_ov_subgraph
    input_node = step.get_ov_subgraph(input_node)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/tokenizer_pipeline.py", line 612, in get_ov_subgraph
    input_nodes.extend(self.create_string_constant_node(self.merges).outputs())
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/tokenizer_pipeline.py", line 78, in create_string_constant_node
    ps = pack_strings(value)
         ^^^^^^^^^^^^^^^^^^^
  File "/qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/str_pack.py", line 39, in pack_strings
    symbols.write(byte_string)
TypeError: a bytes-like object is required, not 'list'

What needs to be done?

Try to support Qwen/Qwen2.5-3B-Instuct mode

Example Pull Requests

I change /qwen-agent-app/venv/lib/python3.12/site-packages/openvino_tokenizers/tokenizer_pipeline.py 75 line

- ps = pack_strings(value)
+ rrr = []
+ for r in value:
+     rrr += r
+ ps = pack_strings(rrr) 

With this patch, can success conver to openvino tokenizer. But I load in openvino ovms, is report error as follow.

Exception from src/inference/src/dev/plugin.cpp:58:
Check 'this->get_input_partial_shape(added_token_input).is_dynamic() || this->get_input_partial_shape(added_token_input) == this->get_input_partial_shape(added_token_input + 3)' failed at /openvino_tokenizers/src/bpe_tokenizer.cpp:38:
Expected equal number of added tokens and added token indices.

Resources

Contact points

Thanks.

Ticket

No response

@luhuaei luhuaei added the good first issue Good for newcomers label Oct 23, 2024
@luhuaei luhuaei changed the title [Good First Issue]: convert Qwen/Qwen2.5-3B-Instruct tokens failed convert Qwen/Qwen2.5-3B-Instruct tokens failed Oct 23, 2024
@Misaka-sister
Copy link

I have the same problem:
> convert_tokenizer Qwen/Qwen2.5-3B-Instruct --with-detokenizer -o llm\Qwen2.5-3B-openvino
Loading Huggingface Tokenizer...
Converting Huggingface Tokenizer to OpenVINO...
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "intel-llm\Scripts\convert_tokenizer.exe\__main__.py", line 7, in <module>
File "intel-llm\Lib\site-packages\openvino_tokenizers\cli.py", line 257, in convert_hf_tokenizer
converted = convert_tokenizer(
^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\convert_tokenizer.py", line 75, in convert_tokenizer
ov_tokenizers = convert_fast_tokenizer(
^^^^^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\hf_parser.py", line 473, in convert_fast_tokenizer
ov_tokenizer = pipeline.get_tokenizer_ov_subgraph()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\tokenizer_pipeline.py", line 1126, in get_tokenizer_ov_subgraph
input_node = step.get_ov_subgraph(input_node)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\tokenizer_pipeline.py", line 609, in get_ov_subgraph
input_nodes.extend(self.create_string_constant_node(self.merges).outputs())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\tokenizer_pipeline.py", line 75, in create_string_constant_node
ps = pack_strings(value)
^^^^^^^^^^^^^^^^^^^
File "intel-llm\Lib\site-packages\openvino_tokenizers\str_pack.py", line 39, in pack_strings
symbols.write(byte_string)
TypeError: a bytes-like object is required, not 'list'

@luhuaei
Copy link
Author

luhuaei commented Oct 26, 2024

After analyzing this issue in depth, I found that the root cause is that in transformers > 4.45.0, BPE token merges are parsed into a 2D array.

"merges": [
  [
    "Ġ",
    "Ġ"
  ]
]

However, in openvino_tokenizers, the pack_string related functions currently only support 1D arrays. Therefore, we need to pip install transformers==4.43.0 or a lower version.

In transformers <= 4.43.0, merges are parsed into a 1D string array, separated by spaces.

"merges": [
    "Ġ Ġ"
]

@Misaka-sister

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants