Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GLM-4v Multimodal Model support for SGLang #1641

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

sixsixcoder
Copy link
Contributor

@sixsixcoder sixsixcoder commented Oct 12, 2024

Motivation

Add GLM-4v support for SGLang, GLM-4v is a widely used multimodal model developed by THUDM, we hope to adapt to the excellent fast serving framework SGLang

Modifications

  1. Migrate chatglm.py file from vllm
  2. Add glm4 vision encoder in python/sglang/srt/models/glm4_vision_encoder.py.
  3. Add optional vision module for ChatGLMModel, making ChatGLMForCausalLM multimodal capable.
  4. Add the model to the test suite test_generation_models.py

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member

zhyncs commented Oct 12, 2024

Wow, that's cool. Thank you and Zhipu AI for your contribution!

@merrymercy
Copy link
Contributor

merrymercy commented Oct 12, 2024

Thanks for the contribution.

  1. Could you fix the lint error? https://github.com/sgl-project/sglang/blob/main/docs/en/contributor_guide.md
  2. Can you test the OpenAI vision API? You probably need to date some chat template.
    cls.model = "lmms-lab/llava-onevision-qwen2-0.5b-ov"

@sixsixcoder
Copy link
Contributor Author

Thanks for the contribution.

  1. Could you fix the lint error? https://github.com/sgl-project/sglang/blob/main/docs/en/contributor_guide.md
  2. Can you test the OpenAI vision API? You probably need to date some chat template.
    cls.model = "lmms-lab/llava-onevision-qwen2-0.5b-ov"

When executing this test file, an error will occur.

File "/root/anaconda3/envs/sglang/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1741, in setattr
if isinstance(value, Parameter):
File "/root/anaconda3/envs/sglang/lib/python3.9/site-packages/torch/nn/parameter.py", line 10, in instancecheck
isinstance(instance, torch.Tensor) and getattr(instance, '_is_param', False))
RecursionError: maximum recursion depth exceeded

Do you have any solution?

@merrymercy
Copy link
Contributor

Can you share your command and more traceback?

I can run it successfully on an H100.

>>> python3 test_vision_openai_server.py
...
he iPod[07:36:06 TP0] Decode batch. #running-req: 1, #token: 6424, token usage: 0.00, gen throughput (token/s): 437.29, #queue-req: 0
 securely. The video does not contain any text or subtitles.------------------------------
.
----------------------------------------------------------------------
Ran 5 tests in 61.104s

OK

@sixsixcoder
Copy link
Contributor Author

您可以分享命令和更多回溯吗?

我可以在H100上成功运行它。

>>> python3 test_vision_openai_server.py
...
he iPod[07:36:06 TP0] Decode batch. #running-req: 1, #token: 6424, token usage: 0.00, gen throughput (token/s): 437.29, #queue-req: 0
 securely. The video does not contain any text or subtitles.------------------------------
.
----------------------------------------------------------------------
Ran 5 tests in 61.104s

OK

It may be a problem with model registration, which leads to infinite recursion and then an error after exceeding the video memory. Where should I modify the model registration? Is my EntryClass written in a standard way?

ValueError: Unsupported architectures: ChatGLMModel. Supported list: ['BaichuanForCausalLM', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPTBigCodeForCausalLM', 'Grok1ForCausalLM', 'Grok1ModelForCausalLM', 'InternLM2ForCausalLM', 'LlamaForCausalLM', 'Phi3ForCausalLM', 'LlamaForClassification', 'LlamaEmbeddingModel', 'MistralModel', 'LlamaForSequenceClassification', 'LlamaForSequenceClassificationWithNormal_Weights', 'LlavaLlamaForCausalLM', 'LlavaQwenForCausalLM', 'LlavaMistralForCausalLM', 'LlavaVidForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'OlmoeForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'StableLmForCausalLM', 'TorchNativeLlamaForCausalLM', 'TorchNativePhi3ForCausalLM', 'XverseForCausalLM', 'XverseMoeForCausalLM', 'YiVLForCausalLM']

@merrymercy
Copy link
Contributor

Your usage seems good.

  1. Can you print the full traceback?
  2. You can search for "ChatGLM" in the whole repo and see the usage of model/config. Is this related https://github.com/xai-org/sglang-private/blob/b9e6afc62fbc0ed1265ea1df4badd3f592cf77df/python/sglang/srt/hf_transformers_utils.py#L41
  3. See also https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md
  4. Try to run test_vision_openai_server.py on the main branch and it should work. Then incrementally update your code see which line introduces problems.

@sixsixcoder
Copy link
Contributor Author

您的用法看起来不错。

  1. 您可以打印完整的回溯吗?
  2. 您可以在整个 repo 中搜索“ChatGLM”,查看 model/config 的使用情况。这与https://github.com/xai-org/sglang-private/blob/b9e6afc62fbc0ed1265ea1df4badd3f592cf77df/python/sglang/srt/hf_transformers_utils.py#L41有关吗?
  3. 另请参阅https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md
  4. 尝试test_vision_openai_server.py在主分支上运行,它应该可以工作。然后逐步更新代码,看看哪一行引入了问题。

The previous problem has been solved, but when I execute test_vision_openai_server.py, an error occurs

[08:01:44 TP0] max_total_num_tokens=1088842, max_prefill_tokens=16384, max_running_requests=4097, context_len=8192
INFO:     Started server process [1833860]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:2157 (Press CTRL+C to quit)
INFO:     127.0.0.1:45942 - "GET /get_model_info HTTP/1.1" 200 OK
[08:01:45 TP0] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:45946 - "POST /generate HTTP/1.1" 200 OK
[08:01:46] The server is fired up and ready to roll!
INFO:     127.0.0.1:56048 - "GET /v1/models HTTP/1.1" 200 OK
[08:01:49 TP0] Prefill batch. #new-seq: 1, #new-token: 52, #cached-token: 2, cache hit rate: 3.23%, token usage: 0.00, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:56064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
F
======================================================================
FAIL: test_chat_completion (test_vision_openai_server.TestOpenAIVisionServer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/xxxx/sglang-master/test/srt/test_vision_openai_server.py", line 73, in test_chat_completion
    assert "man" in text or "cab" in text, text
AssertionError: The image depicts a serene landscape with a clear blue sky, fluffy white clouds, and a green field with a few trees scattered across it.

----------------------------------------------------------------------
Ran 1 test in 51.664s

FAILED (failures=1)

@merrymercy
Copy link
Contributor

merrymercy commented Oct 15, 2024

It seems the model did not see the image and started to hallucinate. Did you pass in the images correctly?

@sixsixcoder
Copy link
Contributor Author

看起来模型没有看到图像并开始产生幻觉。你输入的图像正确吗?

Where does sglang receive and process multimodal input?

@merrymercy
Copy link
Contributor

You can see the llava for example

if need_vision.any():
pixel_values = [
image_inputs[i].pixel_values for i in range(bs) if need_vision[i]
]
image_sizes = [
image_inputs[i].image_sizes for i in range(bs) if need_vision[i]
]
image_offsets = [
image_inputs[i].image_offsets for i in range(bs) if need_vision[i]
]
########## Encode Image ########

and run the https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py to understand the code path.
You can also see some related PRs: #1551 #1546

@zhyncs zhyncs mentioned this pull request Oct 17, 2024
30 tasks
@sixsixcoder
Copy link
Contributor Author

You can see the llava for example

if need_vision.any():
pixel_values = [
image_inputs[i].pixel_values for i in range(bs) if need_vision[i]
]
image_sizes = [
image_inputs[i].image_sizes for i in range(bs) if need_vision[i]
]
image_offsets = [
image_inputs[i].image_offsets for i in range(bs) if need_vision[i]
]
########## Encode Image ########

and run the https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py to understand the code path.
You can also see some related PRs: #1551 #1546

What is the minimum example of running a multimodal model, receiving a prompt and an image, and then performing inference?

@merrymercy
Copy link
Contributor

@zhyncs
Copy link
Member

zhyncs commented Oct 20, 2024

Hi @sixsixcoder The code for Qwen2 VL has already been merged into the main branch, where the triton-related kernel can be reused in GLM 4V, which is more efficient than the torch implementation and was completed by @ispobock . You may consider replacing and using it in this PR. Thanks!

@zhyncs zhyncs requested a review from ispobock October 20, 2024 05:15
@merrymercy
Copy link
Contributor

@sixsixcoder please rebase and add the test for GLM-4v. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants