Add GLM-4v Multimodal Model support for SGLang #1641

sixsixcoder · 2024-10-12T03:18:42Z

Motivation

Add GLM-4v support for SGLang, GLM-4v is a widely used multimodal model developed by THUDM, we hope to adapt to the excellent fast serving framework SGLang

Modifications

Migrate chatglm.py file from vllm
Add glm4 vision encoder in python/sglang/srt/models/glm4_vision_encoder.py.
Add optional vision module for ChatGLMModel, making ChatGLMForCausalLM multimodal capable.
Add the model to the test suite test_generation_models.py

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2024-10-12T04:17:36Z

Wow, that's cool. Thank you and Zhipu AI for your contribution!

merrymercy · 2024-10-12T04:39:38Z

Thanks for the contribution.

Could you fix the lint error? https://github.com/sgl-project/sglang/blob/main/docs/en/contributor_guide.md
Can you test the OpenAI vision API? You probably need to date some chat template.

sglang/test/srt/test_vision_openai_server.py

Line 24 in 1d9deea

cls.model = "lmms-lab/llava-onevision-qwen2-0.5b-ov"

sixsixcoder · 2024-10-12T07:03:42Z

Thanks for the contribution.

Could you fix the lint error? https://github.com/sgl-project/sglang/blob/main/docs/en/contributor_guide.md

Can you test the OpenAI vision API? You probably need to date some chat template.

sglang/test/srt/test_vision_openai_server.py

Line 24 in 1d9deea

cls.model = "lmms-lab/llava-onevision-qwen2-0.5b-ov"

When executing this test file, an error will occur.

File "/root/anaconda3/envs/sglang/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1741, in setattr
if isinstance(value, Parameter):
File "/root/anaconda3/envs/sglang/lib/python3.9/site-packages/torch/nn/parameter.py", line 10, in instancecheck
isinstance(instance, torch.Tensor) and getattr(instance, '_is_param', False))
RecursionError: maximum recursion depth exceeded

Do you have any solution?

merrymercy · 2024-10-12T07:37:32Z

Can you share your command and more traceback?

I can run it successfully on an H100.

>>> python3 test_vision_openai_server.py
...
he iPod[07:36:06 TP0] Decode batch. #running-req: 1, #token: 6424, token usage: 0.00, gen throughput (token/s): 437.29, #queue-req: 0
 securely. The video does not contain any text or subtitles.------------------------------
.
----------------------------------------------------------------------
Ran 5 tests in 61.104s

OK

sixsixcoder · 2024-10-12T07:43:33Z

您可以分享命令和更多回溯吗？

我可以在H100上成功运行它。

>>> python3 test_vision_openai_server.py
...
he iPod[07:36:06 TP0] Decode batch. #running-req: 1, #token: 6424, token usage: 0.00, gen throughput (token/s): 437.29, #queue-req: 0
 securely. The video does not contain any text or subtitles.------------------------------
.
----------------------------------------------------------------------
Ran 5 tests in 61.104s

OK

It may be a problem with model registration, which leads to infinite recursion and then an error after exceeding the video memory. Where should I modify the model registration? Is my EntryClass written in a standard way?

ValueError: Unsupported architectures: ChatGLMModel. Supported list: ['BaichuanForCausalLM', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPTBigCodeForCausalLM', 'Grok1ForCausalLM', 'Grok1ModelForCausalLM', 'InternLM2ForCausalLM', 'LlamaForCausalLM', 'Phi3ForCausalLM', 'LlamaForClassification', 'LlamaEmbeddingModel', 'MistralModel', 'LlamaForSequenceClassification', 'LlamaForSequenceClassificationWithNormal_Weights', 'LlavaLlamaForCausalLM', 'LlavaQwenForCausalLM', 'LlavaMistralForCausalLM', 'LlavaVidForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'OlmoeForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'StableLmForCausalLM', 'TorchNativeLlamaForCausalLM', 'TorchNativePhi3ForCausalLM', 'XverseForCausalLM', 'XverseMoeForCausalLM', 'YiVLForCausalLM']

merrymercy · 2024-10-12T17:50:58Z

Your usage seems good.

Can you print the full traceback?
You can search for "ChatGLM" in the whole repo and see the usage of model/config. Is this related https://github.com/xai-org/sglang-private/blob/b9e6afc62fbc0ed1265ea1df4badd3f592cf77df/python/sglang/srt/hf_transformers_utils.py#L41
See also https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md
Try to run test_vision_openai_server.py on the main branch and it should work. Then incrementally update your code see which line introduces problems.

sixsixcoder · 2024-10-15T08:08:11Z

您的用法看起来不错。

您可以打印完整的回溯吗？

您可以在整个 repo 中搜索“ChatGLM”，查看 model/config 的使用情况。这与https://github.com/xai-org/sglang-private/blob/b9e6afc62fbc0ed1265ea1df4badd3f592cf77df/python/sglang/srt/hf_transformers_utils.py#L41有关吗？

另请参阅https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md

尝试test_vision_openai_server.py在主分支上运行，它应该可以工作。然后逐步更新代码，看看哪一行引入了问题。

The previous problem has been solved, but when I execute test_vision_openai_server.py, an error occurs

[08:01:44 TP0] max_total_num_tokens=1088842, max_prefill_tokens=16384, max_running_requests=4097, context_len=8192
INFO:     Started server process [1833860]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:2157 (Press CTRL+C to quit)
INFO:     127.0.0.1:45942 - "GET /get_model_info HTTP/1.1" 200 OK
[08:01:45 TP0] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:45946 - "POST /generate HTTP/1.1" 200 OK
[08:01:46] The server is fired up and ready to roll!
INFO:     127.0.0.1:56048 - "GET /v1/models HTTP/1.1" 200 OK
[08:01:49 TP0] Prefill batch. #new-seq: 1, #new-token: 52, #cached-token: 2, cache hit rate: 3.23%, token usage: 0.00, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:56064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
F
======================================================================
FAIL: test_chat_completion (test_vision_openai_server.TestOpenAIVisionServer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/xxxx/sglang-master/test/srt/test_vision_openai_server.py", line 73, in test_chat_completion
    assert "man" in text or "cab" in text, text
AssertionError: The image depicts a serene landscape with a clear blue sky, fluffy white clouds, and a green field with a few trees scattered across it.

----------------------------------------------------------------------
Ran 1 test in 51.664s

FAILED (failures=1)

merrymercy · 2024-10-15T15:59:08Z

It seems the model did not see the image and started to hallucinate. Did you pass in the images correctly?

sixsixcoder · 2024-10-16T03:14:18Z

看起来模型没有看到图像并开始产生幻觉。你输入的图像正确吗？

Where does sglang receive and process multimodal input?

merrymercy · 2024-10-17T01:46:29Z

You can see the llava for example

sglang/python/sglang/srt/models/llava.py

Lines 156 to 167 in d19cc0b

    
           if need_vision.any(): 
        
               pixel_values = [ 
        
                   image_inputs[i].pixel_values for i in range(bs) if need_vision[i] 
        
               ] 
        
               image_sizes = [ 
        
                   image_inputs[i].image_sizes for i in range(bs) if need_vision[i] 
        
               ] 
        
               image_offsets = [ 
        
                   image_inputs[i].image_offsets for i in range(bs) if need_vision[i] 
        
               ] 
        
               ########## Encode Image ########

and run the https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py to understand the code path.
You can also see some related PRs: #1551 #1546

sixsixcoder · 2024-10-17T09:59:28Z

You can see the llava for example

sglang/python/sglang/srt/models/llava.py

Lines 156 to 167 in d19cc0b

if need_vision.any():

pixel_values = [

image_inputs[i].pixel_values for i in range(bs) if need_vision[i]

]

image_sizes = [

image_inputs[i].image_sizes for i in range(bs) if need_vision[i]

]

image_offsets = [

image_inputs[i].image_offsets for i in range(bs) if need_vision[i]

]

########## Encode Image ########

and run the https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py to understand the code path.
You can also see some related PRs: #1551 #1546

What is the minimum example of running a multimodal model, receiving a prompt and an image, and then performing inference?

merrymercy · 2024-10-17T15:00:25Z

https://github.com/sgl-project/sglang/blob/main/docs/en/sampling_params.md#multi-modal

zhyncs · 2024-10-20T05:14:43Z

Hi @sixsixcoder The code for Qwen2 VL has already been merged into the main branch, where the triton-related kernel can be reused in GLM 4V, which is more efficient than the torch implementation and was completed by @ispobock . You may consider replacing and using it in this PR. Thanks!

merrymercy · 2024-10-23T06:28:12Z

@sixsixcoder please rebase and add the test for GLM-4v. Thanks!

sixsixcoder added 4 commits October 12, 2024 03:07

add GLM-4v support for sglang

6fec34c

add GLM-4v support for sglang in readme

a41d224

Update chatglm.py

4476d67

Update glm4_vision_encoder.py

bc306d2

zhyncs requested review from Ying1123, merrymercy and zhyncs October 12, 2024 04:17

zhyncs added the high priority label Oct 12, 2024

update format

09867a2

sixsixcoder added 5 commits October 15, 2024 06:53

update format

24868ae

s

26bee4d

Merge branch 'main' of github.com:sixsixcoder/sglang

cd1e612

Merge branch 'main' into glm-4v

40afb23

update chatglm.py

9b0463b

zhyncs mentioned this pull request Oct 17, 2024

Development Roadmap (2024 Q4) #1487

Open

30 tasks

zhyncs requested a review from ispobock October 20, 2024 05:15

zhyncs requested review from hnyls2002 and ByronHsu as code owners October 24, 2024 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLM-4v Multimodal Model support for SGLang #1641

Add GLM-4v Multimodal Model support for SGLang #1641

sixsixcoder commented Oct 12, 2024 •

edited

Loading

zhyncs commented Oct 12, 2024

merrymercy commented Oct 12, 2024 •

edited

Loading

sixsixcoder commented Oct 12, 2024

merrymercy commented Oct 12, 2024

sixsixcoder commented Oct 12, 2024

merrymercy commented Oct 12, 2024

sixsixcoder commented Oct 15, 2024

merrymercy commented Oct 15, 2024 •

edited

Loading

sixsixcoder commented Oct 16, 2024

merrymercy commented Oct 17, 2024

sixsixcoder commented Oct 17, 2024

merrymercy commented Oct 17, 2024

zhyncs commented Oct 20, 2024

merrymercy commented Oct 23, 2024

Add GLM-4v Multimodal Model support for SGLang #1641

Are you sure you want to change the base?

Add GLM-4v Multimodal Model support for SGLang #1641

Conversation

sixsixcoder commented Oct 12, 2024 • edited Loading

Motivation

Modifications

Checklist

zhyncs commented Oct 12, 2024

merrymercy commented Oct 12, 2024 • edited Loading

sixsixcoder commented Oct 12, 2024

merrymercy commented Oct 12, 2024

sixsixcoder commented Oct 12, 2024

merrymercy commented Oct 12, 2024

sixsixcoder commented Oct 15, 2024

merrymercy commented Oct 15, 2024 • edited Loading

sixsixcoder commented Oct 16, 2024

merrymercy commented Oct 17, 2024

sixsixcoder commented Oct 17, 2024

merrymercy commented Oct 17, 2024

zhyncs commented Oct 20, 2024

merrymercy commented Oct 23, 2024

sixsixcoder commented Oct 12, 2024 •

edited

Loading

merrymercy commented Oct 12, 2024 •

edited

Loading

merrymercy commented Oct 15, 2024 •

edited

Loading