feat(vLLM): support async generator #746

fengyizhu · 2024-09-09T02:09:40Z

I have provided a draft version of the VLLM, with the following key changes:

Protocol changes: Aligned with OpenAI /v1/audio/speech.
Removed the custom inference part, keeping the inference logic consistent between streaming and non-streaming.
Optimized some potential issues of inconsistency caused by streaming inference.

These changes may have a significant impact. Feel free to leave comments to guide me in further improvements.

fumiama

Please note that vLLM cannot and will never replace the original infer code.
You should open a new PR for the change of examples/api but not merge them together.

ChatTTS/model/gpt.py

ChatTTS/core.py

fumiama

Please do not change the default behavior unless you must do it. If you have the reason of changing the behavior, please explain.

Note: you shouldn't refer the content written in vLLM code and port those contents into main, because the code in vLLM is contributed by the community and hasn't been fully checked.

ChatTTS/core.py

IrisSally · 2024-09-09T04:03:37Z

这一个版本太强大了，我用4090测，流式api第一个chunk 65毫秒，而且音色能固定，太快了，太快了，怀疑电脑坏了

ZaymeShaw · 2024-09-09T05:30:33Z

感谢大佬分享。这个版本能和原版本保持相同音色吗，这个issue中有提到 #640

fengyizhu · 2024-09-09T05:40:58Z

感谢大佬分享。这个版本能和原版本保持相同音色吗，这个issue中有提到 #640

It is fully compatible in principle.

LLongIsland · 2024-09-09T07:58:05Z

这一个版本太强大了，我用4090测，流式api第一个chunk 65毫秒，而且音色能固定，太快了，太快了，怀疑电脑坏了

4090上使用compile=True耗时3.3s的文本，用main分支vllm加速是1.6s，不能调整音色，用这个pr的版本可以调整，但速度只有2.6s左右

fumiama

Let me say it again. Your modification on examples/api is about another function so you SHOULD move those changes to ANOTHER PR. Thanks for your comprehension.

ChatTTS/core.py

fumiama · 2024-09-11T16:20:03Z

ChatTTS/core.py

-            # Filter both rows and columns using slicing
-            yield new_wavs[:][:, keep_cols]
+                # Hacker：Check if there are any silent segments; if so, take the last segment. Otherwise, try waiting for another loop.
+                import librosa


We don't want to introduce librosa. Modifying the original code makes sense. ex. replace

keep_cols = np.sum(new_wavs != 0, axis=0) > 0

with

# pseudo code without testing, just a hint keep_cols = np.sum(abs(new_wavs) > 1e-6, axis=0) > 0

ChatTTS/core.py

fumiama · 2024-09-11T16:25:04Z

ChatTTS/core.py

-                ),
-            ]
-
-        emb = self.embed(input_ids, text_mask)


If you move a line up, don't change the variable name when it's ok to remain the original one. It will make it difficult to see your changes.

ChatTTS/core.py

fumiama

I noticed that you got some trouble moving your modification of examples/api and tools/audio into another PR 😂. Here's a brief instruction.

Add a new commit in this PR, removing examples/api/main.py and tools/audio/np.py
Add a new branch based on the dev branch of THIS MAIN REPO.
Apply your modification of examples/api/main.py and tools/audio/np.py on this new branch.
Create a new PR based on this branch to dev.

niuzheng168 · 2024-09-19T08:01:10Z

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm.
Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

fumiama · 2024-09-19T08:20:21Z

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm.

Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

This part of code is contributed by the community and you can ask @ylzz1997 about that. I'm sorry but I'm not familiar with vLLM 😂.

ylzz1997 · 2024-09-19T08:57:20Z

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

Because Vllm don't suport some feature:

custom lm-head
multi-codebook sampler (custom sampler)
sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

fumiama · 2024-09-20T07:27:13Z

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question

Because Vllm don't suport some feature:

custom lm-head

multi-codebook sampler (custom sampler)

sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

Thanks for your explanation.

niuzheng168 · 2024-09-29T15:44:16Z

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question

Because Vllm don't suport some feature:

custom lm-head

multi-codebook sampler (custom sampler)

sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments.
Only for the llama part:
Model: https://github.com/niuzheng168/vllm/blob/dev/vllm/model_executor/models/chattts.py
Sample usage: https://github.com/niuzheng168/vllm/blob/dev/chattts_sample.py

For the issues you list above:

We can create multiple lm_heads.
I noticed it runs slowly when we run the sample N times, so I made a lite version of the multi-head sampler.
This is already supported in vllm by setting detokenize=false.

One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works.

Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc.
I also reply the road map of vllm thread, to see if push vllm support the model official, I believe more and more multi-modal models is using similar model arch, especially for those gpt-4o like models.

fumiama · 2024-10-01T05:34:34Z

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question

Because Vllm don't suport some feature:

custom lm-head

multi-codebook sampler (custom sampler)

sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.
That's all

Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments. Only for the llama part: Model: https://github.com/niuzheng168/vllm/blob/dev/vllm/model_executor/models/chattts.py Sample usage: https://github.com/niuzheng168/vllm/blob/dev/chattts_sample.py

For the issues you list above:

We can create multiple lm_heads.

I noticed it runs slowly when we run the sample N times, so I made a lite version of the multi-head sampler.

This is already supported in vllm by setting detokenize=false.

One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works.

Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc. I also reply the road map of vllm thread, to see if push vllm support the model official, I believe more and more multi-modal models is using similar model arch, especially for those gpt-4o like models.

Thanks for your great effort. We welcome you to contribute your code into this repo if you like.

github-actions bot changed the base branch from main to dev September 9, 2024 02:09

fengyizhu force-pushed the supprot_aysnc_vllm branch from b839200 to feb9aad Compare September 9, 2024 02:40

fumiama requested changes Sep 9, 2024

View reviewed changes

ChatTTS/model/gpt.py Outdated Show resolved Hide resolved

ChatTTS/model/gpt.py Outdated Show resolved Hide resolved

ChatTTS/core.py Show resolved Hide resolved

fumiama added enhancement New feature or request algorithm Algorithm improvements & issues performance Running speed & quality labels Sep 9, 2024

fumiama requested changes Sep 9, 2024

View reviewed changes

fumiama requested changes Sep 11, 2024

View reviewed changes

fumiama added the good first issue Good for newcomers label Sep 11, 2024

fengyizhu force-pushed the supprot_aysnc_vllm branch from 1a60b69 to ecc7b52 Compare September 12, 2024 13:38

feat：support async vllm generator

1e2b671

fengyizhu force-pushed the supprot_aysnc_vllm branch from 5b7863f to 1e2b671 Compare September 12, 2024 13:40

chore(format): run black on dev

7ba54d9

fumiama reviewed Sep 12, 2024

View reviewed changes

fumiama changed the title ~~feat：support async vllm generator~~ feat(vLLM): support async generator Sep 12, 2024

fumiama mentioned this pull request Sep 15, 2024

fix audio for vllm #755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vLLM): support async generator #746

feat(vLLM): support async generator #746

fengyizhu commented Sep 9, 2024 •

edited

Loading

fumiama left a comment

fumiama left a comment

IrisSally commented Sep 9, 2024

ZaymeShaw commented Sep 9, 2024

fengyizhu commented Sep 9, 2024 •

edited

Loading

LLongIsland commented Sep 9, 2024

fumiama left a comment

fumiama Sep 11, 2024

fumiama Sep 11, 2024

fumiama left a comment

niuzheng168 commented Sep 19, 2024 •

edited

Loading

fumiama commented Sep 19, 2024

ylzz1997 commented Sep 19, 2024

fumiama commented Sep 20, 2024

niuzheng168 commented Sep 29, 2024

fumiama commented Oct 1, 2024

feat(vLLM): support async generator #746

Are you sure you want to change the base?

feat(vLLM): support async generator #746

Conversation

fengyizhu commented Sep 9, 2024 • edited Loading

fumiama left a comment

Choose a reason for hiding this comment

fumiama left a comment

Choose a reason for hiding this comment

IrisSally commented Sep 9, 2024

ZaymeShaw commented Sep 9, 2024

fengyizhu commented Sep 9, 2024 • edited Loading

LLongIsland commented Sep 9, 2024

fumiama left a comment

Choose a reason for hiding this comment

fumiama Sep 11, 2024

Choose a reason for hiding this comment

fumiama Sep 11, 2024

Choose a reason for hiding this comment

fumiama left a comment

Choose a reason for hiding this comment

niuzheng168 commented Sep 19, 2024 • edited Loading

fumiama commented Sep 19, 2024

ylzz1997 commented Sep 19, 2024

fumiama commented Sep 20, 2024

niuzheng168 commented Sep 29, 2024

fumiama commented Oct 1, 2024

fengyizhu commented Sep 9, 2024 •

edited

Loading

fengyizhu commented Sep 9, 2024 •

edited

Loading

niuzheng168 commented Sep 19, 2024 •

edited

Loading