-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is it time to rerun the benchmarks? #1639
Comments
Hi @stas00 First of all, thank you for your issue. Your description reveals several issues, which I will point out here. If you have any questions, we can continue the discussion. Regarding the figure in the README, you can refer to https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2, which provides a detailed description of versions and reproduction methods. Regarding the performance improvement of vLLM v0.6, we have also conducted a benchmark that can be found at https://github.com/sgl-project/sglang/tree/main/benchmark/benchmark_vllm_060. vLLM v0.6 has indeed improved significantly, but there are some limitations. Whether it's Meanwhile, there are some issues with the parameters when you benchmark SGLang. For your testing scenario, you should use Overall, there are many aspects to consider with benchmarking. Both the configuration of the benchmark and configuration of the server itself can significantly impact the results. We need to focus on the overall performance metrics rather than local ones. |
Thank you for your reply, Yineng. Thank you for sharing the vllm==0.6.0 vs sglang benchmark. This is great, and fits right into the OP Your frontpage shows vllm throughput being much much worse than sglang. The benchmark you have shared shows that vllm is slightly worse, which is a very different situation. That's why I was suggesting a new visual is needed to show the updated reality. Please note the first results table I shared doesn't use But let's finish the vllm vs sglang discussion as I wasn't seeking to provoke - was just hoping for a fair representation of vllm as it currently appears to be very inferior on that plot you have published many months ago. =============================== If I get the resources my intention is to support multiple inference backends in our team's inference framework and switch between them depending on which backend performs better than others in each particular use-case - or because of a better stability. Let's move to how do I make SGLang shine. Thank you for sharing the tips that I should add And it sounds like a very low TTFT is one of the main objectives of SGLang, correct? We currently do mainly offline generation, so TTFT doesn't matter, but it'll become hugely important when later we will be facing the user. But that's why I was benchmarking throughput. And I'm excited to use SGLang for when very low TTFT is crucial. One other thing I was puzzling over is how could I do |
I had a chance to rerun the benchmark with sglang==0.3.2
sglang==0.3.2 + --disable-radix --enable-torch-compile
the baseline command was:
it also took forever to start with Thoughts? |
@stas00 If you are concerned about offline scenarios, focusing on throughput, you should maximize the batch size and make full use of VRAM, meaning the KV Cache usage should be as high as possible. However, in your benchmark command, only 50 requests were made. Running Llama 3.1 8B under two H100 devices is far from reaching the true limit of throughput with these 50 requests. |
I conducted a simple benchmark with 2000 prompts, which you can use as a reference. The commands are generally consistent with what you provided above. Among them, vLLM is version 0.6.3 and SGLang is the latest main version. The startup command for vLLM removed
|
This is very useful, @zhyncs
Here are the benchmark results: 2000 concurrent requests benchmarkBenchmark
sglang==0.3.2
sglang==0.3.2 + --disable-radix --enable-torch-compile
vllm==0.6.2
vllm==0.6.3.post1
|
The |
You can see my results above - either or both |
|
Hi @stas00 I would like to share with you another perspective on how I make products.
Running Llama 3.1 8B Instruct TP 1 on H100, using the open-source version of TensorRT LLM v0.13.0, SGLang's latest commit and vLLM 0.6.3.post1. The goal is TTFT P99 less than 200ms, benchmark with 1k prompts, TensorRT LLM can achieve a maximum request rate of 46 before higher latency fails to meet requirements. Using the same benchmark configuration for SGLang yielded similar results as above. In short, TensorRT LLM indeed has certain advantages for latency-sensitive online scenarios. Users often require meeting specific latency demands. For example, if their service maxes out at 400ms, then they typically allocate around P99 TTFT of about 200ms for LLM Inference Serving (just a simple example and not necessarily accurate). Under conditions that satisfy latency requirements, higher throughput is better. Note: TensorRT LLM v0.13.0, SGLang latest main, vLLM v0.6.3.post1
|
|
That's very cool, @zhyncs! Thank you for running additional benchmarks. Adding |
@stas00 Thanks for your advice. Cool It gets better. And the performance is still worse than TensorRT LLM and SGLang. |
The TTFT is indeed much worse, but vllm's throughput is higher than SGLang's according to the updated numbers you have shared. And cycling back to the OP, it's hopefully very clear now that the plots on the SGLang's front page need a refresh to bring them to the up-to-date reality ;) |
As described above #1639 (comment) I use this benchmark config to test the online scenario. If we want to test the max throughput for offline, we should maximize the batch size and make full use of VRAM, meaning the KV Cache usage should be as high as possible. At this time, SGLang's throughput is also higher. |
Hi SGLang team,
I have just tried SGLang for the first time - and it was probably one of the easiest projects to setup and launch - it literally took me a few minutes to go from 0 to serving - awesome!!! and thank you for making it so easy on the user.
I have just benchmarked vllm=0.6.2 vs sglang=0.3.2 on 2 H100s w/ 8b llama3 and tp=2 and I get vllm slightly faster than sglang performance, yet the benchmark section shows a very different picture. Would it be possible to re-benchmark and tell me if I am missing on some optimization flags to see the results you get - I'm just checking the baseline at the moment - so no quantization and such. Will get there a bit later. FWIW, I have just benchmarked and vllm had a massive throughput speed up made in v0.6.2 over its v0.5 https://x.com/StasBekman/status/1844886291378470966 - which is probably why the benchmark on your site needs a refresher.
Thank you!
Below are the stats and command lines so that it's reproducible by others.
vllm=0.6.2 w/ normal
vllm=0.6.2 w/ --num-scheduler-steps 8
sglang==0.3.2
the servers
vllm:
sglang:
the benchmark client
The text was updated successfully, but these errors were encountered: