feed.xml

<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2024-10-18T10:49:22-07:00</updated><id>/feed.xml</id><title type="html">vLLM Blog</title><author><name>© 2024. vLLM Team. All rights reserved.</name></author><entry><title type="html">How Speculative Decoding Boosts vLLM Performance by up to 2.8x</title><link href="/2024/10/17/spec-decode.html" rel="alternate" type="text/html" title="How Speculative Decoding Boosts vLLM Performance by up to 2.8x" /><published>2024-10-17T00:00:00-07:00</published><updated>2024-10-17T00:00:00-07:00</updated><id>/2024/10/17/spec-decode</id><content type="html" xml:base="/2024/10/17/spec-decode.html"><![CDATA[<p>Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.</p>

<p><em>This content is based on a session from our bi-weekly vLLM Office Hours, where we discuss techniques and updates to optimize vLLM performance. You can <a href="https://docs.google.com/presentation/d/1wUoLmhfX6B7CfXy3o4m-MdodRL26WvY3/edit#slide=id.p1">view the session slides here</a>. If you prefer watching, you can <a href="https://youtu.be/eVJBFajJRIU?si=9BKjcFkhdOwRcIiy">view the full recording on YouTube</a>. We’d love to see you <a href="https://neuralmagic.com/community-office-hours/?utm_campaign=vLLM%20Office%20Hours&amp;utm_source=vllm-blog">attend future sessions</a> - please register!</em></p>

<h2 id="an-introduction-to-speculative-decoding">An Introduction to Speculative Decoding</h2>

<p>Speculative decoding is a key technique in reducing latency during token generation in large language models (LLMs). This approach leverages smaller models to handle simpler token predictions while utilizing larger models to verify or adjust those predictions. By doing this, speculative decoding accelerates generation without sacrificing accuracy, making it a lossless yet highly efficient method for optimizing LLM performance.</p>

<p>Traditionally, LLMs generate tokens one at a time in an autoregressive manner. For example, given a prompt, the model generates T1, then T2, then T3, and so on, each requiring a separate forward pass. Speculative decoding transforms this process by allowing multiple tokens to be proposed and verified in one forward pass.</p>

<p>Here’s how the process works:</p>

<ol>
  <li><strong>Draft Model</strong>: A smaller, more efficient model proposes tokens, such as T1, T2, and T3’.</li>
  <li><strong>Target Model Verification</strong>: The larger model verifies these tokens in a single forward pass. It confirms correct tokens (T1, T2) and corrects any incorrect ones (replacing T3’ with T3).</li>
  <li><strong>Multiple Tokens in One Pass</strong>: Instead of generating one token per pass, this method processes multiple tokens simultaneously, reducing latency.</li>
</ol>

<p>By using this approach, speculative decoding speeds up token generation, making it an effective method for both small-scale and large-scale language model deployments.</p>

<h2 id="how-speculative-decoding-works-in-vllm">How Speculative Decoding Works in vLLM</h2>

<p>In vLLM, speculative decoding is integrated with the system’s <strong>continuous batching</strong> architecture, where different requests are processed together in a single batch, enabling higher throughput. vLLM uses two key components to implement this:</p>

<ul>
  <li><strong>Draft Runner</strong>: This runner is responsible for executing the smaller model to propose candidate tokens.</li>
  <li><strong>Target Runner</strong>: The target runner verifies the tokens by running the larger model.</li>
</ul>

<p>vLLM’s system is optimized to handle this process efficiently, allowing speculative decoding to work seamlessly with continuous batching, which increases the overall system performance.</p>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure1.png" width="80%" />
</picture><br />
Diagram illustrating how the draft and target runners interact within the vLLM batching system.
</p>

<p>To implement speculative decoding in vLLM, two crucial components had to be modified:</p>

<ol>
  <li><strong>Scheduler</strong>: The scheduler was adjusted to handle multiple token slots within a single forward pass, enabling the simultaneous generation and verification of several tokens.</li>
  <li><strong>Memory Manager</strong>: The memory manager now handles the KV cache for both the draft and target models, ensuring smooth processing during speculative decoding.</li>
</ol>

<h2 id="types-of-speculative-decoding-supported-in-vllm">Types of Speculative Decoding Supported in vLLM</h2>

<p>vLLM supports three types of speculative decoding, each tailored to different workloads and performance needs:</p>

<h3 id="draft-model-based-speculative-decoding">Draft Model-Based Speculative Decoding</h3>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure2.png" width="80%" />
</picture>
</p>

<p>This is the most commonly used form of speculative decoding, where a smaller model predicts the next tokens, and a larger model verifies them. A common example would be using a Llama 68M model to predict tokens for a Llama 2 70B model. This approach requires careful selection of the draft model to balance accuracy and overhead.</p>

<p>Choosing the correct draft model is essential for maximizing the efficiency of speculative decoding. The draft model needs to be small enough to avoid creating significant overhead but still accurate enough to provide a meaningful performance boost.</p>

<p>However, selecting the right draft model can be challenging. For example, in models like Llama 3, finding a suitable draft model is difficult due to differences in vocabulary size. Speculative decoding requires that the draft and target models share the same vocabulary, and in some cases, this can limit the use of speculative decoding.</p>

<h3 id="prompt-lookup-decoding">Prompt Lookup Decoding</h3>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure3.png" width="80%" />
</picture>
</p>

<p>Otherwise known as n-gram matching, this approach is effective for use cases like summarization and question-answering, where there is a significant overlap between the prompt and the answer. Instead of using a small model to propose tokens, the system speculates based on the information already available in the prompt. This works particularly well when the large model repeats parts of the prompt in its answers.</p>

<h3 id="medusaeaglemlpspeculator">Medusa/Eagle/MLPSpeculator</h3>
<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure4.png" width="60%" />
</picture><br />
<i>Picture from https://github.com/FasterDecoding/Medusa</i>
</p>

<p>In this method, additional layers (or heads) are added to the large model itself, allowing it to predict multiple tokens in a single forward pass. This reduces the need for a separate draft model, instead leveraging the large model’s own capacity for parallel token generation. Though preliminary, this method shows promise for improving efficiency as more optimized kernels are developed.</p>

<h2 id="rejection-sampler-and-speculative-decoding-worker">Rejection Sampler and Speculative Decoding Worker</h2>

<p>vLLM implements a rejection sampler as part of its speculative decoding framework. The sampler helps finalize which tokens are accepted and which are rejected, refining the overall accuracy of the process. Additionally, vLLM uses a speculative decoding worker to manage both the draft model and the target model’s token proposals and verifications, ensuring smooth operations during speculative decoding.</p>

<h2 id="speculative-decoding-performance-insights-speedups-and-trade-offs">Speculative Decoding Performance Insights: Speedups and Trade-offs</h2>

<p>Speculative decoding offers significant performance benefits in <strong>low-QPS (queries per second)</strong> environments. For example, in testing on the ShareGPT dataset, vLLM demonstrated up to a 1.5x speedup in token generation when using draft model-based speculative decoding. Similarly, prompt lookup decoding has shown speedups of up to 2.8x when applied to summarization datasets, such as CNN/DailyMail.</p>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure5.png" width="48%" />
</picture>
<picture>
&nbsp; &nbsp;
<img src="/assets/figures/spec-decode/figure6.png" width="48%" />
</picture><br />
Performance comparison showing spec decode delivering up to 1.5x Speedup at QPS=1 Llama3-70B on ShareGPT with 4xH100 using draft model and up to 2.8x Speedup at QPS=1 Llama3-70B on CNN Dailymail with 4xH100 using n-grams
</p>

<p>However, in <strong>high-QPS environments</strong>, speculative decoding may introduce performance trade-offs. The extra compute required to propose and verify tokens can sometimes slow down the system when it is already compute-bound, as seen when the number of requests per second increases. In such cases, the overhead of speculative decoding can outweigh its benefits, leading to reduced performance.</p>

<p align="center">
<picture>
<img src="/assets/figures/spec-decode/figure7.png" width="80%" />
</picture><br />
As high QPS, we see 1.4x slowdown Llama3-70B on ShareGPT with 4xH100, 1.8x slowdown Llama3-70B on CNN Dailymail with 4xH100
</p>

<h2 id="on-the-roadmap-dynamic-adjustments-for-better-performance">On the Roadmap: Dynamic Adjustments for Better Performance</h2>

<p>To overcome the limitations of speculative decoding in high-QPS settings, vLLM is working on implementing <strong>dynamic speculative decoding</strong>. Feel free to check out the <a href="https://arxiv.org/abs/2406.14066">paper</a> for more detail. This is also one of the active research directions in vllm! This feature will allow vLLM to adjust the number of speculative tokens based on system load and the accuracy of the draft model.</p>

<p>In the future, the system will be able to automatically modify the degree of speculation at each step, ensuring speculative decoding is always beneficial, regardless of the workload. This will allow users to activate speculative decoding without worrying about whether it will slow down their system.</p>

<h2 id="how-to-use-speculative-decoding-in-vllm">How to Use Speculative Decoding in vLLM</h2>

<p>Setting up speculative decoding in vLLM is straightforward. When launching the vLLM server, you simply need to include the necessary flags to specify the speculative model, the number of tokens, and the tensor parallel size.</p>

<p>The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">vllm</span> <span class="kn">import</span> <span class="n">LLM</span>

<span class="n">llm</span> <span class="o">=</span> <span class="nc">LLM</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">facebook/opt-6.7b</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">speculative_model</span><span class="o">=</span><span class="sh">"</span><span class="s">facebook/opt-125m</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">num_speculative_tokens</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="sh">"</span><span class="s">The future of AI is</span><span class="sh">"</span><span class="p">)</span>

<span class="k">for</span> <span class="n">output</span> <span class="ow">in</span> <span class="n">outputs</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Prompt: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">prompt</span><span class="si">!r}</span><span class="s">, Generated text: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="si">!r}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>The following code configures vLLM to use speculative decoding where proposals are generated by matching n-grams in the prompt:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">vllm</span> <span class="kn">import</span> <span class="n">LLM</span>

<span class="n">llm</span> <span class="o">=</span> <span class="nc">LLM</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">facebook/opt-6.7b</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">speculative_model</span><span class="o">=</span><span class="sh">"</span><span class="s">[ngram]</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">num_speculative_tokens</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">ngram_prompt_lookup_max</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
    <span class="n">ngram_prompt_lookup_min</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="sh">"</span><span class="s">The future of AI is</span><span class="sh">"</span><span class="p">)</span>

<span class="k">for</span> <span class="n">output</span> <span class="ow">in</span> <span class="n">outputs</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Prompt: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">prompt</span><span class="si">!r}</span><span class="s">, Generated text: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="si">!r}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>At times, you may want the draft model to operate with a different tensor parallel size than the target model to improve efficiency. This allows the draft model to use fewer resources and has less communication overhead, leaving the more resource-intensive computations to the target model. In vLLM, you can configure the draft model to use a tensor parallel size of 1, while the target model uses a size of 4, as demonstrated in the example below.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">vllm</span> <span class="kn">import</span> <span class="n">LLM</span>

<span class="n">llm</span> <span class="o">=</span> <span class="nc">LLM</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">meta-llama/Meta-Llama-3.1-70B-Instruct</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">tensor_parallel_size</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
    <span class="n">speculative_model</span><span class="o">=</span><span class="sh">"</span><span class="s">ibm-fms/llama3-70b-accelerator</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">speculative_draft_tensor_parallel_size</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="sh">"</span><span class="s">The future of AI is</span><span class="sh">"</span><span class="p">)</span>

<span class="k">for</span> <span class="n">output</span> <span class="ow">in</span> <span class="n">outputs</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Prompt: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">prompt</span><span class="si">!r}</span><span class="s">, Generated text: </span><span class="si">{</span><span class="n">output</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="si">!r}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<!-- For draft-model-based decoding, users specify the draft model and the number of tokens to speculate. vLLM also supports **Ngram speculative decoding**, where users only need to specify the number of tokens to speculate. Soon, vLLM will include automatic token speculation, removing the need for manual configuration altogether. -->

<p>Future updates (<a href="https://arxiv.org/abs/2406.14066">paper</a>, <a href="https://github.com/vllm-project/vllm/issues/4565">RFC</a>) will allow vLLM to automatically choose the number of speculative tokens, removing the need for manual configuration and simplifying the process even further.</p>

<p>Follow our docs on <a href="https://docs.vllm.ai/en/v0.6.0/models/spec_decode.html">Speculative Decoding in vLLM</a> to get started. <a href="https://neuralmagic.com/community-office-hours/">Join our bi-weekly office hours to ask questions and give feedback</a>.</p>

<h2 id="conclusion-the-future-of-speculative-decoding-in-vllm">Conclusion: The Future of Speculative Decoding in vLLM</h2>

<p>Speculative decoding in vLLM delivers substantial performance improvements, especially in low-QPS environments. As dynamic adjustments are introduced, it will become a highly effective tool even in high-QPS settings, making it a versatile and essential feature for reducing latency and increasing efficiency in LLM inference.</p>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.]]></summary></entry><entry><title type="html">vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction</title><link href="/2024/09/05/perf-update.html" rel="alternate" type="text/html" title="vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction" /><published>2024-09-05T00:00:00-07:00</published><updated>2024-09-05T00:00:00-07:00</updated><id>/2024/09/05/perf-update</id><content type="html" xml:base="/2024/09/05/perf-update.html"><![CDATA[<p><strong>TL;DR:</strong> vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/llama8B_comparison.png" width="42%" />
</picture><picture>
&nbsp; &nbsp;
<img src="/assets/figures/perf-v060/llama70B_comparison.png" width="42%" />
</picture><br />
Performance comparison between vLLM v0.5.3 and v0.6.0 for Llama 8B on 1xH100 and 70B on 4xH100 on ShareGPT dataset (500 prompts). TPOT measured at 32 QPS.
</p>

<p>A month ago, we released our <a href="https://blog.vllm.ai/2024/07/25/lfai-perf.html">performance roadmap</a> committing to performance as our top priority. Today, we released vLLM v0.6.0, with 1.8-2.7x throughput improvements compared to v0.5.3, reaching state-of-the-art performance while keeping rich features and great usability.</p>

<p>We will start by diagnosing the performance bottleneck in vLLM previously. Then we will describe the solution we implemented and landed in the past month. Finally, we will showcase the benchmarks of the latest vLLM release v0.6.0 other inference engines.</p>

<h3 id="performance-diagnosis">Performance Diagnosis</h3>

<p>LLM inference requires tight collaboration between CPUs and GPUs. Although the major computation happens in GPUs, CPUs also play an important role in serving and scheduling requests. If CPUs cannot schedule fast enough, GPUs will sit idle to wait for CPUs, which eventually leads to inefficient GPU utilization and hinders inference performance.</p>

<p>One year ago, when vLLM was first released, we mainly optimized for relatively large models on GPUs with limited memory (e.g. Llama 13B on NVIDIA A100-40G). As faster GPUs with larger memory (like NVIDIA H100) become more available and models become more optimized for inference (e.g. with techniques like GQA and quantization), the time spent on other CPU parts of the inference engine becomes a significant bottleneck. Specifically, our profiling results show that for Llama 3 8B running on 1 H100 GPU:</p>

<ul>
  <li>The HTTP API server takes 33% of the total execution time.</li>
  <li>29% of the total execution time is spent on scheduling, including gathering the LLM results from the last step, scheduling the requests to run for the next step, and preparing these requests as inputs for the LLMs.</li>
  <li>Finally, only 38% of the time was spent on the actual GPU execution for LLMs.</li>
</ul>

<p>We found two main issues in vLLM through the benchmark above:</p>

<ul>
  <li><strong>High CPU overhead.</strong> The CPU components of vLLM take a surprisingly long time. To make vLLM’s code easy to understand and contribute, we keep most of vLLM in Python and use many Python native data structures (e.g., Python Lists and Dicts). This becomes a significant overhead that causes the scheduling and data preparation time to be high.</li>
  <li><strong>Lack of asynchronicity among different components.</strong> In vLLM, many components (e.g., scheduler and output processor) execute in a synchronous manner that blocks GPU execution. This is mainly due to 1) our original assumption that model execution would be much slower than the CPU parts and 2) ease of implementation for many complicated scheduling situations (e.g., scheduling for beam search). However, this issue causes the GPU to wait for the CPU and reduces its utilization.</li>
</ul>

<p>To summarize, the performance bottleneck of vLLM is mainly caused by <em>the CPU overhead that blocks the GPU execution</em>. In vLLM v0.6.0, we introduce a series of optimizations to minimize these overheads.</p>

<h3 id="performance-enhancements">Performance Enhancements</h3>

<p>To make sure we can keep GPUs busy, we made several enhancements:</p>

<h4 id="separating-api-server-and-inference-engine-into-different-processes-pr-6883">Separating API server and inference engine into different processes (<a href="https://github.com/vllm-project/vllm/pull/6883">PR #6883</a>)</h4>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/illustration-api-server.png" width="90%" />
</picture>
Illustration of the serving process architecture before and after. We separated the http serving component from the vLLM engine, and connected them with a ZMQ socket. This architecture ensures both CPU heavy components are isolated from each other.
</p>

<p>Through careful profiling, we found that managing network requests and formatting the response for OpenAI protocol can consume quite a bit of CPU cycles, especially under high load with token streaming enabled. For example, Llama3 8B can generate 1 token every 13 ms under light load. This translates to the frontend needing to stream back 76 objects per second, and this demand further increases with hundreds of concurrent requests. This posed a challenge for the previous version of vLLM, where the API server and the inference engine were running in the same process. As a result, the inference engine and API server coroutines had to compete for Python GIL, leading to CPU contention.</p>

<p>Our solution is to separate out the API server, which handles request validation, tokenization, and JSON formatting, from the engine, which manages request scheduling and model inference. We connect these two Python processes using ZMQ, which has low overhead. By eliminating GIL constraints, both components can operate more efficiently without CPU contention, leading to improved performance.</p>

<p>Even after splitting these two processes, we find there’s still much room for improvement in terms of how we process requests in the engine and how we interact with http requests. We are actively working on further improving the performance of API server (<a href="https://github.com/vllm-project/vllm/pull/8157">PR #8157</a>), towards making it as efficient as offline batching inference in the near future.</p>

<h4 id="batch-scheduling-multiple-steps-ahead-pr-7000">Batch scheduling multiple steps ahead (<a href="https://github.com/vllm-project/vllm/pull/7000">PR #7000</a>)</h4>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/illustration-multi-step.png" width="90%" />
</picture>
<br />
Illustration of the multistep scheduling method in vLLM. By batching multiple scheduling steps at once, we keep the GPU busier than before, therefore reducing latency and improve throughput.
</p>

<p>We identified that the CPU overhead from vLLM’s scheduler and input preparation was leading to GPU underutilization, resulting in suboptimal throughput. To tackle this, we introduced <em>multi-step scheduling</em>, which performs scheduling and input preparation once and runs the model for <code class="language-plaintext highlighter-rouge">n</code> consecutive steps. By ensuring that the GPU can continue processing between the <code class="language-plaintext highlighter-rouge">n</code> steps without waiting for the CPU, this approach spreads the CPU overhead across multiple steps, significantly reducing GPU idle time and boosting overall performance.</p>

<p>This improves the throughput of running Llama 70B models on 4xH100 by 28%.</p>

<h4 id="asynchronous-output-processing-pr-7049-7921-8050">Asynchronous output processing (<a href="https://github.com/vllm-project/vllm/pull/7049">PR #7049</a>, <a href="https://github.com/vllm-project/vllm/pull/7921">#7921</a>, <a href="https://github.com/vllm-project/vllm/pull/8050">#8050</a>)</h4>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/illustration-async-output-processing.png" width="90%" />
</picture>
<br />
Illustration of the asynchronous output processing in vLLM. By overlapping the CPU work for output data structure processing with the GPU computation, we reduced GPU idle time and improved throughput.
</p>

<p>Continuing our efforts to maximize GPU utilization, we also revamped how the model output is processed in vLLM.</p>

<p>Previously, after generating each token, vLLM moved the model output from GPU to CPU, checked the stopping criteria to determine if the request had finished, and then executed the next step. This output processing was often slow, involving de-tokenizing the generated token IDs and performing string matching, with the overhead increasing as batch sizes grew.</p>

<p>To address this inefficiency, we introduced <em>asynchronous output processing</em>, which overlaps the output processing with model execution. Instead of processing the output immediately, vLLM now delays it, performing the processing of the <code class="language-plaintext highlighter-rouge">n</code>-th step output while executing the <code class="language-plaintext highlighter-rouge">n+1</code>-th step. This approach assumes that no request from the <code class="language-plaintext highlighter-rouge">n</code>-th step has met the stopping criteria, incurring a slight overhead of executing one additional step per request. However, the significant boost in GPU utilization more than offsets this cost, leading to improved overall performance.</p>

<p>This improves the time-per-output-token of running Llama 70B models on 4xH100 by 8.7%.</p>

<h4 id="miscellaneous-optimization">Miscellaneous optimization</h4>
<p>To further reduce the CPU overhead, we carefully examined the whole codebase and performed the following optimizations:</p>
<ul>
  <li>As requests come and finish, Python will allocate new objects and deallocate them again and again. To alleviate this overhead, we create an object cache (<a href="https://github.com/vllm-project/vllm/pull/7162">#7162</a>) to hold these objects, which significantly improves the end-to-end throughput by 24%.</li>
  <li>When sending data from CPU to GPU, we use non-blocking operations (<a href="https://github.com/vllm-project/vllm/pull/7172">#7172</a>) as much as possible. The CPU can launch many copy operations while the GPU is copying the data.</li>
  <li>vLLM supports diverse attention backends and sampling algorithms.  For commonly used workloads with simple sampling requests (<a href="https://github.com/vllm-project/vllm/pull/7117">#7117</a>), we introduce a fast code path that skips the complex steps.</li>
</ul>

<p>Over the last month, the vLLM community has devoted many efforts for such optimizations. And we will continue to optimize the code base to improve the efficiency.</p>

<h3 id="performance-benchmarks">Performance Benchmarks</h3>

<p>With the above efforts, we are happy to share that vLLM’s performance has improved a lot compared with last month’s vLLM. And it reaches state-of-the-art performance according to our performance benchmarks.</p>

<p><strong>Serving engines.</strong> We benchmark the vLLM v0.6.0 against TensorRT-LLM r24.07, SGLang v0.3.0, and lmdeploy v0.6.0a0. For other benchmarks, we use their default setting. For vLLM, we have turned on multistep scheduling via setting <code class="language-plaintext highlighter-rouge">--num-scheduler-steps 10</code>. We are actively working on making it on by default.</p>

<p><strong>Dataset.</strong> We benchmark different serving engines using the following three datasets:</p>

<ul>
  <li><strong>ShareGPT</strong>: 500 prompts randomly sampled from ShareGPT dataset with fixed random seed.
    <ul>
      <li>Average input tokens: 202, average output tokens: 179</li>
    </ul>
  </li>
  <li><strong>Prefill-heavy dataset</strong>: 500 prompts synthetically generated from sonnet dataset with roughly 462 input tokens and 16 output tokens on average.</li>
  <li><strong>Decode-heavy dataset</strong>: 500 prompts synthetically generated from sonnet dataset with roughly the same amount of 462 input tokens and 256 output tokens on average.</li>
</ul>

<p><strong>Models.</strong> We benchmark on two models: Llama 3 8B and 70B. We did not use the latest Llama 3.1 models as TensorRT-LLM r24.07 with TensorRT LLM backend v0.11 does not support it (<a href="https://github.com/NVIDIA/TensorRT-LLM/issues/2105">issue link</a>).</p>

<p><strong>Hardware.</strong> We use A100 and H100 for benchmarking. They are the major two high-end GPUs used for inference.</p>

<p><strong>Mertics.</strong> We evaluate the following metrics:</p>

<ul>
  <li>Time-to-first-token (TTFT, measured in ms). We show the mean and standard error of the mean in the plots.</li>
  <li>Time-per-output-token (TPOT, measured in ms). We show the mean and standard error of the mean in the plots.</li>
  <li>Throughput (measured in request per second).
    <ul>
      <li>Throughput is measured under QPS inf (meaning that all requests come at once).</li>
    </ul>
  </li>
</ul>

<h4 id="benchmarking-results">Benchmarking results</h4>

<p>In ShareGPT and Decode-heavy dataset, vLLM achieves <strong>highest throughput on H100</strong> when serving Llama-3 models.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/overall_throughput.png" width="90%" />
</picture>
<br />
Across different workloads, vLLM achieves high throughput compared to other frameworks, for Llama 8B and 70B on H100.
</p>

<p>For the rest of performance benchmarks, as well as captured detailed metrics for time-to-first-token (TTFT) and time-per-output-token (TPOT), please refer to the <a href="#appendix">appendix</a> for more data and analysis. You can follow <a href="https://github.com/vllm-project/vllm/issues/8176">this github issue</a> to reproduce our benchmark.</p>

<p><strong>Limitation of current optimizations.</strong> Although our current optimizations give a significant throughput gain, there are performance trade-offs from our current optimizations, especially from multi-step scheduling:</p>

<ul>
  <li><em>Bumpy inter-token latency:</em> In our current implementation of multi-step scheduling, we also return the output tokens for multiple steps in a batch. From an end-user’s perspective, they will receive batches of tokens being replied. We are fixing this by streaming the intermediate tokens back to the engine.</li>
  <li><em>Higher TTFT at low request rate:</em> A new request can only start execution after the current multi-step execution finishes. Therefore, higher <code class="language-plaintext highlighter-rouge">--num-scheduler-steps</code> will lead to higher TTFT at low request rates. Our experiments focus on the queueing delay at high QPS so this effect is not significant in the results in the appendix.</li>
</ul>

<h3 id="conclusion--future-work">Conclusion &amp; Future Work</h3>

<p>In this post, we discussed the performance enhancements in vLLM that lead to 1.8-2.7x throughput increase and matching other inference engines. We remain committed to steadily improving the performance, while continuously broadening our model coverages, hardware support, and diverse features. For the features discussed in this post, we will continue to harden them for production readiness.</p>

<p>Importantly, we will also focus on improving the core of vLLM to reduce the complexity so it lowers the barriers for contribution and unlocking even more performance enhancements.</p>

<h3 id="get-involved">Get Involved</h3>

<p>If you haven’t, we highly recommend you to update the vLLM version (see instructions <a href="https://docs.vllm.ai/en/latest/getting\_started/installation.html">here</a>) and try it out for yourself! We always love to learn more about your use cases and how we can make vLLM better for you. The vLLM team can be reached out via <a href="mailto:vllm-questions@lists.berkeley.edu">vllm-questions@lists.berkeley.edu</a>. vLLM is also a community project, if you are interested in participating and contributing, we welcome you to check out our <a href="https://roadmap.vllm.ai/">roadmap</a> and see <a href="https://github.com/vllm-project/vllm/issues?q=is:open+is:issue+label:%22good+first+issue%22">good first issues</a> to tackle. Stay tuned for more updates by <a href="https://x.com/vllm\_project">following us on X</a>.</p>

<p>If you are in the Bay Area, you can meet the vLLM team at the following events: <a href="https://lu.ma/87q3nvnh">vLLM’s sixth meetup with NVIDIA(09/09)</a>, <a href="https://pytorch2024.sched.com/event/1fHmx/vllm-easy-fast-and-cheap-llm-serving-for-everyone-woosuk-kwon-uc-berkeley-xiaoxuan-liu-ucb">PyTorch Conference (09/19)</a>, <a href="https://events.accel.com/cudamode">CUDA MODE IRL meetup (09/21)</a>, and <a href="https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog?search.sessiontracks=1719251906298001uzJ2">the first ever vLLM track at Ray Summit (10/01-02)</a>.</p>

<p>Regardless where you are, don’t forget to sign up for the online <a href="https://neuralmagic.com/community-office-hours/">biweekly vLLM office hours</a>! There are always new topics discussed every two weeks. The next one will be a deep dive into the performance enhancements.</p>

<h3 id="acknowledgment">Acknowledgment</h3>

<p>The blogpost is drafted by the vLLM team at Berkeley. The performance boost comes from collective efforts in the vLLM community: <a href="https://github.com/robertgshaw2-neuralmagic">Robert Shaw</a> from Neural Magic and <a href="https://github.com/njhill">Nick Hill</a>, <a href="https://github.com/joerunde">Joe Runde</a> from IBM lead the API server refactoring, <a href="https://github.com/SolitaryThinker">Will Lin</a> from UCSD and <a href="https://github.com/Yard1">Antoni Baum</a>, <a href="https://github.com/comaniac">Cody Yu</a> from Anyscale lead the multi-step scheduling effort, <a href="https://github.com/megha95">Megha Agarwal</a> from Databricks and <a href="https://github.com/alexm-neuralmagic">Alexander Matveev</a> from Neural Magic lead the async output processing, and many contributors from the vLLM community contribute various optimizations. All these efforts bring us together to get a huge performance boost.</p>

<h2 id="appendix">Appendix</h2>

<p>We include the detailed experiment results in this section.</p>

<h4 id="llama-3-8b-on-1xa100">Llama 3 8B on 1xA100</h4>

<p>On Llama 3 8B, vLLM achieves comparable TTFT and TPOT on ShareGPT and decode-heavy dataset as TensorRT-LLM and SGLang. LMDeploy has lower TPOT compared to other engines but has higher TTFT in general. Throughput-wise, TensorRT-LLM has the highest throughput among all engines, and vLLM has the second highest throughput on ShareGPT and decode-heavy dataset.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/A100_8B.png" width="90%" />
</picture>
</p>

<h4 id="llama-3-70b-on-4xa100">Llama 3 70B on 4xA100</h4>

<p>On Llama 3 70B, vLLM, SGLang and TensorRT-LLM have similar TTFT and TPOT (LMDeploy has lower TPOT but higher TTFT). Throughput-wise, vLLM achieves highest throughput on ShareGPT dataset and comparable throughput compared to other engines on other datasets.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/A100_70B.png" width="90%" />
</picture>
</p>

<h4 id="llama-3-8b-on-1xh100">Llama 3 8B on 1xH100</h4>

<p>vLLM achieves state-of-the-art throughput on ShareGPT and Decode-heavy dataset, though it has lower throughput on Prefill-heavy dataset.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/H100_8B.png" width="90%" />
</picture>
</p>

<h5 id="llama-3-70b-on-4xh100">Llama 3 70B on 4xH100</h5>

<p>vLLM has highest throughput on ShareGPT and Decode-heavy dataset (though the throughput is only marginally higher than TensorRT-LLM), but the throughput of vLLM is lower on Prefill-heavy dataset.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf-v060/H100_70B.png" width="90%" />
</picture>
</p>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.]]></summary></entry><entry><title type="html">vLLM’s Open Governance and Performance Roadmap</title><link href="/2024/07/25/lfai-perf.html" rel="alternate" type="text/html" title="vLLM’s Open Governance and Performance Roadmap" /><published>2024-07-25T00:00:00-07:00</published><updated>2024-07-25T00:00:00-07:00</updated><id>/2024/07/25/lfai-perf</id><content type="html" xml:base="/2024/07/25/lfai-perf.html"><![CDATA[<p>We would like to share two updates to the vLLM community.</p>

<h3 id="future-of-vllm-is-open">Future of vLLM is Open</h3>

<p align="center">
<picture>
<img src="/assets/figures/lfai/vllm-lfai-light.png" width="60%" />
</picture>
</p>

<p>We are excited to see vLLM is becoming the standard for LLM inference and serving. In the recent <a href="https://ai.meta.com/blog/meta-llama-3-1/">Meta Llama 3.1 announcement</a>, 8 out of 10 official partners for real time inference run vLLM as the serving engine for the Llama 3.1 models. We have also heard anecdotally that vLLM is being used in many of the AI features in our daily life.</p>

<p>We believe vLLM’s success comes from the power of the strong open source community. vLLM is actively maintai
ned by a consortium of groups such as UC Berkeley, Anyscale, AWS, CentML, Databricks, IBM, Neural Magic, Roblox,
 Snowflake, and others. To this extent, we want to ensure the ownership and governance of the project is open an
d transparent as well.</p>

<p>We are excited to announce that vLLM has <a href="https://lfaidata.foundation/blog/2024/07/17/lf-ai-data-foundation-mid-year-review-significant-growth-in-the-first-half-of-2024/?hss_channel=tw-976478457881247745">started the incubation process into LF AI &amp; Data Foundation</a>. This means no one party will have exclusive control over the future of vLLM. The license and trademark will be irrevocably open. You can trust vLLM is here to stay and will be actively maintained and improved going forward.</p>

<h3 id="performance-is-top-priority">Performance is top priority</h3>

<p>The vLLM contributors are doubling down to ensure vLLM is a fastest and easiest-to-use LLM inference and serving engine.</p>

<p>To recall our roadmap, we focus vLLM on six objectives: wide model coverage, broad hardware support, top performance, production-ready, thriving open source community, and extensible architecture.</p>

<p>In our objective for performance optimization, we have made the following progress to date:</p>

<ul>
  <li>Publication of benchmarks
    <ul>
      <li>Published per-commit performance tracker at <a href="https://perf.vllm.ai">perf.vllm.ai</a> on our public benchmarks. The goal of this is to track performance enhancement and regressions.</li>
      <li>Published reproducible benchmark (<a href="https://docs.vllm.ai/en/latest/performance_benchmark/benchmarks.html">docs</a>) of vLLM compared to LMDeploy, TGI, and TensorRT-LLM. The goal is to identify gaps in performance and close them.</li>
    </ul>
  </li>
  <li>Development and integration of highly optimized kernels
    <ul>
      <li>Integrated FlashAttention2 with PagedAttention, and <a href="https://github.com/flashinfer-ai/flashinfer">FlashInfer</a>. We plan to integrate <a href="https://github.com/vllm-project/vllm/issues/6348">FlashAttention3</a>.</li>
      <li>Integrating <a href="https://arxiv.org/abs/2406.06858v1">Flux</a> which overlaps computation and collective communication.</li>
      <li>Developed state of the art kernels for quantized inference, including INT8 and FP8 activation quantization (via cutlass) and INT4, INT8, and FP8 weight-only quantization for GPTQ and AWQ (via marlin).</li>
    </ul>
  </li>
  <li>Started several work streams to lower critical overhead
    <ul>
      <li>We identified vLLM’s synchronous and blocking scheduler is a key bottleneck for models running on fast GPUs (H100s). We are working on making the schedule asynchronous and plan steps ahead of time.</li>
      <li>We identified vLLM’s OpenAI-compatible API frontend has higher than desired overhead. <a href="https://github.com/vllm-project/vllm/issues/6797">We are working on isolating it from the critical path of scheduler and model inference. </a></li>
      <li>We identified vLLM’s input preparation, and output processing scale suboptimally with the data size. Many of the operations can be vectorized and enhanced by moving them off the critical path.</li>
    </ul>
  </li>
</ul>

<p>We will continue to update the community in vLLM’s progress in closing the performance gap. You can track our overall progress <a href="https://github.com/vllm-project/vllm/issues/6801">here</a>. Please continue to suggest new ideas and contribute with your improvements!</p>

<h3 id="more-resources">More Resources</h3>

<p>We would like to highlight the following RPCs being actively developed</p>

<ul>
  <li><a href="https://github.com/vllm-project/vllm/issues/6556">Single Program Multiple Data (SPMD) Worker Control Plane</a> reduces complexity and enhances performance of tensor parallel performance.</li>
  <li><a href="https://github.com/vllm-project/vllm/issues/6378">A Graph Optimization System in vLLM using torch.compile</a> brings in PyTorch native compilation workflow for kernel fusion and compilation.</li>
  <li><a href="https://github.com/vllm-project/vllm/issues/5557">Implement disaggregated prefilling via KV cache transfer</a> is critical for workload with long input and lowers variance in inter-token latency.</li>
</ul>

<p>There is a thriving research community building their research projects on top of vLLM. We are deeply humbled by the impressive works and would love to collaborate and integrate. The list of papers includes but is not limited to:</p>

<ul>
  <li><a href="https://www.usenix.org/conference/osdi24/presentation/agrawal">Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve</a></li>
  <li><a href="https://arxiv.org/abs/2407.00079">Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving</a></li>
  <li><a href="https://arxiv.org/abs/2406.03243">Llumnix: Dynamic Scheduling for Large Language Model Serving</a></li>
  <li><a href="https://arxiv.org/abs/2310.07240">CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving</a></li>
  <li><a href="https://arxiv.org/abs/2405.04437">vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention</a></li>
  <li><a href="https://arxiv.org/abs/2404.16283">Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services</a></li>
  <li><a href="https://arxiv.org/abs/2312.07104">SGLang: Efficient Execution of Structured Language Model Programs</a></li>
</ul>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[We would like to share two updates to the vLLM community.]]></summary></entry><entry><title type="html">Announcing Llama 3.1 Support in vLLM</title><link href="/2024/07/23/llama31.html" rel="alternate" type="text/html" title="Announcing Llama 3.1 Support in vLLM" /><published>2024-07-23T00:00:00-07:00</published><updated>2024-07-23T00:00:00-07:00</updated><id>/2024/07/23/llama31</id><content type="html" xml:base="/2024/07/23/llama31.html"><![CDATA[<p>Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting new features with longer context length (up to 128K tokens), larger model size (up to 405B parameters), and more advanced model capabilities. The vLLM community has added many enhancements to make sure the longer, larger Llamas run smoothly on vLLM, which includes chunked prefill, FP8 quantization, and pipeline parallelism. We will introduce these new enhancements in this blogpost.</p>

<h3 id="introduction">Introduction</h3>

<p>vLLM is a fast, easy-to-use, open-source serving engine for large language models. vLLM has support for more than 40 types of open-source LLMs, a diverse set of hardware platforms (Nvidia GPU, AMD GPU, AWS Inferentia, Google TPU, Intel CPU, GPU, Gaudi, …) and all kinds of inference optimizations. Learn more about vLLM <a href="https://docs.vllm.ai/">here</a>.</p>

<p>For the new Llama 3.1 series, vLLM can run the models with a full 128K context window. In order to support a large context window, vLLM automatically enables <a href="https://www.linkedin.com/posts/joinanyscale_recently-weve-contributed-chunked-prefill-activity-7201277641490849792-lGqZ">chunked prefill</a>. Chunked prefill not only keeps memory usage under control, but also reduces the interruption from long prompt processing for ongoing requests. You can install vLLM by running the following command or using our official docker image (<code class="language-plaintext highlighter-rouge">vllm/vllm-openai</code>):</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">-U</span> vllm
</code></pre></div></div>

<p>For the large Llama 405B model, vLLM supports it in several methods:</p>
<ul>
  <li><strong>FP8:</strong> vLLM runs the official FP8 quantized model natively on 8xA100 or 8xH100.</li>
  <li><strong>Pipeline Parallelism:</strong> vLLM runs the official BF16 version on multiple nodes by placing different layers of the model on different nodes.</li>
  <li><strong>Tensor Parallelism:</strong> vLLM can also run by sharding the model across multiple nodes, and multiple GPUs within the nodes.</li>
  <li><strong>AMD MI300x or NVIDIA H200:</strong> vLLM can run the model on a single 8xMI300x or 8xH200 machine, where each GPU has 192GB and 141 GB memory, respectively.</li>
  <li><strong>CPU Offloading:</strong> as the last resort, vLLM can offload some of the weights to CPU while performing the forward pass, allowing you to run the large model at full precision on limited GPU memory.</li>
</ul>

<p>Please note that while vLLM supports all these methods, the performance is still preliminary. The vLLM community is actively working on optimizations and we welcome everyone’s contribution. For example, we are actively exploring more approaches to quantize the model, and to increase the throughput of pipeline parallelism. The performance numbers posted later in the blog are meant as early reference points; we expect the performance to improve significantly over the next few weeks.</p>

<p>Out of all the methods, we recommend FP8 for a single node, and pipeline parallelism for multiple nodes. Let’s discuss them in more detail.</p>

<h3 id="fp8">FP8</h3>

<p>FP8 represents float point numbers in 8 bits. The current generation of GPUs (H100, MI300x) provide native support for FP8 via specialized tensor cores. Currently, vLLM can run FP8 quantized models for KV cache, attention, and MLP layers. This reduces memory footprint, increases throughput, lowers latency, and comes with minimal accuracy drops.</p>

<p>Currently, vLLM supports the official Meta Llama 3.1 405B FP8 model quantized via FBGEMM by leveraging per-channel quantization in the MLP layer. In particular, each channel of the up/gate/down projections are quantized and multiplied by a static scaling factor. Combined with skipping quantization for the first and the last layer, and a static upper bound, this approach has minimal impact on the model’s accuracy. You can run the model with latest vLLM on a single 8xH100 or 8xA100 with the following command:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 <span class="nt">--tensor-parallel-size</span> 8
</code></pre></div></div>

<p>Using the FP8 quantized model serving requests with the average input length of 1024 tokens and the average output length of 128 output tokens, the server can sustain 2.82 requests per second. The corresponding serving throughput is 2884.86 input tokens per second and 291.53 output tokens per seconds, respectively.</p>

<p>We also independently confirmed the accuracy drop of the FP8 checkpoints is minimal. For example, running the GSM8K benchmark using lm-eval-harness with 8 shots and chain-of-thought, we observed the exact match score of 95.38% (+- 0.56% stddev), which is a minimal drop compared to the BF16 official score of 96.8%.</p>

<h3 id="pipeline-parallelism">Pipeline Parallelism</h3>

<p>What if you want to run the Llama 3.1 405B model without quantization? You can do it with 16xH100 or 16xA100 GPUs using vLLM’s pipeline parallelism!</p>

<p>Pipeline parallelism splits a model into smaller sets of layers, executing them in parallel on two or more nodes in a pipelined fashion. Unlike tensor parallelism, which requires expensive all-reduce operations, pipeline parallelism partitions the model across layer boundaries, needing only inexpensive point-to-point communication. This is particularly useful when you have multiple nodes that are not necessarily connected via fast interconnects like Infiniband.</p>

<p>vLLM supports combining pipeline and tensor parallelism. For example, with 16 GPUs across 2 nodes, you can use 2-way pipeline parallelism and 8-way tensor parallelism to optimize hardware usage. This configuration maps half the model to each node, partitioning each layer across 8 GPUs using NVLink for all-reduce operations. You can run the Llama 3.1 405B model with the following command:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct <span class="nt">--tensor-parallel-size</span> 8 <span class="nt">--pipeline-parallel-size</span> 2
</code></pre></div></div>

<p>If you have fast interconnects like Infiniband, you can use 16-way tensor parallelism:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct <span class="nt">--tensor-parallel-size</span> 16
</code></pre></div></div>

<p align="center">
<picture>
<img src="/assets/figures/llama31/perf_llama3.png" width="50%" />
</picture>
<br />Serving throughput on 16xH100 GPUs with a synthetic dataset (avg. input len 1024, avg. output len 128).
</p>

<p>We have observed that pipeline parallelism is essential when the nodes are not connected via Infiniband. Compared to 16-way tensor parallelism, combining 2-way pipeline parallelism with 8-way tensor parallelism leads to 6.6x performance improvements. On the other hand, with Infiniband, the performance of both configurations is similar.</p>

<p>To learn more about distributed inference using vLLM please refer to <a href="https://docs.vllm.ai/en/latest/serving/distributed_serving.html">this doc</a>. For CPU offloading, please refer to <a href="https://docs.vllm.ai/en/latest/getting_started/examples/cpu_offload.html">this example</a>.</p>

<p><br /></p>

<hr />

<h3 id="acknowledgements">Acknowledgements</h3>

<p>We would like to thank Meta for the pre-release partnership and letting us test the model. Independently from the release, we thank the following vLLM contributors for the features mentioned in this blogpost: <a href="https://neuralmagic.com/">Neural Magic</a> for FP8 quantization; <a href="https://centml.ai/">CentML</a> and <a href="https://www.snowflake.com/blog/authors/snowflake-ai-research/">Snowflake AI Research</a> for pipeline parallelism; <a href="https://www.anyscale.com/">Anyscale</a> for the chunked prefill feature. The evaluation runs on <a href="https://lambdalabs.com/service/gpu-cloud/1-click-clusters">Lambda’s 1-Click Clusters</a> with InfiniBand, and we thank Lambda for the resource and the smooth cluster setup experience.</p>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting new features with longer context length (up to 128K tokens), larger model size (up to 405B parameters), and more advanced model capabilities. The vLLM community has added many enhancements to make sure the longer, larger Llamas run smoothly on vLLM, which includes chunked prefill, FP8 quantization, and pipeline parallelism. We will introduce these new enhancements in this blogpost.]]></summary></entry><entry><title type="html">Notes on vLLM v.s. DeepSpeed-FastGen</title><link href="/2023/11/14/notes-vllm-vs-deepspeed.html" rel="alternate" type="text/html" title="Notes on vLLM v.s. DeepSpeed-FastGen" /><published>2023-11-14T00:00:00-08:00</published><updated>2023-11-14T00:00:00-08:00</updated><id>/2023/11/14/notes-vllm-vs-deepspeed</id><content type="html" xml:base="/2023/11/14/notes-vllm-vs-deepspeed.html"><![CDATA[<hr />
<p><strong>TL;DR:</strong></p>

<ul>
  <li>vLLM matches DeepSpeed-FastGen’s speed in common scenarios and surpasses it when handling longer outputs.</li>
  <li>DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.</li>
  <li>vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned, offering extensive model and optimization support.</li>
</ul>

<hr />

<p>The DeepSpeed team recently published <a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen">a blog post</a> claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique.
We are happy to see the technology advancements from the open-source community.
In this blog, we show the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited.
For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed-FastGen.</p>

<h3 id="performance-benchmark">Performance Benchmark</h3>

<p>We’ve identified two key differences between vLLM and DeepSpeed-FastGen in terms of performance optimization:</p>

<ol>
  <li><strong>DeepSpeed-FastGen adopts a conservative/suboptimal memory allocation scheme</strong>, which wastes memory when output lengths are large.</li>
  <li>DeepSpeed-FastGen’s Dynamic SplitFuse scheduling gives <strong>speedup only when prompt lengths are much greater than output lengths</strong>.</li>
</ol>

<p>As a result, DeepSpeed-FastGen outperforms when the workload is consistently long prompt and short output.
In other scenarios, vLLM shows superior performance.</p>

<p>We benchmarked the two systems on an NVIDIA A100-80GB GPU with the LLaMA-7B model in the following scenarios:</p>

<h4 id="scenario-1-long-prompt-length-short-output">Scenario 1: Long Prompt Length, Short Output</h4>
<p>Here, DeepSpeed-FastGen’s Dynamic SplitFuse scheduling is expected to shine.
However, the performance gain we observe isn’t as significant as 2x.</p>

<p align="center">
<picture>
<img src="/assets/figures/notes-vllm-vs-deepspeed/s1.png" width="50%" />
</picture>
</p>

<h4 id="scenario-2-other-cases">Scenario 2: Other cases</h4>
<p>In these cases, vLLM is up to <strong>1.8x</strong> faster than DeepSpeed-FastGen.</p>

<p align="center">
<picture>
<img src="/assets/figures/notes-vllm-vs-deepspeed/s2.png" width="50%" />
</picture>
</p>

<h3 id="vllms-future-a-true-community-project">vLLM’s Future: A True Community Project</h3>
<p>We are committed to making vLLM the best open-source project incorporating the community’s best models, optimizations, and hardware. Coming out of UC Berkeley Sky Computing Lab, we are building vLLM truly in open source with the Apache 2.0 license.</p>

<p>The vLLM team prioritizes collaborations and we strive to keep the codebase with high quality code and easy to contribute. We are actively working on system performance; as well as new features like LoRA, Speculative Decoding, and better Quantization Support. Additionally, we are collaborating with hardware vendors like AMD, AWS Inferenetia, and Intel Habana to bring LLM to the broadest community.</p>

<p>Specifically for the Dynamic SplitFuse optimization, we are actively investigating the proper integration. If you have any questions and suggestions, please feel free to contact us on <a href="https://github.com/vllm-project/vllm">GitHub</a>. We also published the benchmark code <a href="https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py">here</a>.</p>

<h3 id="appendix-feature-comparison">Appendix: Feature Comparison</h3>

<p>DeepSpeed-FastGen currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
We do expect the DeepSpeed-FastGen is eager to catch up and we welcome the creative innovation in the market!</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center">vLLM</th>
      <th style="text-align: center">DeepSpeed-FastGen</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Runtime</td>
      <td style="text-align: center">Python/PyTorch</td>
      <td style="text-align: center">Python/PyTorch</td>
    </tr>
    <tr>
      <td>Model implementation</td>
      <td style="text-align: center">HuggingFace Transformers</td>
      <td style="text-align: center">Custom implementation + converter for HF models</td>
    </tr>
    <tr>
      <td>Server frontend</td>
      <td style="text-align: center">Simple FastAPI server for demo purposes</td>
      <td style="text-align: center">Custom gRPC-based server</td>
    </tr>
    <tr>
      <td>Scheduling</td>
      <td style="text-align: center">Continuous batching</td>
      <td style="text-align: center">Dynamic SplitFuse</td>
    </tr>
    <tr>
      <td>Attention kernel</td>
      <td style="text-align: center">PagedAttention &amp; FlashAttention</td>
      <td style="text-align: center">PagedAttention &amp; FlashAttention</td>
    </tr>
    <tr>
      <td>Custom kernels (for LLaMA)</td>
      <td style="text-align: center">Attention, RoPE, RMS, SILU</td>
      <td style="text-align: center">Attention, RoPE, RMS, SILU, Embedding</td>
    </tr>
    <tr>
      <td>KV Cache allocation</td>
      <td style="text-align: center">Near-optimal</td>
      <td style="text-align: center">Suboptimal/conservative</td>
    </tr>
    <tr>
      <td>Supported models</td>
      <td style="text-align: center">16 different architectures</td>
      <td style="text-align: center">LLaMA, Mistral, OPT</td>
    </tr>
    <tr>
      <td>Sampling methods</td>
      <td style="text-align: center">Random, parallel, beam search</td>
      <td style="text-align: center">Random</td>
    </tr>
    <tr>
      <td>Stop criterion</td>
      <td style="text-align: center">Stop strings, stop tokens, EOS</td>
      <td style="text-align: center">EOS</td>
    </tr>
  </tbody>
</table>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[TL;DR:]]></summary></entry><entry><title type="html">vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention</title><link href="/2023/06/20/vllm.html" rel="alternate" type="text/html" title="vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" /><published>2023-06-20T00:00:00-07:00</published><updated>2023-06-20T00:00:00-07:00</updated><id>/2023/06/20/vllm</id><content type="html" xml:base="/2023/06/20/vllm.html"><![CDATA[<p align="center" style="margin-top:-15px">
<a href="https://github.com/vllm-project/vllm"><b>GitHub</b></a> | <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://arxiv.org/pdf/2309.06180.pdf"><b>Paper</b></a>
</p>

<p>LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes <strong>PagedAttention</strong>, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.</p>

<p>vLLM has been developed at UC Berkeley and deployed at <a href="https://chat.lmsys.org">Chatbot Arena and Vicuna Demo</a> for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources. Try out vLLM now with a single command at our <a href="https://github.com/vllm-project/vllm">GitHub repository</a>.</p>

<h3 id="beyond-state-of-the-art-performance">Beyond State-of-the-art Performance</h3>

<p>We compare the throughput of vLLM with <a href="https://huggingface.co/docs/transformers/main_classes/text_generation">HuggingFace Transformers (HF)</a>, the most popular LLM library and <a href="https://github.com/huggingface/text-generation-inference">HuggingFace Text Generation Inference (TGI)</a>, the previous state of the art. We evaluate in two settings: LLaMA-7B on an NVIDIA A10G GPU and LLaMA-13B on an NVIDIA A100 GPU (40GB). We sample the requests’ input/output lengths from the ShareGPT dataset. In our experiments, vLLM achieves up to <strong>24x</strong> higher throughput compared to HF and up to <strong>3.5x</strong> higher throughput than TGI.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf_a100_n1_light.png" width="45%" />
</picture><picture>
<img src="/assets/figures/perf_a10g_n1_light.png" width="45%" />
</picture><br />
Serving throughput when each request asks for <em> one output completion</em>. vLLM achieves 14x - 24x higher throughput than HF and 2.2x - 2.5x higher throughput than TGI.
</p>

<p align="center">
<picture>
<img src="/assets/figures/perf_a100_n3_light.png" width="45%" />
</picture><picture>
<img src="/assets/figures/perf_a10g_n3_light.png" width="45%" />
</picture>
<br />Serving throughput when each request asks for <em>three parallel output completions</em>. vLLM achieves 8.5x - 15x higher throughput than HF and 3.3x - 3.5x higher throughput than TGI.
</p>

<h3 id="the-secret-sauce-pagedattention">The Secret Sauce: PagedAttention</h3>

<p>In vLLM, we identify that the performance of LLM serving is bottlenecked by memory. In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as KV cache. The KV cache is</p>
<ul>
  <li><em>Large:</em> Takes up to 1.7GB for a single sequence in LLaMA-13B.</li>
  <li><em>Dynamic:</em> Its size depends on the sequence length, which is highly variable and unpredictable.
As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste <strong>60% – 80%</strong> of memory due to fragmentation and over-reservation.</li>
</ul>

<p>To address this problem, we introduce <strong>PagedAttention</strong>, an attention algorithm inspired by the classic idea of virtual memory and paging in operating systems. Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.</p>

<p align="center">
<picture>
<img src="/assets/figures/annimation0.gif" width="80%" />
</picture>
<br />
<em>PagedAttention:</em> KV Cache are partitioned into blocks. Blocks do not need to be contiguous in memory space.
</p>

<p>Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous <em>logical blocks</em> of a sequence are mapped to non-contiguous <em>physical blocks</em> via a block table. The physical blocks are allocated on demand as new tokens are generated.</p>

<p align="center">
<picture>
<img src="/assets/figures/annimation1.gif" width="100%" />
</picture>
<br />
Example generation process for a request with PagedAttention.
</p>

<p>In PagedAttention, memory waste only happens in the last block of a sequence. In practice, this results in near-optimal memory usage, with a mere waste of under 4%. This boost in memory efficiency proves highly beneficial: It allows the system to batch more sequences together, increase GPU utilization, and thereby significantly increase the throughput as shown in the performance result above.</p>

<p>PagedAttention has another key advantage: efficient memory sharing. For example, in <em>parallel sampling</em>, multiple output sequences are generated from the same prompt. In this case, the computation and memory for the prompt can be shared between the output sequences.</p>

<p align="center">
<picture>
<img src="/assets/figures/annimation2.gif" width="80%" />
</picture>
<br />
Example of parallel sampling.
</p>

<p>PagedAttention naturally enables memory sharing through its block table. Similar to how processes share physical pages, different sequences in PagedAttention can share the blocks by mapping their logical blocks to the same physical block. To ensure safe sharing, PagedAttention keeps track of the reference counts of the physical blocks and implements the <em>Copy-on-Write</em> mechanism.</p>

<p align="center">
<picture>
<img src="/assets/figures/annimation3.gif" width="100%" />
</picture>
<br />
Example generation process for a request that samples multiple outputs.
</p>

<p>PageAttention’s memory sharing greatly reduces the memory overhead of complex sampling algorithms, such as parallel sampling and beam search, cutting their memory usage by up to 55%. This can translate into up to 2.2x improvement in throughput. This makes such sampling methods practical in LLM services.</p>

<p>PagedAttention is the core technology behind vLLM, our LLM inference and serving engine that supports a variety of models with high performance and an easy-to-use interface. For more technical details about vLLM and PagedAttention, check out our <a href="https://github.com/vllm-project/vllm">GitHub repo</a> and stay tuned for our paper.</p>

<h3 id="the-silent-hero-behind-lmsys-vicuna-and-chatbot-arena">The Silent Hero Behind LMSYS Vicuna and Chatbot Arena</h3>

<p>This April, <a href="https://lmsys.org">LMSYS</a> developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in <a href="https://arena.lmsys.org/">Chatbot Arena</a> for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py">serving backend</a> to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py">as the new backend</a> in order to support the growing demands (up to 5x more traffic). In an early <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py">internal micro-benchmark</a> by LMSYS, the vLLM serving backend can <strong>achieve up to 30x higher throughput than an initial HF backend.</strong></p>

<p>Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the multi-model chat serving frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPUs to serve Vicuna to millions of users with <em>high throughput</em> and <em>low latency</em>. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The <a href="https://vllm.readthedocs.io/en/latest/models/supported_models.html">support for more models</a> is being developed and forthcoming.</p>

<p align="center">
<picture>
<img src="/assets/figures/lmsys_traffic.png" width="100%" />
</picture>
<br />
Requests served by FastChat-vLLM integration in the Chatbot Arena between April to May. Indeed, more than half of the requests to Chatbot Arena use vLLM as the inference backend.
</p>

<p>This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K requests daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.</p>

<h3 id="get-started-with-vllm">Get started with vLLM</h3>

<p>Install vLLM with the following command (check out our <a href="https://vllm.readthedocs.io/en/latest/getting_started/installation.html">installation guide</a> for more):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>pip <span class="nb">install </span>vllm
</code></pre></div></div>

<p>vLLM can be used for both offline inference and online serving. To use vLLM for offline inference, you can import vLLM and use the <code class="language-plaintext highlighter-rouge">LLM</code> class in your Python scripts:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">vllm</span> <span class="kn">import</span> <span class="n">LLM</span>

<span class="n">prompts</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">Hello, my name is</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">The capital of France is</span><span class="sh">"</span><span class="p">]</span>  <span class="c1"># Sample prompts.
</span><span class="n">llm</span> <span class="o">=</span> <span class="nc">LLM</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">lmsys/vicuna-7b-v1.3</span><span class="sh">"</span><span class="p">)</span>  <span class="c1"># Create an LLM.
</span><span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="n">prompts</span><span class="p">)</span>  <span class="c1"># Generate texts from the prompts.
</span></code></pre></div></div>

<p>To use vLLM for online serving, you can start an OpenAI API-compatible server via:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python <span class="nt">-m</span> vllm.entrypoints.openai.api_server <span class="nt">--model</span> lmsys/vicuna-7b-v1.3
</code></pre></div></div>

<p>You can query the server with the same format as OpenAI API:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl http://localhost:8000/v1/completions <span class="se">\</span>
    <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
    <span class="nt">-d</span> <span class="s1">'{
        "model": "lmsys/vicuna-7b-v1.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'</span>
</code></pre></div></div>

<p>For more ways to use vLLM, please check out the <a href="https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html">quickstart guide</a>.</p>

<p><br /></p>

<hr />

<p><em>Blog written by Woosuk Kwon and Zhuohan Li (UC Berkeley). Special thanks to Hao Zhang for the integration of vLLM and FastChat and for writing the corresponding section. We thank the entire team — Siyuan Zhuang, Ying Sheng, Lianmin Zheng (UC Berkeley), Cody Yu (Independent Researcher), Joey Gonzalez (UC Berkeley), Hao Zhang (UC Berkeley &amp; UCSD), and Ion Stoica (UC Berkeley).</em></p>]]></content><author><name>Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution)</name></author><summary type="html"><![CDATA[GitHub | Documentation | Paper]]></summary></entry></feed>