Skip to content

Commit

Permalink
Revert "Revert "edit docs " (#518)"
Browse files Browse the repository at this point in the history
This reverts commit 6370571.
  • Loading branch information
sufubao authored Sep 4, 2024
1 parent 32c646e commit 8d89c94
Show file tree
Hide file tree
Showing 15 changed files with 254 additions and 152 deletions.
10 changes: 4 additions & 6 deletions docs/CN/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,11 +77,11 @@ Lightllm 整合了众多的开源方案的优点,包括但不限于 FasterTran

.. toctree::
:maxdepth: 1
:caption: lightllm原理

lightllm介绍 <lightllm_info/lightllm>

:caption: Lightllm

lightllm/lightllm_intro
lightllm/lightllm_impl

.. toctree::
:maxdepth: 1
:caption: 模型
Expand All @@ -104,14 +104,12 @@ Lightllm 整合了众多的开源方案的优点,包括但不限于 FasterTran

user/api_param
user/openapi_docs
user/param_class/index


.. toctree::
:maxdepth: 1
:caption: 开发者文档

dev/lightllm_impl
dev/token_attention
dev/router

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
lightllm概述
Lightllm 框架
==========================

lightllm 的设计核心是多进程协作,每个进程负责一个模块,通过zmq和rpc的方式进行多进程协同工作。
Expand Down
37 changes: 2 additions & 35 deletions docs/CN/source/lightllm_info/lightllm.rst → docs/CN/source/lightllm/lightllm_intro.rst
100755 → 100644
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _lightllm:

LightLLM介绍
LightLLM 介绍
================

随着ChatGPT的流行,大语言模型(简称LLM)受到越来越多的关注。此类模式的出现,极大地提高了人们的工作效率。
Expand Down Expand Up @@ -94,37 +94,4 @@ LightLLM的核心特点如下:



性能评测
-----------

我们使用当前主流推理框架TGI、NV Triton + FasterTransformer、vLLM在ShareGPT_Vicuna_unfiltered数据集上进行了性能比较。结果如下图所示。可以看出,LightLLM 在不同模型大小上实现了更高的吞吐量。 TGI内存碎片严重,难以实现高吞吐量。 vLLM引入了PageAttention,但由于其整体实现细节更利于小模型推理,因此在大模型上的并发性能不是很理想(使用默认配置)。相比之下,LightLLM 在各种模型尺寸上都保持了稳健的性能,并且在大型模型 (LLaMA-65B) 上比 TGI 和 vLLM 提高了约 2-3 倍。

.. image:: ../assets/lightllm/Performance.png
:alt: Efficient_Router1
:align: center


TGI兼容性和消融分析为了进一步验证TokenAttention和Router的有效性,我们还将这些功能集成到TGI中以解决其内存碎片问题,如下图(左)所示。可以看出,引入TokenAttention和Router后,与原始TGI相比,性能提升了4倍以上。

长短混合请求情况下的改进:从下图(左)可以看出,Router的引入并没有带来更明显的性能提升,这是由于问题长度的差异ShareGPT_Vicuna_unfiltered 的数据集并不重要。为此,我们构建了长度差异较大的请求集合,并验证了高效路由器的性能。结果如下所示(右)。可以看到,我们的Efficient Router可以更好地利用GPU资源,对于问题长度差异较大的请求可以带来近50%的性能提升。


.. image:: ../assets/lightllm/Performance2.png
:alt: Efficient_Router1
:align: center


左图展示了LightLLM和TGI的兼容性以及消融分析,右图展示了我们的Efficient Router对长短请求的增强


未来工作
---------

* 支持更多的模型
* 增强路由调度算法
* 高性能的 int8 和 int4 仅权重的 kv cache 的支持
* 全量化模型的支持
* 混合精度模型
* 稀疏化

LightLLM致力于让更多人参与进来,从而灵活高效地探索各种LLM部署和推理解决方案。也为硬件厂商推动该领域的发展提供参考。我们希望大家能够给它更多的star,fork这个项目,并做出贡献。我们相信未来将会出现更多的技术和解决方案(如TensorRT),不断降低部署成本,让AGI更容易走进普通家庭。
LightLLM致力于让更多人参与进来,从而灵活高效地探索各种LLM部署和推理解决方案。也为硬件厂商推动该领域的发展提供参考。我们希望大家能够给它更多的star,fork这个项目,并做出贡献。我们相信未来将会出现更多的技术和解决方案(如TensorRT),不断降低部署成本,让AGI更容易走进普通家庭。
4 changes: 0 additions & 4 deletions docs/CN/source/user/api_param.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,6 @@
$ "multimodal_params":{}
$ }'
.. tip::

其中的 parameters 的内容参考 :ref:`parameters` , multimodal_params 的内容参考 :ref:`MultimodalParams`

**输出示例**:

Expand Down
9 changes: 0 additions & 9 deletions docs/CN/source/user/param_class/index.rst

This file was deleted.

12 changes: 0 additions & 12 deletions docs/CN/source/user/param_class/multimodal_params.rst

This file was deleted.

7 changes: 0 additions & 7 deletions docs/CN/source/user/param_class/sampling_params.rst

This file was deleted.

9 changes: 3 additions & 6 deletions docs/EN/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,12 @@ Docs List
getting_started/installation
getting_started/quickstart


.. toctree::
:maxdepth: 1
:caption: lightllm Overview

lightllm_info/lightllm
:caption: Lightllm

lightllm/lightllm_intro
lightllm/lightllm_impl

.. toctree::
:maxdepth: 1
Expand All @@ -102,14 +101,12 @@ Docs List

user/api_param
user/openapi_docs
user/param_class/index


.. toctree::
:maxdepth: 1
:caption: development docs

dev/lightllm_impl
dev/token_attention
dev/router

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
lightllm Overview
Lightllm Architecture
==========================

The core design of lightllm is multi-process collaboration. Each process is responsible for a module, and multi-process collaboration is carried out through zmq and rpc.
Expand Down
36 changes: 1 addition & 35 deletions docs/EN/source/lightllm_info/lightllm.rst → docs/EN/source/lightllm/lightllm_intro.rst
100755 → 100644
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _lightllm:

LightLLM introduction
LightLLM Overview
===========================

With the popularity of ChatGPT, large language model, abbreviated as LLM, has received increasing attention. The emergence of such models has greatly improved people's work efficiency. However, the key to further widespread adoption lies in how to deploy models with billons of parameters at low cost and high throughput. To improve the throughput of large model services and enable more interested researchers to quickly get involved, a lightweight LLM inference service framework called LightLLM has emerged. LightLLM introduces a more fine-grained kv cache management algorithm called TokenAttention and designs an Efficient Router scheduling implementation that works efficiently with TokenAttention. Through the interaction of TokenAttention and Efficient Router, LightLLM achieves higher throughput than vLLM and Text Generation Inference in most scenarios, with performance improvements of around 4 times in some cases. LightLLM is flexible, user-friendly, and efficient. Interested friends may want to click on the link below to try it out.
Expand Down Expand Up @@ -61,10 +61,6 @@ Lightllm

Therefore, to address these issues, we have developed a LLM deployment framework called LightLLM, which is based on the pure Python language. It enables researchers to easily deploy and customize lightweight models locally, allowing for rapid expansion of different models and integration of various excellent open-source features. The core features of LightLLM are as follows:

* 三进程异步协作:分词、模型推理、去分词异步进行,GPU利用率大幅提升。
* :ref:`TokenAttention`:实现token-wise的KV缓存内存管理机制,实现推理时内存零浪费。
* :ref:`Efficient_Router`:与Token Attention合作,精心管理每个Token的GPU内存,从而优化系统吞吐量。

* Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.
* :ref:`TokenAttention`: implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference.
* :ref:`Efficient_Router`: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.
Expand All @@ -79,34 +75,4 @@ With the highly coordinated efficient kernels developed based on OpenAI Triton a



Performance
-----------

We conducted performance comparisons on the ShareGPT_Vicuna_unfiltered dataset using the current mainstream inference frameworks TGI, NV Triton + FasterTransformer, and vLLM. The results are shown in the graph below. It can be observed that LightLLM achieves higher throughput across different model sizes. TGI suffers from severe memory fragmentation, making it difficult to achieve high throughput. vLLM introduces PageAttention but due to its overall implementation details being more favorable for small model inference, its concurrent performance on large models is not very ideal (using default configurations). In contrast, LightLLM maintains robust performance across various model sizes and achieves around a 2-3x improvement over TGI and vLLM on large models (LLaMA-65B).

.. image:: ../assets/lightllm/Performance.png
:alt: Efficient_Router1
:align: center

TGI Compatibility & Ablation Analysis To further validate the effectiveness of TokenAttention and Router, we also integrated these features into TGI to address its memory fragmentation issue, as shown in the figure below (left). It can be observed that introducing TokenAttention and Router leads to more than a 4x performance improvement compared to the original TGI.

Improvement in case of mixed long and short requests:From the figure below (left), it can be noticed that the introduction of Router did not bring a more significant performance improvement, which is due to the fact that the difference in the question length of ShareGPT_Vicuna_unfiltered's dataset is not significant. For this reason, we constructed a collection of requests with a greater difference in the length, and verified the performance of our Efficient Router. The results are shown below (right). It can be seen that our Efficient Router can make better use of GPU resources, and can bring about nearly 50% performance improvement with requests that have large differences in question lengths.

.. image:: ../assets/lightllm/Performance2.png
:alt: Efficient_Router1
:align: center


The left figure shows the compatibility of LightLLM and TGI and the ablation analysis, and the right figure shows the enhancement of our Efficient Router with the long and short request

Future Work
----------------

* Support for more models
* router scheduling enhancements
* High-performance int8 int4 weight only support and int8 kv cache.
* Fully quantized models
* Mixed-precision models
* Sparsification

LightLLM is committed to enabling more people to participate, allowing flexible and efficient exploration of various LLM deployment and inference solutions. It also serves as a reference for hardware manufacturers to promote the development of the field. We hope that everyone can give it more stars, fork the project, and contribute. We believe that in the future, more technologies and solutions (such as TensorRT) will emerge, continuously reducing deployment costs and making AGI more accessible to ordinary households.
Loading

0 comments on commit 8d89c94

Please sign in to comment.