Skip to content

chenhongyu2048/LLM-inference-optimization-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 

Repository files navigation

LLM-inference-optimization-paper

Summary of some awesome works for optimizing LLM inference

This summary will including three parts:

  1. some repositories that you can follow
  2. some representative person or labs that you can follow
  3. some important works in the different research interests

Repositories

For example, LLMSys-PaperList contains many excellent articles, and is keeping updating (which I believe is the most important for a paperlist). Awesome-LLM-Inference and Awesome_LLM_Accelerate-PaperList are also worth reading.

Besides, awesome-AI-system works also very well. And you can find other repositories in its content.

The log "Large Transformer Model Inference Optimization" helps me a lot at the beginning.

This log OpenAI Keynote on Building Scalable AI Infrastructure seems to be a laeding guidance.

Person/Lab

Follow others' research, and find yourself's idea.

It is not my intention to judge the work of these pioneers, and I understand that the shortness of my knowledge will lead me to leave out many important people. If you have a different opinion, please feel free to communicate with me through the issue.
In no particular order!!
Damn, I can't remember the names of foreigners.

Zhihao JIA: FlexFlow and other imporessive work, important role in MLSys, affiliated with CMU
Tianqi CHEN: TVM, XGBoost, and other imporessive work, important role in Machine Learning System and ML compilers, affiliated with CMU
Song HAN: many impoertant work in efficient ML including sparsity and quantization. btw, the class TinyML and Efficient Deep Learning Computing is highly recommanded, affiliated with MIT Zhen DONG: many important work in quantization and high-performance ML, affiliated with UCB
Tri DAO: author of FlashAttention, affiliated with Princeton
Ce ZHANG: famous in efficient MLsys, affiliated with UChicago
Ion Stoica: Alpa, Ray, Spark, et.al.

SPCL: Scalable Parallel Computing Lab, affiliated with ETHz
Luo MAI: affiliated with University of Edinburgh

IPADS: focus more on PURE systems, buut also make great progress in MLSys, affiliated with SJTU
EPCC: Emerging Parallel Computing Center, parallel computing and MLSys are Naturally combined, affiliated with SJTU

Xin JIN: FastServe and LLMCad are impressive work, affiliated with PKU
Bin CUI: important role in MLSys including DL, GNN, and MoE, affiliated with PKU
Jidong ZHAI: leading many important work in MLSys, affiliated with THU
Lingxiao MA: with many important work in MLSys on Top-Conference, affiliated with MSRA
Cheng LI: high performce system and MLSys, affiliated with USTC
Xupeng Miao: SpotServe, SpecInfer, HET, et.al

Chuan WU: with some important work in distributed machine learning systems, affiliated with HKU James CHENG: affiliated with CUHK
Kai CHEN: database works well with MLSys, affiliated with HKUST
Lei CHEN: database works well with MLSys, many papers so I recommand u to focus on his Top-Conference paper, affiliated with HKUST
Yang YOU: leader of Colossal-AI, affiliated with NUS
Wei WANG: work in System and MLSys, affiliated with HKUST

Work

I hope to conlude these impressive works based on their research direction.
But my summary must not be informative enough, and I am looking forward to your addition.

Perhaps someone should write a detailed survey.

Periodically check the "cited by" of the papers with ⭐ will be helpful.
Paragraphs with 💡 are not perfect.

Survey/Evaluations/Benchmarks 💡

Make useful benchmark or evaluation is helfpul.

Interesting NEW Frameworks in Parallel Decoding

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, pdf

prior paper: Blockwise Parallel Decoding for Deep Autoregressive Models

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding: by lookahead decoding

Both frameworks use parallel decoding, and deserve a more detailed research.

Papers for Parallel Decoding

There are some interesting papers about parallel decoding.

Complex Inference

In fact, I'm not so familiar with with topic. But perhaps OpenAI 4o1 used this...
Spend more time inferencing than pre-training

GPT-o1

This topic is about GPT-o1, aka the strawberry.

Speculative Decoding

Also named as Speculative Sampling, model collaboration.

different model collaboration

Skeleton-of-Thought

3D Parallelism 💡

Some knowledege about data parallel, model tensor parallel, and model pipeline parallel will help in this track.

Communication Overlap

Prune & Sparsity 💡

An enduring topic in efficient machine learning.
We mainly focus on Semi-structured and Structured pruning becasue they can accelerate computing.

Quantization 💡

Low-precision for memory and computing efficiency.

Batch Processing

Perhaps the most important way for improving the throughput in LLM inference.
This blog Dissecting Batching Effects in GPT Inference helps me a lot at the beginning.

Update2023/12/12: I'd like to use Continues Batching to take place of the Dynamic Batching I used before. The name Dynamic Batching is more likely to be used in Triton.

Computing Optimization

This part include some impressive work optimizing LLM computing by observing the underlying computing properties. Such as FlashAttention, et.al.

FlashAttention Family

Optimization focus on Auto-regressive Decoding

Kernels Optimization

Memory Manage

This part is inspired by PagedAttention of vLLM. And there are many Top-Conference paper discussing the memory management in DL computing on GPUs.

Inference on hardware: GPUs, CPUs or based on SSD

Underlying optimization for GPU

CPUs or based on SSD

Heterogeneous scenarios or single PC are becoming increasingly important.

Making optimization for the calculating on CPU or SSD will have different methods.

Inference on personal device

Inspired by AI PC, open up a new area.

Heterogeneous or decentralized environments

Algorithm Optimization 💡

In this part, researchers provide some algorithm-based method to optimizing LLM inference.

Industrial Inference Frameworks 💡

LLM Serving 💡

LLM server providers will focus on this part. Engineering practices are just as important as algorithm optimization.

Aligning Systems

Comm kernels

Dynamic resource

Request Scheduling

Shared Prefix Serving

Serving for LoRA


For LoRA but not serving

Combining fine-tuning/training with inference

Serving Long-Context

Long-Context is a hot point recently.

RAG with LLM

Combine MoE with LLM inference

Here are two repositories have some papers for MoE: Papers: MoE/Ensemble, and MOE papers to read

MoE training

Inference with multimodal

Training in Multimodal

Diffusion Models

Compound Inference Systems

What is this? maybe multiple LLM?

LLM Application

Fault Tolerance

Energy Optimization

It is usually related to CPU-GPU heterogeneity and GPU power consumption.

Some Interesting Idea

Wise men learn by others.

Dataflow

I'd like to create a separate area for data flows. It's just my preference.

How about data pre-processing overhead in training?

GNN

Just my preference.

About

Summary of some awesome work for optimizing LLM inference

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published