LLM-inference-optimization-paper

Summary of some awesome works for optimizing LLM inference

This summary will including three parts:

some repositories that you can follow
some representative person or labs that you can follow
some important works in the different research interests

Repositories

For example, LLMSys-PaperList contains many excellent articles, and is keeping updating (which I believe is the most important for a paperlist). Awesome-LLM-Inference and Awesome_LLM_Accelerate-PaperList are also worth reading.

Besides, awesome-AI-system works also very well. And you can find other repositories in its content.

The log "Large Transformer Model Inference Optimization" helps me a lot at the beginning.

This log OpenAI Keynote on Building Scalable AI Infrastructure seems to be a laeding guidance.

Person/Lab

Follow others' research, and find yourself's idea.

It is not my intention to judge the work of these pioneers, and I understand that the shortness of my knowledge will lead me to leave out many important people. If you have a different opinion, please feel free to communicate with me through the issue.
In no particular order!!
Damn, I can't remember the names of foreigners.

Zhihao JIA: FlexFlow and other imporessive work, important role in MLSys, affiliated with CMU
Tianqi CHEN: TVM, XGBoost, and other imporessive work, important role in Machine Learning System and ML compilers, affiliated with CMU
Song HAN: many impoertant work in efficient ML including sparsity and quantization. btw, the class TinyML and Efficient Deep Learning Computing is highly recommanded, affiliated with MIT Zhen DONG: many important work in quantization and high-performance ML, affiliated with UCB
Tri DAO: author of FlashAttention, affiliated with Princeton
Ce ZHANG: famous in efficient MLsys, affiliated with UChicago
Ion Stoica: Alpa, Ray, Spark, et.al.

SPCL: Scalable Parallel Computing Lab, affiliated with ETHz
Luo MAI: affiliated with University of Edinburgh

IPADS: focus more on PURE systems, buut also make great progress in MLSys, affiliated with SJTU
EPCC: Emerging Parallel Computing Center, parallel computing and MLSys are Naturally combined, affiliated with SJTU

Xin JIN: FastServe and LLMCad are impressive work, affiliated with PKU
Bin CUI: important role in MLSys including DL, GNN, and MoE, affiliated with PKU
Jidong ZHAI: leading many important work in MLSys, affiliated with THU
Lingxiao MA: with many important work in MLSys on Top-Conference, affiliated with MSRA
Cheng LI: high performce system and MLSys, affiliated with USTC
Xupeng Miao: SpotServe, SpecInfer, HET, et.al

Chuan WU: with some important work in distributed machine learning systems, affiliated with HKU James CHENG: affiliated with CUHK
Kai CHEN: database works well with MLSys, affiliated with HKUST
Lei CHEN: database works well with MLSys, many papers so I recommand u to focus on his Top-Conference paper, affiliated with HKUST
Yang YOU: leader of Colossal-AI, affiliated with NUS
Wei WANG: work in System and MLSys, affiliated with HKUST

Work

I hope to conlude these impressive works based on their research direction.
But my summary must not be informative enough, and I am looking forward to your addition.

Perhaps someone should write a detailed survey.

Periodically check the "cited by" of the papers with ⭐ will be helpful.
Paragraphs with 💡 are not perfect.

Survey/Evaluations/Benchmarks 💡

Make useful benchmark or evaluation is helfpul.

MLPerf Inference Benchmark: inference github, a well-known benchmark
llmperf: evaluate both performance and correctness, but based on ray
The Importance of Workload Choice in Evaluating LLM Inference Systems: important angles in LLM inference systems
Vidur: A Large-Scale Simulation Framework For LLM Inference: test the performance of LLM inference
Metron: Holistic Performance Evaluation Framework for LLM Inference Systems: an evaluation framework
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale: a Simulator
LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference: a performance evaluation framework, can be used to estimate the time cost

Interesting NEW Frameworks in Parallel Decoding

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, pdf

prior paper: Blockwise Parallel Decoding for Deep Autoregressive Models

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding: by lookahead decoding

Both frameworks use parallel decoding, and deserve a more detailed research.

Papers for Parallel Decoding

There are some interesting papers about parallel decoding.

Complex Inference

In fact, I'm not so familiar with with topic. But perhaps OpenAI 4o1 used this...
Spend more time inferencing than pre-training

⭐ Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: Starter material, apply repeated sampling
⭐ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters: Starter material, scaling LLM Test-Time to improve accuracy
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation: seems fewer people have explore the efficiency of CoT; a two-stage method gives me some throught

GPT-o1

This topic is about GPT-o1, aka the strawberry.

⭐ Reverse engineering OpenAI’s o1: a leading blog for introduction in OpenAI’s o1
⭐ Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: base work
Tree of Thoughts: Deliberate Problem Solving with Large Language Models: a improment based on CoT
Large Language Model Guided Tree-of-Thought: also a ToT
Let's Verify Step by Step: verify by step can be helpful
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models: what is Language Agent Tree Search (LATS)? accepted by ICML'24
Critique-out-Loud Reward Models
Generative Verifiers: Reward Modeling as Next-Token Prediction: a verifier, by DeepMind

Speculative Decoding

Also named as Speculative Sampling, model collaboration.

different model collaboration

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding: use both LLM and SLM

Skeleton-of-Thought

Adaptive Skeleton Graph Decoding: successor of Skeleton-of-Thought

3D Parallelism 💡

Some knowledege about data parallel, model tensor parallel, and model pipeline parallel will help in this track.

Communication Overlap

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models: overlap comm with comp, similar to Liger
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning: accepted by ASPLOS'24
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives: many work about overlap in LLM, accepted by ASPLOS'24
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion: Fine-grained decomposition, perhaps provide some experiment result
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference: modify the model design for fast decoding, based on comm-comp overlapping
NanoFlow: Towards Optimal Large Language Model Serving Throughput: overlaping based on nano-batch, with some interesting engineer implemntation
Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping: overlapping, provided by Deepspeed team

Prune & Sparsity 💡

An enduring topic in efficient machine learning.
We mainly focus on Semi-structured and Structured pruning becasue they can accelerate computing.

Quantization 💡

Low-precision for memory and computing efficiency.

Batch Processing

Perhaps the most important way for improving the throughput in LLM inference.
This blog Dissecting Batching Effects in GPT Inference helps me a lot at the beginning.

Update2023/12/12: I'd like to use Continues Batching to take place of the Dynamic Batching I used before. The name Dynamic Batching is more likely to be used in Triton.

Computing Optimization

This part include some impressive work optimizing LLM computing by observing the underlying computing properties. Such as FlashAttention, et.al.

Memory Manage

This part is inspired by PagedAttention of vLLM. And there are many Top-Conference paper discussing the memory management in DL computing on GPUs.

Inference on hardware: GPUs, CPUs or based on SSD

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective: a helpful survey

Underlying optimization for GPU

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library: implement some APIs to reduce the shared memory footprint, accepted in HPC Asia'23
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture: help us understand GPUs
SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving: optimizing energy consuming based on lower GPU frequency
Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference: similar to cutlass, optimization on intel GPU

CPUs or based on SSD

Heterogeneous scenarios or single PC are becoming increasingly important.

Making optimization for the calculating on CPU or SSD will have different methods.

Inference on personal device

Inspired by AI PC, open up a new area.

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU: inference a 30B model with a 16GB GPU, accepted by ICML'23
LLM as a System Service on Mobile Devices: an intro for LLM on private devices
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU: based on sparsity in NN Layers
⭐ LLM for Mobile: An Initial Roadmap: a road map
PowerInfer-2: Fast Large Language Model Inference on a Smartphone: work on smartphone
Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM: on edge devices, accepted by MICRO'24

Heterogeneous or decentralized environments

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs: decentrailized system on consumer-level GPUs, through there will be some problems
Distributed Inference and Fine-tuning of Large Language Models Over The Internet: some techs in this paper will be instructive
⭐ HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices: heterogeneous parallel computing using CPUs and GPUs
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs: accepted by ATC'24
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: algorithm analyse for Heterogeneous GPUs
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity: making heterogeneity-aware GPU provisioning decisions for LLM serving

Algorithm Optimization 💡

In this part, researchers provide some algorithm-based method to optimizing LLM inference.

Industrial Inference Frameworks 💡

LLM Serving 💡

LLM server providers will focus on this part. Engineering practices are just as important as algorithm optimization.

Aligning Systems

PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch

Comm kernels

Enabling Elastic Model Serving with MultiWorld: optimizing collective communication lib for LLM inference
Flexible Scheduling of Network and Computing Resources for Distributed AI Tasks
AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive: communicating strategy based on runtime, ICDCS'24
Crux: GPU-Efficient Communication Scheduling for Deep Learning Training: a communication scheduler that aims to maximize GPU computation utilization by mitigating the communication contention among DLT jobs, SIGCOMM'24

Dynamic resource

TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections: by Luo MAI, similar to SpotServe?
SpotServe: Serving Generative Large Language Models on Preemptible Instances: by Xupeng MIAO and under guidence of Zhihao JIA
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances: by team of SpotServe

Request Scheduling

Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows: scheduler for latency-sensitive request
Llumnix: Dynamic Scheduling for Large Language Model Serving: scheduling in multi instances may by helpful for me now
Arlo: Serving Transformer-based Language Models with Dynamic Input Lengths: solve Dynamic Input Lengths by multi-instance and request scheduling
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling: scheduling based on a output length predictor

Shared Prefix Serving

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition: share prefix and optimize KV Cache

Serving for LoRA

S-LoRA: Serving Thousands of Concurrent LoRA Adapters: beginninf of Serving for LoRA, under the guidence of Ion Stoica: accepted by MLSys'24
Dynamic LoRA Serving System for Offline Context Learning: successor of S-LoRA
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference: serving LoRA is becoming more and more important
PUNICA: MULTI-TENANT LORA SERVING: accepted by MLSys'24
Petals: Collaborative Inference and Fine-tuning of Large Models
LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design: maybe useful, kernel optimization
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving: accepted by OSDI'24
Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU: optimize SGMV kernels

For LoRA but not serving

Combining fine-tuning/training with inference

Deferred Continuous Batching in Resource-Efficient Large Language Model Serving
Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses: place training and inference together, control the inference latency to the desired SLO, while maximizing the throughput of the training jobs co-located on the same GPUs, accepted by ICDCS'24

Serving Long-Context

Long-Context is a hot point recently.

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference: like a update for H2O or Dejevu, et.al, each attention head have different memory budget

RAG with LLM

⭐ Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models: retrieval will be helpful, but how to use it?
Generative Dense Retrieval: Memory Can Be a Burden: accepted by EACL'24
⭐ Accelerating Retrieval-Augmented Language Model Serving with Speculation: also a paper for RaLM
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation: improve RAG inference with cache, under guidence of Xin JIN
FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research
Accelerating Retrieval-Augmented Language Model Serving with Speculation: help understand RaLM
NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting: RAG with spec decoding, different draft models with different RAG

Combine MoE with LLM inference

Here are two repositories have some papers for MoE: Papers: MoE/Ensemble, and MOE papers to read

MoE training

Inference with multimodal

MOSEL: Inference Serving Using Dynamic Modality Selection: improving system throughput by 3.6× with an accuracy guarantee and shortening job completion times by 11×
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation: by META
Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations: by Google
Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference: optimization for diffusion models by cache
DISTMM: Accelerating distributed multimodal model training: helpful although it is made for training, accepted by NSDI'24
Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training: distributed MM trainging
DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models: multimodal model training, mm is getting more popular recently

Training in Multimodal

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models: disaggregation in MM training, under guidence of Xin JIN
Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management: efficient MM model training

Diffusion Models

Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models: serving Diffusion models, accepted by NSDI'24
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines: accepted by MLSys'24
SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules: more papers in diffusion models

Compound Inference Systems

What is this? maybe multiple LLM?

LLM Application

Teola: Towards End-to-End Optimization of LLM-based Applications: endd-to-end optimization
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable: accepted by OSDI'24
Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications: many LLM apps share GPU, accepted by EuroSys'24

Fault Tolerance

Characterization of Large Language Model Development in the Datacenter: fault-tolerant serving in the future?
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement: Fault Tolerance in MoE training
Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training: checkpointing in MoE

Energy Optimization

It is usually related to CPU-GPU heterogeneity and GPU power consumption.

Some Interesting Idea

Wise men learn by others.

Dataflow

I'd like to create a separate area for data flows. It's just my preference.

⭐ FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks: dataflow in inference
Pathways: Asynchronous Distributed Dataflow for ML: accepted by MLSys'22
VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware: accepted by MLSys'22

How about data pre-processing overhead in training?

Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement

GNN

Just my preference.

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication
GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism
[PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks]: accepted by IPDPS'24
NPA: Improving Large-scale Graph Neural Networks with Non-parametric Attention: SIGMOD'24
Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression: compress node features in graph, accepted by VLDB'24
Mega: More Efficient Graph Attention for GNNs: optimize graph attention efficiency, ICDCS'24

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.gitignore		.gitignore
README.md		README.md

chenhongyu2048/LLM-inference-optimization-paper

Folders and files

Latest commit

History

Repository files navigation