SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Introduction

SWIFT is an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. This method does not require auxiliary models or additional training, making it a plug-and-play and cost-effective solution for accelerating LLM inference.

SWIFT divides LLM inference into two distinct phases:

Optimization phase: Identify the optimal skipped layer set given the input data stream.
Acceleration phase: Employ the determined configuration to accelerate LLM inference.

During the optimization stage, SWIFT performs an optimization step prior to each LLM decoding step to adjust the skipped layer set, which involves: a) Efficient layer set optimization. SWIFT integrates random search with interval Bayesian optimization to propose layer set candidates efficiently; b) Parallel candidate evaluation. SWIFT uses LLM-generated tokens as ground truth, enabling simultaneous validation of the proposed candidates. The best-performing layer set is selected to accelerate the current decoding step.

Todo

Support both greedy and sampling inference (maintaining output distribution).
Support cached layer configuration

Installation

conda create -n swift python=3.9
conda activate swift
cd SWIFT
pip install -r requirements.txt

Inference

Run command lines in eval_llama.sh, the results will be stored in outputs/.../model_answer/.

./eval_llama.sh

For quick start with cached layer configuration, uncomment --cache-hit in eval_llama.sh.

Speedup Report

Obtain the corresponding speedup compared to vanilla autoregressive decoding.

python evaluation_llama/speed.py --file-path /your_own_path/swift.jsonl --base-path /your_own_path/llama_vanilla.jsonl

Acknowledgments

This codebase is built from Self-SD and EAGLE. The logo is designed by GPT-4.

Citation

If you find the resources in this repository useful, please cite our paper:

@misc{xia2024swiftontheflyselfspeculativedecoding,
      title={SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration}, 
      author={Heming Xia and Yongqi Li and Jun Zhang and Cunxiao Du and Wenjie Li},
      year={2024},
      eprint={2410.06916},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.06916}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
evaluation_llama		evaluation_llama
model/swift		model/swift
LICENSE		LICENSE
Readme.md		Readme.md
eval_llama.sh		eval_llama.sh
requirements.txt		requirements.txt
skip_layers.json		skip_layers.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Introduction

Todo

Installation

Inference

Speedup Report

Acknowledgments

Citation

About

Releases

Packages

Languages

License

hemingkx/SWIFT

Folders and files

Latest commit

History

Repository files navigation

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Introduction

Todo

Installation

Inference

Speedup Report

Acknowledgments

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages