With Speculative Sampling (SSp), a large language model can generate tokens quite faster using a smaller model as help.
This repo shows how it's done and measures timing improvements using Llama models. Here's the idea:
Above, you see a 30B llama model generating tokens (on an 8-GPU A100 machine), then you see the same model going ~50% to 100% faster (i.e. in 33% to 50% less time) using speculative sampling -- with the same completion quality. Go to Try it yourself to try it yourself :)
This repo implements an algorithm published in this paper whose authors are warmly thanked for their work. For non-academic folks, we attempt a shortened, didactic explanation of how it works in our our blog post
In the following, the large model we try to speed up is called the target model
, and the smaller model that helps sampling is called the draft model
. We compare regular sampling (RSp) of the target model with speculative sampling (SSp) of the target model with the draft model.
Side-Note: this repo is an example of the kind of work we do at Dust ✨
- Almost identical memory footprint
- Same completion quality
- 1.5 to 3 times faster token generation
- Relatively simple code
- useful mostly for live token generation, less useful in batch settings (and irrelevant to model training)
- works best in settings where there are often "easily guessable" tokens. This is very often the case, especially for written language, e.g.:
- there is often a dot at the end of a sentence, a capitalized letter next, determinants or common words like "the, a, it..."
- given the way tokenization is done too; Eg token "amaz" => when the beginning is "It was amaz", even tiny models will guess the next token is "ing" (and the following one probably ! or :))
- draft model & target model must have the same vocabulary size
- draft model should be at least 2-3 faster than target model
They return different outputs because speculative sampling has inherent randomness. However, it is demonstrated in the paper on which this repo is based that using speculative sampling provides the same output probability distributions than regular sampling.
The paper also performs experiments to show this on typical benchmarks. In this repo, experiments on completing multiplications provide further empirical evidence that SSp completion and RSp completion are of similar quality.
Get access to a machine with 4 or more GPUs, then:
ssh [fat_machine_with_gpus]
git clone https://github.com/philipperolet/llama-ssp.git
cd llama-ssp
chmod a+x machine-install.sh
sudo ./machine-install.sh # global setup
. setup.sh # project setup(virtual env, requirements)
python llamassp.py compare 30B 7B # compare regular & ssp as in the gif
Get the cli details with
python llamassp.py -h
Note: the above was tested on an Ubuntu 22.04 OS.
To run experiments measuring average model latency in ms/token:
python3 llamassp.py latency TARGET_MODEL_NAME [--draft DRAFT_MODEL_NAME]
Results of such measurements are below.
TARGET_MODEL_NAME correspond to various flavors of Llama models (7B to 30B), with or without quantization. Possible values are 7B, 13B, 30B, 7B_8bit, 13B_8bit, 30B, 30B_8bit, 65B, 65B_8bt
. These are the models published on HuggingFace by decapoda-research
.
Using 65B versions, however, requires providing the weights yourself. Do so as explained below in "Use of custom model"
The full model configs are defined as model_params
in llamassp.py
and can be completed/changed as required -- make sure that there is enough memory for the model to run. Models run in parallel in all the GPUs available.
If DRAFT_MODEL_NAME is not specified, the target model is run with regular sampling on a few examples, and model latency is measured.
If DRAFT_MODEL_NAME is specified, speculative sampling latency is measured.
To use a custom model, add it in model_params
. You can provide a path to HF weights as model_name
. Find weights on the internet, then convert them to HF weights following these instructions.
This step is required if you want to use a 65B model. This is because the 65B model provided by decapoda-research performed very bad completions through the code of this repo at the time of the writing on the settings. Using 65B weights provided by another source then converting them to HF weights was tested and works fine.
!! Any huggingface model can be tested (not only Llama), however speculative sampling requires the draft and the target models to have the same tokenizer.
To run these experiments, the sum of available memory should be about 4 times the model size for a regular model, and 2 times the model size for a quantized model
E.g. to test the 30B model => you need 120GB GPU memory. To test the 7B_8bit model => you can do it with 14GB so a single GPU might be enough. To test speculative sampling of a 30B model with a 7B draft, you need 148GB GPU memory (4*37).
You may stumble on this kind of error: Error 802: system not yet initialized
. On some machines--e.g. p4d.24xlarge that was used for these experiments--additional setup might be required:
sudo apt-get install cuda-drivers-fabricmanager
sudo systemctl start nvidia-fabricmanager
To try exactly what's in the gif with various target models and draft models
python3 llamassp.py compare TARGET_MODEL_NAME DRAFT_MODEL_NAME
Evals measure how well models (regular or speculative sampling) perform on a simple multiplication benchmark:
python llamassp.py eval [--seed SEED] [--nb-prompts NB_PROMPTS] TARGET_MODEL DRAFT_MODEL
The command will assess the model by making it perform NB_PROMPTS multiplications and return the ratio of success. The multiplications are randomly drawn with numbers between 1 and 100, with a seed for repeatability.
The main result is the Llama 30B and 65B speed improvements using SSp.
Using speculative sampling with a 7B draft model provides a ~80% speed improvement with same quality of completions over the regular sampling of the 30B model, and a ~125% speed improvement for the 65B model.
On a p4d.24xlarge (8 A100 40GB gpus):
Model_type | Ms/token |
---|---|
7B_8bit | 215ms |
7B | 70ms |
7B_4GPUs | 70ms |
13B | 125ms |
30B | 330ms |
65B | 610ms |
Comparison SSp / RSp (regular sampling):
Model_type | Ms/token | Speed Improvement |
---|---|---|
13B | 125ms | - |
SSp 13B/7B | 114ms | 1.1x |
30B | 330ms | - |
SSp 30B/7B | 180ms | 1.8x |
65B | 610ms | - |
SSp 65B/7B | 270ms | 2.25x |
The timings perform completions on 15 examples (+ 1 warmup not shown), and finally output the model generation latency in ms/token (so lower is better). The measurements are on relatively small prompts (~ 1 sentence) and small completions (64 tokens); longer prompts / completion would of course decrease the speed.
The per-token timings may seem slow. There are two reasons for this: 1/ we are not batching, many models can generate a lot faster in batch settings 2/ we did not optimize for per-model speed. This was not needed since the general point to make the case for speculative sampling is that if you have a fast model and a slow one, you can get the slow one to be a lot faster for (almost) free.
Experiments with quantized models on a g5.12xlarge AWS instance:
Model_type | Ms/token |
---|---|
7B_4GPUs | 86ms |
7B_8bit | 210ms |
7B_8bit_4GPUs | 216ms |
13B_8bit | 270ms |
30B_8bit | 405ms |
65B_8bit | 530ms |
Comparison SSp / RSp:
Model_type | Ms/token | Speed Improvement |
---|---|---|
13B_8bit | 270ms | - |
SSp 13B/7B_8bit | 330ms | - |
30B_8bit | 405ms | |
SSp 30B/7B_8bit | 370ms |
In addition to measuring the speed improvement, it is necessary to check speculative sampling samples with the same distribution than regular sampling, in other words that the SSp generations are as good as the regular ones.
There is inherent randomness in the acceptation of tokens from the draft model; therefore the completions from SSp and RSp naturally differ and we cannot expect the outputs to be exactly the same.
The paper cited at the beginning provides theoretical proof that the output token distributions are the same, as well as evaluations on common benchmarks. With this repo, an intuitive check can be done by looking at the completions : those of the regularly sampled 30B model seem to be of the same quality level than those of SSp 30B/7B.
In order to further show that SSp provides same quality results as the target model with the code of this repo, we perform evaluations by prompting models for random multiplications of numbers between 1 and 99. Each model is assessed by its percentage of successes, in regular sampling (RSp) and SSp mode with a 7B draft model. Results are displayed below with 95% confidence bounds (Wald).
Model_type | RSp perf | SSp perf |
---|---|---|
7B | 28.4% (+- 1.4%) | 28.1% (+- 1.4%) |
13B | 42.9% (+-1.5%) | 43.1 (+- 1.5%) |
30B | 50.1% (+- 1.5%) | 49.6% (+- 1.5%) |
65B | TBD | TBD |
The results show that SSp and RSp perform the same with high confidence. The script for these experiments can be found here.