Skip to content

minimal C implementation of speculative decoding based on llama2.c

License

Notifications You must be signed in to change notification settings

mscheong01/speculative_decoding.c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

speculative_decoding.c

minimal C implementation of speculative decoding on llama2 model.

Speculative decoding is a technique used to speed up autoregressive inference with the help of a lightweight draft model. This project demonstrates this approach with simple pure C code.

specdec llama

what I basically did was fix the llama2.c/run.c file to support forwarding multiple tokens and implemented speculative_decoding.c using that.

Special thanks to:

@karpathy for providing llama2.c as a starting point and inspiration for this project

  • llama2.c/run.c was copied along with license notations to this project.

@ggerganov for writing llama.cpp where I initially got the oppertunity to study and code spec-dec related stuff

How to use

  1. download base/draft models
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
  1. build and run
make && ./speculative_decoding -m ./models/stories42M.bin -d ./models/stories15M.bin -n 256 -i "Once upon a time"   

example output: image

  • orange text: accepted draft model tokens
  • black text: base model tokens

Meta llama2 models:

to use llama2 models, follow the description written in llama2.c

References

@inproceedings{leviathan2023fast,
  title={Fast inference from transformers via speculative decoding},
  author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
  booktitle={International Conference on Machine Learning},
  pages={19274--19286},
  year={2023},
  organization={PMLR}
}

some known issues

  • The generation is constrained by the maximum sequence length of the draft model. Consequently, employing lengthy generation with speculative decoding is unfeasible with the current setup, when utilizing a draft model with a short maximum sequence length.

License

MIT

I added the original copyright notice to the copied run.c file. Please let me know if I made any mistakes with the licensing.

ETC

Any sort of feedback is very welcome :)

More speculative-decoding related C implementations are to come!

I'm thinking of https://github.com/SafeAILab/EAGLE next.

About

minimal C implementation of speculative decoding based on llama2.c

Topics

Resources

License

Stars

Watchers

Forks