Skip to content

Danqi7/flash-attention-cuda

Repository files navigation

flash-attention-cuda

Final Project for CPSC 524 Parallel Programming. Danqi Liao.

This is my CUDA c implementation of the Flash Attention paper. Specifically I focus on forward pass of the attention mechanism without multi-head attention. This is a work in progress and I will be adding more features in the future.

For now, I have implemented the following:

  • CPU implementation of the attention mechanism
  • GPU naive implementation of the attention mechanism
  • Forward pass of Flash Attention without multi-head attention

To do (outside of the scope of this project):

  • Backward pass of Flash Attention without multi-head attention
  • Multi-head attention
  • Options for masking, dropout, etc.
  • Integration with PyTorch

Run scripts

(Each GPU attention implementation will be compared against the CPU implementation for error checking, you can comment out the CPU code if you don't want to run it)

sbatch run-standard.sh # naive GPU implementation
sbatch run-flash.sh # forward flash attention

About

CPSC524 Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published