Final Project for CPSC 524 Parallel Programming. Danqi Liao.
This is my CUDA c implementation of the Flash Attention paper. Specifically I focus on forward pass of the attention mechanism without multi-head attention. This is a work in progress and I will be adding more features in the future.
- CPU implementation of the attention mechanism
- GPU naive implementation of the attention mechanism
- Forward pass of Flash Attention without multi-head attention
- Backward pass of Flash Attention without multi-head attention
- Multi-head attention
- Options for masking, dropout, etc.
- Integration with PyTorch
(Each GPU attention implementation will be compared against the CPU implementation for error checking, you can comment out the CPU code if you don't want to run it)
sbatch run-standard.sh # naive GPU implementation
sbatch run-flash.sh # forward flash attention