flash-attention-cuda

Final Project for CPSC 524 Parallel Programming. Danqi Liao.

This is my CUDA c implementation of the Flash Attention paper. Specifically I focus on forward pass of the attention mechanism without multi-head attention. This is a work in progress and I will be adding more features in the future.

For now, I have implemented the following:

CPU implementation of the attention mechanism
GPU naive implementation of the attention mechanism
Forward pass of Flash Attention without multi-head attention

To do (outside of the scope of this project):

Backward pass of Flash Attention without multi-head attention
Multi-head attention
Options for masking, dropout, etc.
Integration with PyTorch

Run scripts

(Each GPU attention implementation will be compared against the CPU implementation for error checking, you can comment out the CPU code if you don't want to run it)

sbatch run-standard.sh # naive GPU implementation

sbatch run-flash.sh # forward flash attention

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
attn.cu		attn.cu
flash_attention		flash_attention
flash_attention.cu		flash_attention.cu
run-flash.sh		run-flash.sh
run-standard.sh		run-standard.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flash-attention-cuda

For now, I have implemented the following:

To do (outside of the scope of this project):

Run scripts

About

Releases

Packages

Languages

Danqi7/flash-attention-cuda

Folders and files

Latest commit

History

Repository files navigation

flash-attention-cuda

For now, I have implemented the following:

To do (outside of the scope of this project):

Run scripts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages