- This is not supposed to prove that you can derive gradients and do backprop from scratch.
- The main objective of this project is:
- Get used to CUDA with C++.
- Use as much NVIDIA ecosystem as possible.
First we implement this minimal forward setting:
-
Forward pass, with correctness:
- tiled mult
- Softmax implementation
- Cross-entropy loss (++ reduction pattern)
-
Backpropagation
-
Add one intermediate layer.
-
Optimize, Optimize, Optimize.
-
MLP to ...?
Few small trips:
- Mini blog on softmax, compare to CuDNN.
- Profile your code, use NVIDIA NSIGHTs.
The ultimate goal is to code something interesting, e.g., flash attention. If not code then at least appreciate the intricacies of such high level implementations.
- When adding bias, no need for share memory.
- But it's cool, you've seen a case where
extern __shared__ ...
is used.
- But it's cool, you've seen a case where