[Proposal] Optionally use flash attention. #378

tbenthompson · 2023-09-08T17:44:47Z

It would be nice to have a flag to enable flash attention in models where that would make sense. This is helpful for performance and memory usage in larger models. In my case working with Pythia 12B, I get ~50% better performance and ~4x larger batch sizes when using flash attention. I also find numerical stability in float16 to be better using flash attention, probably because the model was trained using flash attention.

The downside of using flash attention in TransformerLens is that we would not have access to intermediate quantities in the attention calculation like the attention matrix itself. This is why I would suggest having a default-off flag so that users can choose whether they need those intermediate values to be available. In addition, when only a small subset of attention intermediates are needed, it's much faster to just cache the input to the attention layer (or the residual stream) and then recompute those intermediates when needed.

Thanks!

neelnanda-io · 2023-09-08T17:46:59Z

Seems reasonable to me, I'd be happy for someone to add this

…

On Fri, 8 Sept 2023 at 18:44, Ben Thompson ***@***.***> wrote: It would be nice to have a flag to enable flash attention in models where that would make sense. This is helpful for performance and memory usage in larger models. In my case working with Pythia 12B, I get ~50% better performance and ~4x larger batch sizes when using flash attention. I also find numerical stability in float16 to be better using flash attention, probably because the model was trained using flash attention. The downside of using flash attention in TransformerLens is that the we would not have access to intermediate quantities in the attention calculation like the attention matrix itself. This is why I would suggest having a default-off flag so that users can choose whether they need those intermediate values to be available. In addition, when only a small subset of attention intermediates are needed, it's much faster to just cache the input to the attention layer and then recompute those intermediates when needed. Thanks! — Reply to this email directly, view it on GitHub <#378>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASRPNKOCL65GY7N52PH2PQDXZNKRXANCNFSM6AAAAAA4QWO7HE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

alan-cooney · 2023-10-28T23:10:51Z

Seems v. useful for sparse autoencoder training.

Docs here - https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#conclusion - in case anyone wants to take this (I'll pick it up at some point if no-one does).

cmathw · 2024-01-24T08:53:43Z

I'd be quite keen to make a start on this soon, @alan-cooney have you made a start already?

alan-cooney · 2024-01-24T19:38:04Z

I'd be quite keen to make a start on this soon, @alan-cooney have you made a start already?

I haven't yet so please feel free to!

cmathw linked a pull request Jan 30, 2024 that will close this issue

[Draft] Support Flash Attention #501

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Optionally use flash attention. #378

[Proposal] Optionally use flash attention. #378

tbenthompson commented Sep 8, 2023 •

edited

Loading

neelnanda-io commented Sep 8, 2023 via email

alan-cooney commented Oct 28, 2023

cmathw commented Jan 24, 2024

alan-cooney commented Jan 24, 2024

[Proposal] Optionally use flash attention. #378

[Proposal] Optionally use flash attention. #378

Comments

tbenthompson commented Sep 8, 2023 • edited Loading

neelnanda-io commented Sep 8, 2023 via email

alan-cooney commented Oct 28, 2023

cmathw commented Jan 24, 2024

alan-cooney commented Jan 24, 2024

tbenthompson commented Sep 8, 2023 •

edited

Loading