You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey -
first of all - thank you for your inspiring research.
there's a lot of work around how to make efficient self attention - especially as the sequence length increases.
it seems to me that in assumption of kwta - you could skip the vast majority of calculations due to the inherent extreme sparsity.
and the best part is it would be complementary to many of the linear complexity attention methods that are coming out.
Are you experimenting with something like that?
Regards,
Dan
The text was updated successfully, but these errors were encountered:
fyi - I did a quick poc with cifar 10 + a small ViT trained with and without kwta (90% sparsity) - and the kwta actually worked a bit like a regularization (slightly higher max validation accuracy + slower convergence).
so looks like this definitely has potential. my team /I may look farther into this if you want to collaborate on a paper or something.
Hey -
first of all - thank you for your inspiring research.
there's a lot of work around how to make efficient self attention - especially as the sequence length increases.
it seems to me that in assumption of kwta - you could skip the vast majority of calculations due to the inherent extreme sparsity.
and the best part is it would be complementary to many of the linear complexity attention methods that are coming out.
Are you experimenting with something like that?
Regards,
Dan
The text was updated successfully, but these errors were encountered: