You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the only layers that use self-attention blocks, which is the last two layers, you have set the window size to be equal to the Spatial Size, doesn't that mean that you are not really even computing self-attention and that there is only a single token?
Please correct me if I am wrong as this seems perplexing.
The text was updated successfully, but these errors were encountered:
Feature map size is 14 x 14
Window size is also 14
Since window size equals feature map size, this means we're doing global self-attention across the entire feature map
Therefore:
Number of tokens in stage 3 >>> 14 x 14 = 196 tokens
Basically each position in the 14 x 14 feature map becomes a token.
This way of computing attention is similar to ViT-style attention without local windows.
In the only layers that use self-attention blocks, which is the last two layers, you have set the window size to be equal to the Spatial Size, doesn't that mean that you are not really even computing self-attention and that there is only a single token?
Please correct me if I am wrong as this seems perplexing.
The text was updated successfully, but these errors were encountered: