Window Size is the same as the Image Size? #51

siddagra · 2024-12-21T01:54:40Z

In the only layers that use self-attention blocks, which is the last two layers, you have set the window size to be equal to the Spatial Size, doesn't that mean that you are not really even computing self-attention and that there is only a single token?

Please correct me if I am wrong as this seems perplexing.

ahatamiz · 2024-12-21T17:25:52Z

Hi @siddagra

In stage 3, we have:

Feature map size is 14 x 14
Window size is also 14
Since window size equals feature map size, this means we're doing global self-attention across the entire feature map

Therefore:

Number of tokens in stage 3 >>> 14 x 14 = 196 tokens

Basically each position in the 14 x 14 feature map becomes a token.

This way of computing attention is similar to ViT-style attention without local windows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Window Size is the same as the Image Size? #51

Window Size is the same as the Image Size? #51

siddagra commented Dec 21, 2024

ahatamiz commented Dec 21, 2024

Window Size is the same as the Image Size? #51

Window Size is the same as the Image Size? #51

Comments

siddagra commented Dec 21, 2024

ahatamiz commented Dec 21, 2024