You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Has anyone conducted a performance comparison of pre-trained Vision Transformer (ViT) models when applied to images of different resolutions? Or does anyone have experience to share in this regard?
Specifically, I'm curious about the effectiveness of two approaches:
Applying the model with interpolation – for instance, using a ViT trained on 224x224 images with 14x14 patches and applying it to 256x256 images with 16x16 patches by setting image_size=256 and patch_size=16 when creating the model.
Resizing input images to the original training resolution – downscaling the input to 224x224 before feeding it into the model, and upsampling the predictions if needed (e.g., in tasks like semantic segmentation where the ViT acts as the encoder in an encoder-decoder network).
I am using approach 1. at the moment, but I'm wondering how these approaches compare in terms of performance.
Thanks in advance for any insights or experiences you can share!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi everyone,
Has anyone conducted a performance comparison of pre-trained Vision Transformer (ViT) models when applied to images of different resolutions? Or does anyone have experience to share in this regard?
Specifically, I'm curious about the effectiveness of two approaches:
Applying the model with interpolation – for instance, using a ViT trained on 224x224 images with 14x14 patches and applying it to 256x256 images with 16x16 patches by setting
image_size=256
andpatch_size=16
when creating the model.Resizing input images to the original training resolution – downscaling the input to 224x224 before feeding it into the model, and upsampling the predictions if needed (e.g., in tasks like semantic segmentation where the ViT acts as the encoder in an encoder-decoder network).
I am using approach 1. at the moment, but I'm wondering how these approaches compare in terms of performance.
Thanks in advance for any insights or experiences you can share!
Beta Was this translation helpful? Give feedback.
All reactions