Seeking Insights on Applying Pre-trained ViTs to Different Image Sizes #2424

vadori · 2025-01-22T14:20:04Z

vadori
Jan 22, 2025

Hi everyone,

Has anyone conducted a performance comparison of pre-trained Vision Transformer (ViT) models when applied to images of different resolutions? Or does anyone have experience to share in this regard?

Specifically, I'm curious about the effectiveness of two approaches:

Applying the model with interpolation – for instance, using a ViT trained on 224x224 images with 14x14 patches and applying it to 256x256 images with 16x16 patches by setting image_size=256 and patch_size=16 when creating the model.
Resizing input images to the original training resolution – downscaling the input to 224x224 before feeding it into the model, and upsampling the predictions if needed (e.g., in tasks like semantic segmentation where the ViT acts as the encoder in an encoder-decoder network).

I am using approach 1. at the moment, but I'm wondering how these approaches compare in terms of performance.

Thanks in advance for any insights or experiences you can share!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking Insights on Applying Pre-trained ViTs to Different Image Sizes #2424

{{title}}

Replies: 0 comments

Select a reply

Seeking Insights on Applying Pre-trained ViTs to Different Image Sizes #2424

vadori Jan 22, 2025

Replies: 0 comments

vadori
Jan 22, 2025