Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment
This is the official implementation of Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment
(IJCAI-ECAI2022, Short).
The paper is available at IJCAI-ECAI2022(main only) and arXiv(main and appendix).
- GPU : A100 x 1GPU (40 GB Memory)
- Disk Space : About 300 GB
- Python==3.7.13
- torch==1.9.0
- torchvision==0.10.0
timm library
- For ViT
pip install timm==0.4.9
- For the other models (ViT-AugReg, DeiT, MLP-Mixer, BeiT)
pip install git+https://github.com/rwightman/pytorch-image-models@more_datasets # 0.5.0
pip install -r requirements.txt
Download each datasets and unzip them under the following directory.
- ImageNet-2012 (as Source)
./datasets/imagenet2012/train
./datasets/imagenet2012/val
- ImageNet-C (as Target)
./datasets/imagenet2012/val_c
model={'ViT-B_16', 'ViT-L_16', 'ViT_AugReg-B_16', 'ViT_AugReg-L_16', 'resnet50', 'resnet101', 'mlpmixer_B16', 'mlpmixer_L16', 'DeiT-B', 'DeiT-S', 'Beit-B16_224', 'Beit-L16_224'}
method={'cfa', 't3a', 'shot-im', 'tent', 'pl', 'source'}
Our method does not need to alter training phase, i.e., does not need to retrain models from scratch. Therefore, if a fine-tuned model is available, we can skip fine-tuning phase. In this implementation, we use models that are already fine-tuned on ImageNet-2012 dataset.
python main.py --calc_statistics_flag --model=${model} --method=${method}
python main.py --tta_flag --model=${model} --method=${method}
Top-1 Error Rate on ImageNet-C with severity level=5. ViT_B16 is used as a backbone network.
mean | gauss_noise | shot_noise | impulse_noise | defocus_blur | glass_blur | motion_blur | zoom_blur | snow | frost | fog | brightness | contrast | elastic_trans | pixelate | jpeg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
source | 61.9 | 77.7 | 75.1 | 77.0 | 66.9 | 69.1 | 58.5 | 62.8 | 60.9 | 57.6 | 62.9 | 31.6 | 88.9 | 51.9 | 45.3 | 42.9 |
CFA | 43.9 | 56.3 | 54.3 | 55.4 | 48.5 | 47.1 | 44.3 | 44.4 | 44.8 | 44.8 | 41.1 | 25.7 | 54.2 | 33.3 | 30.5 | 33.5 |
@inproceedings{kojima2022robustvit,
title = {Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment},
author = {Kojima, Takeshi and Matsuo, Yutaka and Iwasawa, Yusuke},
booktitle = {Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, {IJCAI-22}},
pages = {1009--1016},
year = {2022},
month = {7},
url = {https://doi.org/10.24963/ijcai.2022/141},
}