This folder contains the implementation of the InternViT-6B for image classification, which corresponds to Section 4.2.1 of our InternVL 1.0 paper. The codebase for this part is derived from InternImage, with some code references to EVA and DINOv2. Thanks for their great work.
In this part, we validate the visual perception capabilities of InternViT-6B, the most core component of InternVL 1.0. We evaluate the quality of visual representation produced by InternViT-6B using the ImageNet-1K dataset. Following common practices, we adopt the linear probing evaluation, i.e. training a linear classifier while keeping the backbone frozen. In addition to the ImageNet-1K validation set, we also report performance metrics on several ImageNet variants, to benchmark the domain generalization capability.
InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are listed in the table below.
Follow the installation guide to perform installations.
Please prepare the dataset according to your needs.
-
ImageNet-1K
: We use the standard ImageNet dataset, you can download it from http://image-net.org/. -
ImageNet-A
: Download it from https://people.eecs.berkeley.edu/~hendrycks/imagenet-a.tar. -
ImageNet-R
: Download it from https://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar. -
ImageNetV2
: Download it from https://imagenetv2public.s3-us-west-2.amazonaws.com/imagenetv2-matched-frequency.tar.gz. -
ImageNet-Sketch
: Download it usinggdown
.# GDown is needed to download the dataset. # Please install it via `pip install gdown` gdown --id 1Mj0i5HBthqH1p_yeXzsg22gZduvgoNeA
First, please prepare the ImageNet-1K
, ImageNet-A
, ImageNet-R
, ImageNetV2
, and ImageNet-Sketch
datasets following the directory structure outlined below.
$ tree data
data
├── imagenet-1k
│ ├── train
│ ├── n01498041
│ └── ...
│ └── val
│ ├── ILSVRC2012_val_00000001.JPEG
│ └── ...
├── imagenet-a
│ ├── n01498041
│ └── ...
├── imagenet-r
│ ├── n01443537
│ └── ...
├── imagenet-sketch
│ ├── n01440764
│ └── ...
└── imagenetv2
└── ImageNetV2-matched-frequency
Then, unzip the train.txt.zip
and val.txt.zip
in meta_data/
.
cd meta_data/
unzip train.txt.zip
unzip val.txt.zip
model name | type | download | size |
---|---|---|---|
intern_vit_6b_224px.pth | pytorch | 🤗 HF link | 12 GB |
intern_vit_6b_224px_head.pth | pytorch | 🤗 HF link | 25.7 MB |
Please download the above model weights and place them in the pretrained/
folder.
cd pretrained
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px_head.pth
The directory structure is:
pretrained
├── intern_vit_6b_224px_head.pth
└── intern_vit_6b_224px.pth
Warning: Please install
apex
before training (see installation guide for details).
To train a linear classifier for InternViT-6B
on ImageNet with 8 GPUs, run:
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --launcher slurm
Note, it is normal for the following information to appear during training and it can be safely ignored:
_IncompatibleKeys(missing_keys=[], unexpected_keys=['clip_projector.norm1_q.weight', 'clip_projector.norm1_q.bias', 'clip_projector.norm1_k.weight', 'clip_projector.norm1_k.bias', 'clip_projector.norm1_v.weight', 'clip_projector.norm1_v.bias', 'clip_projector.cross_attn.q_bias', 'clip_projector.cross_attn.k_bias', 'clip_projector.cross_attn.v_bias', 'clip_projector.cross_attn.q.weight', 'clip_projector.cross_attn.k.weight', 'clip_projector.cross_attn.v.weight', 'clip_projector.cross_attn.proj.weight', 'clip_projector.cross_attn.proj.bias'])
Warning: Please install
apex
before evaluation (see installation guide for details).
model name | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch | download |
---|---|---|---|---|---|---|---|
intern_vit_6b_1k_224.yaml | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 | ckpt | log |
Evaluate InternViT-6B on ImageNet-1K val with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 88.230 Acc@5 98.474
Accuracy of the network on the 50000 test images: 88.2%
Evaluate InternViT-6B on ImageNet-ReaL with 1 GPU (click to expand).
Note: ImageNet-ReaL now only supports single-GPU testing.
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=1 GPUS_PER_NODE=1 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* ReaL Acc@1 90.437 Acc@5 98.567 loss 0.605
ReaL Accuracy of the network on the 50000 test images: 90.4%
Evaluate InternViT-6B on ImageNetV2 with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 79.940 Acc@5 95.340
Accuracy of the network on the 10000 test images: 79.9%
Evaluate InternViT-6B on ImageNet-A with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 77.479 Acc@5 92.737
Accuracy of the network on the 7500 test images: 77.5%
Evaluate InternViT-6B on ImageNet-R with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 89.777 Acc@5 97.023
Accuracy of the network on the 30000 test images: 89.8%
Evaluate InternViT-6B on ImageNet-Sketch with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 69.117 Acc@5 88.341
Accuracy of the network on the 50889 test images: 69.1%