Skip to content

Latest commit

 

History

History
 
 

classification

InternViT-6B for Image Classification

This folder contains the implementation of the InternViT-6B for image classification, which corresponds to Section 4.2.1 of our InternVL 1.0 paper. The codebase for this part is derived from InternImage, with some code references to EVA and DINOv2. Thanks for their great work.

In this part, we validate the visual perception capabilities of InternViT-6B, the most core component of InternVL 1.0. We evaluate the quality of visual representation produced by InternViT-6B using the ImageNet-1K dataset. Following common practices, we adopt the linear probing evaluation, i.e. training a linear classifier while keeping the backbone frozen. In addition to the ImageNet-1K validation set, we also report performance metrics on several ImageNet variants, to benchmark the domain generalization capability.

InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are listed in the table below.

image

🛠️ Installation

Follow the installation guide to perform installations.

📦 Data Preparation

Please prepare the dataset according to your needs.

First, please prepare the ImageNet-1K, ImageNet-A, ImageNet-R, ImageNetV2, and ImageNet-Sketch datasets following the directory structure outlined below.

$ tree data
data
├── imagenet-1k
│         ├── train
          │    ├── n01498041
          │    └── ...
│         └── val
│              ├── ILSVRC2012_val_00000001.JPEG
│              └── ...
├── imagenet-a
│         ├── n01498041
│         └── ...
├── imagenet-r
│         ├── n01443537
│         └── ...
├── imagenet-sketch
│         ├── n01440764
│         └── ...
└── imagenetv2
    └── ImageNetV2-matched-frequency

Then, unzip the train.txt.zip and val.txt.zip in meta_data/.

cd meta_data/
unzip train.txt.zip
unzip val.txt.zip

📦 Model Preparation

model name type download size
intern_vit_6b_224px.pth pytorch 🤗 HF link 12 GB
intern_vit_6b_224px_head.pth pytorch 🤗 HF link 25.7 MB

Please download the above model weights and place them in the pretrained/ folder.

cd pretrained
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px_head.pth

The directory structure is:

pretrained
├── intern_vit_6b_224px_head.pth
└── intern_vit_6b_224px.pth

🔍 Linear Probing on ImageNet-1K

Warning: Please install apex before training (see installation guide for details).

To train a linear classifier for InternViT-6B on ImageNet with 8 GPUs, run:

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --launcher slurm

Note, it is normal for the following information to appear during training and it can be safely ignored:

_IncompatibleKeys(missing_keys=[], unexpected_keys=['clip_projector.norm1_q.weight', 'clip_projector.norm1_q.bias', 'clip_projector.norm1_k.weight', 'clip_projector.norm1_k.bias', 'clip_projector.norm1_v.weight', 'clip_projector.norm1_v.bias', 'clip_projector.cross_attn.q_bias', 'clip_projector.cross_attn.k_bias', 'clip_projector.cross_attn.v_bias', 'clip_projector.cross_attn.q.weight', 'clip_projector.cross_attn.k.weight', 'clip_projector.cross_attn.v.weight', 'clip_projector.cross_attn.proj.weight', 'clip_projector.cross_attn.proj.bias'])

📊 Evaluation

Warning: Please install apex before evaluation (see installation guide for details).

model name IN-1K IN-ReaL IN-V2 IN-A IN-R IN-Sketch download
intern_vit_6b_1k_224.yaml 88.2 90.4 79.9 77.5 89.8 69.1 ckpt | log
Evaluate InternViT-6B on ImageNet-1K val with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 88.230 Acc@5 98.474
Accuracy of the network on the 50000 test images: 88.2%
Evaluate InternViT-6B on ImageNet-ReaL with 1 GPU (click to expand).

Note: ImageNet-ReaL now only supports single-GPU testing.

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=1 GPUS_PER_NODE=1 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

* ReaL Acc@1 90.437 Acc@5 98.567 loss 0.605
ReaL Accuracy of the network on the 50000 test images: 90.4%
Evaluate InternViT-6B on ImageNetV2 with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 79.940 Acc@5 95.340
Accuracy of the network on the 10000 test images: 79.9%
Evaluate InternViT-6B on ImageNet-A with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 77.479 Acc@5 92.737
Accuracy of the network on the 7500 test images: 77.5%
Evaluate InternViT-6B on ImageNet-R with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 89.777 Acc@5 97.023
Accuracy of the network on the 30000 test images: 89.8%
Evaluate InternViT-6B on ImageNet-Sketch with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 69.117 Acc@5 88.341
Accuracy of the network on the 50889 test images: 69.1%