GroupViT is a framework for learning semantic segmentation purely from text captions without using any mask supervision. It learns to perform bottom-up heirarchical spatial grouping of semantically-related visual regions. This repository is the official implementation of GroupViT introduced in the paper:
GroupViT: Semantic Segmentation Emerges from Text Supervision, Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang, CVPR 2022.
- Jiarui Xu's Project Page (with additonal visual results)
- arXiv Page
If you find our work useful in your research, please cite:
@article{xu2022groupvit,
author = {Xu, Jiarui and De Mello, Shalini and Liu, Sifei and Byeon, Wonmin and Breuel, Thomas and Kautz, Jan and Wang, Xiaolong},
title = {GroupViT: Semantic Segmentation Emerges from Text Supervision},
journal = {arXiv preprint arXiv:2202.11094},
year = {2022},
}
- Python 3.7
- PyTorch 1.8
- webdataset 0.1.103
- mmsegmentation 0.18.0
- timm 0.4.12
Instructions:
conda create -n groupvit python=3.7 -y
conda activate groupvit
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install mmcv-full==1.3.14 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html
pip install mmsegmentation==0.18.0
pip install webdataset==0.1.103
pip install timm==0.4.12
git clone https://github.com/NVIDIA/apex
cd && apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
pip install opencv-python==4.4.0.46 termcolor==1.1.0 diffdist einops omegaconf
pip install nltk ftfy regex tqdm
-
Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo:
-
To run the demo from the command line:
python demo/demo_seg.py --cfg configs/group_vit_gcc_yfcc_30e.yml --resume /path/to/checkpoint --vis input_pred_label final_group --input demo/examples/voc.jpg --output_dir demo/output
The output is saved in demo/output/
.
Zero-shot Classification | Zero-shot Segmentation | |||
---|---|---|---|---|
config | ImageNet | Pascal VOC | Pascal Context | COCO |
GCC + YFCC (cfg) | 43.7 | 52.3 | 22.4 | 24.3 |
GCC + RedCaps (cfg) | 51.6 | 50.8 | 23.7 | 27.5 |
Pre-trained weights group_vit_gcc_yfcc_30e-879422e0.pth
and group_vit_gcc_redcap_30e-3dd09a76.pth
for these models are provided by Jiarui Xu here.
During training, we use webdataset for scalable data loading. To convert image text pairs into the webdataset format, we use the img2dataset tool to download and preprocess the dataset.
For inference, we use mmsegmentation for semantic segmentation testing, evaluation and visualization on Pascal VOC, Pascal Context and COCO datasets.
The overall file structure is as follows:
GroupViT
├── local_data
│ ├── gcc3m_shards
│ │ ├── gcc-train-000000.tar
│ │ ├── ...
│ │ ├── gcc-train-000436.tar
│ ├── gcc12m_shards
│ │ ├── gcc-conceptual-12m-000000.tar
│ │ ├── ...
│ │ ├── gcc-conceptual-12m-001943.tar
│ ├── yfcc14m_shards
│ │ ├── yfcc14m-000000.tar
│ │ ├── ...
│ │ ├── yfcc14m-001888.tar
│ ├── redcap12m_shards
│ │ ├── redcap12m-000000.tar
│ │ ├── ...
│ │ ├── redcap12m-001211.tar
│ ├── imagenet_shards
│ │ ├── imagenet-val-000000.tar
│ │ ├── ...
│ │ ├── imagenet-val-000049.tar
│ ├── VOCdevkit
│ │ ├── VOC2012
│ │ │ ├── JPEGImages
│ │ │ ├── SegmentationClass
│ │ │ ├── ImageSets
│ │ │ │ ├── Segmentation
│ │ ├── VOC2010
│ │ │ ├── JPEGImages
│ │ │ ├── SegmentationClassContext
│ │ │ ├── ImageSets
│ │ │ │ ├── SegmentationContext
│ │ │ │ │ ├── train.txt
│ │ │ │ │ ├── val.txt
│ │ │ ├── trainval_merged.json
│ │ ├── VOCaug
│ │ │ ├── dataset
│ │ │ │ ├── cls
│ ├── coco
│ │ ├── images
│ │ │ ├── train2017
│ │ │ ├── val2017
│ │ ├── annotations
│ │ │ ├── train2017
│ │ │ ├── val2017
The instructions for preparing each dataset are as follows.
Please download the training split annotation file from Conceptual Caption 12M and name it as gcc3m.tsv
.
Then run img2dataset
to download the image text pairs and save them in the webdataset format.
sed -i '1s/^/caption\turl\n/' gcc3m.tsv
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
--url_col "url" --caption_col "caption" --output_format webdataset\
--output_folder local_data/gcc3m_shards
--processes_count 16 --thread_count 64
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
--enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*
Please refer to img2dataset CC3M tutorial for more details.
Please download the annotation file from Conceptual Caption 12M and name it as gcc12m.tsv
.
Then run img2dataset
to download the image text pairs and save them in the webdataset format.
sed -i '1s/^/caption\turl\n/' gcc12m.tsv
img2dataset --url_list gcc12m.tsv --input_format "tsv" \
--url_col "url" --caption_col "caption" --output_format webdataset\
--output_folder local_data/gcc12m_shards \
--processes_count 16 --thread_count 64
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
--enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-conceptual-12m-/' local_data/gcc12m_shards/*
Please refer to img2dataset CC12M tutorial for more details.
Please follow the CLIP Data Preparation instructions to download the YFCC14M subset.
wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
bunzip2 yfcc100m_subset_data.tsv.bz2
Then run the preprocessing script to create the subset sql db and annotation tsv files. This may take a while.
python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv
This script will create two files: an SQLite db called yfcc100m_dataset.sql
and an annotation tsv file called yfcc14m_dataset.tsv
.
Then follow the YFCC100M Download Instruction to download the dataset and its metadata file.
pip install git+https://gitlab.com/jfolz/yfcc100m.git
mkdir -p yfcc100m_meta
python -m yfcc100m.convert_metadata . -o yfcc100m_meta --skip_verification
mkdir -p yfcc100m_zip
python -m yfcc100m.download yfcc100m_meta -o yfcc100m_zip
Finally convert the dataset into the webdataset format.
python convert_dataset/convert_yfcc14m.py --root yfcc100m_zip --info yfcc14m_dataset.tsv --shards yfcc14m_shards
Please download the annotation file from RedCaps.
wget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1
unzip redcaps_v1.0_annotations.zip
Then run the preprocessing script and img2dataset
to download the image text pairs and save them in the webdataset format.
python convert_dataset/process_redcaps.py annotations redcaps12m_meta/redcaps12m.parquet --num-split 16
img2dataset --url_list ~/data/redcaps12m/ --input_format "parquet" \
--url_col "URL" --caption_col "TEXT" --output_format webdataset \
--output_folder local_data/recaps12m_shards
--processes_count 16 --thread_count 64
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
--enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/redcap12m-/' local_data/recaps12m_shards/*
Please follow the webdataset ImageNet Example to convert ImageNet into the webdataset format.
Please follow the MMSegmentation Pascal VOC Preparation instructions to download and setup the Pascal VOC dataset.
Please refer to the MMSegmentation Pascal Context Preparation instructions to download and setup the Pascal Context dataset.
COCO dataset is an object detection dataset with instance segmentation annotations. To evaluate GroupViT, we combine all the instance masks of a catergory together and generate semantic segmentation maps. To generate the semantic segmentation maps, please follow MMSegmentation's documentation to download the COCO-Stuff-164k dataset first and then run the following
python convert_dataset/convert_coco.py local_data/data/coco/ -o local_data/data/coco/
Train on a single node:
(node0)$ ./tools/dist_launch.sh main_group_vit.py /path/to/config $GPUS_PER_NODE
For example, to train on a node with 8 GPUs, run:
(node0)$ ./tools/dist_launch.sh main_group_vit configs/group_vit_gcc_yfcc_30e.yml 8
Train on multiple nodes:
(node0)$ ./tools/dist_mn_launch.sh main_group_vit.py /path/to/config $NODE_RANK $NUM_NODES $GPUS_PER_NODE $MASTER_ADDR
(node1)$ ./tools/dist_mn_launch.sh main_group_vit.py /path/to/config $NODE_RANK $NUM_NODES $GPUS_PER_NODE $MASTER_ADDR
For example, to train on two nodes with 8 GPUs each, run:
(node0)$ ./tools/dist_mn_launch.sh main_group_vit.py configs/group_vit_gcc_yfcc_30e.yml 0 2 8 tcp://node0
(node1)$ ./tools/dist_mn_launch.sh main_group_vit.py configs/group_vit_gcc_yfcc_30e.yml 1 2 8 tcp://node0
We used 16 NVIDIA V100 GPUs for pre-training (in 2 days) in our paper.
./tools/dist_launch.sh main_group_vit.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --eval
./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint
./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/pascal_context.py
./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/coco.py