Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLLB-CLIP with SigLIP + small tokenizer fix #741

Merged
merged 13 commits into from
Nov 22, 2023
61 changes: 29 additions & 32 deletions docs/PRETRAINED.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ We replicate OpenAI's results on ViT-B/32, reaching a top-1 ImageNet-1k zero-sho

<img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/laion_clip_zeroshot.png" width="700">

__Zero-shot comparison (courtesy of Andreas Fürst)__
**Zero-shot comparison (courtesy of Andreas Fürst)**
<img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/laion_openai_compare_b32.jpg" width="700">

ViT-B/32 was trained with 128 A100 (40 GB) GPUs for ~36 hours, 4600 GPU-hours. The per-GPU batch size was 256 for a global batch size of 32768. 256 is much lower than it could have been (~320-384) due to being sized initially before moving to 'local' contrastive loss.
Expand All @@ -44,9 +44,10 @@ ViT-B/16 was trained with 176 A100 (40 GB) GPUS for ~61 hours, 10700 GPU-hours.
The B/16+ 240x240 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 69.21.

This model is the same depth as the B/16, but increases the
* vision width from 768 -> 896
* text width from 512 -> 640
* the resolution 224x224 -> 240x240 (196 -> 225 tokens)

- vision width from 768 -> 896
- text width from 512 -> 640
- the resolution 224x224 -> 240x240 (196 -> 225 tokens)

<img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/laion_clip_zeroshot_b16_plus_240.png" width="700">

Expand All @@ -67,6 +68,7 @@ ViT-L/14 was trained with 400 A100 (40 GB) GPUS for ~127 hours, 50800 GPU-hours.
A ~2B sample subset of LAION-5B with english captions (https://huggingface.co/datasets/laion/laion2B-en)

#### ViT-B/32 224x224

A ViT-B/32 trained on LAION-2B, reaching a top-1 ImageNet-1k zero-shot accuracy of 65.62%.

<img src="https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/laion2b_clip_zeroshot_b32.png" width="700">
Expand All @@ -91,7 +93,6 @@ A ViT-g/14 with a 76.6% top-1 ImageNet-1k zero-shot was trained on JUWELS Booste

This model was trained with a shorted schedule than other LAION-2B models with 12B samples seen instead of 32+B. It matches LAION-400M training in samples seen. Many zero-shot results are lower as a result, but despite this it performs very well in some OOD zero-shot and retrieval tasks.


#### ViT-B/32 roberta base

A ViT-B/32 with roberta base encoder with a 61.7% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k
Expand All @@ -113,22 +114,20 @@ See full english [metrics](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm

On zero shot classification on imagenet with translated prompts this model reaches:

* 56% in italian (vs 21% for https://github.com/clip-italian/clip-italian)
* 53% in japanese (vs 54.6% for https://github.com/rinnakk/japanese-clip)
* 55.7% in chinese (to be compared with https://github.com/OFA-Sys/Chinese-CLIP)

- 56% in italian (vs 21% for https://github.com/clip-italian/clip-italian)
- 53% in japanese (vs 54.6% for https://github.com/rinnakk/japanese-clip)
- 55.7% in chinese (to be compared with https://github.com/OFA-Sys/Chinese-CLIP)

#### YFCC-15M

Below are checkpoints of models trained on YFCC-15M, along with their zero-shot top-1 accuracies on ImageNet and ImageNetV2. These models were trained using 8 GPUs and the same hyperparameters described in the "Sample running code" section, with the exception of `lr=5e-4` and `epochs=32`.

* [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-yfcc15m-455df137.pt) (32.7% / 27.9%)
* [ResNet-101](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn101-quickgelu-yfcc15m-3e04b30e.pt) (34.8% / 30.0%)
- [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-yfcc15m-455df137.pt) (32.7% / 27.9%)
- [ResNet-101](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn101-quickgelu-yfcc15m-3e04b30e.pt) (34.8% / 30.0%)

#### CC12M - https://github.com/google-research-datasets/conceptual-12m

* [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-cc12m-f000538c.pt) (36.45%)

- [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-cc12m-f000538c.pt) (36.45%)

### CommonPool and DataComp models

Expand All @@ -138,14 +137,13 @@ The best performing models are specified below for the xlarge scale, see our pap

Additional models and more information can be found at [/docs/datacomp_models.md](/docs/datacomp_models.md).

- `datacomp_xl_s13b_b90k`: A ViT-L/14 trained on DataComp-1B for 12.8B steps and batch size 90k. Achieves 79.2% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K.

* `datacomp_xl_s13b_b90k`: A ViT-L/14 trained on DataComp-1B for 12.8B steps and batch size 90k. Achieves 79.2% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K.

* `commonpool_xl_clip_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using CLIP scores, for 12.8B steps and batch size 90k. Achieves 76.4% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K.
- `commonpool_xl_clip_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using CLIP scores, for 12.8B steps and batch size 90k. Achieves 76.4% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K.

* `commonpool_xl_laion_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using the LAION-2B filtering scheme, for 12.8B steps and batch size 90k. Achieves 75.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K.
- `commonpool_xl_laion_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using the LAION-2B filtering scheme, for 12.8B steps and batch size 90k. Achieves 75.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K.

* `commonpool_xl_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL without any filtering, for 12.8B steps and batch size 90k. Achieves 72.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL-s13B-b90K.
- `commonpool_xl_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL without any filtering, for 12.8B steps and batch size 90k. Achieves 72.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL-s13B-b90K.

If you use models trained on DataComp-1B or CommonPool variations, please consider citing the following:

Expand All @@ -158,15 +156,13 @@ If you use models trained on DataComp-1B or CommonPool variations, please consid
}
```


### MetaCLIP

MetaCLIP models are described in the paper [Demystifying CLIP Data](https://arxiv.org/abs/2309.16671).
These models were developed by Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer from Meta, New York University and the University of Washington.

Models are licensed under CC-BY-NC.
More details are available at https://github.com/facebookresearch/MetaCLIP.

More details are available at https://github.com/facebookresearch/MetaCLIP.

If you use MetaCLIP models, please cite the following:

Expand All @@ -179,7 +175,6 @@ If you use MetaCLIP models, please cite the following:
}
```


### EVA-CLIP

EVA-CLIP models are described in the paper [EVA-CLIP: Improved Training Techniques for CLIP at Scale](https://arxiv.org/abs/2303.15389).
Expand All @@ -188,7 +183,6 @@ These models were developed by Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang and
Models are licensed under the MIT License.
More details are available at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.


If you use EVA models, please cite the following:

```bibtex
Expand All @@ -200,15 +194,21 @@ If you use EVA models, please cite the following:
}
```

### NLLB-CLIP

NLLB-CLIP models are described in the paper [NLLB-CLIP - train performant multilingual image retrieval model on a budget](https://arxiv.org/abs/2309.01859) by Alexander Visheratin.

The model was trained following the [LiT](https://arxiv.org/abs/2111.07991) methodology: the image tower was frozen, the text tower was initialized from the [NLLB](https://arxiv.org/abs/2207.04672) encoder and unfrozen.

### NLLB
The model was trained on the [LAION-COCO-NLLB](https://huggingface.co/datasets/visheratin/laion-coco-nllb) dataset.

NLLB models are described in the paper [NLLB-CLIP -- train performant multilingual image retrieval model on a budget
](https://arxiv.org/abs/2309.01859) by Alexander Visheratin.
The first version of the model (`nllb-clip`) described in the paper was trained using the OpenAI CLIP image encoder.

The second version of the model (`nllb-clip-siglip`) was trained using the [SigLIP](https://arxiv.org/abs/2303.15343) image encoder.

Models are licensed under CC-BY-NC.

If you use NLLB models, please cite the following:
If you use NLLB-CLIP models, please cite the following:

```bibtex
@article{visheratin2023nllb,
Expand All @@ -219,7 +219,6 @@ If you use NLLB models, please cite the following:
}
```


### CLIPA

CLIPA models are described in the following papers by Xianhang Li, Zeyu Wang, Cihang Xie from UC Santa Cruz:
Expand All @@ -230,12 +229,11 @@ CLIPA models are described in the following papers by Xianhang Li, Zeyu Wang, Ci
Models are licensed under Apache 2.0.
More details are available at https://github.com/UCSC-VLAA/CLIPA and [here](clipa.md).


If you use CLIPA models, please cite the following:

```bibtex
@inproceedings{li2023clipa,
title={An Inverse Scaling Law for CLIP Training},
title={An Inverse Scaling Law for CLIP Training},
author={Xianhang Li and Zeyu Wang and Cihang Xie},
booktitle={NeurIPS},
year={2023},
Expand All @@ -244,7 +242,7 @@ If you use CLIPA models, please cite the following:

```bibtex
@article{li2023clipav2,
title={CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy},
title={CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy},
author={Xianhang Li and Zeyu Wang and Cihang Xie},
journal={arXiv preprint arXiv:2306.15658},
year={2023},
Expand All @@ -259,7 +257,6 @@ These models were developed by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov
Models are licensed under the Apache 2 license.
More details are available at hhttps://github.com/google-research/big_vision.


If you use SigLIP models, please cite the following:

```bibtex
Expand Down
2 changes: 2 additions & 0 deletions docs/model_profile.csv
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ EVA02-L-14-336,336,768,768,768,428.08,304.43,123.65,395.16,381.86,13.3
ViT-L-14-336,336,1024,768,768,427.94,304.29,123.65,395.22,381.92,13.3
ViT-L-16-SigLIP-384,384,768,1024,1024,652.48,316.28,336.19,422.91,383.85,39.06
convnext_xxlarge,256,768,1024,1024,1200.58,846.54,354.03,443.03,395.94,47.09
nllb-clip-base-siglip,384,768,512,768,507.47,93.18,414.3,472.91,112.13,360.78
mt5-xl-ViT-H-14,224,1280,512,1024,2306.75,632.08,1674.68,514.04,334.59,179.45
EVA01-g-14,224,768,768,1024,1136.44,1012.59,123.85,547.36,534.06,13.3
RN50x64,448,128,1024,1024,623.26,420.38,202.88,552.65,529.11,23.55
Expand All @@ -78,6 +79,7 @@ ViT-bigG-14-CLIPA,224,1664,1280,1280,2517.22,1844.9,672.32,1007.93,967.5,40.44
ViT-H-14-378-quickgelu,378,1280,1024,1024,986.71,632.68,354.03,1054.05,1006.96,47.09
ViT-bigG-14,224,1664,1280,1280,2539.57,1844.91,694.66,1065.36,967.5,97.86
nllb-clip-large,224,1280,512,1024,1399.22,632.08,767.14,1468.46,334.59,1133.87
nllb-clip-large-siglip,384,768,512,1152,1195.5,428.23,767.27,1804.22,670.35,1133.87
ViT-e-14,224,1792,1280,1280,4581.09,3807.72,773.37,2091.45,1981.35,110.1
ViT-bigG-14-CLIPA-336,336,1664,1280,1280,2517.76,1845.44,672.32,2271.58,2231.15,40.44
EVA02-E-14,224,768,1024,1024,4704.59,4350.56,354.03,2311.42,2264.33,47.09
Expand Down
2 changes: 2 additions & 0 deletions docs/openclip_classification_results.csv
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ ViT-B-32-quickgelu,laion400m_e31,151.28,14.78,0.5273,0.6294,0.9121,0.9060,0.7021
ViT-B-32,openai,151.28,14.78,0.5265,0.6332,0.8758,0.8983,0.6423,0.2320,0.2335,0.1720,0.4436,0.5044,0.1953,0.8400,0.3258,0.4229,0.5592,0.3155,0.4775,0.6933,0.2743,0.4839,0.4431,0.6670,0.8700,0.7640,0.6224,0.5865,0.5362,0.5963,0.9713,0.6248,0.3159,0.0732,0.6061,0.1676,0.5386,0.8217
ViT-B-32-quickgelu,openai,151.28,14.78,0.5265,0.6332,0.8758,0.8983,0.6423,0.2320,0.2335,0.1720,0.4436,0.5044,0.1953,0.8400,0.3258,0.4229,0.5592,0.3155,0.4775,0.6933,0.2743,0.4839,0.4431,0.6670,0.8700,0.7640,0.6224,0.5865,0.5362,0.5963,0.9713,0.6248,0.3159,0.0732,0.6061,0.1676,0.5386,0.8217
RN50x4,openai,178.3,51.82,0.5191,0.6627,0.8661,0.7943,0.4514,0.2045,0.0905,0.2039,0.4862,0.3354,0.2102,0.8640,0.3622,0.4468,0.5944,0.4145,0.4955,0.7274,0.2335,0.4903,0.5141,0.6766,0.8829,0.6814,0.5675,0.6716,0.5338,0.6673,0.9658,0.6089,0.3190,0.0870,0.5435,0.1130,0.5654,0.8376
nllb-clip-large-siglip,v1,1195.5,1804.22,0.5148,0.5175,0.8392,0.9651,0.7626,0.1737,0.2211,0.1549,0.4394,0.4941,0.0451,0.6312,0.4700,0.5050,0.4631,0.5611,0.1825,0.8325,0.4290,0.6203,0.6492,0.2846,0.4082,0.7823,0.5004,0.5601,0.5656,0.6451,0.9939,0.6355,0.4258,0.0950,0.5000,0.1415,0.6390,0.8855
ViT-B-32,laion400m_e31,151.28,14.78,0.5070,0.6022,0.8916,0.8825,0.6781,0.1549,0.2261,0.1356,0.5218,0.4694,0.1437,0.7814,0.4082,0.4648,0.5234,0.1957,0.5085,0.7079,0.1224,0.4108,0.4281,0.6319,0.8541,0.7312,0.5495,0.5162,0.5108,0.7436,0.9494,0.6508,0.2891,0.0745,0.4975,0.1076,0.5491,0.8328
ViT-B-32,laion400m_e32,151.28,14.78,0.5067,0.6024,0.8918,0.8840,0.6773,0.1536,0.2261,0.1349,0.5229,0.4754,0.1467,0.7817,0.4070,0.4646,0.5237,0.1953,0.5080,0.7084,0.1181,0.4000,0.4292,0.6323,0.8513,0.7328,0.5490,0.5206,0.5094,0.7454,0.9498,0.6509,0.2759,0.0741,0.5084,0.1068,0.5444,0.8326
RN101,openai,119.69,25.5,0.5036,0.6228,0.8527,0.8078,0.4764,0.2437,0.0923,0.1693,0.4335,0.3131,0.1853,0.8367,0.3753,0.4106,0.5612,0.2944,0.5085,0.6817,0.2644,0.5254,0.4515,0.6532,0.8652,0.6512,0.5819,0.6403,0.5476,0.6100,0.9680,0.5803,0.3185,0.0888,0.4723,0.1615,0.5631,0.8164
Expand All @@ -95,6 +96,7 @@ ViT-B-16,commonpool_l_image_s1b_b8k,149.62,41.09,0.4812,0.5719,0.8856,0.9321,0.6
ViT-B-16,commonpool_l_text_s1b_b8k,149.62,41.09,0.4758,0.5605,0.8720,0.9391,0.7054,0.1843,0.2373,0.0995,0.3941,0.3830,0.0451,0.7724,0.2317,0.4437,0.4835,0.2220,0.4770,0.6708,0.2686,0.2593,0.4911,0.5164,0.7049,0.7669,0.4857,0.4931,0.4663,0.6525,0.9523,0.6088,0.2122,0.0623,0.5697,0.0000,0.5643,0.8564
ViT-B-16,commonpool_l_basic_s1b_b8k,149.62,41.09,0.4566,0.5155,0.8444,0.8289,0.5251,0.2061,0.2277,0.1173,0.4133,0.3820,0.0481,0.7461,0.2021,0.3932,0.4325,0.1913,0.4600,0.6087,0.3333,0.2809,0.4493,0.4357,0.6956,0.7151,0.5899,0.5387,0.4313,0.7216,0.9373,0.5974,0.1173,0.0436,0.5712,0.0000,0.5421,0.8384
ViT-B-16,commonpool_l_s1b_b8k,149.62,41.09,0.4386,0.4593,0.8089,0.9133,0.6421,0.1594,0.2203,0.1177,0.3383,0.3348,0.0316,0.6735,0.2766,0.3448,0.3914,0.1592,0.4335,0.5265,0.2686,0.3603,0.4126,0.3681,0.5587,0.7093,0.5516,0.5118,0.4154,0.6060,0.9339,0.5713,0.3047,0.0399,0.5102,0.0000,0.5654,0.8305
nllb-clip-base-siglip,v1,507.47,472.91,0.4377,0.3909,0.7507,0.9043,0.5939,0.1453,0.2254,0.0583,0.3617,0.3744,0.0090,0.4961,0.3429,0.3886,0.3439,0.3165,0.1695,0.6846,0.1927,0.5007,0.5001,0.1567,0.1868,0.7599,0.6692,0.5859,0.5049,0.4703,0.9818,0.5640,0.4033,0.0694,0.6500,0.0956,0.6320,0.8392
nllb-clip-large,v1,1399.22,1468.46,0.4163,0.3672,0.7234,0.9634,0.6797,0.2389,0.2254,0.0691,0.3447,0.5454,0.0216,0.4447,0.2462,0.3316,0.3233,0.2632,0.1725,0.5624,0.3727,0.2716,0.5268,0.0978,0.1283,0.7551,0.5417,0.5585,0.4983,0.3865,0.9811,0.5512,0.1725,0.0403,0.5181,0.1419,0.6752,0.8305
ViT-B-32,datacomp_m_s128m_b4k,151.28,14.78,0.3364,0.2972,0.7159,0.8252,0.5476,0.1365,0.2249,0.0453,0.2133,0.3393,0.0304,0.4168,0.1366,0.1930,0.2440,0.0493,0.4085,0.3402,0.2110,0.1147,0.1971,0.2965,0.4311,0.5459,0.5862,0.5316,0.2778,0.2803,0.8365,0.3637,0.1500,0.0142,0.6669,0.0000,0.4498,0.6559
ViT-B-32,commonpool_m_clip_s128m_b4k,151.28,14.78,0.3344,0.2725,0.6678,0.8405,0.5549,0.1402,0.2238,0.0458,0.2176,0.2589,0.0215,0.3999,0.1586,0.1844,0.2247,0.0420,0.3925,0.3297,0.3235,0.1778,0.2093,0.2551,0.3828,0.6074,0.5210,0.5014,0.2641,0.4123,0.8370,0.3875,0.1931,0.0154,0.5369,0.0000,0.4451,0.6610
Expand Down
Loading