diff --git a/docs/PRETRAINED.md b/docs/PRETRAINED.md
index 2f899863e..670d71dcd 100644
--- a/docs/PRETRAINED.md
+++ b/docs/PRETRAINED.md
@@ -24,7 +24,7 @@ We replicate OpenAI's results on ViT-B/32, reaching a top-1 ImageNet-1k zero-sho
-__Zero-shot comparison (courtesy of Andreas Fürst)__
+**Zero-shot comparison (courtesy of Andreas Fürst)**
ViT-B/32 was trained with 128 A100 (40 GB) GPUs for ~36 hours, 4600 GPU-hours. The per-GPU batch size was 256 for a global batch size of 32768. 256 is much lower than it could have been (~320-384) due to being sized initially before moving to 'local' contrastive loss.
@@ -44,9 +44,10 @@ ViT-B/16 was trained with 176 A100 (40 GB) GPUS for ~61 hours, 10700 GPU-hours.
The B/16+ 240x240 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 69.21.
This model is the same depth as the B/16, but increases the
- * vision width from 768 -> 896
- * text width from 512 -> 640
- * the resolution 224x224 -> 240x240 (196 -> 225 tokens)
+
+- vision width from 768 -> 896
+- text width from 512 -> 640
+- the resolution 224x224 -> 240x240 (196 -> 225 tokens)
@@ -67,6 +68,7 @@ ViT-L/14 was trained with 400 A100 (40 GB) GPUS for ~127 hours, 50800 GPU-hours.
A ~2B sample subset of LAION-5B with english captions (https://huggingface.co/datasets/laion/laion2B-en)
#### ViT-B/32 224x224
+
A ViT-B/32 trained on LAION-2B, reaching a top-1 ImageNet-1k zero-shot accuracy of 65.62%.
@@ -91,7 +93,6 @@ A ViT-g/14 with a 76.6% top-1 ImageNet-1k zero-shot was trained on JUWELS Booste
This model was trained with a shorted schedule than other LAION-2B models with 12B samples seen instead of 32+B. It matches LAION-400M training in samples seen. Many zero-shot results are lower as a result, but despite this it performs very well in some OOD zero-shot and retrieval tasks.
-
#### ViT-B/32 roberta base
A ViT-B/32 with roberta base encoder with a 61.7% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k
@@ -113,22 +114,20 @@ See full english [metrics](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm
On zero shot classification on imagenet with translated prompts this model reaches:
-* 56% in italian (vs 21% for https://github.com/clip-italian/clip-italian)
-* 53% in japanese (vs 54.6% for https://github.com/rinnakk/japanese-clip)
-* 55.7% in chinese (to be compared with https://github.com/OFA-Sys/Chinese-CLIP)
-
+- 56% in italian (vs 21% for https://github.com/clip-italian/clip-italian)
+- 53% in japanese (vs 54.6% for https://github.com/rinnakk/japanese-clip)
+- 55.7% in chinese (to be compared with https://github.com/OFA-Sys/Chinese-CLIP)
#### YFCC-15M
Below are checkpoints of models trained on YFCC-15M, along with their zero-shot top-1 accuracies on ImageNet and ImageNetV2. These models were trained using 8 GPUs and the same hyperparameters described in the "Sample running code" section, with the exception of `lr=5e-4` and `epochs=32`.
-* [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-yfcc15m-455df137.pt) (32.7% / 27.9%)
-* [ResNet-101](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn101-quickgelu-yfcc15m-3e04b30e.pt) (34.8% / 30.0%)
+- [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-yfcc15m-455df137.pt) (32.7% / 27.9%)
+- [ResNet-101](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn101-quickgelu-yfcc15m-3e04b30e.pt) (34.8% / 30.0%)
#### CC12M - https://github.com/google-research-datasets/conceptual-12m
-* [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-cc12m-f000538c.pt) (36.45%)
-
+- [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-cc12m-f000538c.pt) (36.45%)
### CommonPool and DataComp models
@@ -138,14 +137,13 @@ The best performing models are specified below for the xlarge scale, see our pap
Additional models and more information can be found at [/docs/datacomp_models.md](/docs/datacomp_models.md).
+- `datacomp_xl_s13b_b90k`: A ViT-L/14 trained on DataComp-1B for 12.8B steps and batch size 90k. Achieves 79.2% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K.
-* `datacomp_xl_s13b_b90k`: A ViT-L/14 trained on DataComp-1B for 12.8B steps and batch size 90k. Achieves 79.2% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K.
-
-* `commonpool_xl_clip_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using CLIP scores, for 12.8B steps and batch size 90k. Achieves 76.4% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K.
+- `commonpool_xl_clip_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using CLIP scores, for 12.8B steps and batch size 90k. Achieves 76.4% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K.
-* `commonpool_xl_laion_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using the LAION-2B filtering scheme, for 12.8B steps and batch size 90k. Achieves 75.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K.
+- `commonpool_xl_laion_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using the LAION-2B filtering scheme, for 12.8B steps and batch size 90k. Achieves 75.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K.
-* `commonpool_xl_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL without any filtering, for 12.8B steps and batch size 90k. Achieves 72.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL-s13B-b90K.
+- `commonpool_xl_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL without any filtering, for 12.8B steps and batch size 90k. Achieves 72.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL-s13B-b90K.
If you use models trained on DataComp-1B or CommonPool variations, please consider citing the following:
@@ -158,15 +156,13 @@ If you use models trained on DataComp-1B or CommonPool variations, please consid
}
```
-
### MetaCLIP
MetaCLIP models are described in the paper [Demystifying CLIP Data](https://arxiv.org/abs/2309.16671).
These models were developed by Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer from Meta, New York University and the University of Washington.
Models are licensed under CC-BY-NC.
-More details are available at https://github.com/facebookresearch/MetaCLIP.
-
+More details are available at https://github.com/facebookresearch/MetaCLIP.
If you use MetaCLIP models, please cite the following:
@@ -179,7 +175,6 @@ If you use MetaCLIP models, please cite the following:
}
```
-
### EVA-CLIP
EVA-CLIP models are described in the paper [EVA-CLIP: Improved Training Techniques for CLIP at Scale](https://arxiv.org/abs/2303.15389).
@@ -188,7 +183,6 @@ These models were developed by Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang and
Models are licensed under the MIT License.
More details are available at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.
-
If you use EVA models, please cite the following:
```bibtex
@@ -200,15 +194,21 @@ If you use EVA models, please cite the following:
}
```
+### NLLB-CLIP
+
+NLLB-CLIP models are described in the paper [NLLB-CLIP - train performant multilingual image retrieval model on a budget](https://arxiv.org/abs/2309.01859) by Alexander Visheratin.
+
+The model was trained following the [LiT](https://arxiv.org/abs/2111.07991) methodology: the image tower was frozen, the text tower was initialized from the [NLLB](https://arxiv.org/abs/2207.04672) encoder and unfrozen.
-### NLLB
+The model was trained on the [LAION-COCO-NLLB](https://huggingface.co/datasets/visheratin/laion-coco-nllb) dataset.
-NLLB models are described in the paper [NLLB-CLIP -- train performant multilingual image retrieval model on a budget
-](https://arxiv.org/abs/2309.01859) by Alexander Visheratin.
+The first version of the model (`nllb-clip`) described in the paper was trained using the OpenAI CLIP image encoder.
+
+The second version of the model (`nllb-clip-siglip`) was trained using the [SigLIP](https://arxiv.org/abs/2303.15343) image encoder.
Models are licensed under CC-BY-NC.
-If you use NLLB models, please cite the following:
+If you use NLLB-CLIP models, please cite the following:
```bibtex
@article{visheratin2023nllb,
@@ -219,7 +219,6 @@ If you use NLLB models, please cite the following:
}
```
-
### CLIPA
CLIPA models are described in the following papers by Xianhang Li, Zeyu Wang, Cihang Xie from UC Santa Cruz:
@@ -230,12 +229,11 @@ CLIPA models are described in the following papers by Xianhang Li, Zeyu Wang, Ci
Models are licensed under Apache 2.0.
More details are available at https://github.com/UCSC-VLAA/CLIPA and [here](clipa.md).
-
If you use CLIPA models, please cite the following:
```bibtex
@inproceedings{li2023clipa,
- title={An Inverse Scaling Law for CLIP Training},
+ title={An Inverse Scaling Law for CLIP Training},
author={Xianhang Li and Zeyu Wang and Cihang Xie},
booktitle={NeurIPS},
year={2023},
@@ -244,7 +242,7 @@ If you use CLIPA models, please cite the following:
```bibtex
@article{li2023clipav2,
- title={CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy},
+ title={CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy},
author={Xianhang Li and Zeyu Wang and Cihang Xie},
journal={arXiv preprint arXiv:2306.15658},
year={2023},
@@ -259,7 +257,6 @@ These models were developed by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov
Models are licensed under the Apache 2 license.
More details are available at hhttps://github.com/google-research/big_vision.
-
If you use SigLIP models, please cite the following:
```bibtex