generated from slds-lmu/seminar_website_skeleton
-
Notifications
You must be signed in to change notification settings - Fork 27
/
02-04-text-support-img.Rmd
307 lines (237 loc) · 26.9 KB
/
02-04-text-support-img.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
## Text supporting Vision Models {#c02-04-text-support-img}
*Author: Max Schneider*
*Supervisor: Jann Goschenhofer*
### Introduction
> "The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
> [...] Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available.
> Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.
> [...] One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great."
>
> --- @sutton2019bitterlesson
This insight seems to directly inspire most model choices presented in this chapter.
Each network can be seen as an attempt of its creators to employ their vast available resources on a large scale, with a particular focus on dataset sizes.
This mostly becomes feasible through the adaptation of recent findings in natural language processing (NLP; see chapter \@ref(c01-01-sota-nlp)) to computer vision (CV).
On the one hand, architectural concepts firstly popularized in NLP are translated to CV [e.g., self-supervised learning or the Vision Transformer\; @ImageT] (see chapter \@ref(c01-02-sota-cv)).
On the other hand, these powerful new NLP models, mostly Transformers [@vaswani2017attention], support bigger models from the inside as text encoding building blocks; hence the name of this chapter.
Throughout this chapter, we will introduce recent relevant CV models CLIP [@radford2021learning], ALIGN [@jia2021scaling] and Florence [@yuan2021florence] and discuss their underlying core concepts.
The strong performances confirm the potential, hinted at by the impressive GPT-3 [@brown2020language], of improving CV and increasing scale with the help of NLP.
### Concepts
#### Web-scale data {#webScaleData}
A core problem that troubles researchers is the lack of robustness of previous state-of-the-art CV models to distribution shifts.
I.e., when a model with good performance on its original dataset fails to generalize (transfer its knowledge) to new, more or less similar datasets.
E.g., @radford2021learning report that a ResNet101 which they trained on ImageNet to an accuracy of 76.2% maintains only an accuracy of 32.6% on ObjectNet.
This suggests that the model perhaps did not learn high quality latent representations, but instead overfit to the dataset-specific data-generating distribution.
A common way to tackle this would be to try out various changes on the architecture and the training algorithm of the network.
But this kind of adaptation, inscribing expert knowledge into the model, seems to repeat the mistake pointed out by @sutton2019bitterlesson; "micromanaging" a model is likely to thwart future scaling.
<!-- E.g., introducing long short term memory to recurrent neural networks (RNN) improved them in the short term, but made them more computationally expensive, while not solving their core problem of poor parallelization, which resulted in it being outcompeted by the heavily parallelizable transformer architecture. | Does this really make sense?-->
The researchers of CLIP, ALIGN and Florence follow a different approach, based on scale.
They try to increase sample size as much as possible and work with tremendous numbers of training observations:
* 400 million [CLIP\; @radford2021learning]
* 900 million [Florence\; @yuan2021florence]
* 1.8 billion [ALIGN\; @jia2021scaling]
These large-scale dataset are generated using the vast amount of image-text pairs produced by and readily available on the internet.
Thus, error prone, cost and labor intensive (difficult to scale), manual labeling is avoided.
Unfortunately, the models trained on web data also become vulnerable to their downsides.
Because of their extremely noisy nature, still some form of pre-processing is needed, e.g., filtering for English language, excluding graphic content and, optionally, removing images with non-informative alt-texts.
This makes some degree of dataset curation, and therefore arbitrary choices, necessary.
Likewise, the social biases inherent to the internet are reproduced and furthermore, while this approach improves data efficiency to some degree (see next subsection \@ref(contrObj)), the poor performance of deep learning in this area is not substantially enhanced and mainly just compensated for with a super scalable source of supervision [@radford2021learning].
#### Contrastive objective {#contrObj}
This source of supervision is the information contained in the co-occurrence of the image with its alt-text.
It is accessed through natural language supervision.
The architectures jointly train two sub-networks for image and text encoding, respectively.
During this, the vector encodings are aligned in the latent representation space through minimizing a variant of the contrastive loss function \@ref(eq:contrLoss) [@tian2020contrastive].
Half of the first image-text pair loss
\begin{equation}
\ell_1^{V_\text{img}, V_\text{txt}} = - \underset{\{v_\text{img}^1, v_\text{txt}^1, \ldots, v_\text{txt}^N\}}{\mathbb{E}} \left( \log \frac{h_\theta(\{v_\text{img}^1,v_\text{txt}^1\})}{h_\theta(\{v_\text{img}^1,v_\text{txt}^1\}) + \sum_{k=2}^N h_\theta(\{v_\text{img}^1, v_\text{txt}^k\})} \right),
(\#eq:contrLoss)
\end{equation}
where $v_\text{img}^1$ and $v_\text{txt}^1$ are vector encodings (latent representations) of image 1 and text 1 and $h_\theta(\cdot)$ is a similarity measure.
In order to guarantee symmetry, the total loss is formed by the sum of $\ell_1^{V_\text{img}, V_\text{txt}}$ and $\ell_1^{V_\text{txt}, V_\text{img}}$, where the pairwise similarities of one text and every image is calculated instead of the other way around.
<!-- Question: Is the second part really necessary? If all columns are optimized, all elements are considered. Maybe only if the similarity measure really is asymmetrical?-->
Figure \@ref(fig:contr-viz) visualizes this.
Initially all images and texts in the training data are encoded by the responsible sub-network.
Using the resulting encodings, a similarity matrix with elements $h_\theta(\{v_\text{img}^i,v_\text{txt}^j\})$ can be calculated.
Loosely speaking, the contrastive objective is to maximize elements on the diagonal and minimize the others.
```{r contr-viz, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:contr-viz)'}
knitr::include_graphics('figures/02-04-text-support-img/contrastive-pre-training.png')
```
(ref:contr-viz) Visualization of a contrastive objective [@radford2021learning]. After encoding the data, a similarity matrix for the images and texts is computed. The aim is that the N true image-text pairs score high in terms of similarity, while the $\text{N}^2 - \text{N}$ other possible combinations score low.
Contrastive learning can be contrasted with classical predictive learning.
Figure \@ref(fig:contr-vs-pred-learn) gives an interesting insight into the choice of space, where goodness of fit is measured.
The exemplary task is to color an image given its B/W version.
Approach (a) first encodes the B/W image and then decodes the interim latent representation to fitting colors.
The goodness of this fit is measured in the output space, meaning the estimated colors are compared to the true colors.
Conversely, approach (b) measures the loss in the representation space.^[Note that contrastive learning easily works with other combinations of modalities than text and image; here B/W and colors.]
A reason for the good performance of contrastive learning could be that, while common prediction losses (e.g., the $\mathcal{L}_2$ loss) penalize each prediction output dimension independently, approach (b) implies measurement in the intertwined representation space [@tian2020contrastive].
```{r contr-vs-pred-learn, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:contr-vs-pred-learn)'}
knitr::include_graphics('figures/02-04-text-support-img/tian-predictive-vs-contrastive.png')
```
(ref:contr-vs-pred-learn) Predictive vs. contrastive learning: Predictive losses are measured in the output space while contrastive losses are measured is in the representation space, indicated by red dotted boxes [@tian2020contrastive].
But in the end, rather than theoretical considerations, the driving factor for using this objective is data efficiency.
As can be seen in figure \@ref(fig:data-efficiency), @radford2021learning start their search for an adequate pre-trained model (more on this in subsection \@ref(foundMod)) by experimenting with a Transformer-based language model predicting the exact captions of an image.
It turns out that this approach trains three times slower, in terms of data efficiency, compared to a simpler baseline of predicting a bag-of-words text encoding.
Additionally, switching to the contrastive objective of CLIP improves data efficiency by a factor of four.
```{r data-efficiency, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:data-efficiency)'}
knitr::include_graphics('figures/02-04-text-support-img/data-efficiency.png')
```
(ref:data-efficiency) Data efficiency of contrastive objective. Development of zero-shot accuracy (see next subsection \@ref(foundMod)) on ImageNet with increasing number of instances of training data processed by the models. The contrastive objective reaches similar accuracy scores as the generative approach with only a seventh of the amount of data [@radford2021learning].
Nonetheless, the switch to contrastive learning leads to some limitations.
Its rigidity demands certain extra steps and forfeits the high flexibility of generative models.
In particular, this means contrastive models similar to CLIP are limited to choose from available options and cannot freely generate texts or images.
To extend the capabilities of those models additional network building blocks are necessary.
#### Foundation models and zero-shooting {#foundMod}
The first models which are considered foundation models today began to appear in NLP.
The term, later coined by @bommasani2021opportunities, refers to models that are noteworthy due to their large scale and ability to adapt to a wide variety of downstream tasks.
An early example is BERT [@Devlin2018].
Often, foundation models have an unfinished touch to them and the true scope of their capabilities cannot be sketched out clearly.
This generally is the case because the desired abilities of neural networks are not designed for explicitly, but rather emerge during their implementation and usage on downstream tasks.
@bommasani2021opportunities cite GPT-3's ability to perform certain types of new tasks solely by confronting it with the right natural language prompt.
E.g., it is possible to get GPT-3 to summarize a paragraph by appending "TL;DR" (too long, didn't read) to the prompt, which is a common pattern on the internet to signal a following summery.
This is referred to as "in-context learning" [@brown2020language].
It is apparent that one can make up plenty of unexpected ways to employ these models and it remains unknown whether there is a further way no one thought of yet.
This means possibly saving computational and data collection costs down the line, which ineptly is true for malicious use cases, e.g., surveillance, too.
Foundation models build on the concept of transfer-learning, i.e., pre-training a model on a feasible source task and applying it to the desired downstream task.
In the context of this chapter this means pre-training on web-scale data (see subsection \@ref(webScaleData)) and evaluating performance on various common classification datasets.
E.g., @radford2021learning name the SVHN dataset as a proxy for the task "street number transcription" with the caveat "on the distribution of Google Street View photos", but they remark that a lot of datasets have no obvious, specific task associated, e.g., CIFAR-10.
They use these kind of datasets for measuring the "robustness to distribution shift and domain generation" of their model, which still is a topic of great interest as mentioned in subsection \@ref(webScaleData).
When there is no further fine-tuning on the downstream task, i.e., no resuming of training on the new target dataset, this is referred to as zero-shooting.
Zero-shooting has the clear advantage of evaluating performance more unbiased, as processes like overfitting to the data-generating distribution will not distort results.
Figure \@ref(fig:zero-shooting) shows how contrastive models perform zero-shot transfer.
In the case of image classification all available classes are encoded by the language model.
Afterwards, the CV sub-network computes the encoding of the image to be classified and all pair-wise similarity scores are returned.
The pair with the best score can be retrieved as the decision.
Image retrieval works the other way around:
After an initial encoding of all images, the ones most similar to the encoded natural language text prompt in the representation space can be returned.
```{r zero-shooting, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:zero-shooting)'}
knitr::include_graphics('figures/02-04-text-support-img/zero-shooting.png')
```
(ref:zero-shooting) Visualization of zero-shooting [@radford2021learning].
### Architectures
#### CLIP
The first of the large scale contrastive CV models that were published is CLIP, short for Contrastive Language-Image Pre-training [@radford2021learning].
The components of its name are explained in previous subsections \@ref(contrObj), \@ref(webScaleData) and \@ref(foundMod) and are the crucial concepts of ALIGN and Florence as well.
CLIP is a product of OpenAI, but its code is freely available and the different versions can be accessed as [python modules](https://github.com/openai/CLIP).
The dataset used for training is not released though.
A lot of preliminary work stems from @zhang2020contrastive, who introduced contrastive representation learning using image-text pairs.
Their implementation of the contrastive loss function \@ref(eq:contrLoss) follows
\begin{equation}
\ell_1^{V_\text{img}, V_\text{txt}} = - \log \frac{\exp(\langle v_\text{img}^1, v_\text{txt}^1 \rangle / \tau)}{\sum_{k=1}^{N} \exp(\langle v_\text{img}^1, v_\text{txt}^k \rangle / \tau)},
(\#eq:contrLossCLIP)
\end{equation}
where $\langle v_\text{img}^1, v_\text{txt}^1 \rangle$ represents the cosine similarity, i.e., $v_\text{img}^{1 \top} v_\text{txt}^1 / (\|v_\text{img}^1\| \|v_\text{txt}^1\|)$, and $\tau \in \mathbb{R}^+$ is a temperature parameter, which is directly learned during training [@zhang2020contrastive].
CLIP adopts this.
$\ell_1^{V_\text{txt}, V_\text{img}}$, the counterpart to $\ell_1^{V_\text{img}, V_\text{txt}}$ for the total loss, is function \@ref(eq:contrLossCLIP) with switched arguments.
This can be viewed as a symmetric cross entropy loss over the cosine similarity of the embeddings [@radford2021learning].
**Architecture**
The text encoder for CLIP (see figure \@ref(fig:contr-vs-pred-learn)) is a modified Transformer [@vaswani2017attention], which was also used for GPT-2 [@radford2019language].
For the image encoder multiple sub-networks are evaluated:
* ResNets: ResNet-50, ResNet-101
* ResNets which follow EfficientNet-style model scaling: RN50x4, RN50x16, RN50x64
* Vision Transformers: ViT-B/32, ViT-B/16, ViT-L/14
The best performing sub-network was the ViT-L/14.
In turn, they trained it for an additional epoch with higher resolution images (336px), denoting this version ViT-L/14@336px.
If not indicated otherwise, the performances of this version of CLIP are displayed.
The EfficientNet-style ResNets use x4, x16 and x64 of the compute of a ResNet-50 and the largest model (the RN50x64) trained for 18 days on 592 V100 GPUs, while the ViT-L/14 only took 12 days on 256 GPUs.
The high parallelization capabilities of Transformers seem to pay off.
When explaining zero-shooting initially (see subsection \@ref(foundMod)), a text processing step was skipped.
As can be seen in figure \@ref(fig:zero-shooting), there is an additional operation before the labels are fed into the text encoder.
In order to help the model understand the context of the words, the class labels are embedded in a sentence, e.g., "A photo of a {label}.".
This increases the models zero-shot accuracy on ImageNet by 1.3 percentage points (pp).
When ensembling 80 different context prompts^[Prompts like: "A photo of a big {label}.", "A photo of a small {label}." [@radford2021learning]] @radford2021learning improve ImageNet accuracy by an additional 3.5pp, which adds up to a total of nearly 5pp.
The average performance gain across 36 datasets is reported to be 5pp.
It is similarly possible to directly communicate visual concepts like "picture", "macro", "drawing" or even "dog" to the model.
**Robustness**
Figure \@ref(fig:performance-clip) illustrates the performance of CLIP and a ResNet101, whose training on ImageNet was stopped at the point it reached the same accuracy as zero-shot CLIP.
It can be deduced that the methods studied in the paper of @radford2021learning constitute an important step towards closing the robustness gap mentioned earlier (see subsection \@ref(webScaleData)).
While the performance of the ResNet101 deteriorates with datasets generated from more and more different data distributions compared to ImageNet, CLIP remains fairly accurate.
Note that these findings have to be taken with a grain of salt.
Because OpenAI does not grant public access to their training data, independent parties cannot investigate these claims on their own.
E.g., it has to be relied on the conclusions of their overlap analysis to rule out that CLIP has not seen biasing amounts of future test data during training.
```{r performance-clip, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:performance-clip)'}
knitr::include_graphics('figures/02-04-text-support-img/performance-clip.png')
```
(ref:performance-clip) Robustness of zero-shot CLIP to distribution shifts [@radford2021learning].
**CLIP as a building block**
@shen2021much study how the performance of Vision-and-Language (V&L) models improves, when the visual encoder is switched to CLIP's strong image encoder.
They discover that in this field of CV the ViT-B scores significantly worse than the ResNets.
E.g., tests on image captioning reveal that the V&L model using ViT-B often performs only half as strong as the version using the RN50x4 (the largest network used in this study).
This is possibly due to the pooling strategies of ViT-B, which result in a lack of visual localization abilities.
@shen2021much test their hypothesis and generate, e.g., figure \@ref(fig:attention-ViT) which depicts Grad-CAM Visualizations for a V&L model with a ViT-B backbone and a ResNet-50 backbone and the question "What color is the woman's shirt on the left?".
The red area indicates relevant pixels and appears much more focused for CLIP-Res50 than for CLIP-ViT-B.
```{r attention-ViT, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:attention-ViT)'}
knitr::include_graphics('figures/02-04-text-support-img/attention-of-ViT.png')
```
(ref:attention-ViT) Grad-CAM Visualizations for the prompt "What color is the woman's shirt on the left?".
#### ALIGN
The approach of @jia2021scaling is largely similar to CLIP.
They reiterate the necessity of large-scale vision datasets, but assert that even CLIP's data collection process still involves a non-trivial amount of data curation.
They propose that the amount of additional observations obtained through minimizing the amount of filtering makes up for the increased noise.
Following this rationale, they create a training dataset with 1.8 billion image-text pairs.
The corresponding model is named ALIGN, short for "A Large-scale ImaGe and Noisy-text embedding", whose acronym hints at the contrastive loss, which aligns vector encodings in the representation space (see subsection \@ref(contrObj)).
**Architecture**
ALIGN follows the dual encoder architecture employed by @zhang2020contrastive and @radford2021learning, but uses a part of BERT-Large as the text and EfficientNet-L2 as the image encoder, which they jointly train from scratch.
The model has around 800 million parameters [@alford2021alignparams].
Subsection \@ref(performanceComp) goes into more detail about the performance of ALIGN and compares all three models discussed in this subsection.
**Connecting image and text representations**
The contrastive loss function aligns the latent representations of the different modalities.
In other words, the explicit objective is that similar vector encodings implicate similar inputs.
This means arithmetic operations like the ones mentioned in chapter \@ref(c01-01-sota-nlp) are not only meaningful on encodings belonging to the same modality, but to different modalities.
E.g., one can add up the image encoding of a picture of the Eiffel tower and the text encoding of the word "snow" and retrieve pictures with high cosine similarity as a result, see figure \@ref(fig:img-txt-addition) for an illustration.
```{r img-txt-addition, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:img-txt-addition)'}
knitr::include_graphics('figures/02-04-text-support-img/align-word-and-image-addition.png')
```
(ref:img-txt-addition) Multimodal image retrieval via arithmetic operations on word and image embeddings.
#### Florence
While in principle the approach of @yuan2021florence does not largely differ from the others, the focus of this paper is more about creating a true foundation model.
In order to achieve this, they propose a map of possible vision applications which the try to cover via extending the core model with modules.
As figure \@ref(fig:florence-dimensions) depicts, they want to advance into the dimensions of fine-grained object detection, dynamic action recognition and true multimodal tasks.
Due to their big ambitions, they name their model Florence after "the birthplace of Renaissance" [@yuan2021florence].
```{r florence-dimensions, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:florence-dimensions)'}
knitr::include_graphics('figures/02-04-text-support-img/florence-dimensions.png')
```
(ref:florence-dimensions) Florence' approach to foundation models: A general purpose vision system for all tasks.
**Architecture**
As the two encoders for the pre-trained core they use a hierarchical Vision Transformer (CoSwin Transformer) for images and a Transformer similar to CLIP's for text.
Their 893 million parameters are also jointly trained from scratch on 900 million image-text pairs.
The alignment happens in the so called image-label-description space which is encoded through a special version of the contrastive loss function which regards all image-text pairs with the same label as positive instances.
Figure \@ref(fig:florence-architecture) depicts their version of figure \@ref(fig:contr-viz) where one can schematically see how they flexibly add modules to the pre-trained core in order to adapt to various downstream tasks.
```{r florence-architecture, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:florence-architecture)'}
knitr::include_graphics('figures/02-04-text-support-img/florence-architecture.png')
```
(ref:florence-architecture) Modular architecture of Florence.
### Performance comparison {#performanceComp}
Throughout the papers of @radford2021learning, @jia2021scaling and @yuan2021florence we were able to collect three tables with reported performance measures to compare these approaches.
Table \@ref(fig:table1) summarizes the zero-shot accuracies on four different ImageNet variants.
Unfortunately @yuan2021florence only stated their performance on the original ImageNet, where they beat CLIP and ALIGN by a margin of 7.3pp.
The results on the other three ImageNet pendants are mixed and there is no clear winner between CLIP and ALIGN.
```{r table1, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:table1)'}
knitr::include_graphics('figures/02-04-text-support-img/table-imagenet.png')
```
(ref:table1) Top-1 Accuracy of zero-shot transfer of models to image classification on ImageNet and its variants.
Table \@ref(fig:table2) concerns zero-shot image retrieval on the Flickr30K and the MSCOCO dataset (see chapter \@ref(c01-03-benchmarks)).
Even though there are not many major score differences, there is a clear ranking with CLIP on third, ALIGN on second and Florence on the first place.
```{r table2, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:table2)'}
knitr::include_graphics('figures/02-04-text-support-img/table-img-txt-retrieval.png')
```
(ref:table2) Zero-shot image and text retrieval [@yuan2021florence].
The most comprehensive comparison is shown in table \@ref(fig:table3).
It depicts the accuracy of zero-shot CLIP and Florence on various datasets as well as the scores of all three models fine tuned to the respective datasets.
Florence beats CLIP in nearly all evaluations, for the zero-shot setting as well as for fine tuned performance.
@jia2021scaling only report on four of these twelve datasets, where they win half of the time.
```{r table3, echo=FALSE, out.width='100%', fig.align='center', fig.cap='(ref:table3)'}
knitr::include_graphics('figures/02-04-text-support-img/table-misc-datasets.png')
```
(ref:table3) Top-1 Accuracy of CLIP, Florence and ALIGN on various datasets.
Summing up, ALIGN achieves its goal of replicating CLIP's impressive performance while dramatically reducing the required data curation effort and Florence has the overall top performance.
This could be attributed to its custom loss, maybe to @yuan2021florence striking the best balance between sample size and data curation or to Florence having the best sub-networks; or a combination of all three.
Once again note that none of the training datasets were made publicly available.
It cannot be guaranteed that all benchmarks were evaluated on unseen datasets.
### Resources
One can access the pre-trained CLIP models on [Github](https://github.com/openai/CLIP) and they even found their way into simple command line tools already.
For example there is a CLI named [rclip](https://github.com/yurijmikhalevich/rclip), which can be used for personal image retrieval, wrapping the _ViT-B/32_ CLIP architecture.
On a (mid-range, regular) laptop, we were able to find seemingly good matches for search terms which we tried out inside a folder containing about 100 different pictures.
After an initial caching one request took about ten seconds.
Furthermore CLIP continues to be used inside new models, e.g., DALL$\cdot$E 2, where it is used for the image embedding [@ramesh2022hierarchical].
Also, there is a crowd-sourcing effort to replicate CLIP's training dataset called LAION-400M [@schuhmann2022laion].
To validate the image-text pairs collected for this, their cosine similarity is computed using CLIP and instances with a value too low are discarded.
To our knowledge no resources were open-sourced as part of the other two papers ALIGN and FLORENCE.