01-02-sota-cv.Rmd

## State-of-the-art in Computer Vision {#c01-02-sota-cv}

*Author: Vladana Djakovic*

*Supervisor: Daniel Schalk*


### History

The first research about visual perception comes from neurophysiological research performed in the 1950s and 1960s on cats. The researchers used cats as a model to understand how human vision is compounded. Scientists concluded that human vision is hierarchical and neurons detect simple features like edges followed by more complex features like shapes and even more complex visual representations. Inspired by this knowledge, computer scientists focused on recreating human neurological structures.

At around the same time, as computers became more advanced, computer scientists worked on imitating human neurons' behavior and simulating a hypothetical neural network. In his book "The Organization of Behaviour" (1949) Donald Hebbian stated that neural pathways strengthen over each successive use, especially between neurons that tend to fire at the same time, thus beginning the long journey towards quantifying the complex processes of the brain. The first Hebbian network, inspired by this neurological research, was successfully implemented at MIT in 1954 [@history1].

New findings led to the establishment of the field of artificial intelligence in 1956 on-campus at Dartmouth College. Scientists began to develop ideas and research how to create techniques that would imitate the human eye.

In 1959 early research on developing neural networks was performed at Stanford University, where models called "ADALINE" and "MADALINE," (Multiple ADAptive LINear Elements) were developed. Those models aimed to recognize binary patterns and could predict the next bit [@history2].

Starting optimism about Computer Vision and neural networks disappeared after 1969 and the publication of the book "Perceptrons" by Marvin Minsky, founder of the MIT AI Lab, stated that the single perception approach to neural networks could not be translated effectively into multi-layered neural networks. The period that followed was known as AI Winter, which lasted until 2010, when the technological development of computer and the internet became widely used. In 2012 breakthroughs in Computer Vision happened at the ImageNet Large Scale Visual Recognition Challenge (ILSVEC). The team from the University of Toronto issued a deep neural network called AlexNet [@alexnet] that changed the field of artificial intelligent and Computer Vision (CV). AlexNet achieved an error rate of 16.4%.

From then until today, Computer Vision has been one of the fastest developing fields. Researchers are competing to develop a model that would be the most similar to the human eye and help humans in their everyday life. In this chapter the author will describe only a few recent state-of-the-art models.

### Supervised  and unsupervised learning

As part of artificial intelligence (AI) and machine learning (ML), there are two basic approaches:

* supervised learning;
* unsupervised learning.

Supervised learning [@supervised] is used to train algorithms on labeled datasets that accurately classify data or predict outcomes. With labeled data, the model can measure its accuracy and learn over time. Among others, we can distinguish between two common supervised learning problems:

* classification,
* regression.

In unsupervised learning [@unsupervised], unlabelled datasets are analyzed and clustered using machine learning algorithms. These algorithms aim to discover hidden patterns or data groupings without previous human intervention. The ability to find similarities and differences in information is mainly used for three main tasks:

* clustering,
* association,
* dimensionality reduction.

Solving the problems where the dataset can be both labeled and unlabeled requires a semi-supervised approach that lies between supervised and unsupervised learning. It is useful when extracting relevant features from complex and high volume data, i.e., medical images.

Nowadays, a new research topic appeared in the machine learning community, Self-Supervised Learning. Self-Supervised learning is a process where the model trains itself to learn one part of the input from another [@selfsup]. As a subset of unsupervised learning, it involves machines labeling, categorizing, and analyzing information independently and drawing conclusions based on connections and correlations. It can also be considered as an autonomous form of supervised learning since it does not require human input to label data. Unlike unsupervised learning, self-supervised learning does not focus on clustering nor grouping [@selfsup2]. One part of Self-Supervised learning is contrastive learning, which is used to learn the general features of an unlabeled dataset identifying similar and dissimilar data points. It is utilized to train the model to learn about our data without any annotations or labels [@contrastive].

### Scaling networks

Ever since the introduction of AlexNet in 2012, the problem of scaling convolutional neural networks (ConvNet) has become the topic of active research. ConvNet can be scaled in all three dimensions: depth, width, or image size. One of the first researches in 2015 showed that network depth is crucial for image classification. The question whether stacking more layers enables the network to learn better leads to deep residual networks called ResNet [@ResNet], which will be described in this work. Later on, scaling networks by their depth became the most popular way to improve their performance.
The second solution was to scale ConvNets by their width. Wider networks tend to be able to capture more fine-grained features and are easier to train [@width].
Lastly, scaling the image's resolution can improve the network's performance. With higher resolution input images, ConvNets could capture more fine-grained patterns. GPipe [@gpipe] is one of the most famous networks created by this technique.
The question of possibility of scaling by all three dimensions was answered by @EfficientNet in the work presenting Efficient Net. This network was built by scaling up ConvNets by all three dimensions and will also be described here.


### Deep residual networks

The deep residual networks, called ResNets [@ResNet], were presented as the answer on the question whether stacking more layers would enable network to learn better. Until then one obstacle for simply stacking layers was the problem of vanishing/exploding gradients. It has been primarily addressed by normalized initialization and intermediate normalization layers. That enabled networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation.

Another obstacle was a degradation problem. It occurs when the network depth increases, followed by saturating and then rapidly decreasing accuracy. Overfitting is not caused by such degradation, and adding more layers to a suitably deep model leads to higher training error, which indicates that not all systems are similarly easy to optimize.

For example, it was suggested to consider a shallower architecture and its deeper counterpart that adds more layers. One way to avoid the degradation problem is to create a deeper model, where the auxiliary layers are identity mappings and other layers are copied from a shallower model. The deeper model should produce no higher training error than its shallower counterpart. However, in practice it is not the case and it is hard to find comparably good constructs or better solutions. The solution to this degradation problem proposed by them is a deep residual learning framework.


#### Deep Residual Learning

##### Residual Learning

The idea of residual learning is to replace the approximation of underlying mapping $H\left( x\right)$, which is approximated by a few stacked layers (not necessarily the entire net), with an approximation of residual function $F(x):= H\left( x \right) − x$. Here x denotes the inputs to the first of these layers, and it is assumed that both inputs and outputs have the same dimensions. The original function changes its form $F\left( x \right)+x$.

A counter-intuitive phenomenon about degradation motivated this reformulation. The new deeper model should not have a more significant training error when compared to a construction using identity mappings. However, due to the degradation problem, solvers may have challenges approximating identity mappings by multiple non-linear layers. Using the residual learning reformulation can drive the weights of the non-linear layers toward zero to approach identity mappings if they are optimal.
Generally, identity mappings are not optimal, but new reformulations may help to pre-condition the problem. When an optimal function is closer to an identity mapping than a zero mapping, finding perturbations concerning an identity mapping should be easier than learning the function from scratch.

##### Identity Mapping by Shortcuts

Residual learning is adopted to every few stacked layers where a building block is defined:

\begin{equation}
(\#eq:ch01-02-01)
y = F  \left( x,\left\{  W_i\right\} \right) + x
\end{equation}

x and y present the input and output vectors of the layers. Figure \@ref(fig:ch01-figure01) visualizes the building block.

```{r ch01-figure01, echo=FALSE, out.width="35%", fig.cap="(ref:ch01-figure01)", fig.align="center"}
knitr::include_graphics("./figures/01-chapter1/resnetBlock.png")
```
(ref:ch01-figure01) Building block of residual learning [@ResNet].

The function $F \left( x,\left\{  W_i\right\} \right)$ represents the residual mapping that is to be learned. For the example with two layers from Figure \@ref(fig:ch01-figure01), $F = W_2\sigma\left( W_1x\right)$ in which $\sigma$ denotes the ReLU activation function. Biases are left out to simplify the notation. The operation $F + x$ is conducted with a shortcut connection and element-wise addition. Afterward, a second non-linear (i.e., $\sigma \left( y \right)$ transformation is applied.

The shortcut connections in Equation \@ref(eq:ch01-02-01) neither adds an extra parameter nor increases computation complexity and enables a comparisons between plain and residual networks that concurrently have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).
The dimensions of $x$ and $F$ in Equation \@ref(eq:ch01-02-01) must be equal. Alternatively, to match the dimensions, linear projection $W_s$ by the shortcut connections can be applied:

\begin{equation}
(\#eq:ch01-02-02)
y = F  \left( x,\left\{  W_i\right\} \right)+ W_sx.
\end{equation}

The square matrix $W_s$ can be used in Equation \@ref(eq:ch01-02-02). However, experiments showed that identity mapping is enough to solve the degradation problem. Therefore, $W_s$ only aims to match the dimensions. Although more levels are possible, it was experimented with function $F$ having two or three layers without stating the exact form of it. Assuming $F$ only has one layer (Equation \@ref(eq:ch01-02-01)) it is comparable to a linear layer:  $y = W_1 x + x$. The theoretical notations are about fully-connected layers, but convolutional layers were used. The function $F  \left( x,\left\{  W_i\right\} \right)$  can be applied to represent multiple convolutional layers. Two feature maps are added element-wise, channel by channel.


#### Network Architectures

Various plain/residual networks were tested to construct an efficient residual network. They trained the network on benchmarked datasets, e.g. the ImageNet dataset, that are used for a comparison of network architectures. Figure \@ref(eq:ch01-02-02) shows that every residual network needs a plain baseline network inspired by the VGG [@vgg] network on which identity mapping by shortcuts is applied.


*Plain Network:* The philosophy of VGG nets 41 mainly inspires the plain baselines. Two rules convolution layers, which usually have $3\times 3$ filters, follow are:

* feature maps with the same output size have the same number of layers;
* reducing the size of a feature map by half doubles the number of filters per layer to maintain time complexity per layer.

Convolutional layers with a stride of 2 perform downsampling directly. A global average pooling layer and a 1000-way fully-connected layer with softmax are at the end of the network. The number of weighted layers sums up to 34 (Figure \@ref(fig:ch01-figure02), middle). Compared to VGG nets, this model has fewer filters and lower complexity (Figure \@ref(fig:ch01-figure02), left).

*Residual Network:* Based on the above plain network, additional shortcut connections (Figure \@ref(fig:ch01-figure02), right) turn the network into its associate residual variant. The identity shortcuts (Equation \@ref(eq:ch01-02-01)) can be directly used in the case of the exact dimensions of the input and output (solid line shortcuts in Figure \@ref(fig:ch01-figure02)). For the different dimensions (dotted line shortcuts in Figure \@ref(fig:ch01-figure02)), two options are considered:

* The shortcut still performs identity mapping, but with extra zero entries padded to cope with the increasing dimensions, without adding new parameters;
* The projection shortcut in Equation \@ref(eq:ch01-02-02) matches dimensions (due to $1\times 1$ convolutions).

In both cases, shortcuts will be done with a stride of two when they go across feature maps of two sizes.

```{r ch01-figure02, echo=FALSE, out.width="100%", fig.cap="(ref:ch01-figure02)", fig.align="center"}
knitr::include_graphics("./figures/01-chapter1/ResNet_architecture.png")
```
(ref:ch01-figure02) Architecture of ResNet [@ResNet].

### EfficientNet

Until @efficient introduced EfficientNet, it was popular to scale only one of the three dimensions – depth, width, or image size. The empirical study shows that it is critical to balance all network dimensions, which can be achieved by simply scaling each with a constant ratio. Based on this observation, a simple yet effective compound scaling method was proposed, which uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if $2N$ times more computational resources are available, increasing the network depth by $\alpha N$, width by $\beta N$, and image size by $\gamma N$ would be possible. Here $\alpha,\beta,\gamma$ are constant coefficients determined by a small grid search on the original miniature model. Figure \@ref(fig:ch01-figure03) illustrates the difference between this scaling method and conventional methods. A compound scaling method makes sense if an input image is bigger because a larger receptive field requires more layers and more significant channel features to capture fine-grained patterns. Theoretically and empirically, there has been a special relationship between network width and depth [@depthwidth]. Existing MobileNets [@mobilenet] and ResNets are used to demonstrated new scaling methods.

```{r ch01-figure03, echo=FALSE, out.width="100%", fig.cap="(ref:ch01-figure03)", fig.align="center"}
knitr::include_graphics("./figures/01-chapter1/Model_scaling.png")
```
(ref:ch01-figure03) Model scaling [@effecient].

#### Compound Model Scaling

##### Problem Formulation

A function $Y_i = \mathcal{F}_i \left( X_i \right)$ with the operator $\mathcal{F}_i$, output tensor $Y_i$, input tensor $X_i$ of shape $\left( H_i, W_i, C_i \right)$, spatial dimensions $H_i$, $W_i$, and channel dimension $C_i$ is called a ConvNet Layer $i$. A ConvNet N appears as a list of composing layers:
$$
\mathcal{N}=\mathcal{F_k}\odot \cdots \mathcal{F_2}\odot\mathcal{F_1}\left( X_1 \right)=\bigodot{j=1\cdots k}\mathcal{F_j}\left( X_1 \right)
$$

Effectively, these layers are often partitioned into multiple stages and all layers in each stage share the same architecture. For example, ResNet has five stages with all layers in every stage being the same convolutional type except for the first layer that performs down-sampling. Therefore, a ConvNet can be defined as:

$$
\mathcal{N}=\bigodot_{i=1\cdots s}\mathcal{F_i}^{L_i}\left( X_{\left( H_i, W_i, C_i  \right)} \right)
$$

where $\mathcal{F_i}^{L_i}$ denotes layer $\mathcal{F_i}$ which is repeated $L_i$ times in stage $i$, and $\left( H_i, W_i, C_i \right)$ is the shape of input tensor $X$ of layer $i$.

In comparison to the regular ConvNet focusing on the best layer architecture search $\mathcal{F_i}$, model scaling centers on the expansion of the network length $\left( L_i\right)$, width $\left( C_i \right)$, and/or resolution $\left( H_i, W_i\right)$ without changing $\mathcal{F_i}$ that was predefined in the baseline network. Although model scaling simplifies the design problem of the new resource constraints through fixing $\mathcal{F_i}$, a different large design space $\left( L_i, H_i, W_i, C_i \right)$ for each layer remains to be explored. To further reduce the design space, all layers are restricted to be scaled uniformly with a constant ratio. In this case, the goal is to maximize the model's accuracy for any given resource constraint, which is presented as an optimization problem:

\begin{align*}
\max_{d,w,r} &\text{Accuracy} \left( \mathcal{N}\left( d,w,r \right) \right) \\
s.t.\mathcal{N}\left( d,w,r \right) &=\bigodot_{I=1...s}\hat{\mathcal{F}}{i}^{d\cdot \hat{L{i}}}\left( X_{\left\langle r\cdot \hat{H_i},r\cdot \hat{W_i},w\cdot \hat{C_i}\right\rangle} \right) \\
Memory\left( \mathcal{N} \right) &\leq\ targetMemory \\
FLOPS\left( \mathcal{N} \right) &\leq\ targetFlops
\end{align*}
where $w,d,r$ are coefficients for scaling network width, depth, and resolution; $\left(\widehat{\mathcal{F}}_i, \widehat{L}_i, \widehat{H}_i, \widehat{W}_i, \widehat{C}_i \right)$ are predefined parameters of the baseline network.

##### Scaling Dimensions

The main difficulty of this optimization problem is that the optimal $d, w, r$ depend on each other and the values are changing under different resource constraints. Due to this difficulty, conventional methods mostly scale ConvNets in one of these dimensions:

**Depth ($d$):** One of the most significant networks previously described is the ResNet. As it was described, the problem of ResNets is that accuracy gain of a very deep network diminishes. For example, ResNet-1000 has similar accuracy to ResNet-101 even though it contains many more layers.

**Width ($w$):** Scaling network width is commonly used for small-sized models. However, wide but shallow networks tend to have difficulty grasping higher-level features.

**Resolution ($r$):** Starting from $224\times 224$ in early ConvNets, modern ConvNets tend to use $299\times 299$ or $331\times 331$ for better accuracy. GPipe [@gpipe] recently achieved state-of-the-art ImageNet accuracy with $480\times 480$ resolution. Higher resolutions, such as $600\times 600$, are also widely used in ConvNets for object detection.

The above analyses lead to the first observation:

**Observation 1:** Scaling up any network width, depth, or resolution dimension improves accuracy. Without the upscaling, the gain diminishes for bigger models.

##### Compound Scaling

Firstly, it was observed that different scaling dimensions are not independent because higher resolution images also require to increase the network depth. The larger receptive fields can help capture similar features that include more pixels in bigger images. Similarly, network width should be increased when the resolution is higher to capture more fine-grained patterns. The intuition suggests that different scaling dimensions should be coordinated and balanced rather than conventional scaling in single dimensions.
To confirm this thought, results of networks with width $w$ without changing depth ($d$=1.0) and resolution ($r$=1.0) were compared with deeper ($d$=2.0) and higher resolution ($r$=2.0) networks. This showed that width scaling achieves much better accuracy under the same FLOPS. These results lead to the second observation:

**Observation 2:**  To achieve better accuracy and efficiency, balancing the network width, depth, and resolution dimensions during ConvNet scaling is critical. Earlier researches have tried to arbitrarily balance network width and depth, but they all require tedious manual tuning.

A new **compound scaling method**, which uses a compound coefficient $\varphi$ to uniformly scale network width, depth, and resolution in a principled way was proposed:

\begin{align}
\begin{split}
\text{depth:} &\mathcal{d}=\alpha^{\varphi} \\
\text{width:} &\mathcal{w}=\beta^{\varphi}\\
\text{resolution:} &\mathcal{r}=\gamma^{\varphi}\\
&s.t.  \alpha\cdot \beta^{2}\cdot \gamma^{2}\approx 2\\
&\alpha \ge 1, \beta \ge 1, \gamma \ge 1
\end{split}
 (\#eq:01-02-06)
\end{align}

where $\alpha, \beta, \gamma$ are constants that can be determined by a small grid search, $\varphi$ is a user-specified coefficient that controls how many more resources are available for model scaling, while $\alpha, \beta, \gamma$ specify how to assign these extra resources to the network width, depth, and resolution, respectively. Notably, the FLOPS of a regular convolution operation is proportional to $d, w^{2}, r^{2}$, i.e., doubling network depth will double the FLOPS, but doubling network width or resolution will increase the FLOPS by four times. Scaling a ConvNet following Equation \@ref(eq:01-02-06) will approximately increase the total number of FLOPS by $\left( \alpha\cdot \beta^{2}\cdot \gamma^{2} \right)^{\varphi}$. In this chapter, $\alpha\cdot \beta^{2}\cdot \gamma^{2}\approx 2$ is constrained such that for any new $\varphi$ the total number of FLOPS will approximately increase by $2\varphi$.

#### EfficientNet Architecture

A good baseline network is essential because model scaling does not affect its layer operators $F*[i]$. Therefore this method is also estimated on ConvNets.
A new mobile-sized baseline called EfficientNet was developed to show the effectiveness of the new scaling method. Metrics that were used to estimate the efficacy are accuracy and FLOPS.
The baseline efficient network that was created is named EfficientNet-B0. Afterwards, this compound scaling method is applied in two steps:

* **STEP 1**:  By fixing  $\varphi = 1$ and, assuming twice more resources available, a small grid search of $\alpha, \beta, \gamma $ based on Equation \@ref(eq:01-02-06) showed that the best values for EfficientNet-B0 are $\alpha = 1.2, \beta = 1.1, \gamma=1.15$ under the constraint of $\alpha·\beta^2·\gamma^2 ≈2$.

* **STEP 2**: Afterwards, fix $\alpha,\beta,\gamma$ as constants and scale up the baseline network with different $\varphi$ using Equation \@ref(eq:01-02-06) to construct EfficientNet-B1 to B7.

| Name             | Number of parameters|
|:---------------:|:--------:|
| EfficientNet-B0  | 5.3M parameters |
| EfficientNet-B1  | 7.8M parameters |
| EfficientNet-B2  | 9.2M parameters |
| EfficientNet-B3  | 12M parameters  |
| EfficientNet-B4  | 19M parameters  |
| EfficientNet-B5  | 30M parameters  |
| EfficientNet-B6  | 43M parameters  |
| EfficientNet-B7  | 66M parameters  |

Indeed, even better performance is achievable by searching for $\alpha,\beta,\gamma$ directly around a large model, but the search cost becomes prohibitively more expensive on larger models. This method searches once on a small baseline network, then scales the coefficient for all other models.

#### Results and comparison of the networks

To demonstrate the performance of both networks, ResNet and EfficientNets were trained and evaluated on the ImageNet 2012 classification dataset consisting out of 1000 classes. Since deeper scaling should provide better results in the case of ResNet, it was trained with increased depth each time. First meaningful results were obtained in ResNet-34, which performed 3.5 % better than plain-34 baseline when top-1 accuracy is compared. They also compared three versions of ResNet: (A) zero-padding shortcuts (increasing dimensions, all shortcuts are parameter-free) (B) projection shortcuts (increasing dimensions, other shortcuts are identity), and (C) all shortcuts are projections. Each version improved both, the top-1 and top-5 accuracy. Afterward, the depth of the network was increased and ResNet-50, ResNet-101, and ResNet-152 were created. Each increase in depth leads to higher accuracy. In deeper models, the trade-off between accuracy increase and deeper model is not worth describing. All results are shown in the following table:

| Model    | top-1 acc.|top-5 acc.|
|:----------:|:-----------:|:---------:|
| VGG-16 | 71.93     | 90.67|
| GoogLeNet  | -    | 90.85|
| plain-34  | 71.46     |89.98|
| ResNet-34 A  | 74.97    |92.24|
| ResNet-34 B  | 75.48    |92.54|
| ResNet-34 C  | 75.81      |92.6|
| ResNet-50 | 77.15      |93.29|
| ResNet-101 | 78.25      |93.95|
| ResNet-152 | **78.57**     |**94.29**|

In the case of EfficientNets, the results achieved by the previous state-of-the-art networks on the same ImageNet dataset were aimed to improve. Among all state-of-the-art networks, EfficientNets were compared with ResNets-50 and ResNet-152. They compared the results of networks deviated by changing scaling parameters EfficientNet-B0 to EfficientNet-B7. The results of each network were better than the previous one. Also, they have shown that EfficientNet-B0 outperforms ResNet-50 and that EfficientNet-B1 outperforms ResNet-152. This means that scaling through all three dimensions can provide better results than scaling through just one dimension. The drawback of this approach is the computational power which makes it less popular than the previous methods. Again, all results are shown in the following table:

| Model             | top-1 acc.|top-5 acc.|
|:----------:|:-----------:|:---------:|
| EfficientNet-B0 / ResNet-50  | 77.1 / 76| 93.3 / 93 |
| EfficientNet-B1 / ResNet-152  | 79.1 / 77.8 | 94.4 / 93.8 |
| EfficientNet-B2 | 80.1     |94.9|
| EfficientNet-B3 / ResNeXt-101| 81.6 / 80.9     |95.7 / 95.6 |
| EfficientNet-B4  | 82.9    |96.4|
| EfficientNet-B5  | 83.6    |96.7|
| EfficientNet-B6 | 84      |96.8|
| EfficientNet-B7 / GPipe| **84.3** / 84.3    |**97** / 97 |


### Contrastive learning

In recent years the problem of classification of unlabeled dataset is becoming more widespread. More unlabeled datasets requiring human labeling are created in fields like medicine, the automotive industry, military, etc. Since the process is expensive and time-consuming, researchers assumed it could be automated with contrastive learning frameworks. One of the first and most known contrastive learning frameworks is SimCLR [@SimCLR]. The advantage of this framework is its simplicity, yet it achieves high accuracy on classification tasks. The main idea is to have two copies of the image, which are then used to train two networks and that are compared. The problem with this framework is that it doubles the size of the dataset and reaches among all images, which can be computationally infeasible for large datasets. Bootstrap Your Own Latent [@BYOL] was introduced to avoid making double-sized datasets. The idea was to bootstrap image representations to avoid unnecessary image comparisons. These two frameworks will be described in this chapter.

Further improvements in the choice of creating two views of images and comparison techniques were presented in different frameworks such as Nearest-Neighbor Contrastive Learning (NNCLR) [@NNCLR], Open World Object Detection (ORE) [@ORE], Swapping Assignments between multiple Views (SwAV) {@SwAV}, and many more.
This field is a constant research topic and new improved frameworks are proposed on a constant basis to help researchers solve different tasks that requires labeled datasets.


#### A Simple Framework for Contrastive Learning of Visual Representations

@SimCLR intended to analyze and describe a better approach to learning visual representations without human supervision. They have introduced a simple framework for contrastive learning of visual representations called SimCLR. As they claim, SimCLR outperforms previous work, is more straightforward, and does not require a memory bank.

Intending to understand what qualifies good contrastive representation learning, the significant components of the framework were studied and resulted in:

* A contrastive prediction task requires combining multiple data augmentation operations, which results in effective representations. Unsupervised contrastive learning benefits from more significant data augmentation.
* The quality of the learned representations can be substantially improved by introducing a learn-able non-linear transformation between the representation and the contrastive loss.
* Representation learning with contrastive cross-entropy loss can be improved by normalizing embeddings and adjusting the temperature parameter appropriately.
* Unlike its supervised counterpart, contrastive learning benefits from larger batch sizes and extended training periods. Contrastive learning also benefits from deeper and broader networks, just as supervised learning does.


#### The Contrastive Learning Framework

Like for SimCLR, a contrastive loss is used to learn a representation by maximizing the agreement between various augmented views of the same data example. This framework contains four significant components, which are shown in Figure \@ref(fig:ch01-figure4):

1. A stochastic *data augmentation* module
2. A neural network *base encoder*
3. A small neural network *projection head*
4. A *contrastive loss function*

```{r ch01-figure04, echo=FALSE, out.width="30%", fig.cap="(ref:ch01-figure04)", fig.align="center"}
knitr::include_graphics("./figures/01-chapter1/SimCLR.png")
```
(ref:ch01-figure04) A simple framework for contrastive learning of visual representations [@SimCLR].


##### Stochastic data augmentation module

First, the minibatch of $N$ examples is sampled randomly, and the contrastive prediction task is defined on pairs of augmented examples, resulting in $2N$ data points. A memory bank was not used to train the model, instead, the training batch size varies from 256 to 8192. Any given data example randomly returns two correlated views of the same example, denoted $\tilde{x}_{i}$ and $\tilde{x}_{j}$, which is known as a **positive pair**. **Negative pairs** are all other $2(N-1)$ pairs. In one view, some data augmentation techniques are applied. Data augmentation is widely embraced in supervised and unsupervised representation learning. Unfortunately, it has not been used to define the contrastive prediction task, which is mainly determined by changing the architecture. It was shown that choosing different data augmentation techniques can reduce the complexity of previous contrastive learning frameworks. There are many data augmentation operations, the focus was on the most common ones, which are:

* **Spatial geometric transformation**: cropping and resizing (with horizontal flipping), rotation and cutout,
*  **Appearance transformation**:  color distortion (including color dropping), brightness, contrast, saturation, Gaussian blur, and Sobel filtering.

```{r ch01-figure05, echo=FALSE, out.width="80%", fig.cap="(ref:ch01-figure05)", fig.align="center"}
knitr::include_graphics("./figures/01-chapter1/augmentation.png")
```
(ref:ch01-figure05) Augmentation texhniques [@SimCLR].

Due to the image sizes in the ImageNet dataset, all images were always randomly cropped and resized to the same resolution. Later on, other targeted data augmentation transformations were applied to one branch, remaining the one as original i.e. $t\left( x_{i}\right)= x_i$.
Applying just individual transformation is insufficient for the model to learn good representations. The model's performance improves after composing augmentations, although the contrastive prediction task becomes more complex. The composition of augmentations that stood out were random cropping and random color distortion.

It was also observed that stronger color augmentation significantly improves the linear evaluation of unsupervised learned models. Stronger color augmentations do not enhance the performance of supervised learning models when trained with the same augmentations. Based on the experiments, unsupervised contrastive learning benefits from stronger color data augmentation than supervised learning.

##### Neural network base encoder

Neural network based encoder $f\left( \cdot  \right)$ extracts multiple representation vectors from the augmented data examples. This framework does not restrict a choice of the network architecture, although for simplicity, the commonly used ResNet was picked and gives $h_i=f\left( \tilde{x}_{i} \right)=ResNet\left(\tilde{x}_{i}\right)$ where $\textbf{h}_i\in \mathbb{R}^{d}$ is the output after the average pooling layer. Although increasing depth and width improves performance, the ResNet-50 was chosen. Furthermore, when the model size increases, the gap between supervised and unsupervised learning shrinks, suggesting that bigger models benefit more from unsupervised learning.

##### Small neural network projection head

A small neural network projection head $g\left( \cdot  \right)$ maps the representation to the space where the contrastive loss is applied to. The importance of including a projection head, i.e., $g\left( h  \right)$ was evaluated and they considered three different architectures for the head:

1. identity mapping,
2. linear projection,
3. the default non-linear projection with one additional hidden layer and ReLU activation function.

The results showed that a non-linear projection head is better than a linear projection and much better than no projection. It improves the representation quality of the layer that is applied previous to it. They have used a MLP with one hidden layer to obtain $z_i = g\left( \textbf{h}_i  \right) = W^{\left(  2\right)}\sigma \left(  W^{\left(  1\right)} \textbf{h}_i\right)$ where $\sigma$ is a ReLU non-linearity transformation.

This step is performed because defining the contrastive loss on $z_i$ instead of on $\textbf{h}_i$ would not lead to a loss of information caused by contrastive loss. Especially, $z=g\left( h  \right)$ is trained to be invariant to data transformations. As a result, $g$ can remove information useful for a downstream task such as object color or orientation. Using the non-linear transformation $g\left( *  \right)$, $h$ can maintain and form more information.

##### Contrastive loss function

Given a set $\left\{ \tilde{x}_{ik} \right\}$ including a positive pair of examples $\tilde{x}_{i}$ and $\tilde{x}_{j}$, the contrastive prediction task aims to identify $\tilde{x}_{i}$ in $\left\{ \tilde{x}_{i} \right\}_{k\neq i}$ for a given $\tilde{x}_{i}$. In the case of positive examples, the loss function is defined as
$$
\mathcal{l}_{i,j} = −\log\frac{exp\left( \frac{sim(z_i,z_j)}{\tau} \right)}{\sum_{k=1}^{2N}\mathbb{I_{\left[ k\neq i \right]}}exp\left( \frac{sim(z_i,z_k)}{\tau} \right)}
$$

where $\mathbb{I_{\left[ k\neq i \right]}}\in\left\{ 0,1 \right\}$ is an indicator function, $\tau$ denotes a temperature parameter and $sim\left(\textbf{u,v} \right)= \frac{\textbf{u}^T\textbf{v}}{\left\|  \textbf{u}\right\|\left\| \textbf{v} \right\|}$ is a dot product between $\mathcal{l}_2$ and normalized $\textbf{u},\textbf{v}$.

The final loss is calculated across all positive pairs, both $\left( i,j \right)$ and $\left( j,i \right)$, in a mini-batch. It was named **NT-Xent**, the normalized temperature-scaled cross-entropy loss.

The NT-Xent loss was compared against other commonly used contrastive loss functions, such as logistic loss and margin loss. Gradient analysis shows that $l_2$ normalization, cosine similarity, and temperature together effectively weight different examples and a suitable temperature can make the model learn from hard negatives. The advantage of NT-Xent is that it weights the negatives by their relative hardness. Without normalization and proper temperature scaling the performance is significantly worse. Also, the contrastive task accuracy is higher, but the resulting representation is worse under linear evaluation.

#### Bootstrap Your Own Latent

The fundamental idea of contrastive learning is to create pairs of images on which the framework would be trained. Creating negative pairs relies on large batch sizes, memory banks, or customized mining strategies which can be challenging in larger datasets. @BYOL wanted to create a new approach that would achieve better performance than other contrastive methods without using negative pairs. A solution they have introduced is a method called Bootstrap Your Own Latent (BYOL). The idea was to bootstrap representations of images. As a result, BYOL is more robust to the choice of image augmentations.
Furthermore, BYOL has two neural networks, called online and target network, who interact and learn from each other. Using an augmented view of an image, BYOL trains its online network to predict the target network's representation of another augmented view. This approach achieved state-of-the-art results when trained on the ImageNet dataset under the linear evaluation protocol. Additionally, compared to SimCLR, a strong contrastive baseline, BYOL suffers from much less performance drop when only random crops are used to augment images.


##### Description of the method

BYOL aims to learn a representation of $y_\theta$. It uses two neural networks: *online* and *the target network* to achieve that. The *online network*  is determined by a set of weights $\theta$ and consists of:

* an encoder $f_\theta$,
* a projector $g_\theta$,
* a predictor $q_\theta$.

```{r ch01-figure06, echo=FALSE, out.width="80%", fig.cap="(ref:ch01-figure06)", fig.align="center"}
knitr::include_graphics("./figures/01-chapter1/BYOL.png")
```
(ref:ch01-figure06) Bootstrap Your Own Latent [@BYOL].

The *target network* has the same architecture as the online network but uses different weights $\xi$. It provides the regression targets to train the online network, and its parameters $\xi$ are an exponential moving average of the online parameters $\theta$. Precisely, given a target decay rate $\tau \in[0,1]$, after each training step, the following update

$$
\xi \leftarrow \tau \xi+(1-\tau) \theta
$$

is performed.
Firstly, an image is sampled uniformly from $\mathcal{D}$ from which two distributions of image augmentations $\mathcal{T}$ and $\mathcal{T}^{\prime}$ are created. BYOL applies respectively two image augmentations $t \sim \mathcal{T}$ and $t^{\prime} \sim \mathcal{T}^{\prime}$ creating two augmented views $v \triangleq t(x)$ and $v^{\prime} \triangleq t^{\prime}(x)$. First augmented view $v$ is used for the online network and result in the output $y_{\theta} \triangleq f_{\theta}(v)$ and afterwards the projection $z_{\theta} \triangleq g_{\theta}(y)$. Similarly, from the second augmented view $v^{\prime}$ the target network outputs $y_{\xi}^{\prime} \triangleq f_{\xi}(v^{\prime})$ and the target projection $z_{\xi}^{\prime} \triangleq g_{\xi}(y^{\prime})$. Later on output a prediction of $q_{\theta}\left(z_{\theta}\right)$ of $z_{\xi}^{\prime}$ and $\ell_{2}$-normalize both $q_{\theta}\left(z_{\theta}\right)$ and $z_{\xi}^{\prime}$ to

$$
\overline{q_{\theta}}\left(z_{\theta}\right) \triangleq q_{\theta}\left(z_{\theta}\right) /\left\|q_{\theta}\left(z_{\theta}\right)\right\|_{2} \quad \textrm{and} \quad
\bar{z}_{\xi}^{\prime} \triangleq z_{\xi}^{\prime} /\left\|z_{\xi}^{\prime}\right\|_{2}.
$$

The predictor is only applied to the online pipeline, making the architecture asymmetric between the online and target pipeline. Lastly, the following mean squared error between the normalized predictions and target projections is defined:

$$
\mathcal{L}_{\theta, \xi} \triangleq\left\|\overline{q_{\theta}}\left(z_{\theta}\right)-\bar{z}_{\xi}^{\prime}\right\|_{2}^{2}=2-2 \cdot \frac{\left\langle q_{\theta}\left(z_{\theta}\right), z_{\xi}^{\prime}\right\rangle}{\left\|q_{\theta}\left(z_{\theta}\right)\right\|_{2} \cdot\left\|z_{\xi}^{\prime}\right\|_{2}}
$$

The loss is symmetrized $\mathcal{L}_{\theta, \xi}$ by using $v^{\prime}$ for the online network and $v$ for the target network separately to calculate $\widetilde{\mathcal{L}}_{\theta, \xi}$. At each training step, a stochastic optimization step is applied to minimize $\mathcal{L}_{\theta, \xi}^{\mathrm{BYOL}}=\mathcal{L}_{\theta, \xi}+\widetilde{\mathcal{L}}_{\theta, \xi}$ with respect to $\theta$ only but not $\xi$. BYOL's dynamics are summarized as

$$
\theta \leftarrow \operatorname{optimizer}\left(\theta, \nabla_{\theta} \mathcal{L}_{\theta, \xi}^{\mathrm{BYOL}}, \eta\right).
$$

where $\eta$ is a learning rate. At the end of the training, only the encoder $f_{\theta}$ is used.

#### Comparison of contrastive learning frameworks

Of all frameworks, SimCLR is the most popular due to its simplicity. The ResNet-50 in 3 different hidden layer widths (width multipliers of $1\times$, $2\times$, and $4\times$) were used and trained for 1000 epochs each. The accuracy of these frameworks on the ImageNet dataset with few labels improved when the width of ResNet-50 increases. For SimCLR with ResNet-50 top-1 accuracy is 69.3 and top-5 accuracy is 89, while for ResNet-50(4x) top-1 accuracy is 85.8 and top-5 accuracy is 92.6. These results are comparable with supervised methods.
The BYOL framework was built to improve the results of SimCLR. It was also stated that the accuracy for the baseline ResNet-50 is 74.3 and 91.6 for top-1 accuracy and top-5 accuracy. When using ResNet-50(4x), an increase in accuracy to 78.6 and 94.2 for top-1 and top-5 is observed, respectively. More information about performance can be found in following table:

| Model    | Architecture|Param (M)| top-1 acc.|top-5 acc.|
|:----------:|:-----------:|:-----------:|:-----------:|:---------:|
| SimCLR |ResNet-50|24| 69.3    | 89.0|
| SimCLR  |ResNet-50 (2x) |94| 74.2 | 93.0|
| SimCLR  |ResNet-50 (4x) |375| 76.5    |93.2|
| BYOL |ResNet-50|24| 74.3   |91.6|
| BYOL |ResNet-50 (x2) |94| 77.4   |93.6|
| BYOL |ResNet-50 (x4)|375| 78.6      |94.2|
| BYOL |ResNet-200 (x2)|250| 79.6      |94.8|


### Transformers in Computer Vision

Since the first appearance of the Transformers architecture in 2017 @TRANSFORMERS_NLP, it has become an irreplaceable part of all-natural language processing (NLP) models. The main advantage of Transformers is that they can be trained on a large text corpus and then fine-tuned on a smaller task-specific dataset. This enabled model training of unspecified size with more than 100B parameters.

However, computer vision still relied on convolutional architectures. With datasets constantly growing and the diversity of the fields computer vision tasks could be applied to, researchers wanted to implement Transformers architecture in the CV field. Some works aim for combining CNN-like architectures with self-attention [@wang]. Others attempted to replace convolutions entirely, e.g. @selfa. Due to specialized attention patterns, the problem was that they have not yet been scaled effectively on modern hardware accelerators. Therefore, in large-scale image recognition, classic ResNet-like architectures are still state-of-the-art.

In 2021 the Google research Brain Team published the paper "An image is worth $16\times 16$ words" where they introduced new Transformers-based architecture for CV called Vision Transformers (ViT) [@vit]. Based on the success of Transformer in NLP scaling, they aimed to apply standard Transformer directly to images with little as possible changes to the existing architecture. The image is split into patches and linear embeddings of these patches are provided as inputs to the Transformer.
These patches are the same as tokens (e.g. words) in NLP. The model is trained for image classification in a supervised learning fashion.

#### Vision Transformers

Brain Team wanted to create simple but universally scalable architecture to follow the original Transformers architecture.


```{r ch01-figure7, echo=FALSE, out.width="90%", fig.cap="(ref:ch01-figure7)", fig.align="center"}
knitr::include_graphics("./figures/01-chapter1/ViT.png")
```
(ref:ch01-figure7) Vision Transformer [@vit].

##### Method

Compared to NLP, with 1-dimensional token embedding input for the Transformer, images are 2-dimensional objects. Firstly, images needed to be represented differently to imitate original architectures as close as possible. For that reason image $x\in \mathbb{R}^{ H \times  W \times  C}$ is reshaped into a sequence of flattened 2-dimensional patches $x_p\in \mathbb{R}^{ N \times \left( P^2 \cdot C \right)}$, where $\left(H,W\right)$ is the resolution of the original image, $C$ is the number of channels, $\left(P,P\right)$ is the resolution of each image patch, and $N =HW/P^2$ is the resulting number of patches, also the Transformer's effective input sequence length. The Transformer input through all layers is a fixed vector of size $D$. The first step is to flatten the patches, usually $16\times 16$ and map them to $D$ dimensions with a trainable linear projection to create patch embeddings.

$$
\mathbf{z}_{0} =\left[\mathbf{x}_{\text {class }} ; \mathbf{x}_{p}^{1} \mathbf{E} ; \mathbf{x}_{p}^{2} \mathbf{E} ; \cdots ; \mathbf{x}_{p}^{N} \mathbf{E}\right]+\mathbf{E}_{p o s}, \mathbf{E} \in \mathbb{R}^{\left(P^{2} \cdot C\right) \times D}, \mathbf{E}_{p o s} \in \mathbb{R}^{(N+1) \times D}
$$

To this sequence of "patch embeddings", a prefix learnable [class] token, like in BERT, is usually added. This token $\mathbf{z}_{0}^{0} = \mathbf{x}_{class}$ tells the model to classify the image and increases the dimension of vector $z$. Also, the state of this token at the output of the Transformer encoder $\left(\mathbf{z}_{L}^{0}\right)$, on which the layernorm is applied, serves as the image representation $y$.

$$
\mathbf{y} =\operatorname{LN}\left(\mathbf{z}_{L}^{0}\right)
$$


Furthermore, it is the only one to which the classification head is attached to during pre-training and fine-tuning. The classification head during pre-training is compiled of MLP with one hidden layer and a single linear layer at a fine-tuning time. Position embedding, a standard learnable 1-dimensional position embedding, are attached to the patch embeddings, serving as input to the encoder. The standard Transformer encoder consists of alternating layers of multiheaded self-attention and MLP blocks. After each block, a residual connection is applied.

$$
\mathbf{z}_{\ell}^{\prime} =\operatorname{MSA}\left(\operatorname{LN}\left(\mathbf{z}_{\ell-1}\right)\right)+\mathbf{z}_{\ell-1},  \ell=1 \ldots L
$$

$$
\mathbf{z}_{\ell} =\operatorname{MLP}\left(\mathrm{LN}\left(\mathbf{z}_{\ell}^{\prime}\right)\right)+\mathbf{z}_{\ell}^{\prime}, \ell=1 \ldots L
$$

Vision Transformer has a significantly lower inductive bias than CNNs in image-specific information. VIT only has local and translational equivariant MLP layers, while the self-attention layers are global. A 2-dimensional neighborhood structure is used sparingly: the image is cut into patches at the beginning and the position embeddings are resized as needed at the fine-tuning time. Alternatively, the input sequence can consist of a CNN's feature map on which the patch embedding projection is applied.
Vision Transformers are pre-trained on large datasets and fine-tuned to (smaller) downstream tasks. For fine-tuning, a projection head is removed and a zero-initialized $D \times K$ feedforward layer is attached with $K$ being the number of downstream classes. It is also beneficial to use higher resolution then in pre-training. Also ViT can handle arbitrary sequence lengths but the pre-trained position embeddings can become sufficient. It is necessary to point out that resolution adjustment and patch extraction are the only points at which an inductive bias about the 2-dimensional structure of the images is manually injected into the Vision Transformers


##### Experiments

Similarly to BERT models, multiple versions of the model at various scales were created. They have created Base = "B", Large = "L", Huge = "H" versions of ViT, with 12, 24 and 32 layers and 86M, 307M and 632M parameters respectively.

To explore the model scalability, the previous mentioned dataset ImageNet was used. In addition, ViT was compared against a slightly modified ResNet called "ResNet(BiT)". The batch Normalization layer was replaced with Group Normalization and used standardized convolutions. Another network that it was compared to was Noisy Student [@noisy], a large EfficientNet. Experiments showed that ViT Hughe with $14\times 14$ input patch size outperformed both CNN-based networks with an accuracy of 88.5%, whereas ResNet BiT had 87.54% and Noisy Student 88.4%. It is worth mentioning that ViT Large with $16\times 16$ input patch size had 87.76% accuracy on the same dataset.
Another thing worth pointing out is that ViT outperforms CNN-based architectures on all larger datasets yet performs slightly worse than CNN networks on a smaller dataset.

### Conclusion

In this chapter, the authors presented some of the current state-of-the-art approaches in Computer Vision. Nowadays, when technology is advancing each day, creating networks that would imitate human brain is more challenging. Still, the networks presented in this chapter are highly accurate and creating network which can out-perform them is challenging. Furthermore, it is noticeable that the application of CV is dictating the development of networks and frameworks which help humans with their everyday tasks.