diff --git a/notebooks/index.yml b/notebooks/index.yml
index 26cd98d..9f3b461 100644
--- a/notebooks/index.yml
+++ b/notebooks/index.yml
@@ -5,3 +5,4 @@
- notebook: notebooks/decision_trees
- notebook: notebooks/robust_and_trustworthy_machine_learning
- notebook: notebooks/deep_reinforcement_learning
+- notebook: notebooks/transformer
diff --git a/notebooks/transformer/images/masked_multi-head_attention.png b/notebooks/transformer/images/masked_multi-head_attention.png
new file mode 100644
index 0000000..8488bb9
Binary files /dev/null and b/notebooks/transformer/images/masked_multi-head_attention.png differ
diff --git a/notebooks/transformer/images/multi-head_attention.png b/notebooks/transformer/images/multi-head_attention.png
new file mode 100644
index 0000000..be5cdb9
Binary files /dev/null and b/notebooks/transformer/images/multi-head_attention.png differ
diff --git a/notebooks/transformer/images/positional_encoding.jpg b/notebooks/transformer/images/positional_encoding.jpg
new file mode 100644
index 0000000..a8f4849
Binary files /dev/null and b/notebooks/transformer/images/positional_encoding.jpg differ
diff --git a/notebooks/transformer/images/transformer_decoder.jpg b/notebooks/transformer/images/transformer_decoder.jpg
new file mode 100644
index 0000000..71b7fdd
Binary files /dev/null and b/notebooks/transformer/images/transformer_decoder.jpg differ
diff --git a/notebooks/transformer/images/transformer_encoder.jpg b/notebooks/transformer/images/transformer_encoder.jpg
new file mode 100644
index 0000000..1e8dd4d
Binary files /dev/null and b/notebooks/transformer/images/transformer_encoder.jpg differ
diff --git a/notebooks/transformer/images/transformer_model.jpg b/notebooks/transformer/images/transformer_model.jpg
new file mode 100644
index 0000000..f6116ff
Binary files /dev/null and b/notebooks/transformer/images/transformer_model.jpg differ
diff --git a/notebooks/transformer/index.ipynb b/notebooks/transformer/index.ipynb
new file mode 100644
index 0000000..e528f5f
--- /dev/null
+++ b/notebooks/transformer/index.ipynb
@@ -0,0 +1,320 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xPtZUk9Hii0M"
+ },
+ "source": [
+ "## Table of Contents\n",
+ "* [Introduction](#intro)\n",
+ "* [Transformer Architecture](#transformer_architecture)\n",
+ " * [Transformer Encoder](#transformer_encoder)\n",
+ " * [Transformer Decoder](#transformer_decoder)\n",
+ " * [Positional Encoding](#positional_encoding)\n",
+ "* [Supplementary Material](#supplementary_material)\n",
+ " * [Attention Mechanism](#attention_mechanism)\n",
+ " * [Multi-Head Attention Mechanism](#multi-head_attention_mechanism)\n",
+ " * [Masked Multi-Head Attention Mechanism](#masked_multi-head_attention_mechanism)\n",
+ "* [References](#references)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VXdHmheRj1lL"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Introduction\n",
+ "\n",
+ "Before transformers, the state of the art aproaches in sequence modeling were mostly based on recurrent neural networks. The problem with recurrent neural networks is their limitation is parallelization during training phase because of processing the sequence elements one at a time and it becomes more problematic when we have long sequences and the memory limit, limits batching across the examples.\n",
+ "Transformer, on the other hand, is a model architecture, entirely based on attention mechanism to consider long-term dependencies in the sequence and between input and output sequences and it allows notably more parallelization. \n",
+ "In the following sections we first explain transformer architecture and then give an example of transformer models.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EfjsbCSvkO_Q"
+ },
+ "source": [
+ "\n",
+ "## Transformer Architecture\n",
+ "\n",
+ "The main transformer model intruduced in Attention Is All You Need paper consists of two main parts: Encoder and Decoder.\n",
+ "The encoder and the decoder both consist of N = 6 similar layers."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "h3Tdo5D1sSeV"
+ },
+ "source": [
+ "![Transformer Model](images/transformer_model.jpg)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Transformer Encoder\n",
+ "\n",
+ "The architecture of each encoder layer is shown at the left side of the figure bellow. Each encoder layer is made of two sub-layers.\n",
+ "The first sub layer is multi-head self-attention block and the second sub-layer is a fully connected network. As you can see in the image below, Both sub-layers are wrapped by a residual connection followed by a layer normalization."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "![Transformer Encoder](images/transformer_encoder.jpg)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Transformer Decoder\n",
+ "\n",
+ "The architecture of each decoder layer is shown at the right side of the figure below. The decoder consists of three sub-layers. A masked multi-head self attention, a masked multi-head attention and a fully connected feed forward network. The same as encoder part, the three sub-layers are wraped by a residual connection followed by a layer normalization. As you can see in the image below, the two of the multi-head attention inputs are from the output of the encoder stack. The third sub-layer which is called Masked Multi-Head Attention is a modification of self-attention module and ensures that the predictions for position i only attends to the known outputs which means the outputs at position less than i. That is because, for example if you are using transformer for machine translation task, the decoder input will be the translated sentence. Hence the transformer only should attent to the tokens of the sentence that has been translated until the current step, not to the whole translated sequence."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "![Transformer Decoder](images/transformer_decoder.jpg)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Positional Encoding\n",
+ "\n",
+ "Since the attention mechanism in transformer model doesn't consider any order for the token in the input sequence, it is needed to inject some information about the position of the tokens in the sequence. To do so, the transformer makes use of a kind of embedding which is called positional embedding and it has the same dimension as the input embeddings so it can be added to the input embeddings. There are multiple options for such an embedding that encodes the position in a sequence. The transformer model makes use of sine and cosine functions for this purpose."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "$$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $$\n",
+ "$$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $$\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the above formula, pos is the position of token in the sequence and i is the position in the embedding. for example, if the token is the second token in the sentence and the positional embedding dimension is 100, for computing the 5th element in that embedding, the formula will be: \n",
+ "\n",
+ "$$PE_{(2, 5)} = cos(2/10000^{4/100})$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here is a visualization of positional embeddings for better understanding. You can see the visualization of a 100-dimentional positional embedding for a sequence with the maximum length of 30. As you can see, the tokens that are near each other in the sequence have similar positional embeddings and as the token gets farther away from another token, its positional embedding becomes more different."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "d_model = 100\n",
+ "max_len = 30\n",
+ "positional_encodings = np.zeros((max_len, d_model))\n",
+ "for pos in range(max_len):\n",
+ " for i in range(d_model):\n",
+ " if i % 2 == 0:\n",
+ " positional_encodings[pos, i] = np.sin(pos/10000**(i/d_model))\n",
+ " else:\n",
+ " positional_encodings[pos, i] = np.sin(pos/10000**((i-1)/d_model))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "