diff --git a/notebooks/index.yml b/notebooks/index.yml index 26cd98d..9f3b461 100644 --- a/notebooks/index.yml +++ b/notebooks/index.yml @@ -5,3 +5,4 @@ - notebook: notebooks/decision_trees - notebook: notebooks/robust_and_trustworthy_machine_learning - notebook: notebooks/deep_reinforcement_learning +- notebook: notebooks/transformer diff --git a/notebooks/transformer/images/masked_multi-head_attention.png b/notebooks/transformer/images/masked_multi-head_attention.png new file mode 100644 index 0000000..8488bb9 Binary files /dev/null and b/notebooks/transformer/images/masked_multi-head_attention.png differ diff --git a/notebooks/transformer/images/multi-head_attention.png b/notebooks/transformer/images/multi-head_attention.png new file mode 100644 index 0000000..be5cdb9 Binary files /dev/null and b/notebooks/transformer/images/multi-head_attention.png differ diff --git a/notebooks/transformer/images/positional_encoding.jpg b/notebooks/transformer/images/positional_encoding.jpg new file mode 100644 index 0000000..a8f4849 Binary files /dev/null and b/notebooks/transformer/images/positional_encoding.jpg differ diff --git a/notebooks/transformer/images/transformer_decoder.jpg b/notebooks/transformer/images/transformer_decoder.jpg new file mode 100644 index 0000000..71b7fdd Binary files /dev/null and b/notebooks/transformer/images/transformer_decoder.jpg differ diff --git a/notebooks/transformer/images/transformer_encoder.jpg b/notebooks/transformer/images/transformer_encoder.jpg new file mode 100644 index 0000000..1e8dd4d Binary files /dev/null and b/notebooks/transformer/images/transformer_encoder.jpg differ diff --git a/notebooks/transformer/images/transformer_model.jpg b/notebooks/transformer/images/transformer_model.jpg new file mode 100644 index 0000000..f6116ff Binary files /dev/null and b/notebooks/transformer/images/transformer_model.jpg differ diff --git a/notebooks/transformer/index.ipynb b/notebooks/transformer/index.ipynb new file mode 100644 index 0000000..e528f5f --- /dev/null +++ b/notebooks/transformer/index.ipynb @@ -0,0 +1,320 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "xPtZUk9Hii0M" + }, + "source": [ + "## Table of Contents\n", + "* [Introduction](#intro)\n", + "* [Transformer Architecture](#transformer_architecture)\n", + " * [Transformer Encoder](#transformer_encoder)\n", + " * [Transformer Decoder](#transformer_decoder)\n", + " * [Positional Encoding](#positional_encoding)\n", + "* [Supplementary Material](#supplementary_material)\n", + " * [Attention Mechanism](#attention_mechanism)\n", + " * [Multi-Head Attention Mechanism](#multi-head_attention_mechanism)\n", + " * [Masked Multi-Head Attention Mechanism](#masked_multi-head_attention_mechanism)\n", + "* [References](#references)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VXdHmheRj1lL" + }, + "source": [ + "\n", + "\n", + "## Introduction\n", + "\n", + "Before transformers, the state of the art aproaches in sequence modeling were mostly based on recurrent neural networks. The problem with recurrent neural networks is their limitation is parallelization during training phase because of processing the sequence elements one at a time and it becomes more problematic when we have long sequences and the memory limit, limits batching across the examples.\n", + "Transformer, on the other hand, is a model architecture, entirely based on attention mechanism to consider long-term dependencies in the sequence and between input and output sequences and it allows notably more parallelization. \n", + "In the following sections we first explain transformer architecture and then give an example of transformer models.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EfjsbCSvkO_Q" + }, + "source": [ + "\n", + "## Transformer Architecture\n", + "\n", + "The main transformer model intruduced in Attention Is All You Need paper consists of two main parts: Encoder and Decoder.\n", + "The encoder and the decoder both consist of N = 6 similar layers." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h3Tdo5D1sSeV" + }, + "source": [ + "![Transformer Model](images/transformer_model.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Transformer Encoder\n", + "\n", + "The architecture of each encoder layer is shown at the left side of the figure bellow. Each encoder layer is made of two sub-layers.\n", + "The first sub layer is multi-head self-attention block and the second sub-layer is a fully connected network. As you can see in the image below, Both sub-layers are wrapped by a residual connection followed by a layer normalization." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Transformer Encoder](images/transformer_encoder.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Transformer Decoder\n", + "\n", + "The architecture of each decoder layer is shown at the right side of the figure below. The decoder consists of three sub-layers. A masked multi-head self attention, a masked multi-head attention and a fully connected feed forward network. The same as encoder part, the three sub-layers are wraped by a residual connection followed by a layer normalization. As you can see in the image below, the two of the multi-head attention inputs are from the output of the encoder stack. The third sub-layer which is called Masked Multi-Head Attention is a modification of self-attention module and ensures that the predictions for position i only attends to the known outputs which means the outputs at position less than i. That is because, for example if you are using transformer for machine translation task, the decoder input will be the translated sentence. Hence the transformer only should attent to the tokens of the sentence that has been translated until the current step, not to the whole translated sequence." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Transformer Decoder](images/transformer_decoder.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Positional Encoding\n", + "\n", + "Since the attention mechanism in transformer model doesn't consider any order for the token in the input sequence, it is needed to inject some information about the position of the tokens in the sequence. To do so, the transformer makes use of a kind of embedding which is called positional embedding and it has the same dimension as the input embeddings so it can be added to the input embeddings. There are multiple options for such an embedding that encodes the position in a sequence. The transformer model makes use of sine and cosine functions for this purpose." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "$$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $$\n", + "$$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the above formula, pos is the position of token in the sequence and i is the position in the embedding. for example, if the token is the second token in the sentence and the positional embedding dimension is 100, for computing the 5th element in that embedding, the formula will be: \n", + "\n", + "$$PE_{(2, 5)} = cos(2/10000^{4/100})$$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is a visualization of positional embeddings for better understanding. You can see the visualization of a 100-dimentional positional embedding for a sequence with the maximum length of 30. As you can see, the tokens that are near each other in the sequence have similar positional embeddings and as the token gets farther away from another token, its positional embedding becomes more different." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "d_model = 100\n", + "max_len = 30\n", + "positional_encodings = np.zeros((max_len, d_model))\n", + "for pos in range(max_len):\n", + " for i in range(d_model):\n", + " if i % 2 == 0:\n", + " positional_encodings[pos, i] = np.sin(pos/10000**(i/d_model))\n", + " else:\n", + " positional_encodings[pos, i] = np.sin(pos/10000**((i-1)/d_model))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "plt.figure(figsize=(10, 5))\n", + "plt.imshow(positional_encodings, interpolation='nearest')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Supplementary Material" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the following sections we get into more details in the attention mechanisms of the transformer model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Attention Mechanism\n", + "\n", + "In this section we explain what attention mechanism is and how it works.\n", + "Attention mechanism first was introduced by [Dzmitry Bahdanau](https://arxiv.org/pdf/1409.0473.pdf) to improve the performance of encoder-decoder architectures in neural machine translation. It mentions that using a fixed-length vector is a bottleneck in the performance of encoder-decoder architectures and suggests that we use an architecture that allows the model to predict the target word by automatically attending to the parts of the input sentence that are relevant to the target word, regardless of how far the relevant parts are in that sentence in contrast to RNNs which as we go further in the sentence, we start forgetting about the past information in the sentence which means that the ability of RNNs in encoding long-term dependencies is limited which is fixed in attention mechanism." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example of the Attention Mechanism. \n", + "Assume that we have a sequence of N tokens and each token in the sequence have a $d_k$-dimensional initial embedding, but these embeddings are computed independently, hence the context of the sentence is not encoded in those embeddings. Attention mechanism computes new embeddings for each token, so that, in addition to the information of that token, the information of other tokens in that sentence be considered in the embedding of that token. To do so, attention mechanism, computes the embedding of that token by using a weighted sum of the embeddings of tokens in that sentence. Actually not exactly the weighted sum of initial embeddings, but the weighted sum of a transformation of the initial embeddings." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "if $h_j$ is the $j^{th}$ token of the input sequence and $c_i$ is the $i^{th}$ new contextualized embedding, we will have:\n", + "$$\n", + "\\begin{aligned}\n", + "q_j = fc^q(h_j) \\\\\n", + "k_j = fc^k(h_j) \\\\\n", + "v_j = fc^v(h_j) \\\\\n", + "K = [k_1, ..., k_N] \\\\\n", + "\\alpha_{i} = softmax(\\frac{}{\\sqrt{d_k}}) \\\\\n", + "c_i = \\sum_{j = 1}^{N} \\alpha_{ij} \\times h_j \\\\\n", + "\\end{aligned}\n", + "$$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What we explained above was the process of self attention because the initial embeddings of query, key and values were the same, if the query input embeddings be different than key and value input embeddings, we call it attention mechanism." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Multi-Head Attention Mechanism" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lHgHJSUmqvno" + }, + "source": [ + "Multi-Head Attention Mechanism is the same as attention mechanism except that instead of having one attenion mechanism we will split the Query, Key and Value embeddings into N parts and passes each splitted part through a separate attention mechanism which is called attention head and at the end, it merges the output of attention heads into one embedding by concatenation which will be the output of Multi-Head Attention Mechanism.\n", + "This makes the attention mechanism to be able to encode different kinds of relations between the tokens of input sequence." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Multi-Head Attention Mechanism](images/multi-head_attention.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Masked Multi-Head Attention Mechanism" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Masked Multi-Head Attention works almost the same as Multi-Head Attention except that it masks out the padding and future words in the target sequence." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Masked Multi-Head Attention Mechanism](images/masked_multi-head_attention.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tHon2lfejN71" + }, + "source": [ + "\n", + "## References\n", + "* [Attention Is All You Need](https://arxiv.org/abs/1706.03762)\n", + "* [What is Transformer](https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04)\n", + "* [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)\n", + "* [The Attention Mechanism From Scratch](https://machinelearningmastery.com/the-attention-mechanism-from-scratch/#:~:text=The%20idea%20behind%20the%20attention,being%20attributed%20the%20highest%20weights.)\n", + "* [Transformers Explained Visually (Part 3): Multi-head Attention, deep dive](https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853)" + ] + } + ], + "metadata": { + "colab": { + "name": "transformer.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/transformer/metadata.yml b/notebooks/transformer/metadata.yml new file mode 100644 index 0000000..1a13c1b --- /dev/null +++ b/notebooks/transformer/metadata.yml @@ -0,0 +1,25 @@ +title: Transformer + +header: + title: Transformer + description: Transformer Tutorial + +authors: + label: + position: top + content: + - name: Zahra TehraniNasab + role: Author + contact: + - link: https://github.com/realmarv + icon: fab fa-github + - link: https://www.linkedin.com/in/zahra-tehraninasab + icon: fab fa-linkedin + - link: mailto://zahratehraninasab@gmail.com + icon: fas fa-envelope + + +comments: + # enable comments for your post + label: false + kind: comments \ No newline at end of file