Correction of column embedding in chapter TabTransformer 🤖 (#115)

Adresses #10
KarelZe · Jan 15, 2023 · 3ca6b27 · 3ca6b27
1 parent 0508302
commit 3ca6b27
Show file tree

Hide file tree

Showing 8 changed files with 110 additions and 138 deletions.
diff --git a/references/obsidian/.obsidian/workspace.json b/references/obsidian/.obsidian/workspace.json
@@ -6,7 +6,7 @@
       {
         "id": "ac0c9ffcaad5ed9e",
         "type": "tabs",
-        "dimension": 59.76505139500734,
+        "dimension": 64.2504118616145,
         "children": [
           {
             "id": "160a03ac0eb0e817",
@@ -23,13 +23,13 @@
             }
           },
           {
-            "id": "0f92442fd597a142",
+            "id": "5fc6bddb531253ce",
             "type": "leaf",
             "state": {
               "type": "markdown",
               "state": {
-                "file": "chapters/🤖FTTransformer.md",
-                "mode": "source",
+                "file": "chapters/🤖TabTransformer.md",
+                "mode": "preview",
                 "source": false
               }
             }
@@ -52,7 +52,7 @@
       {
         "id": "675af78723ee45d1",
         "type": "tabs",
-        "dimension": 40.23494860499266,
+        "dimension": 35.749588138385505,
         "children": [
           {
             "id": "0abd77888f93b540",
@@ -78,7 +78,19 @@
             "state": {
               "type": "markdown",
               "state": {
-                "file": "chapters/🤖TabTransformer.md",
+                "file": "🧠Deep Learning Methods/Transformer/@huangTabTransformerTabularData2020.md",
+                "mode": "source",
+                "source": false
+              }
+            }
+          },
+          {
+            "id": "27942e948feec381",
+            "type": "leaf",
+            "state": {
+              "type": "markdown",
+              "state": {
+                "file": "chapters/🧵Positional encoding.md",
                 "mode": "source",
                 "source": false
               }
@@ -102,13 +114,14 @@
             "state": {
               "type": "markdown",
               "state": {
-                "file": "🧠Deep Learning Methods/Transformer/@huangTabTransformerTabularData2020.md",
-                "mode": "source",
+                "file": "chapters/🤖FTTransformer.md",
+                "mode": "preview",
                 "source": false
               }
             }
           }
-        ]
+        ],
+        "currentTab": 4
       }
     ],
     "direction": "vertical"
@@ -166,7 +179,7 @@
             "state": {
               "type": "backlink",
               "state": {
-                "file": "chapters/🤖FTTransformer.md",
+                "file": "chapters/🤖TabTransformer.md",
                 "collapseAll": false,
                 "extraContext": false,
                 "sortOrder": "alphabetical",
@@ -194,7 +207,7 @@
             "state": {
               "type": "outline",
               "state": {
-                "file": "chapters/🤖FTTransformer.md"
+                "file": "chapters/🤖TabTransformer.md"
               }
             }
           }
@@ -215,17 +228,17 @@
       "markdown-importer:Open format converter": false
     }
   },
-  "active": "0f92442fd597a142",
+  "active": "5fc6bddb531253ce",
   "lastOpenFiles": [
-    "🧠Deep Learning Methods/@gorishniyRevisitingDeepLearning2021.md",
     "chapters/🤖FTTransformer.md",
-    "chapters/🤖Extensions to TabTransformer.md",
-    "chapters/🤖Pretraining FTTransformer.md",
-    "🧠Deep Learning Methods/Transformer/@gorishniyEmbeddingsNumericalFeatures2022.md",
-    "TOC.md",
-    "Semi-supervised Learning/@devlinBERTPretrainingDeep2019.md",
+    "chapters/🤖TabTransformer.md",
+    "🧠Deep Learning Methods/Transformer/@huangTabTransformerTabularData2020.md",
+    "chapters/🧵Positional encoding.md",
     "chapters/🤖Training of the Transformer.md",
-    "chapters/🛌Token Embedding.md",
-    "chapters/🧭Attention map.md"
+    "chapters/💡Training and tuning.md",
+    "chapters/🤖Transformer.md",
+    "🎨transformer.canvas",
+    "chapters/🤖Pretraining FTTransformer.md",
+    "🧠Deep Learning Methods/@gorishniyRevisitingDeepLearning2021.md"
   ]
 }
diff --git a/references/obsidian/chapters/💡Training and tuning.md b/references/obsidian/chapters/💡Training and tuning.md
@@ -1,6 +1,6 @@
+- training of the transformer has been found non-trivial[[@liuUnderstandingDifficultyTraining2020]]
 - Do less alchemy and more understanding [Ali Rahimi's talk at NIPS(NIPS 2017 Test-of-time award presentation) - YouTube](https://www.youtube.com/watch?v=Qi1Yry33TQE)
 - Keep algorithms / ideas simple. Add complexity only where needed! 
-- Do rigorous testing.
 - Don't chase the benchmark, but aim for explainability of the results.
 - compare against https://github.com/jktis/Trade-Classification-Algorithms
 - Classical rules could be implemented using https://github.com/jktis/Trade-Classification-Algorithms

diff --git a/references/obsidian/chapters/🤖FTTransformer.md b/references/obsidian/chapters/🤖FTTransformer.md
@@ -5,7 +5,7 @@ The FTTransformer of [[@gorishniyRevisitingDeepLearning2021]] is an adaption of
 
 The *feature tokenizer* transforms all features of $x$ to their embeddings. If the $j$-th feature, $x_j$, is **numerical**, it is projected to its embedding $e_j \in \mathbb{R}^{e_d}$ by element-wise multiplication with a learned vector $W_j \in \mathbb{R}^{e_d}$ and the addition of a feature-dependent bias term $b_j \in \mathbb{R}$, as in Equation (1).
 
-For **categorical** inputs, the embedding is implemented as a lookup table, similar to the techniques from Chapter [[🛌Token Embedding]] and [[🤖TabTransformer]]. We denote the cardinality of the $j$-th feature with $N_{C_j}$. The specific embeddings $e_j$ are queried with a unique integer key $c_j \in C_j \cong\left[N_{\mathrm{C_j}}\right]$ from the learned embedding matrix $W_j \in \mathbb{R}^{e_d \times N_{C_j}}$. Finally a feature-specific bias term $b_j$ is added <mark style="background: #FFB86CA6;">(TODO: lookup if bias is a scalar or vector?).</mark>  Similar to the [[🛌Token Embedding]], a previous label encoding (or a similar technique) must be employed, to map the categories to their unique integer keys. Overall:
+For **categorical** inputs, the embedding is implemented as a lookup table, similar to the techniques from Chapter [[🛌Token Embedding]] and [[🤖TabTransformer]]. We denote the cardinality of the $j$-th feature with $N_{C_j}$. The specific embeddings $e_j$ are queried with a unique integer key $c_j \in C_j \cong\left[N_{\mathrm{C_j}}\right]$ from the learned embedding matrix $W_j \in \mathbb{R}^{e_d \times N_{C_j}}$. Finally a feature-specific bias term $b_j$ is added <mark style="background: #FFB86CA6;">(TODO: lookup if bias is a scalar or vector?).</mark>  Overall for $x_j$:
 %%
 Exemplary, the encoding of the option type could be  $\text{P}\mapsto 1$; $\text{C}\mapsto 2$, which would result in a selection of the second column of the embedding matrix whenever a put is traded. 
 %%

diff --git a/references/obsidian/chapters/🤖TabTransformer.md b/references/obsidian/chapters/🤖TabTransformer.md
@@ -4,10 +4,22 @@
 ![[tab_transformer.png]]
 (own drawing. Inspired by [[@huangTabTransformerTabularData2020]]. Top layers a little bit different. They write MLP. I take the FFN with two hidden layers and an output layer. <mark style="background: #FFB8EBA6;">Better change label to MLP</mark>; Also they call the <mark style="background: #FFB8EBA6;">input embedding a column embedding, use L instead of N)</mark> ^87bba0
 
-Motivated by the success of (cp. [[@devlinBERTPretrainingDeep2019]]; [[@liuRoBERTaRobustlyOptimized2019]]) of contextual embeddings in natural language processing, [[@huangTabTransformerTabularData2020]]  propose with *TabTransformer* an adaption of the classical Transformer for the tabular domain. *TabTransformer* is *encoder-only* and features a stack of Transformer layers (see chapter [[🤖Transformer]] or [[@vaswaniAttentionAllYou2017]]) to learn contextualized embeddings of categorical features from their parametric embeddings, as shown in Figure ([[#^87bba0]]]).  The transformer layers, are identical to those found in [[@vaswaniAttentionAllYou2017]] featuring multi-headed self-attention and a norm-last layer arrangement. Continuous inputs are normalized using layer norm ([[@baLayerNormalization2016]]) , concatenated with the contextual embeddings, and input into a multi-layer peceptron. More specifically, [[@huangTabTransformerTabularData2020]] (p. 4; 12) use a feed-forward network with two hidden layers, whilst other architectures and even non-deep models, such as [[🐈gradient-boosting]], are applicable.<mark style="background: #FFB8EBA6;"> (downstream network?)</mark> Thus, for strictly continuous inputs, the network collapses to a multi-layer perceptron with layer normalization.
+Motivated by the success of (cp. [[@devlinBERTPretrainingDeep2019]]; [[@liuRoBERTaRobustlyOptimized2019]]) of contextual embeddings in natural language processing, [[@huangTabTransformerTabularData2020]]  propose with *TabTransformer* an adaption of the classical Transformer for the tabular domain. *TabTransformer* is *encoder-only* and features a stack of Transformer layers (see chapter [[🤖Transformer]] or [[@vaswaniAttentionAllYou2017]]) to learn contextualized embeddings of categorical features from their parametric embeddings, as shown in Figure ([[#^87bba0]]]).  The transformer layers, are identical to those found in [[@vaswaniAttentionAllYou2017]] featuring multi-headed self-attention and a norm-last layer arrangement. Continuous inputs are normalized using layer norm ([[@baLayerNormalization2016]]) , concatenated with the contextual embeddings, and input into a multi-layer peceptron. More specifically, [[@huangTabTransformerTabularData2020]] (p. 4; 12) use a feed-forward network with two hidden layers, whilst other architectures and even non-deep models, such as [[🐈gradient-boosting]], are applicable. Thus, for strictly continuous inputs, the network collapses to a multi-layer perceptron with layer normalization.
 
-Due to the tabular nature of the data, with features arranged in a row-column fashion, the token embedding (see chapter [[🛌Token Embedding]]) is replaced for a *column embedding*. Also the notation needs to be adapted to the tabular domain. We denote the data set with $D:=\left\{\left(\mathbf{x}_k, y_k\right) \right\}_{k=1,\cdots N}$ identified with $\left[N_{\mathrm{D}}\right]:=\left\{1, \ldots, N_{\mathrm{D}}\right\}$.  Each tuple $(\boldsymbol{x}, y)$ represents a row in the data set, and consist of the binary classification target $y_k \in \mathbb{R}$ and the vector of features $\boldsymbol{x} = \left\{\boldsymbol{x}_{\text{cat}}, \boldsymbol{x}_{\text{cont}}\right\}$, where $x_{\text{cont}} \in \mathbb{R}^c$ denotes all $c$ continuous features and $\boldsymbol{x}_{\text{cat}}\in \mathbb{R}^{m}$ all $m$ categorical features. 
+Due to the tabular nature of the data, with features arranged in a row-column fashion, the token embedding (see chapter [[🛌Token Embedding]]) is replaced for a *column embedding*. Also the notation needs to be adapted to the tabular domain. We denote the data set with $D:=\left\{\left(\mathbf{x}_k, y_k\right) \right\}_{k=1,\cdots N}$ identified with $\left[N_{\mathrm{D}}\right]:=\left\{1, \ldots, N_{\mathrm{D}}\right\}$.  Each tuple $(\boldsymbol{x}, y)$ represents a row in the data set, and consist of the binary classification target $y_k \in \mathbb{R}$ and the vector of features 
+$\boldsymbol{x} = \left\{\boldsymbol{x}_{\text{cat}}, \boldsymbol{x}_{\text{cont}}\right\}$, where $x_{\text{cont}} \in \mathbb{R}^c$ denotes all $c$ continuous features and $\boldsymbol{x}_{\text{cat}}\in \mathbb{R}^{m}$ all $m$ categorical features. We denote the cardinality of the $j$-th feature with $j \in 1, \cdots m$ with $N_{C_j}$. 
 
+In chapter [[🛌Token Embedding]], one lookup table suffices for storing the embeddings of all tokens within the sequence. Due to the heterogenous nature of tabular data, every categorical column is independent from all $m-1$ other categorical columns. Thus, every column requires learning their own embedding matrix. 
+The *feature-specific embeddings* are queried with a unique integer key $c_j \in C_j \cong\left[N_{\mathrm{C_j}}\right]$ from the learned embedding matrix $W_j \in \mathbb{R}^{e_d \times N_{C_j}}$ of the categorical column. Similar to the [[🛌Token Embedding]], a previous label encoding must be employed, to map the categories to their unique integer keys.
+%%
+They use +1 class, for NaN. Should already be addressed in pre-processing or imputed i. e. become their own category. Thus, I think it's ok, to not dwell on this here, as it is part of NC already?
+%%
+Additionally, a *shared embedding* is learned. This embedding is equal for all categories of one feature and is added or concatenated to the feature-specific embeddings to enable the model to distinguish classes in one column from those in other columns ([[@huangTabTransformerTabularData2020]] p. 10). For the variant, where the shared embedding is added element-wisely, the embedding matrix $W_S$ is of dimension $\mathbb{R}^{e_d \times m}$ .
+
+Overall, the joint *column embedding* of $x_j$ is given by:
+$$
+e_j = W_j[:c_j] + W_S[:j].
+$$
 %%
 Notation adapted from [[@prokhorenkovaCatBoostUnbiasedBoosting2018]], [[@huangTabTransformerTabularData2020]]) and [[@phuongFormalAlgorithmsTransformers2022]]
 Classification (ETransformer). Given a vocabulary $V$ and a set of classes $\left[N_{\mathrm{C}}\right]$, let $\left(x_n, c_n\right) \in$ $V^* \times\left[N_{\mathrm{C}}\right]$ for $n \in\left[N_{\text {data }}\right]$ be an i.i.d. dataset of sequence-class pairs sampled from $P(x, c)$. The goal in classification is to learn an estimate of the conditional distribution $P(c \mid x)$.
@@ -21,22 +33,7 @@ Assume we observe a dataset of examples $\mathcal{D}=\left\{\left(\mathbf{x}_k,
 Let $(\boldsymbol{x}, y)$ denote a feature-target pair, where $\boldsymbol{x} \equiv$ $\left\{\boldsymbol{x}_{\text {cat }}, \boldsymbol{x}_{\text {cont }}\right\}$. The $\boldsymbol{x}_{\text {cat }}$ denotes all the categorical features and $x_{\text {cont }} \in \mathbb{R}^c$ denotes all of the $c$ continuous features. Let $\boldsymbol{x}_{\text {cat }} \equiv\left\{x_1, x_2, \cdots, x_m\right\}$ with each $x_i$ being a categorical feature, for $i \in\{1, \cdots, m\}$. (from [[@huangTabTransformerTabularData2020]] )
 %%
 
-In chapter [[🛌Token Embedding]], one lookup table suffices for storing the embeddings of all tokens within the sequence. Due to the heterogeneous (?)  nature of tabular data, every categorical column is independent of all $m-1$ other categorical columns. Thus, every column requires learning their own embedding matrix. As such, each column is embedded separately using a *column embedding*. For every $i$-th categorical column with $i \in {1,\cdots m}$ the 
-
-<mark style="background: #FF5582A6;">TODO:</mark> Think about the projection / look up in code.
-
-%%
-
-![[column-embeddings.png]]
-
-The embedding matrix is now dependent on the on the ca table to retrieve the embedding vector $e \in \mathbb{R}^{d_{\mathrm{e}}}$  from a learned, embedding matrix $W_e \in \mathbb{R}^{d_{\mathrm{e}} \times N_{\mathrm{V}}}$ with a token-id $v \in {1,\cdots m}$ as shown :
-$$
-\tag{1}
-e=W_e[:, v].
-$$
-
-%%
-Note that categorical columns may be arranged in an arbitrary order and that the Transformer blocks are (... equivariant?), Thus, no [[🧵Positional encoding]] is required to inject the order. Analogous to chapter [[🤖Transformer]], the column embedding of each row is subsequently passed through several transformer layers, ultimately resulting in contextualized embeddings $\tilde{V} \in \mathbb{R}^{d_{\text {out}} \times m}$.  At the end of the encoder, the contextual embeddings are flattened and concatenated with the continuous inputs into a ($d_\text{dim}  \times m + c$)-dimensional vector, which serves as input to the multi-layer perceptron ([[@huangTabTransformerTabularData2020]] (p. 3)). Like before, a linear layer and softmax activation <mark style="background: #FFB8EBA6;">(actually it's just a sigmoid due to the binary case, which is less computationally demanding for the binary case)</mark> are used to retrieve the class probabilities.
+Note that categorical columns may be arranged in an arbitrary order and that the Transformer blocks are (... equivariant?), Thus, no [[🧵Positional encoding]] is required to inject the order. Analogous to chapter [[🤖Transformer]], the embedding of each row, or $X = [e_1, \cdots, e_m]$, are subsequently passed through several transformer layers, ultimately resulting in contextualized embeddings.  At the end of the encoder, the contextual embeddings are flattened and concatenated with the continuous inputs into a ($e_{d}  \times m + c$)-dimensional vector, which serves as input to the multi-layer perceptron ([[@huangTabTransformerTabularData2020]] (p. 3)). Like before, a linear layer and softmax activation <mark style="background: #FFB8EBA6;">(actually it's just a sigmoid due to the binary case, which is less computationally demanding for the binary case)</mark> are used to retrieve the class probabilities.
 
 In large-scale experiments [[@huangTabTransformerTabularData2020]]  (p. 5 f.) can show, that the use of contextual embeddings elevates both the robustness to noise and missing data of the model. For various binary classification tasks, the TabTransformer outperforms other deep learning models e. g., vanilla multi-layer perceptrons in terms of *area under the curve* (AUC) and can compete with [[🐈gradient-boosting]].  
 

diff --git a/references/obsidian/chapters/🤖Training of the Transformer.md b/references/obsidian/chapters/🤖Training of the Transformer.md
@@ -1,5 +1,7 @@
 #lr-warmup #lr-scheduling 
 
+
+- training of the transformer has been found non-trivial[[@liuUnderstandingDifficultyTraining2020]]
 - introduce notion of effective batch size (batch size when training is split across multiple gpus; see [[🧠Deep Learning Methods/Transformer/@popelTrainingTipsTransformer2018]] p. 46)
 - report or store training times?
 - In case of diverged training, try gradient clipping and/or more warmup steps. (found in [[🧠Deep Learning Methods/Transformer/@popelTrainingTipsTransformer2018]])
@@ -8,6 +10,9 @@
 - One might has to adjust the lr when scaling across multiple gpus [[@poppeSensitivityVPINChoice2016]] contains a nice discussion.
 - Use weight decay of 0.1 for a small amount of regularization [[@loshchilovDecoupledWeightDecay2019]].
 
+- On activation function see [[@shazeerGLUVariantsImprove2020]]
+
+
 - log gradients and loss using `wandb.watch` as shown here https://www.youtube.com/watch?v=k6p-gqxJfP4 with `wandb.log({"epoch":epoch, "loss":loss}, step)` (nested in `if ((batch_ct +1) % 25) == 0:`) and `wandb.watch(model, criterion, log="all", log_freq=10)`
 - watch out for exploding and vanishing gradients
 - distillation, learning rate warmup, learning rate decay (not used but could improve training times and maybe accuracy) ([[@gorishniyRevisitingDeepLearning2021]])

diff --git a/references/obsidian/chapters/🤖transformer.md b/references/obsidian/chapters/🤖transformer.md
@@ -2,6 +2,8 @@
 ![[classical_transformer_architecture.png]]
 (own drawing after [[@daiTransformerXLAttentiveLanguage2019]], use L instead of N)
 
+![[Pasted image 20230115060830.png]]
+
 In the subsequent sections we introduce the classical Transformer of [[@vaswaniAttentionAllYou2017]]. Our focus on introducing the central building blocks like self-attention and multi-headed attention.  We then transfer the concepts to the tabular domain by covering [[🤖TabTransformer]] and [[🤖FTTransformer]]. Throughout the work we adhere to a notation suggested in [[@phuongFormalAlgorithmsTransformers2022]].
 
 - encoder/ decoder models $\approx$ sequence-to-sequence model
@@ -35,7 +37,7 @@ Open:
 - [ ] Residual connections
 - [ ] Layer Norm, Pre-Norm, and Post-Norm
 - [x] TabTransformer
-- [ ] FTTransformer
+- [x] FTTransformer
 - [ ] Pre-Training
 - [ ] Embeddings of categorical / continuous data
 - [ ] Selection of supervised approaches
@@ -106,8 +108,8 @@ feature importance evaluation is a non-trivial problem due to missing ground tru
 - intuition behind multi-head and self-attention e. g. cosine similarity, key and querying mechanism: https://www.youtube.com/watch?v=mMa2PmYJlCo&list=PL86uXYUJ7999zE8u2-97i4KG_2Zpufkfb
 
 
-
-
+- Our analysis starts from the observation: the original Transformer (referred to as Post-LN) is less robust than its Pre-LN variant2 (Baevski and Auli, 2019; Xiong et al., 2019; Nguyen and Salazar, 2019). (from [[@liuUnderstandingDifficultyTraining2020]])
+- motivation to switch 
 
 - General Introduction: [[@vaswaniAttentionAllYou2017]]
 - What is Attentition?