From 83db270e27b552c34f0824f4eb93b003d3a59b1a Mon Sep 17 00:00:00 2001
From: Thomas van Dongen <thomas123@live.nl>
Date: Mon, 14 Oct 2024 19:40:16 +0200
Subject: [PATCH] docs: Move results and add blogpost (#82)

* Moved results

* Added blog link

* Moved section

* Moved section

* Updates

* Updates
---
 .gitignore        |   1 -
 README.md         | 107 +++++++++++++---------------------------------
 results/README.md |  88 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 117 insertions(+), 79 deletions(-)
 create mode 100644 results/README.md
diff --git a/.gitignore b/.gitignore
index 6503faf..0f614e1 100644
--- a/.gitignore
+++ b/.gitignore
@@ -168,7 +168,6 @@ models
 checkpoints/*
 features/*
 model2vec_models
-results/*
 counts/*
 results_old/*
 local/*
diff --git a/README.md b/README.md
index a634689..f416a01 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,8 @@
   <h2>
     <a href="https://huggingface.co/minishlab"><strong>🤗 Models</strong></a> |
     <a href="https://github.com/MinishLab/model2vec/tree/main/tutorials"><strong>📚 Tutorials</strong></a> |
-    <a href="https://github.com/MinishLab"><strong>📖 Website </strong></a>
+    <a href="https://github.com/MinishLab"><strong>💻 Website </strong></a> |
+    <a href="https://huggingface.co/blog/Pringled/model2vec"><strong>📖 Blog</strong></a>
   </h2>
 </div>
 
@@ -42,8 +43,8 @@ Model2Vec is a technique to turn any sentence transformer into a really small fa
 
 ## Table of Contents
 - [Quickstart](#quickstart)
-- [What is Model2Vec?](#what-is-model2vec)
 - [Main Features](#main-features)
+- [What is Model2Vec?](#what-is-model2vec)
 - [Usage](#usage)
     - [Distilling a Model2Vec model](#distilling-a-model2vec-model)
     - [Inferencing a Model2Vec model](#inference-with-a-model2vec-model)
@@ -115,23 +116,12 @@ embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to ever
 ```
 For more documentation, please refer to the [Sentence Transformers documentation](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding).
 
-## What is Model2Vec?
-
-Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Best of all, you don't need _any_ data to distill a model using Model2Vec.
-
-It works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence.
-
-Model2vec has 3 modes:
-- **Output**: behaves much like a real sentence transformer, i.e., it uses a subword tokenizer and simply encodes all wordpieces in its vocab. This is really quick to create (30 seconds on a CPU), very small (30 MB in float32), but might be less performant on some tasks.
-- **Vocab (word level)**: creates a word-level tokenizer and only encodes words that are in the vocabulary. This is a bit slower to create and creates a larger model, but might be more performant on some tasks. Note that this model can go out-of-vocabulary, which might be beneficial if your domain is very noisy
-- **Vocab (subword)**: a combination of the two methods above. In this mode, you can pass your own vocabulary, but it also uses the subword vocabulary to create representations for words not in the passed vocabulary.
-
 ## Main Features
 
 Model2Vec is:
 
-- **Small**: reduces the size of a Sentence Transformer model by a factor of 15, from 120M params, down to 7.5M (30 MB on disk!).
-- **Static, but better**: smaller than GLoVe, but much more performant, even with the same vocabulary.
+- **Small**: reduces the size of a Sentence Transformer model by a factor of 15, from 120M params, down to 7.5M (30 MB on disk, making it the smallest model on [MTEB](https://huggingface.co/spaces/mteb/leaderboard)!).
+- **Static, but better**: smaller than GLoVe and BPEmb, but [much more performant](results/README.md), even with the same vocabulary.
 - **Fast distillation**: make your own model in 30 seconds.
 - **Fast inference**: up to 500 times faster on CPU than the original model. Go green or go home.
 - **No data needed**: Distillation happens directly on the token level, so no dataset is needed.
@@ -143,10 +133,25 @@ Model2Vec is:
 - **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). Feel free to share your own.
 - **Easy Evaluation**: evaluate your models on MTEB and some of our own tasks to measure the performance of the distilled model. Model2Vec models work out of the box on [MTEB](https://huggingface.co/spaces/mteb/leaderboard).
 
+## What is Model2Vec?
+
+Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Best of all, you don't need _any_ data to distill a model using Model2Vec.
+
+It works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence.
+
+Model2vec has 3 modes:
+- **Output**: behaves much like a real sentence transformer, i.e., it uses a subword tokenizer and simply encodes all wordpieces in its vocab. This is really quick to create (30 seconds on a CPU), very small (30 MB in float32), but might be less performant on some tasks.
+- **Vocab (word level)**: creates a word-level tokenizer and only encodes words that are in the vocabulary. This is a bit slower to create and creates a larger model, but might be more performant on some tasks. Note that this model can go out-of-vocabulary, which might be beneficial if your domain is very noisy
+- **Vocab (subword)**: a combination of the two methods above. In this mode, you can pass your own vocabulary, but it also uses the subword vocabulary to create representations for words not in the passed vocabulary.
+
+For a technical deepdive into Model2Vec, please refer to our [blog post](https://huggingface.co/blog/Pringled/model2vec).
+
+
+
 ## Usage
 
 
-### Distilling a Model2Vec model
+### Distillation
 
 <details>
 <summary>  Distilling from a Sentence Transformer </summary>
@@ -254,10 +259,10 @@ python3 -m model2vec.distill --model-name BAAI/bge-base-en-v1.5 --vocabulary-pat
 
 </details>
 
-### Inference with Model2Vec
+### Inference
 
 <details>
-<summary>  Inference a pretrained model </summary>
+<summary>  Inference using pretrained model </summary>
 <br>
 
 Inference works as follows. The example shows one of our own models, but you can also just load a local one, or another one from the hub.
@@ -279,7 +284,7 @@ token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It'
 
 
 <details>
-<summary>  Inference with the Sentence Transformers library </summary>
+<summary>  Inference using the Sentence Transformers library </summary>
 <br>
 
 The following code snippet shows how to use a Model2Vec model in the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline.
@@ -297,7 +302,7 @@ embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to ever
 </details>
 
 
-### Evaluating a Model2Vec model
+### Evaluation
 
 
 <details>
@@ -362,64 +367,10 @@ We provide a number of models that can be used out of the box. These models are
 
 ## Results
 
-### Main Results
-
-Model2Vec is evaluated on MTEB, as well as two additional tasks: [PEARL](https://github.com/tigerchen52/PEARL) (a phrase representation task) and WordSim (a collection of _word_ similarity tasks). The results are shown in the table below.
-
-
-
-| Model                  | Avg (All) | Avg (MTEB) | Class  | Clust  | PairClass | Rank   | Ret    | STS    | Sum    | Pearl  | WordSim |
-|:-----------------------|:---------:|:----------:|:------:|:------:|:---------:|:------:|:------:|:------:|:------:|:------:|:-------:|
-| all-MiniLM-L6-v2        | 56.08     | 56.09      | 62.62  | 41.94  | 82.37     | 58.04  | 41.95  | 78.90  | 30.81  | 60.83  | 49.91   |
-| M2V_base_glove_subword  | 49.06     | 46.69      | 61.27  | 30.03  | 74.71     | 49.15  | 27.16  | 69.09  | 30.08  | 56.82  | 57.99   |
-| M2V_base_glove          | 48.58     | 47.60      | 61.35  | 30.52  | 75.34     | 48.50  | 29.26  | 70.31  | 31.50  | 50.28  | 54.29   |
-| M2V_base_output         | 46.79     | 45.34      | 61.25  | 25.58  | 74.90     | 47.63  | 26.14  | 68.58  | 29.20  | 54.02  | 49.18   |
-| GloVe_300d              | 42.84     | 42.36      | 57.31  | 27.66  | 72.48     | 43.30  | 22.78  | 61.90  | 28.81  | 45.65  | 43.05   |
-| BPEmb_50k_300d          | 39.34     | 37.78      | 55.76  | 23.35  | 57.86     | 43.21  | 17.50  | 55.10  | 29.74  | 47.56  | 41.28   |
-| WL256*                  | 48.88     | 49.36      | 58.98  | 33.34  | 74.00     | 52.03  | 33.12  | 73.34  | 29.05  | 48.81  | 45.16   |
-
-
-<details>
-  <summary>  Task Abbreviations </summary>
-
-For readability, the MTEB task names are abbreviated as follows:
-- Class: Classification
-- Clust: Clustering
-- PairClass: PairClassification
-- Rank: Reranking
-- Ret: Retrieval
-- STS: Semantic Textual Similarity
-- Sum: Summarization
-</details>
-
-\
-\* WL256, introduced in the [WordLlama](https://github.com/dleemiller/WordLlama/tree/main) package is included for comparison due to its similarities to Model2Vec. However, we believe it is heavily overfit to the MTEB dataset since it is trained on datasets used in MTEB itself. This can be seen by the fact that the WL256 model performs much worse on the non-MTEB tasks (PEARL and WordSim) than our models and GLoVe. The results shown in the [Classification and Speed Benchmarks](#classification-and-speed-benchmarks) further support this.
-
-### Classification and Speed Benchmarks
-
-In addition to the MTEB evaluation, we evaluate Model2Vec on a number of classification datasets. These are used as additional evidence to avoid overfitting to the MTEB dataset and to benchmark the speed of the model. The results are shown in the table below.
-
-
-| Model                  | Average | SST2   | IMDB  | TREC   | AG News |
-|:-----------------------|:-------:|:------:|:-----:|:------:|:-------:|
-| bge-base-en-v1.5        | 90.00   | 91.54  | 91.88 | 85.16  | 91.45   |
-| all-MiniLM-L6-v2        | 84.10   | 83.95  | 81.36 | 81.31  | 89.77   |
-| M2V_base_output         | 82.23   | 80.92  | 84.56 | 75.27  | 88.17   |
-| M2V_base_glove_subword  | 81.95   | 82.84  | 85.96 | 70.51  | 88.49   |
-| BPEmb_50k_300d          | 81.15   | 80.42  | 84.04 | 71.25  | 88.92   |
-| M2V_base_glove          | 80.76   | 83.07  | 85.24 | 66.12  | 88.61   |
-| WL256                   | 78.48   | 76.88  | 80.12 | 69.23  | 87.68   |
-| GloVe_300d              | 77.77   | 81.68  | 84.00 | 55.67  | 89.71   |
-
-
-As can be seen, Model2Vec models outperform the GloVe, BPEmb, and WL256 models on all classification tasks, and are competitive with the all-MiniLM-L6-v2 model, while being much faster.
-
-The figure below shows the relationship between the number of sentences per second and the average classification score. The circle sizes correspond to the number of parameters in the models (larger = more parameters).
-This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.
-
-| ![Description](assets/images/speed_vs_accuracy_v3.png) |
-|:--:|
-|*Figure: The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size.*|
+We have performed extensive experiments to evaluate the performance of Model2Vec models. The results are documented in the [results](results/README.md) folder. The results are presented in the following sections:
+- [MTEB Results](results/README.md#mteb-results)
+- [Classification and Speed Benchmarks](results/README.md#classification-and-speed-benchmarks)
+- [Ablations](results/README.md#ablations)
 
 ## Related work
 
diff --git a/results/README.md b/results/README.md
new file mode 100644
index 0000000..7c4c7b3
--- /dev/null
+++ b/results/README.md
@@ -0,0 +1,88 @@
+# Results
+
+This page contains the experiments results of the Model2Vec project. The results are presented in the following sections:
+- [MTEB Results](#mteb-results)
+- [Classification and Speed Benchmarks](#classification-and-speed-benchmarks)
+- [Ablations](#ablations)
+
+## MTEB Results
+
+Model2Vec is evaluated on MTEB, as well as two additional tasks: [PEARL](https://github.com/tigerchen52/PEARL) (a phrase representation task) and WordSim (a collection of _word_ similarity tasks). The results are shown in the table below.
+
+
+
+| Model                  | Avg (All) | Avg (MTEB) | Class  | Clust  | PairClass | Rank   | Ret    | STS    | Sum    | Pearl  | WordSim |
+|:-----------------------|:---------:|:----------:|:------:|:------:|:---------:|:------:|:------:|:------:|:------:|:------:|:-------:|
+| all-MiniLM-L6-v2        | 56.08     | 56.09      | 62.62  | 41.94  | 82.37     | 58.04  | 41.95  | 78.90  | 30.81  | 60.83  | 49.91   |
+| M2V_base_glove_subword  | 49.06     | 46.69      | 61.27  | 30.03  | 74.71     | 49.15  | 27.16  | 69.09  | 30.08  | 56.82  | 57.99   |
+| M2V_base_glove          | 48.58     | 47.60      | 61.35  | 30.52  | 75.34     | 48.50  | 29.26  | 70.31  | 31.50  | 50.28  | 54.29   |
+| M2V_base_output         | 46.79     | 45.34      | 61.25  | 25.58  | 74.90     | 47.63  | 26.14  | 68.58  | 29.20  | 54.02  | 49.18   |
+| GloVe_300d              | 42.84     | 42.36      | 57.31  | 27.66  | 72.48     | 43.30  | 22.78  | 61.90  | 28.81  | 45.65  | 43.05   |
+| BPEmb_50k_300d          | 39.34     | 37.78      | 55.76  | 23.35  | 57.86     | 43.21  | 17.50  | 55.10  | 29.74  | 47.56  | 41.28   |
+| WL256*                  | 48.88     | 49.36      | 58.98  | 33.34  | 74.00     | 52.03  | 33.12  | 73.34  | 29.05  | 48.81  | 45.16   |
+
+
+<details>
+  <summary>  Task Abbreviations </summary>
+
+For readability, the MTEB task names are abbreviated as follows:
+- Class: Classification
+- Clust: Clustering
+- PairClass: PairClassification
+- Rank: Reranking
+- Ret: Retrieval
+- STS: Semantic Textual Similarity
+- Sum: Summarization
+</details>
+
+\
+\* WL256, introduced in the [WordLlama](https://github.com/dleemiller/WordLlama/tree/main) package is included for comparison due to its similarities to Model2Vec. However, we believe it is heavily overfit to the MTEB dataset since it is trained on datasets used in MTEB itself. This can be seen by the fact that the WL256 model performs much worse on the non-MTEB tasks (PEARL and WordSim) than our models and GLoVe. The results shown in the [Classification and Speed Benchmarks](#classification-and-speed-benchmarks) further support this.
+
+## Classification and Speed Benchmarks
+
+In addition to the MTEB evaluation, we evaluate Model2Vec on a number of classification datasets. These are used as additional evidence to avoid overfitting to the MTEB dataset and to benchmark the speed of the model. The results are shown in the table below.
+
+
+| Model                  | Average | SST2   | IMDB  | TREC   | AG News |
+|:-----------------------|:-------:|:------:|:-----:|:------:|:-------:|
+| bge-base-en-v1.5        | 90.00   | 91.54  | 91.88 | 85.16  | 91.45   |
+| all-MiniLM-L6-v2        | 84.10   | 83.95  | 81.36 | 81.31  | 89.77   |
+| M2V_base_output         | 82.23   | 80.92  | 84.56 | 75.27  | 88.17   |
+| M2V_base_glove_subword  | 81.95   | 82.84  | 85.96 | 70.51  | 88.49   |
+| BPEmb_50k_300d          | 81.15   | 80.42  | 84.04 | 71.25  | 88.92   |
+| M2V_base_glove          | 80.76   | 83.07  | 85.24 | 66.12  | 88.61   |
+| WL256                   | 78.48   | 76.88  | 80.12 | 69.23  | 87.68   |
+| GloVe_300d              | 77.77   | 81.68  | 84.00 | 55.67  | 89.71   |
+
+
+As can be seen, Model2Vec models outperform the GloVe, BPEmb, and WL256 models on all classification tasks, and are competitive with the all-MiniLM-L6-v2 model, while being much faster.
+
+The figure below shows the relationship between the number of sentences per second and the average classification score. The circle sizes correspond to the number of parameters in the models (larger = more parameters).
+This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.
+
+| ![Description](../assets/images/speed_vs_accuracy_v3.png) |
+|:--:|
+|*Figure: The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size.*|
+
+
+## Ablations
+
+To better understand the factors contributing to the performance of Model2Vec, we conducted a comprehensive set of ablation studies, covering various aspects of the model's architecture and preprocessing methods. In these studies, we examined the impact of key elements such as PCA, Zipf weighting, and the use of Sentence Transformers versus regular transformer models. We also compared the performance of input embeddings versus output embeddings, since it would seem plausible that these should also work well. The results are shown in the table below.
+
+
+| Model                        |   Avg (All) |   Avg (MTEB) |   Class |   Clust |   PairClass |   Rank |   Ret |   STS |   Sum |   Pearl |   WordSim |
+|:-----------------------------|------------:|-------------:|--------:|--------:|------------:|-------:|------:|------:|------:|--------:|----------:|
+| M2V_base_output              |       46.79 |        45.34 |   61.25 |   25.58 |       74.9  |  47.63 | 26.14 | 68.58 | 29.2  |   54.02 |     49.18 |
+| M2V_base_output_nopca        |       44.04 |        42.31 |   61.42 |   20.15 |       68.21 |  44.67 | 25.25 | 61.87 | 29.85 |   51.02 |     48.96 |
+| M2V_base_output_nozipf       |       43.61 |        41.52 |   60.44 |   21.62 |       72.15 |  45.57 | 20.35 | 62.71 | 30.66 |   52.28 |     49.17 |
+| M2V_base_input_nozipf_nopca  |       40.97 |        39.55 |   54.16 |   18.62 |       68.3  |  43.65 | 23.63 | 59.38 | 32.04 |   50.19 |     40.52 |
+| M2V_base_output_nozipf_nopca |       40.8  |        38.44 |   59.78 |   19.31 |       62.39 |  42.26 | 19.01 | 55.16 | 30    |   49.09 |     48.97 |
+| M2V_base_input               |       40.74 |        39.93 |   60.35 |   22.66 |       59.63 |  43.02 | 25.47 | 50.05 | 29.35 |   50.61 |     34.47 |
+| M2V_bert_output_nozipf_nopca              |       35.54 |        34.82 |   55.69 |   15.42 |       58.68 |  39.87 | 12.92 | 55.24 | 30.15 |   46.9  |     26.72 |
+
+
+There's four main findings in these results:
+1. Non-Sentence Transformers do not work well. This can be seen by comparing `M2V_bert_output_nozipf_nopca` (which uses [BERT](https://huggingface.co/google-bert/bert-base-uncased), a non-Sentence Transformer) and `M2V_base_output_nozipf_nopca` (which uses [BGE-base](https://huggingface.co/BAAI/bge-base-en-v1.5), a Sentence Transformer). Using a Sentence Transformer gives a ~5.2% increase in performance.
+2. PCA is crucial for performance. This can be seen by comparing `M2V_base_output_nozipf_nopca` and `M2V_base_output_nozipf` which gives a ~2.8% increase in performance. Furthermore, PCA improves performance on _all_ tasks.
+3. Zipf weighting is crucial for performance. This can be seen by comparing `M2V_base_output_nozipf_nopca` and `M2V_base_output_nopca` which gives a ~3.1% increase in performance.
+4. Output embeddings outperform input embeddings. This can be seen by comparing `M2V_base_input` and `M2V_base_output` which gives a ~6.1% increase in performance. Note that input embeddings do work well for some tasks. We hypothesize that this is because input embeddings are inherently normalized.