Added train test bleed example

MinishLab · Oct 11, 2024 · 7119d28 · 7119d28
1 parent 206cb0e
commit 7119d28
Show file tree

Hide file tree

Showing 2 changed files with 12 additions and 10 deletions.
diff --git a/tutorials/README.md b/tutorials/README.md
@@ -12,4 +12,4 @@ This is a list of all our tutorials. They are all self-contained ipython noteboo
 |                    | what?                                                                                                                                                                      | Link |
 |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
 | **Recipe search**   | Learn how to do lightning-fast semantic search by distilling a small model. Compare a really tiny model to a larger with one with a better vocabulary. Learn what Fattoush is (delicious). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/recipe_search.ipynb)     |
-| **Semantic deduplication** | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test bleed. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) |
+| **Semantic deduplication** | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test leakage. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) |
diff --git a/tutorials/semantic_deduplication.ipynb b/tutorials/semantic_deduplication.ipynb
@@ -79,9 +79,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As can be seen, we find no duplicate instances using exact string matching. Now, let's use Model2Vec to embed our documents and identify duplicates.\n",
+    "As can be seen, we find no duplicate instances using exact string matching. \n",
     "\n",
-    "**Deduplication using Model2Vec**"
+    "**Deduplication using Model2Vec**\n",
+    "\n",
+    "Let's use Model2Vec to embed our documents and identify duplicates."
    ]
   },
   {
@@ -256,9 +258,9 @@
    "source": [
     "The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast way to deduplicate.\n",
     "\n",
-    "For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data.\n",
+    "**Deduplication using WordLlama**\n",
     "\n",
-    "**Deduplication using WordLlama**"
+    "For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data."
    ]
   },
   {
@@ -290,9 +292,9 @@
    "source": [
     "This approach is considerably slower than Model2Vec for encoding + deduplication (43 vs 27 seconds). It also finds less duplicates with the same threshold.\n",
     "\n",
-    "As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates.\n",
+    "**Deduplication using MinHash**\n",
     "\n",
-    "**Deduplication using MinHash**"
+    "As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates."
    ]
   },
   {
@@ -352,9 +354,9 @@
    "source": [
     "Model2Vec is again much faster, with 27 seconds vs 56 seconds for MinHash. The number of found duplicates is roughly the same using the default settings for MinHash.\n",
     "\n",
-    "Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.\n",
+    "**Train test leagage detection using Model2Vec**\n",
     "\n",
-    "**Train test leagage detection using Model2Vec**"
+    "Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.\n"
    ]
   },
   {
@@ -501,7 +503,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "These again look like duplicates. We can very efficiently find train/test bleed examples using Model2Vec, ensuring that our test set is clean and does not contain any duplicates from the training set."
+    "These again look like duplicates. We can very efficiently find train/test leagage examples using Model2Vec, ensuring that our test set is clean and does not contain any duplicates from the training set."
    ]
   }
  ],