diff --git a/tutorials/semantic_deduplication.ipynb b/tutorials/semantic_deduplication.ipynb index 98a566a..c595edf 100644 --- a/tutorials/semantic_deduplication.ipynb +++ b/tutorials/semantic_deduplication.ipynb @@ -6,7 +6,7 @@ "source": [ "**Semantic Deduplication with Model2Vec**\n", "\n", - "In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicates—documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks." + "In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicates—documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks. We can also use Model2Vec to detect train-test overlap, ensuring that our models are not overfitting to the training data." ] }, { @@ -16,15 +16,17 @@ "outputs": [], "source": [ "!pip install datasets model2vec reach numpy wordllama tqdm datasketch\n", + "\n", + "from difflib import ndiff\n", + "from time import perf_counter\n", + "\n", "from datasets import load_dataset\n", + "from datasketch import MinHash, MinHashLSH\n", + "import numpy as np\n", "from model2vec import StaticModel\n", "from reach import Reach\n", - "import numpy as np\n", "from tqdm import tqdm\n", - "from difflib import ndiff\n", - "from wordllama import WordLlama\n", - "from time import perf_counter\n", - "from datasketch import MinHash, MinHashLSH" + "from wordllama import WordLlama" ] }, { @@ -43,7 +45,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We will first try to find exact matches in the dataset as a baseline. Then, we will use Model2Vec to identify semantic duplicates." + "**Exact overlap baseline**\n", + "\n", + "We will first try to find exact matches in the dataset as a baseline." ] }, { @@ -75,7 +79,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As can be seen, we find no duplicate instances using exact string matching. Now, let's use Model2Vec to embed our documents and identify duplicates." + "As can be seen, we find no duplicate instances using exact string matching. Now, let's use Model2Vec to embed our documents and identify duplicates.\n", + "\n", + "**Deduplication using Model2Vec**" ] }, { @@ -250,7 +256,9 @@ "source": [ "The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast way to deduplicate.\n", "\n", - "For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data." + "For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data.\n", + "\n", + "**Deduplication using WordLlama**" ] }, { @@ -282,7 +290,9 @@ "source": [ "This approach is considerably slower than Model2Vec for encoding + deduplication (43 vs 27 seconds). It also finds less duplicates with the same threshold.\n", "\n", - "As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates." + "As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates.\n", + "\n", + "**Deduplication using MinHash**" ] }, { @@ -342,7 +352,9 @@ "source": [ "Model2Vec is again much faster, with 27 seconds vs 56 seconds for MinHash. The number of found duplicates is roughly the same using the default settings for MinHash.\n", "\n", - "Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set." + "Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.\n", + "\n", + "**Train test leagage detection using Model2Vec**" ] }, {