Skip to content

Commit

Permalink
Added train test bleed example
Browse files Browse the repository at this point in the history
  • Loading branch information
Pringled committed Oct 11, 2024
1 parent 91e958b commit 206cb0e
Showing 1 changed file with 23 additions and 11 deletions.
34 changes: 23 additions & 11 deletions tutorials/semantic_deduplication.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"**Semantic Deduplication with Model2Vec**\n",
"\n",
"In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicates—documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks."
"In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicates—documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks. We can also use Model2Vec to detect train-test overlap, ensuring that our models are not overfitting to the training data."
]
},
{
Expand All @@ -16,15 +16,17 @@
"outputs": [],
"source": [
"!pip install datasets model2vec reach numpy wordllama tqdm datasketch\n",
"\n",
"from difflib import ndiff\n",
"from time import perf_counter\n",
"\n",
"from datasets import load_dataset\n",
"from datasketch import MinHash, MinHashLSH\n",
"import numpy as np\n",
"from model2vec import StaticModel\n",
"from reach import Reach\n",
"import numpy as np\n",
"from tqdm import tqdm\n",
"from difflib import ndiff\n",
"from wordllama import WordLlama\n",
"from time import perf_counter\n",
"from datasketch import MinHash, MinHashLSH"
"from wordllama import WordLlama"
]
},
{
Expand All @@ -43,7 +45,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We will first try to find exact matches in the dataset as a baseline. Then, we will use Model2Vec to identify semantic duplicates."
"**Exact overlap baseline**\n",
"\n",
"We will first try to find exact matches in the dataset as a baseline."
]
},
{
Expand Down Expand Up @@ -75,7 +79,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As can be seen, we find no duplicate instances using exact string matching. Now, let's use Model2Vec to embed our documents and identify duplicates."
"As can be seen, we find no duplicate instances using exact string matching. Now, let's use Model2Vec to embed our documents and identify duplicates.\n",
"\n",
"**Deduplication using Model2Vec**"
]
},
{
Expand Down Expand Up @@ -250,7 +256,9 @@
"source": [
"The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast way to deduplicate.\n",
"\n",
"For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data."
"For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data.\n",
"\n",
"**Deduplication using WordLlama**"
]
},
{
Expand Down Expand Up @@ -282,7 +290,9 @@
"source": [
"This approach is considerably slower than Model2Vec for encoding + deduplication (43 vs 27 seconds). It also finds less duplicates with the same threshold.\n",
"\n",
"As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates."
"As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates.\n",
"\n",
"**Deduplication using MinHash**"
]
},
{
Expand Down Expand Up @@ -342,7 +352,9 @@
"source": [
"Model2Vec is again much faster, with 27 seconds vs 56 seconds for MinHash. The number of found duplicates is roughly the same using the default settings for MinHash.\n",
"\n",
"Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set."
"Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.\n",
"\n",
"**Train test leagage detection using Model2Vec**"
]
},
{
Expand Down

0 comments on commit 206cb0e

Please sign in to comment.