Skip to content

Commit

Permalink
Added train test bleed example
Browse files Browse the repository at this point in the history
  • Loading branch information
Pringled committed Oct 11, 2024
1 parent 206cb0e commit 7119d28
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 10 deletions.
2 changes: 1 addition & 1 deletion tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ This is a list of all our tutorials. They are all self-contained ipython noteboo
| | what? | Link |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| **Recipe search** | Learn how to do lightning-fast semantic search by distilling a small model. Compare a really tiny model to a larger with one with a better vocabulary. Learn what Fattoush is (delicious). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/recipe_search.ipynb) |
| **Semantic deduplication** | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test bleed. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) |
| **Semantic deduplication** | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test leakage. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) |
20 changes: 11 additions & 9 deletions tutorials/semantic_deduplication.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -79,9 +79,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As can be seen, we find no duplicate instances using exact string matching. Now, let's use Model2Vec to embed our documents and identify duplicates.\n",
"As can be seen, we find no duplicate instances using exact string matching. \n",
"\n",
"**Deduplication using Model2Vec**"
"**Deduplication using Model2Vec**\n",
"\n",
"Let's use Model2Vec to embed our documents and identify duplicates."
]
},
{
Expand Down Expand Up @@ -256,9 +258,9 @@
"source": [
"The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast way to deduplicate.\n",
"\n",
"For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data.\n",
"**Deduplication using WordLlama**\n",
"\n",
"**Deduplication using WordLlama**"
"For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data."
]
},
{
Expand Down Expand Up @@ -290,9 +292,9 @@
"source": [
"This approach is considerably slower than Model2Vec for encoding + deduplication (43 vs 27 seconds). It also finds less duplicates with the same threshold.\n",
"\n",
"As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates.\n",
"**Deduplication using MinHash**\n",
"\n",
"**Deduplication using MinHash**"
"As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates."
]
},
{
Expand Down Expand Up @@ -352,9 +354,9 @@
"source": [
"Model2Vec is again much faster, with 27 seconds vs 56 seconds for MinHash. The number of found duplicates is roughly the same using the default settings for MinHash.\n",
"\n",
"Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.\n",
"**Train test leagage detection using Model2Vec**\n",
"\n",
"**Train test leagage detection using Model2Vec**"
"Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.\n"
]
},
{
Expand Down Expand Up @@ -501,7 +503,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"These again look like duplicates. We can very efficiently find train/test bleed examples using Model2Vec, ensuring that our test set is clean and does not contain any duplicates from the training set."
"These again look like duplicates. We can very efficiently find train/test leagage examples using Model2Vec, ensuring that our test set is clean and does not contain any duplicates from the training set."
]
}
],
Expand Down

0 comments on commit 7119d28

Please sign in to comment.