diff --git a/tutorials/semantic_deduplication.ipynb b/tutorials/semantic_deduplication.ipynb index f19938f..5e3490f 100644 --- a/tutorials/semantic_deduplication.ipynb +++ b/tutorials/semantic_deduplication.ipynb @@ -4,110 +4,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "asd" + "** Semantic Deduplication with Model2Vec**\n", + "\n", + "In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicates—documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: datasets in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (3.0.0)\n", - "Requirement already satisfied: model2vec in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (0.2.2)\n", - "Requirement already satisfied: reach in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (4.1.1)\n", - "Requirement already satisfied: numpy in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (1.26.4)\n", - "Requirement already satisfied: tqdm in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (4.66.5)\n", - "Requirement already satisfied: filelock in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (3.13.1)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (17.0.0)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (0.3.8)\n", - "Requirement already satisfied: pandas in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (2.2.0)\n", - "Requirement already satisfied: requests>=2.32.2 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (2.32.3)\n", - "Requirement already satisfied: xxhash in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (3.5.0)\n", - "Requirement already satisfied: multiprocess in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (0.70.16)\n", - "Requirement already satisfied: fsspec<=2024.6.1,>=2023.1.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets) (2024.2.0)\n", - "Requirement already satisfied: aiohttp in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (3.9.3)\n", - "Requirement already satisfied: huggingface-hub>=0.22.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (0.25.1)\n", - "Requirement already satisfied: packaging in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (23.2)\n", - "Requirement already satisfied: pyyaml>=5.1 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from datasets) (6.0.1)\n", - "Requirement already satisfied: click in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from model2vec) (8.1.7)\n", - "Requirement already satisfied: rich in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from model2vec) (13.7.0)\n", - "Requirement already satisfied: typer in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from model2vec) (0.12.5)\n", - "Requirement already satisfied: transformers in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from model2vec) (4.44.2)\n", - "Requirement already satisfied: torch in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from model2vec) (2.2.0)\n", - "Requirement already satisfied: tokenizers in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from model2vec) (0.19.1)\n", - "Requirement already satisfied: scikit-learn in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from model2vec) (1.5.1)\n", - "Requirement already satisfied: setuptools in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from model2vec) (69.2.0)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from aiohttp->datasets) (23.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from aiohttp->datasets) (1.4.1)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.5)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from aiohttp->datasets) (1.9.4)\n", - "Requirement already satisfied: async-timeout<5.0,>=4.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from huggingface-hub>=0.22.0->datasets) (4.9.0)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.6)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (1.26.18)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (2024.2.2)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from pandas->datasets) (2.8.2)\n", - "Requirement already satisfied: pytz>=2020.1 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from pandas->datasets) (2024.1)\n", - "Requirement already satisfied: tzdata>=2022.7 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from pandas->datasets) (2024.1)\n", - "Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from rich->model2vec) (3.0.0)\n", - "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from rich->model2vec) (2.17.2)\n", - "Requirement already satisfied: scipy>=1.6.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from scikit-learn->model2vec) (1.12.0)\n", - "Requirement already satisfied: joblib>=1.2.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from scikit-learn->model2vec) (1.4.2)\n", - "Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from scikit-learn->model2vec) (3.5.0)\n", - "Requirement already satisfied: sympy in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from torch->model2vec) (1.12)\n", - "Requirement already satisfied: networkx in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from torch->model2vec) (3.2.1)\n", - "Requirement already satisfied: jinja2 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from torch->model2vec) (3.1.2)\n", - "Requirement already satisfied: regex!=2019.12.17 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from transformers->model2vec) (2023.12.25)\n", - "Requirement already satisfied: safetensors>=0.4.1 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from transformers->model2vec) (0.4.2)\n", - "Requirement already satisfied: shellingham>=1.3.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from typer->model2vec) (1.5.4)\n", - "Requirement already satisfied: mdurl~=0.1 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->model2vec) (0.1.2)\n", - "Requirement already satisfied: six>=1.5 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from jinja2->torch->model2vec) (2.1.5)\n", - "Requirement already satisfied: mpmath>=0.19 in /Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages (from sympy->torch->model2vec) (1.3.0)\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/thomasvandongen/Recommenders/repositories/recommenders-shared-tools/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - } - ], + "outputs": [], "source": [ - "!pip install datasets model2vec reach numpy tqdm \n", + "!pip install datasets model2vec reach numpy tqdm python-Levenshtein datasketch\n", "from datasets import load_dataset\n", "from model2vec import StaticModel\n", "from reach import Reach\n", "import numpy as np\n", - "from tqdm import tqdm\n", - "from time import perf_counter" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "asdasd" + "from tqdm import tqdm" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": 11, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ @@ -117,9 +35,16 @@ "texts = ds['text']" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will first try to find exact matches in the dataset as a baseline. Then, we will use Model2Vec to identify semantic duplicates." + ] + }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 15, "metadata": {}, "outputs": [ { @@ -128,18 +53,24 @@ "120000" ] }, - "execution_count": 13, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "#texts = [\"asd\", \"asd\", \"tads\", \"qwea\"]\n", "seen = set()\n", "deduplicated_text_indices = np.array([i for i, text in enumerate(texts) if text not in seen and not seen.add(text)])\n", "len(deduplicated_text_indices)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As can be seen, we find no duplicate instances using exact string matching. Now, let's use Model2Vec to embed our documents and identify duplicates." + ] + }, { "cell_type": "code", "execution_count": 4, @@ -161,28 +92,29 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ - "def deduplicate(embedding_matrix: np.ndarray, threshold: float, batch_size: int = 1024) -> np.ndarray:\n", + "# Define a function to deduplicate embeddings\n", + "def deduplicate(embedding_matrix: np.ndarray, threshold: float, batch_size: int = 1024) -> tuple[np.ndarray, dict[str, int]]:\n", " \"\"\"\n", - " Deduplicate embeddings and return the deduplicated indices.\n", + " Deduplicate embeddings and return the deduplicated indices and a mapping of removed indices to their corresponding original indices.\n", " \n", " :param embedding_matrix: The embeddings to deduplicate.\n", " :param threshold: The similarity threshold to use for deduplication.\n", " :param batch_size: The batch size to use for similarity computation.\n", - " :return: The deduplicated indices.\n", + " :return: A tuple containing the deduplicated indices and a dictionary mapping removed indices to original indices.\n", " \"\"\"\n", " reach = Reach(vectors=embedding_matrix, items=[str(i) for i in range(len(embedding_matrix))])\n", " \n", " # Find similar documents\n", - " similarity_cutoff = 1 - threshold\n", " is_duplicate = np.zeros(len(embedding_matrix), dtype=bool)\n", - " \n", + " duplicate_to_original_mapping = {}\n", + "\n", " results = reach.threshold(\n", " [str(i) for i in range(len(embedding_matrix))], \n", - " threshold=similarity_cutoff, \n", + " threshold=threshold, \n", " batch_size=batch_size, \n", " show_progressbar=True\n", " )\n", @@ -192,25 +124,105 @@ " if is_duplicate[i]:\n", " continue # Skip already marked duplicates\n", "\n", + " # Similar items are returned as (index, score), we are only interested in the index\n", " similar_indices = [int(item[0]) for item in similar_items if int(item[0]) != i]\n", - " is_duplicate[similar_indices] = True\n", + " \n", + " # Mark similar documents as duplicates and map them to the original\n", + " for sim_idx in similar_indices:\n", + " is_duplicate[sim_idx] = True\n", + " duplicate_to_original_mapping[sim_idx] = i # Map duplicate to original\n", "\n", " deduplicated_indices = np.where(~is_duplicate)[0]\n", "\n", - " return deduplicated_indices" + " return deduplicated_indices, duplicate_to_original_mapping\n" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 99%|█████████▉| 117/118 [00:24<00:00, 4.77it/s]\n", + "100%|██████████| 120000/120000 [00:00<00:00, 945566.39it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "118769" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Deduplicate (with a high threshold)\n", + "deduplicated_indices, duplicate_to_original_mapping = deduplicate(embedding_matrix, threshold=0.99)\n", + "len(deduplicated_indices)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using Model2Vec, we find > 1000 duplicates with a very high threshold, in < 30 seconds. Now, let's look at a few examples to see if these are indeed duplicates." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Original text: Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.\n", + "Duplicate text: Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market this week during the depth of the\\summer doldrums.\n", + "--------------------------------------------------\n", + "Original text: Oil and Economy Cloud Stocks' Outlook NEW YORK (Reuters) - Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.\n", + "Duplicate text: Oil and Economy Cloud Stocks' Outlook NEW YORK (Reuters) - Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market this week during the depth of the summer doldrums.\n", + "--------------------------------------------------\n", + "Original text: Phelps, Thorpe Advance in 200 Freestyle ATHENS, Greece - Michael Phelps took care of qualifying for the Olympic 200-meter freestyle semifinals Sunday, and then found out he had been added to the American team for the evening's 400 freestyle relay final. Phelps' rivals Ian Thorpe and Pieter van den Hoogenband and teammate Klete Keller were faster than the teenager in the 200 free preliminaries...\n", + "Duplicate text: Phelps, Thorpe Advance in 200 Freestyle ATHENS, Greece - Michael Phelps took care of qualifying for the Olympic 200-meter freestyle semifinals Sunday, and then found out he had been added to the American team for the evening's 400 freestyle relay final. Phelps' rivals Ian Thorpe and Pieter van den Hoogenband and teammate Klete Keller were faster than the teenager in the 200 free preliminaries...\n", + "--------------------------------------------------\n", + "Original text: Government Spending Up Sharply Locally Federal procurement spending in the Washington area rose last year at its highest rate since the 1980s, according to a study to be released today, creating tens of thousands of jobs and increasing economic growth disproportionately in Northern Virginia.\n", + "Duplicate text: Government Spending Up Sharply Locally Federal procurement spending in the Washington area rose last year at its highest rate since the 1980s, according to a study to be released today, creating tens of thousands of jobs and increasing economic growth disproportionately in Northern Virginia.\n", + "--------------------------------------------------\n", + "Original text: F.B.I. Goes Knocking for Political Troublemakers The F.B.I. has been questioning demonstrators in an effort to forestall violent protests at the Republican National Convention.\n", + "Duplicate text: F.B.I. Goes Knocking for Political Troublemakers The F.B.I. has been questioning demonstrators in an effort to forestall violent protests at the Republican convention.\n", + "--------------------------------------------------\n" + ] + } + ], + "source": [ + "# Show a few duplicates with their originals\n", + "num_examples = 5\n", + "for duplicate_idx, original_idx in list(duplicate_to_original_mapping.items())[:num_examples]:\n", + " print(f\"Original text: {texts[original_idx]}\")\n", + " print(f\"Duplicate text: {texts[duplicate_idx]}\")\n", + " print(\"-\" * 50)" + ] + }, + { + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "deduplicated_indices = deduplicate(embedding_matrix, threshold=0.9)\n", - "# Get deduplicated documents and embeddings\n", - "deduplicated_docs = [texts[i] for i in deduplicated_indices]\n", - "deduplicated_embeddings = embedding_matrix[deduplicated_indices]" + "The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast alternative to other methods such as MinHash." ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {