From 41e086e84fdb26f1c7c81593e37d9d30841c8f1b Mon Sep 17 00:00:00 2001 From: Roque Lopez Date: Tue, 19 Nov 2024 10:49:09 -0500 Subject: [PATCH] docs: Add documentation for value matching methods --- docs/source/index.rst | 1 + docs/source/value-matching.rst | 39 ++++++++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+) create mode 100644 docs/source/value-matching.rst diff --git a/docs/source/index.rst b/docs/source/index.rst index f93a377..17e51f3 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -29,3 +29,4 @@ You can find the source code in our `GitHub repository `_ (e.g, `embedding` and `tfidf`) while others are implemented originally for bdikit (e.g., `gpt`). +To see how to use these methods, please refer to the documentation of :py:func:`~bdikit.api.match_values()` in the :py:mod:`~bdikit.api` module. + +.. ``bdikit module `. + + + +.. list-table:: bdikit methods + :header-rows: 1 + + * - Method + - Class + - Description + * - ``gpt`` + - :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.GPTValueMatcher` + - | Leverages a large language model (GPT-4) to identify and select the most accurate value matches. + +.. list-table:: Methods from other libraries + :header-rows: 1 + + * - Method + - Class + - Description + * - ``tfidf`` + - :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.TFIDFValueMatcher` + - | Employs a character-based n-gram TF-IDF approach to approximate edit distance by capturing the frequency and contextual importance of n-gram patterns within strings. This method leverages the Term Frequency-Inverse Document Frequency (TF-IDF) weighting to quantify the similarity between strings based on their shared n-gram features. + * - ``edit_distance`` + - :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.EditDistanceValueMatcher` + - | Uses the edit distance between lists of strings using a customizable scorer that supports various distance and similarity metrics. + * - ``embedding`` + - :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.EmbeddingValueMatcher` + - | A value-matching algorithm that leverages the cosine similarity of value embeddings for precise comparisons. By default, it utilizes the `bert-base-multilingual-cased` model to generate contextualized embeddings, enabling effective multilingual matching.​. + * - ``fasttext`` + - :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.FastTextValueMatcher` + - | This method uses the cosine similarity of FastText embeddings to accurately compare and align values, capturing both semantic and subword-level similarities..