Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add documentation for value matching methods #90

Merged
merged 1 commit into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,4 @@ You can find the source code in our `GitHub repository <https://github.com/VIDA-

api
schema-matching
value-matching
39 changes: 39 additions & 0 deletions docs/source/value-matching.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Value Matching Methods
======================

This page provides an overview of all value matching methods available in the `bdikit` library.
Some methods reuse the implementation of other libraries such as `PolyFuzz <https://maartengr.github.io/PolyFuzz/>`_ (e.g, `embedding` and `tfidf`) while others are implemented originally for bdikit (e.g., `gpt`).
To see how to use these methods, please refer to the documentation of :py:func:`~bdikit.api.match_values()` in the :py:mod:`~bdikit.api` module.

.. ``bdikit module <api>`.



.. list-table:: bdikit methods
:header-rows: 1

* - Method
- Class
- Description
* - ``gpt``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.GPTValueMatcher`
- | Leverages a large language model (GPT-4) to identify and select the most accurate value matches.

.. list-table:: Methods from other libraries
:header-rows: 1

* - Method
- Class
- Description
* - ``tfidf``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.TFIDFValueMatcher`
- | Employs a character-based n-gram TF-IDF approach to approximate edit distance by capturing the frequency and contextual importance of n-gram patterns within strings. This method leverages the Term Frequency-Inverse Document Frequency (TF-IDF) weighting to quantify the similarity between strings based on their shared n-gram features.
* - ``edit_distance``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.EditDistanceValueMatcher`
- | Uses the edit distance between lists of strings using a customizable scorer that supports various distance and similarity metrics.
* - ``embedding``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.EmbeddingValueMatcher`
- | A value-matching algorithm that leverages the cosine similarity of value embeddings for precise comparisons. By default, it utilizes the `bert-base-multilingual-cased` model to generate contextualized embeddings, enabling effective multilingual matching.​.
* - ``fasttext``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.FastTextValueMatcher`
- | This method uses the cosine similarity of FastText embeddings to accurately compare and align values, capturing both semantic and subword-level similarities..
Loading