[Similarity module] Add more similarity measurements #124

FanwangM · 2023-05-29T08:03:24Z

Implement methods listed in as similarity module https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. Please add detailed documentation to show which similarity functions is corresponding to which distance functions in scikit-learn or scipy.

One question I have shall we separate the similarity and distance measurements? I get confused by some measurements, e.g. Tanimoto index of molecule fingerprints. I would see it as a distance, but they treated it as a similarity, https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. If we decide to distinguish them, we may need to make them into similarity and distance modules instead of one module.

@PaulWAyers @FarnazH

The text was updated successfully, but these errors were encountered:

PaulWAyers · 2023-05-30T12:21:37Z

As explained on Wikipedia there is a Tanimoto similarity and a Tanimoto distance. So both exist.

The easiest test is to compare an object to itself. Its similarity is greater than zero (often one) and the distance is zero.

I feel like it is better to add AIMSim as a dependence. Implementing 30+ methods is a lot of work.

We may wish to have a few basic methods implemented; the most common distance metrics and similarity measures are already there in scikit-learn
(distances) sklearn.metrics.DistanceMetric
(similarities and divergences) sklearn.metrics.pairwise

I'd lead with interfacing to scikit-learn (I think we already did this in large part?) and then considering interfacing to AIMSim a follow-up task.

I guess it is important to distinguish between similarities/affinities and distances/divergences. I'd suggest making sure that we have these distinguished, plus the "converter" between them.

FanwangM · 2023-06-02T11:23:58Z

Yes, we should make them differentiable and be obvious as much as we can to avoid any ambiguity.

FarnazH · 2023-06-20T00:49:46Z

Update: We decided not to include any wrappers to support the functionality in other packages (reason: additional overhead and unnecessary dependency), instead, we showcase how our package works with other libraries in notebooks/tutorials.

PaulWAyers · 2023-07-06T14:36:52Z

@ramirandaq will list the "key similarity measures" from https://vlachosgroup.github.io/AIMSim/implemented_metrics.html and we'll reimplement them.

ramirandaq · 2023-07-06T16:41:54Z

Of all the similarity indices we've tested, these are the "best ones". I'm including a sample implementation for the case in which they are calculated from binary fingerprints.

sim_indices.txt

FanwangM · 2023-07-06T19:37:21Z

Thanks for sharing. I am copying @ramirandaq 's code for readibility.

import numpy as np

# Pairwise similarity indices calculated over binary fingerprints

def indicators(x, y):
    """Calculating base descriptors
    a : number of common on bits
    d : number of common off bits
    dis = b + c : 1-0 mismatches
    p : len of fingerprint
    Check Table S1 in the SI of https://link.springer.com/article/10.1186/s13321-021-00505-3#Sec21
    """
    p = len(x)
    a = np.dot(x, y)
    d = np.dot(1 - x, 1 - y)
    dis = p - a - d
    return a, d, dis, p

# Indices
# BUB: Baroni-Urbani-Buser, Fai: Faith, Ja: Jaccard
# JT: Jaccard-Tanimoto, RT: Rogers-Tanimoto, RR: Russel-Rao
# SM: Sokal-Michener, SSn: Sokal-Sneath n

x = np.array([1, 0, 1, 0, 1])
y = np.array([1, 1, 1, 0, 0])

a, d, dis, p = indicators(x, y)

bub = (a * d)**0.5 + a)/((a * d)**0.5 + a + dis)

fai = (a + 0.5 * d)/p

ja = (3 * a)/(3 * a + dis)

jt = a/(a + dis)

rt = (a + d)/(p + dis)

rr = a/p

sm =(a + d)/p

ss1 = a/(a + 2 * dis)

ss2 = (2 * (a + d))/(p + (a + d))

PaulWAyers · 2023-07-09T18:35:51Z

Just to clarify, all of these are "bitwise". We have:
a = logical "and" between bitstrings; intersection between sets if for each element, "1" or "on" means an element/feature is present.
d = logical "not and" between bitstrings; {universe} - {union} between sets if "1" or "on" means an element is present. So these are "features that are not present in either set"
dis = logical "exclusive or" between bitstrings. {union} - {intersection} if "1" or "on" means an element is present. So these are "features that are present in one item, but not present in the other".

As Ramon notes, most of these are just one-line formulas. For things that aren't "logical" obviously there are more complicated forms of similarity, though most will be (some sort of) mahalanobis distance-related function.

FarnazH · 2024-08-20T21:35:02Z

@marco-2023, please:

Rename https://github.com/theochem/Selector/blob/main/selector/similarity.py to measures/similarity.py
Move diversity.py and convertor.py to the measures module.
Implement any similarity measure and test your heart desires (thanks!)

FanwangM added the help wanted Extra attention is needed label May 29, 2023

PaulWAyers assigned ramirandaq Jul 6, 2023

FanwangM mentioned this issue Dec 1, 2023

Add more similarity measurements, fixes #124 #188

Closed

FanwangM added the manuscript label Jun 25, 2024

FanwangM self-assigned this Jun 25, 2024

marco-2023 mentioned this issue Aug 21, 2024

Update package structure #251

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Similarity module] Add more similarity measurements #124

[Similarity module] Add more similarity measurements #124

FanwangM commented May 29, 2023 •

edited

Loading

PaulWAyers commented May 30, 2023

FanwangM commented Jun 2, 2023

FarnazH commented Jun 20, 2023

PaulWAyers commented Jul 6, 2023

ramirandaq commented Jul 6, 2023

FanwangM commented Jul 6, 2023

PaulWAyers commented Jul 9, 2023

FarnazH commented Aug 20, 2024 •

edited by FanwangM

Loading

[Similarity module] Add more similarity measurements #124

[Similarity module] Add more similarity measurements #124

Comments

FanwangM commented May 29, 2023 • edited Loading

PaulWAyers commented May 30, 2023

FanwangM commented Jun 2, 2023

FarnazH commented Jun 20, 2023

PaulWAyers commented Jul 6, 2023

ramirandaq commented Jul 6, 2023

FanwangM commented Jul 6, 2023

PaulWAyers commented Jul 9, 2023

FarnazH commented Aug 20, 2024 • edited by FanwangM Loading

FanwangM commented May 29, 2023 •

edited

Loading

FarnazH commented Aug 20, 2024 •

edited by FanwangM

Loading