Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Similarity module] Add more similarity measurements #124

Open
FanwangM opened this issue May 29, 2023 · 8 comments
Open

[Similarity module] Add more similarity measurements #124

FanwangM opened this issue May 29, 2023 · 8 comments
Assignees
Labels
help wanted Extra attention is needed manuscript

Comments

@FanwangM
Copy link
Collaborator

FanwangM commented May 29, 2023

Implement methods listed in as similarity module https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. Please add detailed documentation to show which similarity functions is corresponding to which distance functions in scikit-learn or scipy.

One question I have shall we separate the similarity and distance measurements? I get confused by some measurements, e.g. Tanimoto index of molecule fingerprints. I would see it as a distance, but they treated it as a similarity, https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. If we decide to distinguish them, we may need to make them into similarity and distance modules instead of one module.

@PaulWAyers @FarnazH

@FanwangM FanwangM added the help wanted Extra attention is needed label May 29, 2023
@PaulWAyers
Copy link
Member

As explained on Wikipedia there is a Tanimoto similarity and a Tanimoto distance. So both exist.

The easiest test is to compare an object to itself. Its similarity is greater than zero (often one) and the distance is zero.

I feel like it is better to add AIMSim as a dependence. Implementing 30+ methods is a lot of work.

We may wish to have a few basic methods implemented; the most common distance metrics and similarity measures are already there in scikit-learn
(distances) sklearn.metrics.DistanceMetric
(similarities and divergences) sklearn.metrics.pairwise

I'd lead with interfacing to scikit-learn (I think we already did this in large part?) and then considering interfacing to AIMSim a follow-up task.

I guess it is important to distinguish between similarities/affinities and distances/divergences. I'd suggest making sure that we have these distinguished, plus the "converter" between them.

@FanwangM
Copy link
Collaborator Author

FanwangM commented Jun 2, 2023

Yes, we should make them differentiable and be obvious as much as we can to avoid any ambiguity.

@FarnazH
Copy link
Member

FarnazH commented Jun 20, 2023

Update: We decided not to include any wrappers to support the functionality in other packages (reason: additional overhead and unnecessary dependency), instead, we showcase how our package works with other libraries in notebooks/tutorials.

@PaulWAyers
Copy link
Member

@ramirandaq will list the "key similarity measures" from https://vlachosgroup.github.io/AIMSim/implemented_metrics.html and we'll reimplement them.

@ramirandaq
Copy link
Collaborator

Of all the similarity indices we've tested, these are the "best ones". I'm including a sample implementation for the case in which they are calculated from binary fingerprints.

sim_indices.txt

@FanwangM
Copy link
Collaborator Author

FanwangM commented Jul 6, 2023

Thanks for sharing. I am copying @ramirandaq 's code for readibility.

import numpy as np

# Pairwise similarity indices calculated over binary fingerprints

def indicators(x, y):
    """Calculating base descriptors
    a : number of common on bits
    d : number of common off bits
    dis = b + c : 1-0 mismatches
    p : len of fingerprint
    Check Table S1 in the SI of https://link.springer.com/article/10.1186/s13321-021-00505-3#Sec21
    """
    p = len(x)
    a = np.dot(x, y)
    d = np.dot(1 - x, 1 - y)
    dis = p - a - d
    return a, d, dis, p

# Indices
# BUB: Baroni-Urbani-Buser, Fai: Faith, Ja: Jaccard
# JT: Jaccard-Tanimoto, RT: Rogers-Tanimoto, RR: Russel-Rao
# SM: Sokal-Michener, SSn: Sokal-Sneath n

x = np.array([1, 0, 1, 0, 1])
y = np.array([1, 1, 1, 0, 0])

a, d, dis, p = indicators(x, y)

bub = (a * d)**0.5 + a)/((a * d)**0.5 + a + dis)

fai = (a + 0.5 * d)/p

ja = (3 * a)/(3 * a + dis)

jt = a/(a + dis)

rt = (a + d)/(p + dis)

rr = a/p

sm =(a + d)/p

ss1 = a/(a + 2 * dis)

ss2 = (2 * (a + d))/(p + (a + d))

@PaulWAyers
Copy link
Member

Just to clarify, all of these are "bitwise". We have:
a = logical "and" between bitstrings; intersection between sets if for each element, "1" or "on" means an element/feature is present.
d = logical "not and" between bitstrings; {universe} - {union} between sets if "1" or "on" means an element is present. So these are "features that are not present in either set"
dis = logical "exclusive or" between bitstrings. {union} - {intersection} if "1" or "on" means an element is present. So these are "features that are present in one item, but not present in the other".

As Ramon notes, most of these are just one-line formulas. For things that aren't "logical" obviously there are more complicated forms of similarity, though most will be (some sort of) mahalanobis distance-related function.

@FarnazH
Copy link
Member

FarnazH commented Aug 20, 2024

@marco-2023, please:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed manuscript
Projects
None yet
Development

No branches or pull requests

4 participants