-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Similarity module] Add more similarity measurements #124
Comments
As explained on Wikipedia there is a Tanimoto similarity and a Tanimoto distance. So both exist. The easiest test is to compare an object to itself. Its similarity is greater than zero (often one) and the distance is zero. I feel like it is better to add AIMSim as a dependence. Implementing 30+ methods is a lot of work. We may wish to have a few basic methods implemented; the most common distance metrics and similarity measures are already there in scikit-learn I'd lead with interfacing to scikit-learn (I think we already did this in large part?) and then considering interfacing to AIMSim a follow-up task. I guess it is important to distinguish between similarities/affinities and distances/divergences. I'd suggest making sure that we have these distinguished, plus the "converter" between them. |
Yes, we should make them differentiable and be obvious as much as we can to avoid any ambiguity. |
Update: We decided not to include any wrappers to support the functionality in other packages (reason: additional overhead and unnecessary dependency), instead, we showcase how our package works with other libraries in notebooks/tutorials. |
@ramirandaq will list the "key similarity measures" from https://vlachosgroup.github.io/AIMSim/implemented_metrics.html and we'll reimplement them. |
Of all the similarity indices we've tested, these are the "best ones". I'm including a sample implementation for the case in which they are calculated from binary fingerprints. |
Thanks for sharing. I am copying @ramirandaq 's code for readibility. import numpy as np
# Pairwise similarity indices calculated over binary fingerprints
def indicators(x, y):
"""Calculating base descriptors
a : number of common on bits
d : number of common off bits
dis = b + c : 1-0 mismatches
p : len of fingerprint
Check Table S1 in the SI of https://link.springer.com/article/10.1186/s13321-021-00505-3#Sec21
"""
p = len(x)
a = np.dot(x, y)
d = np.dot(1 - x, 1 - y)
dis = p - a - d
return a, d, dis, p
# Indices
# BUB: Baroni-Urbani-Buser, Fai: Faith, Ja: Jaccard
# JT: Jaccard-Tanimoto, RT: Rogers-Tanimoto, RR: Russel-Rao
# SM: Sokal-Michener, SSn: Sokal-Sneath n
x = np.array([1, 0, 1, 0, 1])
y = np.array([1, 1, 1, 0, 0])
a, d, dis, p = indicators(x, y)
bub = (a * d)**0.5 + a)/((a * d)**0.5 + a + dis)
fai = (a + 0.5 * d)/p
ja = (3 * a)/(3 * a + dis)
jt = a/(a + dis)
rt = (a + d)/(p + dis)
rr = a/p
sm =(a + d)/p
ss1 = a/(a + 2 * dis)
ss2 = (2 * (a + d))/(p + (a + d)) |
Just to clarify, all of these are "bitwise". We have: As Ramon notes, most of these are just one-line formulas. For things that aren't "logical" obviously there are more complicated forms of similarity, though most will be (some sort of) mahalanobis distance-related function. |
@marco-2023, please:
|
Implement methods listed in as similarity module https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. Please add detailed documentation to show which similarity functions is corresponding to which distance functions in scikit-learn or scipy.
One question I have shall we separate the similarity and distance measurements? I get confused by some measurements, e.g. Tanimoto index of molecule fingerprints. I would see it as a distance, but they treated it as a similarity, https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. If we decide to distinguish them, we may need to make them into
similarity
anddistance
modules instead of one module.@PaulWAyers @FarnazH
The text was updated successfully, but these errors were encountered: