Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

results of get_matches() are not sorted by similarity score for all the values #50

Open
ashutosh486 opened this issue Oct 28, 2022 · 3 comments

Comments

@ashutosh486
Copy link

Hi,

I was running polyfuzz tfidf model to get the matches but few rows of the result was not sorted as per the top_n similarity score.

tfidf_model = PolyFuzz(tfidf_matcher)
tfidf_model.match(from_list, to_list)
tfidf_model.get_matches()

eg:

From To Similarity To_2 Similarity_2 To_3 Similarity_3 To_4 Similarity_4 To_5 Similarity_5
21 3 IN 1 LAVENDER & CAMOMILE 2 IN 1 LAVENDER & CAMOMILE 0.938 3 IN 1 LAVENDER & CAMOMILE 1 3 IN 1 LAVENDER 0.771 3 IN 1 LAVENDER & CHAMOMILE 0.831 LAVENDER CAMOMILE 0.764
@MaartenGr
Copy link
Owner

Could create a minimal example out of what you show here? So with values for from_list and to_list? Also, with the value for top_n that you selected? That way, it makes it a bit easier for me to figure out what exactly is going on.

@ashutosh486
Copy link
Author

Please find below a minimal example:

test_tolist_1 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE", 
                "3 IN 1 LAVENDER", "3 IN 1 LAVENDER & CHAMOMILE", "LAVENDER CAMOMILE"]

test_tolist_2 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE", 
                "3 IN 1 LAVENDER", "LAVENDER CAMOMILE"]

test_fromlist = ["3 IN 1 LAVENDER & CAMOMILE"]

test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5,  model_id = "tfidf")
# test_model
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_1)["TF-IDF"]
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_2)["TF-IDF"]

Output for test_tolist_1:

From To Similarity To_2 Similarity_2 To_3 Similarity_3 To_4 Similarity_4 To_5 Similarity_5
0 3 IN 1 LAVENDER & CAMOMILE 3 IN 1 LAVENDER & CAMOMILE 1 3 IN 1 LAVENDER 0.733 LAVENDER CAMOMILE 0.81 2 IN 1 LAVENDER & CAMOMILE 0.887 3 IN 1 LAVENDER & CHAMOMILE 0.696

Output for test_tolist_2:

From To Similarity To_2 Similarity_2 To_3 Similarity_3 To_4 Similarity_4
0 3 IN 1 LAVENDER & CAMOMILE 3 IN 1 LAVENDER & CAMOMILE 1 LAVENDER CAMOMILE 0.797 2 IN 1 LAVENDER & CAMOMILE 0.893 3 IN 1 LAVENDER 0.747

Problems:

  1. Similarity score is sorted
  2. by removing or adding new text in the to_list the similarity score changes

Just to add to this: I have commented following line of code as I had asked in the previous issue: #48

ngrams = [''.join(ngram) for ngram in ngrams if ' ' not in ngram]

@MaartenGr
Copy link
Owner

Similarity score is sorted

Did you install PolyFuzz through pip install polyfuzz[fast]? If so, then I believe it is since sparse_dot_topn does not return the similarities in order. I would have to check what exactly goes on there.

by removing or adding new text in the to_list the similarity score changes

The to_list is used together with the from_list in order to create the feature matrix as a result of the TF-IDF calculation. As such, it is indeed possible that the similarity score then changes. The more words you put in either list, the more the resulting feature matrix can generalize and the more accurate your similarity function becomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants