results of get_matches() are not sorted by similarity score for all the values #50

ashutosh486 · 2022-10-28T12:00:43Z

Hi,

I was running polyfuzz tfidf model to get the matches but few rows of the result was not sorted as per the top_n similarity score.

tfidf_model = PolyFuzz(tfidf_matcher)
tfidf_model.match(from_list, to_list)
tfidf_model.get_matches()

eg:

	From	To	Similarity	To_2	Similarity_2	To_3	Similarity_3	To_4	Similarity_4	To_5	Similarity_5
21	3 IN 1 LAVENDER & CAMOMILE	2 IN 1 LAVENDER & CAMOMILE	0.938	3 IN 1 LAVENDER & CAMOMILE	1	3 IN 1 LAVENDER	0.771	3 IN 1 LAVENDER & CHAMOMILE	0.831	LAVENDER CAMOMILE	0.764

MaartenGr · 2022-10-29T05:26:55Z

Could create a minimal example out of what you show here? So with values for from_list and to_list? Also, with the value for top_n that you selected? That way, it makes it a bit easier for me to figure out what exactly is going on.

ashutosh486 · 2022-11-04T11:26:07Z

Please find below a minimal example:

test_tolist_1 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE", 
                "3 IN 1 LAVENDER", "3 IN 1 LAVENDER & CHAMOMILE", "LAVENDER CAMOMILE"]

test_tolist_2 = ["2 IN 1 LAVENDER & CAMOMILE", "3 IN 1 LAVENDER & CAMOMILE", 
                "3 IN 1 LAVENDER", "LAVENDER CAMOMILE"]

test_fromlist = ["3 IN 1 LAVENDER & CAMOMILE"]

test_model = TFIDF(n_gram_range=(2,5), min_similarity=0, top_n = 5,  model_id = "tfidf")
# test_model
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_1)["TF-IDF"]
PolyFuzz(test_model).fit_transform(test_fromlist, test_tolist_2)["TF-IDF"]

Output for test_tolist_1:

	From	To	Similarity	To_2	Similarity_2	To_3	Similarity_3	To_4	Similarity_4	To_5	Similarity_5
0	3 IN 1 LAVENDER & CAMOMILE	3 IN 1 LAVENDER & CAMOMILE	1	3 IN 1 LAVENDER	0.733	LAVENDER CAMOMILE	0.81	2 IN 1 LAVENDER & CAMOMILE	0.887	3 IN 1 LAVENDER & CHAMOMILE	0.696

Output for test_tolist_2:

	From	To	Similarity	To_2	Similarity_2	To_3	Similarity_3	To_4	Similarity_4
0	3 IN 1 LAVENDER & CAMOMILE	3 IN 1 LAVENDER & CAMOMILE	1	LAVENDER CAMOMILE	0.797	2 IN 1 LAVENDER & CAMOMILE	0.893	3 IN 1 LAVENDER	0.747

Problems:

Similarity score is sorted
by removing or adding new text in the to_list the similarity score changes

Just to add to this: I have commented following line of code as I had asked in the previous issue: #48

PolyFuzz/polyfuzz/models/_tfidf.py

Line 130 in b26638f

ngrams = [''.join(ngram) for ngram in ngrams if ' ' not in ngram]

MaartenGr · 2022-11-06T09:23:52Z

Similarity score is sorted

Did you install PolyFuzz through pip install polyfuzz[fast]? If so, then I believe it is since sparse_dot_topn does not return the similarities in order. I would have to check what exactly goes on there.

by removing or adding new text in the to_list the similarity score changes

The to_list is used together with the from_list in order to create the feature matrix as a result of the TF-IDF calculation. As such, it is indeed possible that the similarity score then changes. The more words you put in either list, the more the resulting feature matrix can generalize and the more accurate your similarity function becomes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

results of get_matches() are not sorted by similarity score for all the values #50

results of get_matches() are not sorted by similarity score for all the values #50

ashutosh486 commented Oct 28, 2022

MaartenGr commented Oct 29, 2022

ashutosh486 commented Nov 4, 2022

MaartenGr commented Nov 6, 2022

results of get_matches() are not sorted by similarity score for all the values #50

results of get_matches() are not sorted by similarity score for all the values #50

Comments

ashutosh486 commented Oct 28, 2022

MaartenGr commented Oct 29, 2022

ashutosh486 commented Nov 4, 2022

MaartenGr commented Nov 6, 2022