Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate/together way of writing and synonymes aren't recognized #31

Open
e-orlov opened this issue Nov 15, 2021 · 3 comments
Open

Separate/together way of writing and synonymes aren't recognized #31

e-orlov opened this issue Nov 15, 2021 · 3 comments

Comments

@e-orlov
Copy link

e-orlov commented Nov 15, 2021

Keywords "trinkwasser test", "trinkwassertest" and "analyse trinkwasser" aren't clustered at all.

@MaartenGr
Copy link
Owner

Which version of PolyFuzz are you using? Also, could you create a reproducible example? Since PolyFuzz can use many models, without any code it is difficult to see what is happening in your use case.

@e-orlov
Copy link
Author

e-orlov commented Nov 15, 2021

I'm using IF-IDF, implemented under https://share.streamlit.io/charlywargnier/keyword-clustering-app/main/app.py / https://github.com/searchsolved/search-solved-public-seo/blob/main/Keyword_Clustering_Tool/Keyword_Clustering_Tool_V2.ipynb (codeblock 12)

Keywords are here: https://docs.google.com/spreadsheets/d/1nkiFNO8JadbaFcL7BvYKCLNPYPB5ILJwk2K__2DOzdc/edit?usp=sharing

Maybe PolyFuzz is not a right tool for this. To catch "trinkwasser test" and "trinkwassertest" into the same cluster, keywords must be permutated and then searched for a minimal Levenshteyn between permutations. But for "trinkwasser test" and "analyse trinkwasser" the should be probably any "real" synonyme search, maybe even based on a synonym vocabulary...

@MaartenGr
Copy link
Owner

Let me start by saying that I cannot give much support for that tool specifically as I did not create it. Having said that, I did try it out with PolyFuzz directly and it seems that "trinkwasses test" gets grouped with "trinkwassertest" but not with "analyse trinkwasser". Most likely, using TF-IDF they are simply not similar enough to each other. You can try to circumvent this issue by using a different technique than TF-IDF as it tries to mirror Levenshtein distance.

You can implement or use any distance measure in PolyFuzz that you would like. However, if you are looking at semantic similarity and not such much string similarity, then I would advise going for embedding-based methods such as BERT models, sentence-transformers, Hugging Face, or Flair.

You can find more information about that here and here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants