-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment to find record similarity based on text analysis #8
Comments
I have used a jupyter notebook (will add the code to the repository) to
the results are in the associated excell file |
@pvgenuchten I did a first experiment (results and description of methodology in previous comment), but was not able to set up the 'pagination' to extract more than 50 records, what is the parameter I have to use for the api to get to the next page? And is it correct that the limit is per 50 records? |
The parameters to manage pagination are |
Ok, thanks @pvgenuchten !! The highest similarity= 0.9999997, this is for 2 records (68749995-c4bf-4f80-94e5-43c2291c99be and 5187f8c5-38ef-4b07-bc26-a5e257a8ef59) with different id's without title and description and with the same keywords; guess this is a data quality issue from the source. when going through the results the top similarity metrics look good, e.g. similarity = 0.99996 for combination 49bebaf8-bae4-4748-8e5c-ce80c0406953 and 56fcf114-1c1e-46ac-b21a-b43ff7441335 with titles "SUSALPS temperature and volumetric soil water content Graswang Subplot 2 in Fendt extensiv" and "SUSALPS temperature and volumetric soil water content Graswang Subplot 1 in Fendt extensiv" (description and keywords the same), or similarity= 0.99991 for combination 56fcf114-1c1e-46ac-b21a-b43ff7441335 and 07388e86-f38b-469a-9910-6e24af66bbf5 with titles "SUSALPS temperature and volumetric soil water content Graswang Subplot 1 in Fendt extensiv" and "SUSALPS temperature and volumetric soil water content Graswang Subplot 1 in Fendt intensiv" and descriptions "..... This dataset contains daily average soil temperature and volumetric soil water content in 5 and 15 cm depth. Treatment: Graswang Subplot 1 in Fendt extensiv"" and " .... This dataset contains daily average soil temperature and volumetric soil water content in 5 and 15 cm depth. Treatment: Graswang Subplot 1 in Fendt intensiv Device: Decagon 5TM Timescale: Daily average Depths: 5 and 15 cm" (keywords the same). However when going to the lower rows, the similarity seems quite high in my opinion, this could be due to the fact that many (scientific) 'niche' words are being used which the current model (distilbert) doesn't contain, other transformers or LLM's might perform better in this respect Maybe would be good to have @wbcbugfree or @Max-at-Vlaanderen also have a look at the code/approach for input/modifications/suggetions I guess next steps could be:
other ideas/requirements? |
Hi, I took a quick look at the results. Looks like indeed what we can expect from an embedding model. I do fear that we won't be able to use this approach is to use it directly for duplication detection. The essence of these models is to generalise to a "topicvector" about what the text is about. In my view, duplicate detection goes a bit further. Other than that, I think this approach will be ideal for search queries and recomendation systems. |
thanks @Max-at-Vlaanderen , I think your suggestion for duplicate detection is quite interesting! I also looked at Euclidean distance (couldn't find that algorithm you suggested @Max-at-Vlaanderen), I see some differences in the order of the similar-pairs, but not that big in my opinion when looking at the most similar. @robknapen do you have any knowledge on which similarity algorithm is most suitable to identify the similarity between records based on the title - description - keywords? |
Hi Nick, I think the demo is great. But I have the same suggestion as Max, i.e. we should embed authors, titles, and descriptions of metadata records and match them separately. This way only three items with high similarity (above a certain threshold) can be judged as duplicates, the others should be judged as similar. |
@pvgenuchten and @roblokers , I have worked further on the suggestion by Max and Beichem on calculating the similarity metrics for 'title', 'description' and keywords' each separately to be able to better evaluate duplicates. The notebook is added to the repository. I analyzed 500 records and only kept pairs where at least one of the 3 fields had similarity 1 (to reduce size of the file), the results are in the attached file |
Hi, I would also say that the different fields need to be processed and compared based on the characteristics/type of the data. if there is sufficient text in a field an embedding might help to figure out semantic similarity (dot product is also an often used algorithm, but similar to cosine). Then to find actual duplicates probably other rules would apply than in case you want to find similar records from a collection. There is knowledge (literature :-) ) on duplicate record removal from databases, and how this can be done. But no perfect solution as far as I know, working with text will always be tricky due to its nature. |
Can be used to identify similar records (‘more like this’) or records describing the same source
DoD:
The text was updated successfully, but these errors were encountered: