Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Papers: Abstract Entity Matching #73

Open
Jiros opened this issue Jul 21, 2020 · 0 comments
Open

Papers: Abstract Entity Matching #73

Jiros opened this issue Jul 21, 2020 · 0 comments
Assignees
Labels
Epic Priority: Normal This issue can be dealt with when possible Type: Data Analysis To identify an issue as data analysis

Comments

@Jiros
Copy link
Contributor

Jiros commented Jul 21, 2020

@tomasonjo:

I have been looking at ways to match the entities to abstracts It seems that our database has some duplicates of papers, and also not always the same md5 is calculated for the same text how is it possible that a single pubmed_id has two articles with the same title and abstract, one published in 1978 and the other in 2013 pubmed states it was published in 1978 though, https://pubmed.ncbi.nlm.nih.gov/30881/

while the abstract text is the same in both articles, they don't produce the same md5 hash :/ so matching by hash is not accurate as even the import produces different hashes for the same text.

@motey:

Then they are not the same/equal :) maybe the content is the same, but maybe one is "stripped()" and the other is not. If they are the same, the would allready be merged when loaded into neo4j

But we have only 26 cases (in prd). cant we just ignore the "problem" :)

MATCH (pm:PaperID{type:"pubmed_id"})<-[r:PAPER_HAS_PAPERID]-() 
WITH count(r) as cntr, pm
WHERE cntr > 1
RETURN count(pm)

count(pm) = 26

@tomasonjo:

the duplicate papers I will ignore.
For the abstracts I will match the first abstract of the paperId, I've checked a few examples, and this workaround should be fine

I have imported the first version
63k abstracts have 613k mentions to 260k entities
I need to add external ids to MESH and other medical libraries as the source file did not contain that information
but it is available in their google drive -> https://drive.google.com/drive/folders/12fq6ZjVYmKjQFpMruUXq2QihG22fBg9q

@Jiros Jiros added Priority: Normal This issue can be dealt with when possible Type: Feature To identify an issue as a feature labels Jul 21, 2020
@Jiros Jiros added the Epic label Jul 21, 2020
@Jiros Jiros added the Type: Data Analysis To identify an issue as data analysis label Jul 21, 2020
@Jiros Jiros removed the Type: Feature To identify an issue as a feature label Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic Priority: Normal This issue can be dealt with when possible Type: Data Analysis To identify an issue as data analysis
Projects
None yet
Development

No branches or pull requests

3 participants