-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find the missing RNAs reported by GeneCArds #210
base: master
Are you sure you want to change the base?
Conversation
…in an earlier groupby I think this is working as expected now - the lists of genes targeted by an miRNA contained duplicates (I don't know why) which led to conflicts on insertion to the databse. This now produces FAR less data, which is somewhat concerning
…raise an exception when the providing database is missing for an unmapped region
' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this ok, but I'm going to take a look at the query to see if that could
have some issues.
@@ -76,6 +78,9 @@ def build(cls, index, raw): | |||
"databases": clean_databases(raw["databases"]), | |||
} | |||
if not metadata["providing_databases"]: | |||
if not raw["was_mapped"]: | |||
print(raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to change the exception to f".. {raw}" to show what data has the
issue.
Poking around a bit and I can't see any other obvious sources of problems. So probably a combination of what you found and some other issues. |
Between releases 23 and 24, a large number of RNAs went missing from our genome coordinate FTP exports. GeneCards reported this for our human coordinates, and gave this table that summarises what went missing:
This amounts to a little over 400k sequences that have gone missing.
One possible reason is badly handled export of mapped vs recieved coordinates. When an expert db gives us coordinates, we import them and they get associated with the database when we export them, and have
source=expert-database
in the gff. Mapped sequences should havesource=alignment
in the gff, and inspecting the release 24 gff file - there are 0 aligned sequences, which is clearly not right.I've modified how we handle the mapped sequences in the gff writer so that we only throw an exception if :
This is able to produce gff files that contain ~273k aligned sequences that were previously missing. I think this is part of the missing data, but we still need to find ~100k more, so this is not completely finished yet.