Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find the missing RNAs reported by GeneCArds #210

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

afg1
Copy link
Contributor

@afg1 afg1 commented Dec 16, 2024

Between releases 23 and 24, a large number of RNAs went missing from our genome coordinate FTP exports. GeneCards reported this for our human coordinates, and gave this table that summarises what went missing:

Database Previous Count Current Count Percentage Decrease
ENA 869,516 483,656 44%
Rfam 47,153 7,229 84%
PDBe 3,387 102 96%
lncRNAdb 222 55 75%
SRPDB 34 2 94%
CRW 173 0 100%
RiboVision 76 0 100%
5SrRNAdb 32 0 100%

This amounts to a little over 400k sequences that have gone missing.

One possible reason is badly handled export of mapped vs recieved coordinates. When an expert db gives us coordinates, we import them and they get associated with the database when we export them, and have source=expert-database in the gff. Mapped sequences should have source=alignment in the gff, and inspecting the release 24 gff file - there are 0 aligned sequences, which is clearly not right.

I've modified how we handle the mapped sequences in the gff writer so that we only throw an exception if :

  1. The database associated with a region genuinely cannot be looked up (mis-spelled or whatever)
  2. We do not have a providing database for a provided sequence

This is able to produce gff files that contain ~273k aligned sequences that were previously missing. I think this is part of the missing data, but we still need to find ~100k more, so this is not completely finished yet.

afg1 added 2 commits December 16, 2024 11:49
…in an earlier groupby

I think this is working as expected now - the lists of genes targeted by an miRNA contained duplicates (I don't know why) which led to conflicts on insertion to the databse.

This now produces FAR less data, which is somewhat concerning
…raise an exception when the providing database is missing for an unmapped region
@afg1 afg1 added the bug label Dec 16, 2024
@afg1 afg1 requested a review from blakesweeney December 16, 2024 16:14
@afg1 afg1 self-assigned this Dec 16, 2024
@blakesweeney
Copy link
Member

'
Nice work on tracking these down. The missing 100k are a little surprising. I'''d suggest grabbing those old files and checking if there were ids which were mapped and had coordinates provided. We made the change to the database we did because it had duplicate data on that front. So checking that would be worthwhile too. There is also the precompute issue that could have effected this as well. Though we don'''t know what that one is yet either.'

Copy link
Member

@blakesweeney blakesweeney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ok, but I'm going to take a look at the query to see if that could
have some issues.

@@ -76,6 +78,9 @@ def build(cls, index, raw):
"databases": clean_databases(raw["databases"]),
}
if not metadata["providing_databases"]:
if not raw["was_mapped"]:
print(raw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to change the exception to f".. {raw}" to show what data has the
issue.

@blakesweeney
Copy link
Member

Poking around a bit and I can't see any other obvious sources of problems. So probably a combination of what you found and some other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants