Find the missing RNAs reported by GeneCArds #210

afg1 · 2024-12-16T16:14:30Z

Between releases 23 and 24, a large number of RNAs went missing from our genome coordinate FTP exports. GeneCards reported this for our human coordinates, and gave this table that summarises what went missing:

Database	Previous Count	Current Count	Percentage Decrease
ENA	869,516	483,656	44%
Rfam	47,153	7,229	84%
PDBe	3,387	102	96%
lncRNAdb	222	55	75%
SRPDB	34	2	94%
CRW	173	0	100%
RiboVision	76	0	100%
5SrRNAdb	32	0	100%

This amounts to a little over 400k sequences that have gone missing.

One possible reason is badly handled export of mapped vs recieved coordinates. When an expert db gives us coordinates, we import them and they get associated with the database when we export them, and have source=expert-database in the gff. Mapped sequences should have source=alignment in the gff, and inspecting the release 24 gff file - there are 0 aligned sequences, which is clearly not right.

I've modified how we handle the mapped sequences in the gff writer so that we only throw an exception if :

The database associated with a region genuinely cannot be looked up (mis-spelled or whatever)
We do not have a providing database for a provided sequence

This is able to produce gff files that contain ~273k aligned sequences that were previously missing. I think this is part of the missing data, but we still need to find ~100k more, so this is not completely finished yet.

…in an earlier groupby I think this is working as expected now - the lists of genes targeted by an miRNA contained duplicates (I don't know why) which led to conflicts on insertion to the databse. This now produces FAR less data, which is somewhat concerning

…raise an exception when the providing database is missing for an unmapped region

blakesweeney · 2024-12-19T17:47:05Z

'
Nice work on tracking these down. The missing 100k are a little surprising. I'''d suggest grabbing those old files and checking if there were ids which were mapped and had coordinates provided. We made the change to the database we did because it had duplicate data on that front. So checking that would be worthwhile too. There is also the precompute issue that could have effected this as well. Though we don'''t know what that one is yet either.'

blakesweeney

I think this ok, but I'm going to take a look at the query to see if that could
have some issues.

blakesweeney · 2024-12-19T18:44:01Z

rnacentral_pipeline/rnacentral/ftp_export/coordinates/data.py

@@ -76,6 +78,9 @@ def build(cls, index, raw):
            "databases": clean_databases(raw["databases"]),
        }
        if not metadata["providing_databases"]:
+            if not raw["was_mapped"]:
+                print(raw)


Probably better to change the exception to f".. {raw}" to show what data has the
issue.

blakesweeney · 2024-12-19T19:06:54Z

Poking around a bit and I can't see any other obvious sources of problems. So probably a combination of what you found and some other issues.

afg1 added 2 commits December 16, 2024 11:49

Handle providing database being none when writing the gff file. Only …

d2f9cec

…raise an exception when the providing database is missing for an unmapped region

afg1 added the bug label Dec 16, 2024

afg1 requested a review from blakesweeney December 16, 2024 16:14

afg1 self-assigned this Dec 16, 2024

blakesweeney reviewed Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find the missing RNAs reported by GeneCArds #210

Find the missing RNAs reported by GeneCArds #210

afg1 commented Dec 16, 2024

blakesweeney commented Dec 19, 2024

blakesweeney left a comment

blakesweeney Dec 19, 2024

blakesweeney commented Dec 19, 2024

Find the missing RNAs reported by GeneCArds #210

Are you sure you want to change the base?

Find the missing RNAs reported by GeneCArds #210

Conversation

afg1 commented Dec 16, 2024

blakesweeney commented Dec 19, 2024

blakesweeney left a comment

Choose a reason for hiding this comment

blakesweeney Dec 19, 2024

Choose a reason for hiding this comment

blakesweeney commented Dec 19, 2024