Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address and Filter NCBI Gene IDs misassigned due to read-through transcripts #5

Closed
ACastanza opened this issue Oct 12, 2021 · 2 comments

Comments

@ACastanza
Copy link
Contributor

ACastanza commented Oct 12, 2021

There seems to be an issue which appears to originate on the NCBI side, whereby genes with a read-through transcript can end up getting the NCBI gene ID of the read through assigned to one of(?) the parent Ensembl Genes.

Here's an example from Biomart (taken in Ensembl 103) which demonstrates this issue:

Ensembl Gene ID NCBI Gene ID HGNC Gene ID Gene Symbol Gene Title
ENSG00000278232 1394 HGNC:2357 CRHR1 corticotropin releasing hormone receptor 1 [Source:HGNC Symbol;Acc:HGNC:2357]
ENSG00000278232 104909134 HGNC:51483 CRHR1 corticotropin releasing hormone receptor 1 [Source:HGNC Symbol;Acc:HGNC:2357]
ENSG00000282456 104909134 HGNC:51483 LINC02210-CRHR1 LINC02210-CRHR1 readthrough [Source:HGNC Symbol;Acc:HGNC:51483]
ENSG00000204650 147081 HGNC:26327 LINC02210 long intergenic non-protein coding RNA 2210 [Source:HGNC Symbol;Acc:HGNC:26327]

https://www.ncbi.nlm.nih.gov/gene/?term=1394
https://www.ncbi.nlm.nih.gov/gene/?term=104909134

From a quick look at your genes sheet:

ensembl_gene_id ensembl_gene_version gene_symbol gene_symbol_source_db gene_symbol_source gene_biotype gene_description
ENSG00000278232 4 LINC02210-CRHR1 HGNC HGNC:51483 protein_coding LINC02210-CRHR1 readthrough [Source:HGNC Symbol;Acc:HGNC:51483]
ENSG00000282456 1 LINC02210-CRHR1 HGNC HGNC:51483 lncRNA LINC02210-CRHR1 readthrough [Source:HGNC Symbol;Acc:HGNC:51483]

it would seem to be affected

@ACastanza ACastanza changed the title Address and Filter NCBI Gene IDs miss-assigned due to read-through transcripts Address and Filter NCBI Gene IDs misassigned due to read-through transcripts Oct 12, 2021
@dhimmel
Copy link
Member

dhimmel commented Feb 1, 2022

Here's an example where ENSG00000004866 (ST7) maps to both ST7 and ST7-OT3 in ncbigene:

ensembl_representative_gene_id ensembl_gene_id gene_symbol xref_source xref_accession xref_label xref_description xref_info_type xref_linkage_annotation xref_curie
ENSG00000004866 ENSG00000004866 ST7 EntrezGene 7982 ST7 suppression of tumorigenicity 7 DEPENDENT None ncbigene:7982
ENSG00000004866 ENSG00000004866 ST7 EntrezGene 93655 ST7-OT3 ST7 overlapping transcript 3 DEPENDENT None ncbigene:93655

I don't think "ST7 overlapping transcript 3" is a read-through, although perhaps this is a similar situation where we don't want to be mapping to both ncbigenes.

@dhimmel
Copy link
Member

dhimmel commented Feb 1, 2022

Here's an example of a read-through:

ensembl_representative_gene_id ensembl_gene_id gene_symbol xref_source xref_accession xref_label xref_description xref_info_type xref_linkage_annotation xref_curie
ENSG00000088298 ENSG00000088298 EDEM2 EntrezGene 111089941 MMP24-AS1-EDEM2 MMP24-AS1-EDEM2 readthrough DEPENDENT None ncbigene:111089941
ENSG00000088298 ENSG00000088298 EDEM2 EntrezGene 55741 EDEM2 ER degradation enhancing alpha-mannosidase lik... DEPENDENT None ncbigene:55741

What I'm thinking is that for a given representative ensembl gene, we can pick the NCBI gene mapping with the same symbol. That would address these issues as well as #10.

@dhimmel dhimmel closed this as completed in 3844c02 Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants