Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

homo_sapiens_core_104_38: SMN2 xrefs SMN1 in EntrezGene #10

Closed
dhimmel opened this issue Nov 22, 2021 · 4 comments
Closed

homo_sapiens_core_104_38: SMN2 xrefs SMN1 in EntrezGene #10

dhimmel opened this issue Nov 22, 2021 · 4 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Nov 22, 2021

In the homo_sapiens_core_104_38 database, ensembl gene SMN2 (ENSG00000205571) maps to two ncbigenes: SMN1 (6606) and SMN2 (6607). This can be seen in the following table that shows all ensembl gene mappings to ncbigenes for SMN1 & SMN2:

ensembl_gene_id gene_symbol ensembl_representative_gene_id is_representative xref_source xref_accession xref_label xref_description xref_info_type xref_linkage_annotation
ENSG00000172062 SMN1 ENSG00000172062 True EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000275349 SMN1 ENSG00000172062 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000205571 SMN2 ENSG00000205571 True EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000205571 SMN2 ENSG00000205571 True EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None
ENSG00000273772 SMN2 ENSG00000205571 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000273772 SMN2 ENSG00000205571 False EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None
ENSG00000277773 SMN2 ENSG00000205571 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000277773 SMN2 ENSG00000205571 False EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None

Some notes from the table:

  • ENSG00000172062 / SMN1 only maps to SMN1 in ncbigene and not SMN2
  • ENSG00000172062 / SMN1 has a single non-representative alt-allele, which is ENSG00000275349
  • ENSG00000205571 / SMN2 has two non-representative alt-alleles, which are ENSG00000273772 and ENSG00000277773.
  • alt alleles have the same mappings as their representative gene. So any fix to the mappings of ENSG00000205571 should also be applied to the alt alleles.

I'll forward this issue to the Ensembl helpdesk to see if they have any insights on why SMN2 is mapping to both SMN1 & SMN2 in ncbigene and whether this is an error that should be fixed.

Python code to generate the table above:

import pandas as pd
commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet"
genes_df = pd.read_parquet(url)
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/xrefs.snappy.parquet"
xrefs_df = pd.read_parquet(url)
smn_symbols = {"SMN1", "SMN2"}
smn_df = (
    xrefs_df
    .query("xref_source == 'EntrezGene'")
    .query("xref_label in @smn_symbols")
)
smn_df = (
    genes_df
    [["ensembl_gene_id", "gene_symbol", "ensembl_representative_gene_id"]]
    .eval("is_representative = ensembl_gene_id == ensembl_representative_gene_id")
    .merge(smn_df)
    .sort_values(["gene_symbol", "ensembl_gene_id"])
)
smn_df
@michalszpak
Copy link

These genes code for the same protein product (also reflected by the UniProt mappings). The cross-reference pipeline attempts to compare exon structure and position when mapping RefSeq transcripts. It allows for some mismatches but if a RefSeq mRNA has matching exons with an Ensembl transcript, then they’ll be matched.

SMN1
https://www.ncbi.nlm.nih.gov/gene/6606
survival motor neuron protein isoform d

NP_000335.1 survival motor neuron protein isoform d [Homo sapiens]
MAMSSGGSGGGVPEQEDSVLFRRGTGQSDDSDIWDDTALIKAYDKAVASFKHALKNGDICETSGKPKTTPKRKPAKKNKSQKKNTAASLQQWKVGDKCSAIWSEDGCIYPATIASIDFKRETCVVVYTGYGNREEQNLSDLLSPICEVANNIEQNAQENENESQVSTDESENSRSPGNKSDNIKPKSAPWNSFLPPPPPMPGPRLGPGKPGLKFNGPPPPPPPPPPHLLSCWLPPFPSGPPIIPPPPPICPDSLDDADALGSMLISWYMSGYHTGYYMGFRQNQKEGRCSHSLN

SMN2
https://www.ncbi.nlm.nih.gov/gene/6607
survival motor neuron protein isoform d

NP_059107.1 survival motor neuron protein isoform d [Homo sapiens]
MAMSSGGSGGGVPEQEDSVLFRRGTGQSDDSDIWDDTALIKAYDKAVASFKHALKNGDICETSGKPKTTPKRKPAKKNKSQKKNTAASLQQWKVGDKCSAIWSEDGCIYPATIASIDFKRETCVVVYTGYGNREEQNLSDLLSPICEVANNIEQNAQENENESQVSTDESENSRSPGNKSDNIKPKSAPWNSFLPPPPPMPGPRLGPGKPGLKFNGPPPPPPPPPPHLLSCWLPPFPSGPPIIPPPPPICPDSLDDADALGSMLISWYMSGYHTGYYMGFRQNQKEGRCSHSLN

Cross-references on haplotypes and patches are projected from the alt_allele on the primary assembly.

@dhimmel
Copy link
Member Author

dhimmel commented Dec 1, 2021

Thanks @michalszpak for you help! Much appreciated.

These genes code for the same protein product

Fascinating! I read a bit more about it:

The full-size protein made from the SMN2 gene is identical to the protein made from a similar gene called SMN1; however, only 10 to 15 percent of all functional SMN protein is produced from the SMN2 gene (the rest is produced from the SMN1 gene). Typically, people have two copies of the SMN1 gene and one to two copies of the SMN2 gene in each cell. However, the number of copies of the SMN2 gene varies, with some people having up to eight copies.

So ensembl genes are mapped to NCBI genes using a transcript matching approach, which in the case of ensembl:ENSG00000205571-to-ncbigene:6606 creates a spurious mapping.

I wonder whether this repository should pick a "primary" mapped NCBI gene for each ensembl gene. When an ensembl gene maps to multiple ncbi genes, we'd compare the ensembl and ncbi gene symbols (gene_symbol and xref_label columns above) to select the primary-mapped-ncbigene for each ensembl gene. Any other heuristics we could use to select the most similar ncbi gene from many? Would this work for human, rat, mouse, and beyond?

Another motivation besides removing spurious mappings is that many use cases for mappings benefit from one-to-one mappings. The proposed approach would create many-to-one mappings, which is still preferable to the current many-to-many.

@dhimmel
Copy link
Member Author

dhimmel commented Dec 1, 2021

Here's all the instances where the ensembl gene_symbol does not match the xref_label (ncbi symbol) for humans release 104: ensembl-gene-ncbi-mapping-symbol-mismatch.xlsx. This dataset is helpful for this issue and #5.

Expand for source code
import pandas as pd
commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet"
genes_df = pd.read_parquet(url)
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/xrefs.snappy.parquet"
ncbi_xref_df = pd.read_parquet(url).query("xref_source == 'EntrezGene'")
ncbi_xref_df = (
    genes_df
    [["ensembl_gene_id", "gene_symbol", "gene_description", "ensembl_representative_gene_id"]]
    .eval("is_representative = ensembl_gene_id == ensembl_representative_gene_id")
    .merge(ncbi_xref_df)
    .sort_values(["gene_symbol", "ensembl_gene_id"])
)
(
    ncbi_xref_df
    .query("gene_symbol != xref_label")
    .to_excel("ensembl-gene-ncbi-mapping-symbol-mismatch.xlsx", freeze_panes=(1, 0), index=False)
)

@michalszpak
Copy link

Essentially, Ensembl features are mapped to NCBI features based on sequence matching and mRNA location information, which improves the accuracy of the mapping. Due to intrinsic differences between these annotations and the fact that different loci in the genome might code for the same product, the relationship between Ensembl and NCBI features is not necessarily 1-to-1. If you'd like to further filter these mappings then you'll need to use your own judgement, but it will certainly result in information loss, as some mappings might be equally good (100% sequence identity). Please bear in mind that assigned gene symbols are also external mappings and might be unstable or missing (especially in non-human species). I'd suggest taking into account the location information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants