-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
homo_sapiens_core_104_38: SMN2 xrefs SMN1 in EntrezGene #10
Comments
These genes code for the same protein product (also reflected by the UniProt mappings). The cross-reference pipeline attempts to compare exon structure and position when mapping RefSeq transcripts. It allows for some mismatches but if a RefSeq mRNA has matching exons with an Ensembl transcript, then they’ll be matched. SMN1
SMN2
Cross-references on haplotypes and patches are projected from the alt_allele on the primary assembly. |
Thanks @michalszpak for you help! Much appreciated.
Fascinating! I read a bit more about it:
So ensembl genes are mapped to NCBI genes using a transcript matching approach, which in the case of I wonder whether this repository should pick a "primary" mapped NCBI gene for each ensembl gene. When an ensembl gene maps to multiple ncbi genes, we'd compare the ensembl and ncbi gene symbols ( Another motivation besides removing spurious mappings is that many use cases for mappings benefit from one-to-one mappings. The proposed approach would create many-to-one mappings, which is still preferable to the current many-to-many. |
Here's all the instances where the ensembl Expand for source codeimport pandas as pd
commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet"
genes_df = pd.read_parquet(url)
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/xrefs.snappy.parquet"
ncbi_xref_df = pd.read_parquet(url).query("xref_source == 'EntrezGene'")
ncbi_xref_df = (
genes_df
[["ensembl_gene_id", "gene_symbol", "gene_description", "ensembl_representative_gene_id"]]
.eval("is_representative = ensembl_gene_id == ensembl_representative_gene_id")
.merge(ncbi_xref_df)
.sort_values(["gene_symbol", "ensembl_gene_id"])
)
(
ncbi_xref_df
.query("gene_symbol != xref_label")
.to_excel("ensembl-gene-ncbi-mapping-symbol-mismatch.xlsx", freeze_panes=(1, 0), index=False)
) |
Essentially, Ensembl features are mapped to NCBI features based on sequence matching and mRNA location information, which improves the accuracy of the mapping. Due to intrinsic differences between these annotations and the fact that different loci in the genome might code for the same product, the relationship between Ensembl and NCBI features is not necessarily 1-to-1. If you'd like to further filter these mappings then you'll need to use your own judgement, but it will certainly result in information loss, as some mappings might be equally good (100% sequence identity). Please bear in mind that assigned gene symbols are also external mappings and might be unstable or missing (especially in non-human species). I'd suggest taking into account the location information. |
In the homo_sapiens_core_104_38 database, ensembl gene SMN2 (
ENSG00000205571
) maps to two ncbigenes: SMN1 (6606
) and SMN2 (6607
). This can be seen in the following table that shows all ensembl gene mappings to ncbigenes for SMN1 & SMN2:Some notes from the table:
ENSG00000172062
/ SMN1 only maps to SMN1 in ncbigene and not SMN2ENSG00000172062
/ SMN1 has a single non-representative alt-allele, which isENSG00000275349
ENSG00000205571
/ SMN2 has two non-representative alt-alleles, which areENSG00000273772
andENSG00000277773
.ENSG00000205571
should also be applied to the alt alleles.I'll forward this issue to the Ensembl helpdesk to see if they have any insights on why SMN2 is mapping to both SMN1 & SMN2 in ncbigene and whether this is an error that should be fixed.
Python code to generate the table above:
The text was updated successfully, but these errors were encountered: