You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the protein_homolog_clusters.csv dataset we are replacing the MGI data (mgi_homologs.tsv) with the Alliance Genome Resource (AGR) ortholog data. Please change how the protein_homolog_clusters.csv dataset is processed by excluding the MGI data and adding the AGR data.
protein_homolog_clusters.csv update details:
1. Exclude the MGI homolog data from protein_homolog_clusters.csv but keep the mouse_protein_xref_mgi.csv dataset
Update your processing script for the protein_homolog_clusters.csv dataset so that it does not extract data from the mgi_homologs.tsv file; it has not been downloaded for 2.8 and will not be downloaded for future releases. However please still create the mouse_protein_xref_mgi.csv dataset which is generated from the mouse nt file, because this is the model organism for mouse and is in the organism specific section of the mouse protein pages.
AGR dataset details:
Rather than all the orthologs being part of the same cluster as in the OMA and MGI datasets, in the AGR dataset (ORTHOLOGY-ALLIANCE_COMBINED.tsv), orthologs are mapped to a single gene which has it's own homolog ID (in the Gene1ID field). So the protein corresponding to the Gene1ID, is the protein page to which the ortholog data will be added.
In the download, the Gene1 fields (Gene1ID, Gene1Symbol, Gene1SpeciesTaxonID, Gene1SpeciesName) contain info on the species gene/protein, and the Gene2 fields (Gene2ID, Gene2Symbol, Gene2SpeciesTaxonID, Gene2SpeciesName) are the orthologs mapped to this gene. For example the Gene1ID "HGNC:9546", has multiple rows displaying each ortholog (Gene2ID) that's mapped to this gene/protein (see below).
that contain data from GlyGen species i.e. rows in which the taxa ID in both the Gene1SpeciesTaxonID and the mapped ortholog Gene2SpeciesTaxonID are GlyGen species. In the Gene1SpeciesTaxonID and Gene2SpeciesTaxonID fields you can extract the taxa ID after the colon e.g. “9606” from “NCBITaxon:9606”, and check it's a species in GlyGen using misc/species_info.csv.
that are Gene1IDs mapped to canoncial proteins. The AGR IDs i.e. Gene1ID and Gene2ID, are already mapped to the UniProt AC in the taxa nt files (see triples below extracted from different nt files). You can map the AGR IDs to their UniProt ACs using these triples, and then filter Gene1ID canonical proteins using the appropriate $taxa_protein_masterlist.csv
Adding data from download to the protein_homolog_clusters.csv file
Source Field
protein_homolog_clusters Field
Notes
Gene1ID
homolog_cluster_id
e.g. “WB:WBGene00003947” or ”ZFIN:ZDB-GENE-990415-216”
Gene2ID
uniprotkb_canonical_ac
Use the appropriate taxa nt file (see "tax_id" to determine species and correct nt file) to map the “Gene2ID” to the UniProt ac (see further details above)
Gene2SpeciesTaxonID
tax_id
Extract taxa ID after colon e.g. “9606” from “NCBITaxon:9606”
xref_key
Add “protein_xref_alliance_genome_resources” to all rows originating from this source file
3. Add AGR data from protein_homolog_clusters.csv to the correct protein pages within the homologs section
To identify the protein page to which the data needs to be added, map the homolog_cluster_id to the UniProt AC using the taxa's uniprot-proteome NT file (as described above in the extracting rows section). I would expect data to be added to Homo sapiens, Rattus norvegicus, Mus musculus, Danio rerio, Drosophila melanogaster and Saccharomyces cerevisiae GlyGen proteins. I will create a ticket for making the xref datasets for this data protein_xref_alliance_genome_resources.csv
The text was updated successfully, but these errors were encountered:
katewarner
changed the title
Dataset instructions for AGR orthologs
Instructions for adding new data to homolog dataset
Jan 29, 2025
katewarner
changed the title
Instructions for adding new data to homolog dataset
Instructions for adding new data to protein_homolog_clusters.csv dataset
Jan 30, 2025
In the
protein_homolog_clusters.csv
dataset we are replacing the MGI data (mgi_homologs.tsv)
with the Alliance Genome Resource (AGR) ortholog data. Please change how theprotein_homolog_clusters.csv
dataset is processed by excluding the MGI data and adding the AGR data.protein_homolog_clusters.csv update details:
1. Exclude the MGI homolog data from protein_homolog_clusters.csv but keep the mouse_protein_xref_mgi.csv dataset
Update your processing script for the
protein_homolog_clusters.csv
dataset so that it does not extract data from themgi_homologs.tsv
file; it has not been downloaded for 2.8 and will not be downloaded for future releases. However please still create themouse_protein_xref_mgi.csv
dataset which is generated from the mouse nt file, because this is the model organism for mouse and is in the organism specific section of the mouse protein pages.2. Add AGR data to protein_homolog_clusters.csv
Source = downloads/alliance_genome/current/ORTHOLOGY-ALLIANCE_COMBINED.tsv
Mapping files:
ebi/current/uniprot-proteome-$taxa.nt files
misc/species_info.csv.
unreviewed/$taxa_protein_masterlist.csv
AGR dataset details:
Rather than all the orthologs being part of the same cluster as in the OMA and MGI datasets, in the AGR dataset (
ORTHOLOGY-ALLIANCE_COMBINED.tsv
), orthologs are mapped to a single gene which has it's own homolog ID (in theGene1ID
field). So the protein corresponding to theGene1ID
, is the protein page to which the ortholog data will be added.In the download, the Gene1 fields (
Gene1ID, Gene1Symbol, Gene1SpeciesTaxonID, Gene1SpeciesName
) contain info on the species gene/protein, and the Gene2 fields (Gene2ID, Gene2Symbol, Gene2SpeciesTaxonID, Gene2SpeciesName
) are the orthologs mapped to this gene. For example theGene1ID
"HGNC:9546", has multiple rows displaying each ortholog (Gene2ID
) that's mapped to this gene/protein (see below).https://www.alliancegenome.org/gene/HGNC:9546#orthology
What data to extract and mapping files
Only extract rows:
Gene1SpeciesTaxonID
and the mapped orthologGene2SpeciesTaxonID
are GlyGen species. In theGene1SpeciesTaxonID
andGene2SpeciesTaxonID
fields you can extract the taxa ID after the colon e.g. “9606” from “NCBITaxon:9606”, and check it's a species in GlyGen usingmisc/species_info.csv
.Gene1ID
andGene2ID
, are already mapped to the UniProt AC in the taxa nt files (see triples below extracted from different nt files). You can map the AGR IDs to their UniProt ACs using these triples, and then filter Gene1ID canonical proteins using the appropriate$taxa_protein_masterlist.csv
Adding data from download to the protein_homolog_clusters.csv file
Example
Input file:
downloads/alliance_genome/current/ORTHOLOGY-ALLIANCE_COMBINED.tsv
Output data:
/unreviewed/protein_homolog_clusters.csv
3. Add AGR data from protein_homolog_clusters.csv to the correct protein pages within the homologs section
To identify the protein page to which the data needs to be added, map the homolog_cluster_id to the UniProt AC using the taxa's uniprot-proteome NT file (as described above in the extracting rows section). I would expect data to be added to Homo sapiens, Rattus norvegicus, Mus musculus, Danio rerio, Drosophila melanogaster and Saccharomyces cerevisiae GlyGen proteins. I will create a ticket for making the xref datasets for this data
protein_xref_alliance_genome_resources.csv
The text was updated successfully, but these errors were encountered: