Instructions for adding new data to protein_homolog_clusters.csv dataset #2136

katewarner · 2025-01-29T20:52:20Z

In the protein_homolog_clusters.csv dataset we are replacing the MGI data (mgi_homologs.tsv) with the Alliance Genome Resource (AGR) ortholog data. Please change how the protein_homolog_clusters.csv dataset is processed by excluding the MGI data and adding the AGR data.

protein_homolog_clusters.csv update details:

1. Exclude the MGI homolog data from protein_homolog_clusters.csv but keep the mouse_protein_xref_mgi.csv dataset
Update your processing script for the protein_homolog_clusters.csv dataset so that it does not extract data from the mgi_homologs.tsv file; it has not been downloaded for 2.8 and will not be downloaded for future releases. However please still create the mouse_protein_xref_mgi.csv dataset which is generated from the mouse nt file, because this is the model organism for mouse and is in the organism specific section of the mouse protein pages.

2. Add AGR data to protein_homolog_clusters.csv

Source = downloads/alliance_genome/current/ORTHOLOGY-ALLIANCE_COMBINED.tsv

Mapping files:
ebi/current/uniprot-proteome-$taxa.nt files
misc/species_info.csv.
unreviewed/$taxa_protein_masterlist.csv

AGR dataset details:
Rather than all the orthologs being part of the same cluster as in the OMA and MGI datasets, in the AGR dataset (ORTHOLOGY-ALLIANCE_COMBINED.tsv), orthologs are mapped to a single gene which has it's own homolog ID (in the Gene1ID field). So the protein corresponding to the Gene1ID, is the protein page to which the ortholog data will be added.
In the download, the Gene1 fields (Gene1ID, Gene1Symbol, Gene1SpeciesTaxonID, Gene1SpeciesName) contain info on the species gene/protein, and the Gene2 fields (Gene2ID, Gene2Symbol, Gene2SpeciesTaxonID, Gene2SpeciesName) are the orthologs mapped to this gene. For example the Gene1ID "HGNC:9546", has multiple rows displaying each ortholog (Gene2ID) that's mapped to this gene/protein (see below).

/alliance_genome/current/ORTHOLOGY-ALLIANCE_COMBINED.tsv

Gene1ID | Gene1Symbol | Gene1SpeciesTaxonID | Gene1SpeciesName | Gene2ID | Gene2Symbol | Gene2SpeciesTaxonID | Gene2SpeciesName
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | SGD:S000003538 | PRE3 | NCBITaxon:559292 | Saccharomyces cerevisiae
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | FB:FBgn0010590 | ProsÎ²1 | NCBITaxon:7227 | Drosophila melanogaster
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | Xenbase:XB-GENE-481018 | psmb9 | NCBITaxon:8364 | Xenopus tropicalis
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | Xenbase:XB-GENE-865433 | psmb9.S | NCBITaxon:8355 | Xenopus laevis
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | ZFIN:ZDB-GENE-001208-3 | psmb9b | NCBITaxon:7955 | Danio rerio
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | ZFIN:ZDB-GENE-990415-140 | psmb9a | NCBITaxon:7955 | Danio rerio
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | RGD:3427 | Psmb9 | NCBITaxon:10116 | Rattus norvegicus
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | WB:WBGene00003947 | pbs-1 | NCBITaxon:6239 | Caenorhabditis elegans
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | MGI:1346526 | Psmb9 | NCBITaxon:10090 | Mus musculus

https://www.alliancegenome.org/gene/HGNC:9546#orthology

What data to extract and mapping files

Only extract rows:

that contain data from GlyGen species i.e. rows in which the taxa ID in both the Gene1SpeciesTaxonID and the mapped ortholog Gene2SpeciesTaxonID are GlyGen species. In the Gene1SpeciesTaxonID and Gene2SpeciesTaxonID fields you can extract the taxa ID after the colon e.g. “9606” from “NCBITaxon:9606”, and check it's a species in GlyGen using misc/species_info.csv.
that are Gene1IDs mapped to canoncial proteins. The AGR IDs i.e. Gene1ID and Gene2ID, are already mapped to the UniProt AC in the taxa nt files (see triples below extracted from different nt files). You can map the AGR IDs to their UniProt ACs using these triples, and then filter Gene1ID canonical proteins using the appropriate $taxa_protein_masterlist.csv

<http://purl.uniprot.org/uniprot/A0A096MIX7> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://purl.uniprot.org/agr/RGD:1561261> .
<http://purl.uniprot.org/uniprot/A0AQH0> <http://www.w3.org/2000/01/rdf-schema#seeAlso> http://purl.uniprot.org/agr/FB:FBgn0010590
<http://purl.uniprot.org/uniprot/A0A075F5C6> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://purl.uniprot.org/agr/MGI:96238> .

Adding data from download to the protein_homolog_clusters.csv file

Source Field	protein_homolog_clusters Field	Notes
Gene1ID	homolog_cluster_id	e.g. “WB:WBGene00003947” or ”ZFIN:ZDB-GENE-990415-216”
Gene2ID	uniprotkb_canonical_ac	Use the appropriate taxa nt file (see "tax_id" to determine species and correct nt file) to map the “Gene2ID” to the UniProt ac (see further details above)
Gene2SpeciesTaxonID	tax_id	Extract taxa ID after colon e.g. “9606” from “NCBITaxon:9606”
	xref_key	Add “protein_xref_alliance_genome_resources” to all rows originating from this source file
Gene1ID	xref_id

Example

Input file:
downloads/alliance_genome/current/ORTHOLOGY-ALLIANCE_COMBINED.tsv

Gene1ID | Gene1Symbol | Gene1SpeciesTaxonID | Gene1SpeciesName | Gene2ID | Gene2Symbol | Gene2SpeciesTaxonID | Gene2SpeciesName | Algorithms | AlgorithmsMatch | OutOfAlgorithms | IsBestScore | IsBestRevScore
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | SGD:S000003538 | PRE3 | NCBITaxon:559292 | Saccharomyces cerevisiae | PhylomeDB\|Ensembl   Compara\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|OrthoInspector\|InParanoid | 8 | 9 | Yes | No
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | FB:FBgn0010590 | ProsÎ²1 | NCBITaxon:7227 | Drosophila melanogaster | PhylomeDB\|Ensembl Compara\|OrthoFinder\|Hieranoid\|PANTHER\|OrthoInspector | 6 | 9 | Yes | No
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | Xenbase:XB-GENE-481018 | psmb9 | NCBITaxon:8364 | Xenopus tropicalis | PhylomeDB\|Ensembl Compara\|OrthoFinder\|PANTHER\|OMA | 5 | 9 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | Xenbase:XB-GENE-865433 | psmb9.S | NCBITaxon:8355 | Xenopus laevis | Xenbase | 1 | 1 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | ZFIN:ZDB-GENE-001208-3 | psmb9b | NCBITaxon:7955 | Danio rerio | ZFIN | 1 | 10 | No | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | ZFIN:ZDB-GENE-990415-140 | psmb9a | NCBITaxon:7955 | Danio rerio | Ensembl   Compara\|PhylomeDB\|OrthoFinder\|PANTHER\|SonicParanoid\|ZFIN\|OMA\|OrthoInspector\|InParanoid | 9 | 10 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | RGD:3427 | Psmb9 | NCBITaxon:10116 | Rattus norvegicus | Ensembl   Compara\|PhylomeDB\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|OMA\|HGNC\|OrthoInspector\|InParanoid | 10 | 10 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | WB:WBGene00003947 | pbs-1 | NCBITaxon:6239 | Caenorhabditis elegans | PhylomeDB\|Ensembl   Compara\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|OMA\|OrthoInspector\|InParanoid | 9 | 9 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | MGI:1346526 | Psmb9 | NCBITaxon:10090 | Mus musculus | Ensembl   Compara\|PhylomeDB\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|OMA\|HGNC\|OrthoInspector\|InParanoid | 10 | 10 | Yes | Yes
ZFIN:ZDB-GENE-991019-6 | aanat2 | NCBITaxon:7955 | Danio rerio | SGD:S000002478 | PAA1 | NCBITaxon:559292 | Saccharomyces cerevisiae | Ensembl   Compara\|PhylomeDB\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|InParanoid | 7 | 9 | Yes | Yes
ZFIN:ZDB-GENE-991019-6 | aanat2 | NCBITaxon:7955 | Danio rerio | HGNC:19 | AANAT | NCBITaxon:9606 | Homo sapiens | PANTHER\|SonicParanoid\|ZFIN\|OrthoInspector\|InParanoid | 5 | 10 | Yes | No
ZFIN:ZDB-GENE-991019-6 | aanat2 | NCBITaxon:7955 | Danio rerio | MGI:1328365 | Aanat | NCBITaxon:10090 | Mus musculus | PANTHER\|OrthoInspector\|ZFIN | 3 | 10 | Yes | No

Output data:
/unreviewed/protein_homolog_clusters.csv

"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"HGNC:9546","P38624-1","559292","protein_xref_genome_alliance_homologset","HGNC:9546"
"HGNC:9546","A0AQH0-1","7227","protein_xref_genome_alliance_homologset","HGNC:9546"
"HGNC:9546","Q9PUS3-1","8355","protein_xref_genome_alliance_homologset","HGNC:9546"
"HGNC:9546","P28077-1","10116","protein_xref_genome_alliance_homologset","HGNC:9546"
"HGNC:9546","P28076-1","10090","protein_xref_genome_alliance_homologset","HGNC:9546"
"ZFIN:ZDB-GENE-991019-6","Q12447-1","559292","protein_xref_genome_alliance_homologset","ZFIN:ZDB-GENE-991019-6"
"ZFIN:ZDB-GENE-991019-6","Q16613-1","9606","protein_xref_genome_alliance_homologset","ZFIN:ZDB-GENE-991019-6"
"ZFIN:ZDB-GENE-991019-6","O88816-1","10090","protein_xref_genome_alliance_homologset","ZFIN:ZDB-GENE-991019-6"

3. Add AGR data from protein_homolog_clusters.csv to the correct protein pages within the homologs section
To identify the protein page to which the data needs to be added, map the homolog_cluster_id to the UniProt AC using the taxa's uniprot-proteome NT file (as described above in the extracting rows section). I would expect data to be added to Homo sapiens, Rattus norvegicus, Mus musculus, Danio rerio, Drosophila melanogaster and Saccharomyces cerevisiae GlyGen proteins. I will create a ticket for making the xref datasets for this data protein_xref_alliance_genome_resources.csv

The text was updated successfully, but these errors were encountered:

katewarner added this to the 2.8 milestone Jan 29, 2025

katewarner self-assigned this Jan 29, 2025

katewarner changed the title ~~Dataset instructions for AGR orthologs~~ Instructions for adding new data to homolog dataset Jan 29, 2025

katewarner changed the title ~~Instructions for adding new data to homolog dataset~~ Instructions for adding new data to protein_homolog_clusters.csv dataset Jan 30, 2025

katewarner mentioned this issue Jan 30, 2025

xref datasets for AGR homolog data #2137

Open

katewarner assigned rykahsay and unassigned katewarner Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instructions for adding new data to protein_homolog_clusters.csv dataset #2136

Instructions for adding new data to protein_homolog_clusters.csv dataset #2136

katewarner commented Jan 29, 2025 •

edited

Loading

Instructions for adding new data to protein_homolog_clusters.csv dataset #2136

Instructions for adding new data to protein_homolog_clusters.csv dataset #2136

Comments

katewarner commented Jan 29, 2025 • edited Loading

protein_homolog_clusters.csv update details:

Example

katewarner commented Jan 29, 2025 •

edited

Loading