Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instructions for adding new data to protein_homolog_clusters.csv dataset #2136

Open
katewarner opened this issue Jan 29, 2025 · 0 comments
Open
Assignees
Milestone

Comments

@katewarner
Copy link

katewarner commented Jan 29, 2025

In the protein_homolog_clusters.csv dataset we are replacing the MGI data (mgi_homologs.tsv) with the Alliance Genome Resource (AGR) ortholog data. Please change how the protein_homolog_clusters.csv dataset is processed by excluding the MGI data and adding the AGR data.

protein_homolog_clusters.csv update details:

1. Exclude the MGI homolog data from protein_homolog_clusters.csv but keep the mouse_protein_xref_mgi.csv dataset
Update your processing script for the protein_homolog_clusters.csv dataset so that it does not extract data from the mgi_homologs.tsv file; it has not been downloaded for 2.8 and will not be downloaded for future releases. However please still create the mouse_protein_xref_mgi.csv dataset which is generated from the mouse nt file, because this is the model organism for mouse and is in the organism specific section of the mouse protein pages.

2. Add AGR data to protein_homolog_clusters.csv

Source = downloads/alliance_genome/current/ORTHOLOGY-ALLIANCE_COMBINED.tsv

Mapping files:
ebi/current/uniprot-proteome-$taxa.nt files
misc/species_info.csv.
unreviewed/$taxa_protein_masterlist.csv

AGR dataset details:
Rather than all the orthologs being part of the same cluster as in the OMA and MGI datasets, in the AGR dataset (ORTHOLOGY-ALLIANCE_COMBINED.tsv), orthologs are mapped to a single gene which has it's own homolog ID (in the Gene1ID field). So the protein corresponding to the Gene1ID, is the protein page to which the ortholog data will be added.
In the download, the Gene1 fields (Gene1ID, Gene1Symbol, Gene1SpeciesTaxonID, Gene1SpeciesName) contain info on the species gene/protein, and the Gene2 fields (Gene2ID, Gene2Symbol, Gene2SpeciesTaxonID, Gene2SpeciesName) are the orthologs mapped to this gene. For example the Gene1ID "HGNC:9546", has multiple rows displaying each ortholog (Gene2ID) that's mapped to this gene/protein (see below).

/alliance_genome/current/ORTHOLOGY-ALLIANCE_COMBINED.tsv

Gene1ID | Gene1Symbol | Gene1SpeciesTaxonID | Gene1SpeciesName | Gene2ID | Gene2Symbol | Gene2SpeciesTaxonID | Gene2SpeciesName
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | SGD:S000003538 | PRE3 | NCBITaxon:559292 | Saccharomyces cerevisiae
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | FB:FBgn0010590 | Prosβ1 | NCBITaxon:7227 | Drosophila melanogaster
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | Xenbase:XB-GENE-481018 | psmb9 | NCBITaxon:8364 | Xenopus tropicalis
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | Xenbase:XB-GENE-865433 | psmb9.S | NCBITaxon:8355 | Xenopus laevis
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | ZFIN:ZDB-GENE-001208-3 | psmb9b | NCBITaxon:7955 | Danio rerio
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | ZFIN:ZDB-GENE-990415-140 | psmb9a | NCBITaxon:7955 | Danio rerio
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | RGD:3427 | Psmb9 | NCBITaxon:10116 | Rattus norvegicus
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | WB:WBGene00003947 | pbs-1 | NCBITaxon:6239 | Caenorhabditis elegans
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | MGI:1346526 | Psmb9 | NCBITaxon:10090 | Mus musculus

https://www.alliancegenome.org/gene/HGNC:9546#orthology

What data to extract and mapping files

Only extract rows:

  • that contain data from GlyGen species i.e. rows in which the taxa ID in both the Gene1SpeciesTaxonID and the mapped ortholog Gene2SpeciesTaxonID are GlyGen species. In the Gene1SpeciesTaxonID and Gene2SpeciesTaxonID fields you can extract the taxa ID after the colon e.g. “9606” from “NCBITaxon:9606”, and check it's a species in GlyGen using misc/species_info.csv.
  • that are Gene1IDs mapped to canoncial proteins. The AGR IDs i.e. Gene1ID and Gene2ID, are already mapped to the UniProt AC in the taxa nt files (see triples below extracted from different nt files). You can map the AGR IDs to their UniProt ACs using these triples, and then filter Gene1ID canonical proteins using the appropriate $taxa_protein_masterlist.csv
<http://purl.uniprot.org/uniprot/A0A096MIX7> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://purl.uniprot.org/agr/RGD:1561261> .
<http://purl.uniprot.org/uniprot/A0AQH0> <http://www.w3.org/2000/01/rdf-schema#seeAlso> http://purl.uniprot.org/agr/FB:FBgn0010590
<http://purl.uniprot.org/uniprot/A0A075F5C6> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://purl.uniprot.org/agr/MGI:96238> .

Adding data from download to the protein_homolog_clusters.csv file

Source Field protein_homolog_clusters Field Notes
Gene1ID homolog_cluster_id e.g. “WB:WBGene00003947” or ”ZFIN:ZDB-GENE-990415-216”
Gene2ID uniprotkb_canonical_ac Use the appropriate taxa nt file (see "tax_id" to determine species and correct nt file) to map the “Gene2ID” to the UniProt ac (see further details above)
Gene2SpeciesTaxonID tax_id Extract taxa ID after colon e.g. “9606” from “NCBITaxon:9606”
xref_key Add “protein_xref_alliance_genome_resources” to all rows originating from this source file
Gene1ID xref_id

Example

Input file:
downloads/alliance_genome/current/ORTHOLOGY-ALLIANCE_COMBINED.tsv

Gene1ID | Gene1Symbol | Gene1SpeciesTaxonID | Gene1SpeciesName | Gene2ID | Gene2Symbol | Gene2SpeciesTaxonID | Gene2SpeciesName | Algorithms | AlgorithmsMatch | OutOfAlgorithms | IsBestScore | IsBestRevScore
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | SGD:S000003538 | PRE3 | NCBITaxon:559292 | Saccharomyces cerevisiae | PhylomeDB\|Ensembl   Compara\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|OrthoInspector\|InParanoid | 8 | 9 | Yes | No
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | FB:FBgn0010590 | Prosβ1 | NCBITaxon:7227 | Drosophila melanogaster | PhylomeDB\|Ensembl Compara\|OrthoFinder\|Hieranoid\|PANTHER\|OrthoInspector | 6 | 9 | Yes | No
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | Xenbase:XB-GENE-481018 | psmb9 | NCBITaxon:8364 | Xenopus tropicalis | PhylomeDB\|Ensembl Compara\|OrthoFinder\|PANTHER\|OMA | 5 | 9 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | Xenbase:XB-GENE-865433 | psmb9.S | NCBITaxon:8355 | Xenopus laevis | Xenbase | 1 | 1 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | ZFIN:ZDB-GENE-001208-3 | psmb9b | NCBITaxon:7955 | Danio rerio | ZFIN | 1 | 10 | No | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | ZFIN:ZDB-GENE-990415-140 | psmb9a | NCBITaxon:7955 | Danio rerio | Ensembl   Compara\|PhylomeDB\|OrthoFinder\|PANTHER\|SonicParanoid\|ZFIN\|OMA\|OrthoInspector\|InParanoid | 9 | 10 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | RGD:3427 | Psmb9 | NCBITaxon:10116 | Rattus norvegicus | Ensembl   Compara\|PhylomeDB\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|OMA\|HGNC\|OrthoInspector\|InParanoid | 10 | 10 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | WB:WBGene00003947 | pbs-1 | NCBITaxon:6239 | Caenorhabditis elegans | PhylomeDB\|Ensembl   Compara\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|OMA\|OrthoInspector\|InParanoid | 9 | 9 | Yes | Yes
HGNC:9546 | PSMB9 | NCBITaxon:9606 | Homo sapiens | MGI:1346526 | Psmb9 | NCBITaxon:10090 | Mus musculus | Ensembl   Compara\|PhylomeDB\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|OMA\|HGNC\|OrthoInspector\|InParanoid | 10 | 10 | Yes | Yes
ZFIN:ZDB-GENE-991019-6 | aanat2 | NCBITaxon:7955 | Danio rerio | SGD:S000002478 | PAA1 | NCBITaxon:559292 | Saccharomyces cerevisiae | Ensembl   Compara\|PhylomeDB\|OrthoFinder\|Hieranoid\|PANTHER\|SonicParanoid\|InParanoid | 7 | 9 | Yes | Yes
ZFIN:ZDB-GENE-991019-6 | aanat2 | NCBITaxon:7955 | Danio rerio | HGNC:19 | AANAT | NCBITaxon:9606 | Homo sapiens | PANTHER\|SonicParanoid\|ZFIN\|OrthoInspector\|InParanoid | 5 | 10 | Yes | No
ZFIN:ZDB-GENE-991019-6 | aanat2 | NCBITaxon:7955 | Danio rerio | MGI:1328365 | Aanat | NCBITaxon:10090 | Mus musculus | PANTHER\|OrthoInspector\|ZFIN | 3 | 10 | Yes | No

Output data:
/unreviewed/protein_homolog_clusters.csv

"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"HGNC:9546","P38624-1","559292","protein_xref_genome_alliance_homologset","HGNC:9546"
"HGNC:9546","A0AQH0-1","7227","protein_xref_genome_alliance_homologset","HGNC:9546"
"HGNC:9546","Q9PUS3-1","8355","protein_xref_genome_alliance_homologset","HGNC:9546"
"HGNC:9546","P28077-1","10116","protein_xref_genome_alliance_homologset","HGNC:9546"
"HGNC:9546","P28076-1","10090","protein_xref_genome_alliance_homologset","HGNC:9546"
"ZFIN:ZDB-GENE-991019-6","Q12447-1","559292","protein_xref_genome_alliance_homologset","ZFIN:ZDB-GENE-991019-6"
"ZFIN:ZDB-GENE-991019-6","Q16613-1","9606","protein_xref_genome_alliance_homologset","ZFIN:ZDB-GENE-991019-6"
"ZFIN:ZDB-GENE-991019-6","O88816-1","10090","protein_xref_genome_alliance_homologset","ZFIN:ZDB-GENE-991019-6"

3. Add AGR data from protein_homolog_clusters.csv to the correct protein pages within the homologs section
To identify the protein page to which the data needs to be added, map the homolog_cluster_id to the UniProt AC using the taxa's uniprot-proteome NT file (as described above in the extracting rows section). I would expect data to be added to Homo sapiens, Rattus norvegicus, Mus musculus, Danio rerio, Drosophila melanogaster and Saccharomyces cerevisiae GlyGen proteins. I will create a ticket for making the xref datasets for this data protein_xref_alliance_genome_resources.csv

@katewarner katewarner added this to the 2.8 milestone Jan 29, 2025
@katewarner katewarner self-assigned this Jan 29, 2025
@katewarner katewarner changed the title Dataset instructions for AGR orthologs Instructions for adding new data to homolog dataset Jan 29, 2025
@katewarner katewarner changed the title Instructions for adding new data to homolog dataset Instructions for adding new data to protein_homolog_clusters.csv dataset Jan 30, 2025
@katewarner katewarner assigned rykahsay and unassigned katewarner Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants