Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added GPAD annotation notebook #253

Merged
merged 1 commit into from
Nov 30, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
255 changes: 255 additions & 0 deletions notebooks/annotations_gpad_file.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reading a _Gene Product Association Data_ (GPAD) file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Download a GPAD file\n",
"2. Load the GPAD file into the GpadReader\n",
"3. Get Annotations\n",
"\n",
"**Bonus: Each line in the GPAD file is stored in a namedtuple**:\n",
" * Namedtuple fields\n",
" * Print a subset of the namedtuple fields"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1) Download a GPAD file and GO Dag file"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from os.path import exists\n",
"if not exists('goa_human.gpad'):\n",
" !rm -f goa_human.gpad.gz\n",
" !wget http://current.geneontology.org/annotations/goa_human.gpad.gz\n",
" !gunzip goa_human.gpad.gz"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"if not exists('go-basic.obo'):\n",
" !wget http://geneontology.org/ontology/go-basic.obo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2) Load the GPAD file into the GpadReader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" EXISTS: go-basic.obo\n",
"go-basic.obo: fmt(1.2) rel(2022-11-03) 46,800 Terms\n",
"HMS:0:00:16.880690 635,765 annotations READ: goa_human.gpad \n"
]
}
],
"source": [
"from goatools.base import get_godag\n",
"from goatools.anno.gpad_reader import GpadReader\n",
"\n",
"godag = get_godag(\"go-basic.obo\")\n",
"anno = GpadReader(\"goa_human.gpad\", godag=godag)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3) Get Annotations\n",
"The annotations will be stored in three dicts, one for each GODAG branch, where:\n",
" * the key is the protein ID and \n",
" * the value is a list of GO IDs associated with the protein."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"ns2assc = anno.get_ns2assc()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MF A0A024RBG1 : GO:0000298 GO:0008486 GO:0034431 GO:0034432 GO:0046872 GO:0050072\n",
"MF A0A075B6Q5 : GO:0003823 GO:0034987\n",
"MF A0A075B6R2 : GO:0003823 GO:0034987\n",
"CC A0A024RBG1 : GO:0005634 GO:0005737 GO:0005829\n",
"CC A0A075B6H5 : GO:0005886\n",
"CC A0A075B6H7 : GO:0005615 GO:0005886 GO:0019814\n",
"BP A0A024RBG1 : GO:0071543 GO:1901907 GO:1901909 GO:1901911\n",
"BP A0A075B6H5 : GO:0007166\n",
"BP A0A075B6H7 : GO:0002250 GO:0002377 GO:0006955\n"
]
}
],
"source": [
"for namespace, associations in ns2assc.items():\n",
" for protein_id, go_ids in sorted(associations.items())[:3]:\n",
" print(\"{NS} {PROT:7} : {GOs}\".format(\n",
" NS=namespace,\n",
" PROT=protein_id,\n",
" GOs=' '.join(sorted(go_ids))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bonus: The GPAD is stored as a list of named tuples\n",
"The list of namedtuples is stored in the **GpadReader** data member named **_associations_**.\n",
"\n",
"Each namedtuple stores data for one line in the GPAD file."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ntgpadobj(DB='UniProtKB', DB_ID='A0A024RBG1', Qualifier={'enables'}, GO_ID='GO:0046872', DB_Reference={'GO_REF:0000043'}, ECO='ECO:0000501', Evidence_Code='IEA', With_From={'UniProtKB-KW:KW-0479'}, Taxon=None, Date=datetime.date(2022, 9, 7), Assigned_By='UniProt', Extension=None, Properties={}, NS='MF')\n"
]
}
],
"source": [
"# Sort the list of GPAD namedtuples by ID\n",
"nts = sorted(anno.associations, key=lambda nt:nt.DB_ID)\n",
"\n",
"# Print one namedtuple\n",
"print(nts[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Namedtuple fields\n",
"```\n",
"DB # 0 required 1 UniProtKB\n",
"DB_ID # 1 required 1 P12345\n",
"DB_Symbol # 2 required 1 PHO3\n",
"Qualifier # 3 optional 0 or greater NOT\n",
"GO_ID # 4 required 1 GO:0003993\n",
"DB_Reference # 5 required 1 or greater PMID:2676709\n",
"Evidence_Code # 6 required 1 IMP\n",
"With_From # 7 optional 0 or greater GO:0000346\n",
"Aspect # 8 required 1 F\n",
"DB_Name # 9 optional 0 or 1 Toll-like receptor 4\n",
"DB_Synonym # 10 optional 0 or greater hToll|Tollbooth\n",
"DB_Type # 11 required 1 protein\n",
"Taxon # 12 required 1 or 2 taxon:9606\n",
"Date # 13 required 1 20090118\n",
"Assigned_By # 14 required 1 SGD\n",
"Annotation_Extension # 15 optional 0 or greater part_of(CL:0000576)\n",
"Gene_Product_Form_ID # 16 optional 0 or 1 UniProtKB:P12345-2\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Print a subset of the namedtuple fields"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MF A0A024RBG1 GO:0046872 IEA 2022-09-07 UniProt\n",
"CC A0A024RBG1 GO:0005829 IDA 2016-12-04 HPA\n",
"BP A0A024RBG1 GO:1901907 IBA 2017-02-28 GO_Central\n",
"MF A0A024RBG1 GO:0000298 IBA 2017-02-28 GO_Central\n",
"MF A0A024RBG1 GO:0008486 IBA 2021-05-30 GO_Central\n",
"CC A0A024RBG1 GO:0005634 IBA 2017-02-28 GO_Central\n",
"BP A0A024RBG1 GO:1901911 IBA 2017-02-28 GO_Central\n",
"BP A0A024RBG1 GO:1901909 IBA 2017-02-28 GO_Central\n",
"MF A0A024RBG1 GO:0034432 IBA 2017-02-28 GO_Central\n",
"CC A0A024RBG1 GO:0005737 IBA 2017-02-28 GO_Central\n"
]
}
],
"source": [
"fmtpat = '{NS} {DB_ID} {GO_ID} {Evidence_Code} {Date} {Assigned_By}'\n",
"for nt_line in nts[:10]:\n",
" print(fmtpat.format(**nt_line._asdict()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (C) 2010-present, DV Klopfenstein PhD, Haibao Tang. All rights reserved."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}