This is a benchmarking repository accompanying the obnb
Python package.
conda create -n obnb python=3.8 -y && conda activate obnb
# Install PyTorch and PyG with CUDA 11.7
conda install pytorch=2.0.1 torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install pyg=2.3.0 -c pyg -y
pip install obnb[ext]==0.1.0 # install obnb with extension modules (PecanPy, GraPE, ...)
pip install -r requirements_extra.txt # extra dependencies for benchmarking
conda clean --all -y # clean up
The extra dependencies are, e.g.,
Hydra
for managing experiments.Lightning
for organizing model training framework.WandB
for logging metrics.
Note: if you do not need to run the benchmarking experiments and only want to play around with our benchmarking results with one of the notebooks, you can skip the installation for PyTorch and PyG.
pip install obnb[ext]==0.1.0
Run get_data.py
to download and set up data for all the experiments.
Data will be saved under the datasets/
directory by default, and will take up approximately 6 GB of space.
python get_data.py
This step is completely optional and directly runing the training script will work fine.
But runing get_data.py
once before training prevents multiple parallel jobs doing the same data preprocessing
work if the processed data is not available yet.
After setting up the data, one can run a single experiment by specifying the choices of network, label, and model:
python main.py dataset.network=BioGRID dataset.label=DisGeNET model=GCN
Check out the conf/model/
directory for all available model presets.
The main model presets are:
GCN
GAT
GCN+BoT
GAT+BoT
LogReg+Adj
LogReg+Node2vec
LogReg+Walklets
cd run
# GNN node feature ablation (example of runing GCN with node2vec features on BioGRID)
sh run_abl_gnn_feature.sh GCN BioGRID Node2vec
# C&S ablation (example of runing GCN with C&S post processing on BioGRID)
sh run_abl_cs.sh GCN BioGRID
# GNN label reuse ablation (example of runing GCN with label reuse on BioGRID)
sh run_abl_gnn_label.sh GCN BioGRID
# GNN label reuse with C&S ablation (example of runing GCN with label reuse with C&S on BioGRID)
sh run_abl_gnn_cs_label.sh GCN BioGRID
# GNN with bag of tricks, i.e., node2vec node feature + label reuse + C&S
sh run_gnn_bot.sh GCN BioGRID
To run all experiments presented in the paper (may take several days):
sh run_all.sh
First create a sweep agent, e.g., for BioGRID-DisGeNET-GCN:
wandb sweep conf/tune/BioGRID-DisGeNET-GCN.yaml
Then, follow the instruction from the command above to spawn sweep agents to automatically tune the model configuration on a particular dataset.
To run the notebooks, first download our benchmarking results (or you can rerun all the benchmarking experiments yourself using our run scripts described above).
wget -O results/main.csv.gz https://zenodo.org/record/8048305/files/main.csv.gz
Network | Weighted | Num. nodes | Num. edges | Density | Category |
---|---|---|---|---|---|
HumanBaseTopGlobal | ✅ | 25,689 | 77,807,094 | 0.117908 | Large & Dense |
HuMAP | ✅ | 15,433 | 35,052,604 | 0.147180 | Large & Dense |
STRING | ✅ | 18,480 | 11,019,492 | 0.032269 | Large |
ConsensusPathDB | ✅ | 17,735 | 10,611,416 | 0.033739 | Large |
FunCoup | ✅ | 17,892 | 10,037,478 | 0.031357 | Large |
PCNet | ❌ | 18,544 | 5,365,116 | 0.015603 | Large |
BioGRID | ❌ | 19,765 | 1,554,790 | 0.003980 | Medium |
HumanNet | ✅ | 18,591 | 2,250,780 | 0.006513 | Medium |
HIPPIE | ✅ | 19,338 | 1,542,044 | 0.004124 | Medium |
ComPPIHumanInt | ✅ | 17,015 | 699,620 | 0.002417 | Medium |
OmniPath | ❌ | 16,325 | 289,134 | 0.001085 | Small |
ProteomeHD | ❌ | 2,471 | 125,172 | 0.020509 | Small |
HuRI | ❌ | 8,100 | 103,188 | 0.001573 | Small |
BioPlex | ❌ | 8,108 | 71,004 | 0.001080 | Small |
SIGNOR | ❌ | 5,291 | 28,676 | 0.001025 | Small |
Network | Num. tasks | Num. pos. avg. | Num. pos. std. | Num. pos. med. |
---|---|---|---|---|
BioGRID | 145 | 178.1 | 137.4 | 127.0 |
BioPlex | 72 | 123.8 | 64.4 | 101.5 |
ComPPIHumanInt | 145 | 174.6 | 134.5 | 125.0 |
ConsensusPathDB | 144 | 177.4 | 137.5 | 126.0 |
FunCoup | 145 | 177.1 | 135.1 | 127.0 |
HIPPIE | 143 | 178.1 | 137.6 | 127.0 |
HuMAP | 123 | 168.0 | 119.2 | 120.0 |
HuRI | 50 | 130.3 | 56.7 | 112.5 |
HumanBaseTopGlobal | 149 | 178.5 | 137.7 | 129.0 |
HumanNet | 142 | 179.0 | 136.9 | 127.0 |
OmniPath | 135 | 180.2 | 131.1 | 131.0 |
PCNet | 143 | 171.8 | 130.6 | 122.0 |
ProteomeHD | 15 | 76.9 | 22.4 | 70.0 |
SIGNOR | 89 | 144.6 | 89.4 | 117.0 |
STRING | 146 | 175.4 | 135.6 | 126.0 |
Network | Num. tasks | Num. pos. avg. | Num. pos. std. | Num. pos. med. |
---|---|---|---|---|
BioGRID | 305 | 208.3 | 143.1 | 159.0 |
BioPlex | 189 | 138.6 | 71.4 | 111.0 |
ComPPIHumanInt | 301 | 204.1 | 138.7 | 159.0 |
ConsensusPathDB | 298 | 207.4 | 140.8 | 161.5 |
FunCoup | 299 | 204.7 | 139.4 | 158.0 |
HIPPIE | 306 | 208.1 | 142.9 | 159.5 |
HuMAP | 279 | 194.3 | 126.7 | 155.0 |
HuRI | 152 | 122.9 | 54.7 | 108.0 |
HumanBaseTopGlobal | 287 | 219.7 | 145.7 | 173.0 |
HumanNet | 302 | 204.2 | 140.3 | 158.5 |
OmniPath | 298 | 199.6 | 136.0 | 153.5 |
PCNet | 292 | 202.1 | 135.5 | 159.0 |
ProteomeHD | 56 | 78.0 | 24.8 | 71.0 |
SIGNOR | 219 | 147.3 | 81.9 | 124.0 |
STRING | 296 | 208.0 | 140.6 | 162.0 |
Network | Num. tasks | Num. pos. avg. | Num. pos. std. | Num. pos. med. |
---|---|---|---|---|
BioGRID | 114 | 89.5 | 37.1 | 76.0 |
BioPlex | 38 | 77.6 | 22.6 | 76.0 |
ComPPIHumanInt | 104 | 91.8 | 37.0 | 77.5 |
ConsensusPathDB | 112 | 90.1 | 37.0 | 76.5 |
FunCoup | 114 | 87.8 | 36.7 | 74.0 |
HIPPIE | 111 | 89.2 | 37.1 | 76.0 |
HuMAP | 96 | 84.6 | 32.3 | 74.0 |
HuRI | 27 | 69.9 | 16.0 | 65.0 |
HumanBaseTopGlobal | 115 | 89.2 | 37.3 | 76.0 |
HumanNet | 117 | 88.6 | 36.9 | 75.0 |
OmniPath | 106 | 88.7 | 36.2 | 74.0 |
PCNet | 105 | 89.0 | 36.0 | 77.0 |
ProteomeHD | 5 | 80.4 | 22.6 | 70.0 |
SIGNOR | 41 | 81.3 | 22.7 | 78.0 |
STRING | 116 | 88.9 | 36.6 | 75.0 |