A network-based approach for isolating the chronic inflammation gene signatures underlying complex diseases towards finding new treatment opportunities
This GitHub repository contains all code used for reproducing results from the manuscript "A network-based approach for isolating the chronic inflammation gene signatures underlying complex diseases towards finding new treatment opportunities", which can be found [here] (Get link)
This markdown documents will provide instructions on how to run the code for a sample disease and how to recreate the results.
- unix or unix-like OS
- Anaconda3 distribution
- R version >=4.0.0
- Slurm workload manager (To recreate project results)
R libraries needed:
- tidyverse 1.3.1
- parallel
- mccf1 1.1
- grid
- org.Hs.eg.db 3.12.0
- igraph 1.2.9
- topGO 2.42.0
Required data, as well as copies of pertinent results, are included on our Zenodo record
Note that this record contains ~45 GB.
These can be downloaded with the script get_data.sh
. Run this script in
the repo so the Zenodo folder appears in this default filepath.
This folder, data_Zenodo
, will include:
GenePlexus
: A local instance of GenePlexusGenePlexus_parameter_checks
: Output for running GenePlexus with different argumentsGenePlexus_String_Adjacency
: Record of output from pipeline and analysispascal_out
: Pascal output files for UK BioBank traits we usedclinical_trials
: Clinical trial data for analysisbiogrid
: Biogrid networkConsensusPathDB
: ConsensusPathDB networkstring
: String networkstring-exp
: String-exp networkprediction_clusters_same_graph
: Compendium of results to create paper figures
The 'data' directory has some required data. This includes:
disease_gene_files
: Seed gene lists, generated in the pipeline,drugcentral
: Data from DrugCentral and scripts used to format it.dgidb
: Data from DGIdb and scripts used to download/format it.
Scripts used to run the pipeline are located in src
.
Contains chronic_inflammation_functions.R
, which contains utility functions used by most scripts in the pipeline.
run
contains slurm submission scripts that were used to do our analysis.
This readme has instructions for how to run each script without use of slurm
for a sample disease
figures
contains Markdown notebooks used to analyze final results. These results
can be found in the Zenodo record
The instructions in this readme will be how to run the pipeline with one sample disease. Running with all traits used in this project requires the use of slurm workload manager and is impractical without it.
Script:
prep_disease_gene_dfs.R
Purpose:
Creates the seed gene files in data/disease_gene_files
for specified diseases from
Disgenet.
Arguments:
- Text file with one column (no header) containing the disease ids of interest from disgenet
- Directory where "disease_gene_files" folder will go
Run:
Rscript prep_disease_gene_dfs.R \
../data/chronic_inflammation_diseases_non-ovlp_cuid.txt \
../data
Script:
getInflammationGenesFrom_org.HS.eg.db.R
Purpose: Takes the human genome and gets genes from inflammation related GO terms
Arguments: N/A
Run:
Rscript getInflammationGenesFrom_org.HS.eg.db.R
Script:
prepEdgelist.R
Purpose: Formats and creates an edgelist and Rdata object for a given network
Arguments:
- Path to tab delimited edgelist
- Path to output dir
- Network name
- True/False, keep edge weights or not
Run:
Rscript prepEdgelist.R \
../data_Zenodo/biogrid/biogrid_entrez_edgelist.txt \
../data_Zenodo/biogrid/ \
bioGRID \
FALSE
Script:
getNegativeControls.R
Purpose:
Creates the seed gene files in data/disease_gene_files
for the UK BioBank
traits used in this project.
Arguments:
- File from Zenodo of UK BioBank traits of interest
- Location of Pascal output in Zenodo for each trait
- location of
disease_gene_files
, where files will be output
Run:
Rscript getNegativeControls.R \
../data_Zenodo/our_ukbb_traits_description.tsv \
../data_Zenodo/pascal_out \
../data/
Script:
bin/GenePlexus/example_run.py
Purpose:
Runs GenePlexus on a trait of interest and output the results. This project
used ConsensusPathDB
,Adjacency
, and DisGeNet
for its final results
Arguments:
-i : Disease seed genes
-j : Job name
-n : Network, options are BioGRID, STRING-EXP, STRING, ConsensusPathDB
-f : Features, options are Embedding, Adjacency, Influence
-g : GSC type, options are GO or DisGeNet
-s : Output directory
-fl : Option for how to run, always use local
for this project \
Run:
python example_run.py \
-i ../../data/disease_gene_files/Chronic_Obstructive_Airway_Disease.txt \
-j Chronic_Obstructive_Airway_Disease--ConsensusPathDB--Adjacency--DisGeNet \
-n ConsensusPathDB \
-f Adjacency \
-g DisGeNet \
-s ../../results/GenePlexus_output/ \
-fl local
Script:
summarizeGeneplexusPredictions.R
Purpose: Returns multiple figures showing results for the network combination, along with a summarized Rdata file that has pertinent disease results used in later parts of the pipeline
Arguments:
- Path to directory with GenePlexus predictions
- Output directory
- Average cv threshold
Run:
Rscript summarizeGeneplexusPredictions.R \
../results/GenePlexus_output/ \
../results/GenePlexus_parameters \
1.0
Script:
filterAndClusterGeneplexusPredictions.R
Purpose: Takes the GenePlexus predictions and assign genes to clusters for each disease.
Arguments:
- GenePlexus prediction path
- Prediction threshold, either
mccf1
or a number < 1 - Path to igraph object containing network for clustering
- Leiden algorithm partition type
- Resolution parameter
- GenePlexus results path
- True/False, Is the network weighted?
Run:
Rscript filterAndClusterGeneplexusPredications.R \
../results/GenePlexus_output/Chronic_Obstructive_Airway_Disease--ConsensusPathDB--Adjacency--DisGeNet--predictions.tsv \
0.8 \
../data_Zenodo/ConsensusPathDB/ConsensusPathDB_igraph.Rdata \
ModularityVertexPartition \
0.1 \
../results/prediction_clusters_same_graph \
TRUE
Script:
clusterInflammationGenes.R
Purpose: Clustering the inflammation genes
Arguments:
- Path to inflammation genes
- Path to igraph object that has network for clustering
- Partition type
- Resolution parameter
- Results path
- True/False, Is the network weighted?
Run:
Rscript clusterInflammationGenes.R \
../data/disease_gene_files/chronic_inflammatory_response_GO2ALLEGS.txt \
../data_Zenodo/ConsensusPathDB/ConsensusPathDB_igraph.Rdata \
ModularityVertexPartition \
0.1 \
../results/prediction_clusters_same_graph/ \
TRUE
Script:
clusterRandomGenes.R
Purpose: Takes the 5000 fake traits that were randomly generated from a disease and assigns the genes to clusters
Arguments:
- Path to data containing all fake traits generated
- Disease of interest
- Path to igraph object containing network for clustering
- Partition type
- Resolution parameter
- Results path
- True/False, Is the network weighted?
Run:
Rscript clusterRandomGenes.R \
../data_Zenodo/5000Expandedfaketraits_ConsensusPathDB.tsv \
Chronic_Obstructive_Airway_Disease \
../data_Zenodo/ConsensusPathDB/ConsensusPathDB_igraph.Rdata \
ModularityVertexPartition \
0.1 \
../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB \
TRUE
Script:
find_GOBP_enriched_clusters_GenePlexus.R
Purpose: Finds GOBPs that are enriched in each cluster of a disease
Arguments:
- Path to cluster file
- Background genes from network
- Output directory
Run:
Rscript find_GOBP_enriched_clusters_GenePlexus.R \
../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB/Chronic_Obstructive_Airway_Disease--threshold--0.8--PredictionGraph--ConsensusPathDB--ClusterGraph--ConsensusPathDB_clusters.csv \
../data_Zenodo/ConsensusPathDB/ConsensusPathDB_genes.csv \
../results/prediction_clusters_same_graph/GOBP_enrichment
Script:
find_GOBP_enriched_inflammation_clusters.R
Purpose: Finds GOBPs that are enriched in each cluster of a disease
Arguments:
- Path to cluster file
- Background genes from network
- Output directory
Run:
Rscript find_GOBP_enriched_clusters_GenePlexus.R \
../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB/Chronic_Obstructive_Airway_Disease--threshold--0.8--PredictionGraph--ConsensusPathDB--ClusterGraph--ConsensusPathDB_clusters.csv \
../data_Zenodo/ConsensusPathDB/ConsensusPathDB_genes.csv \
../results/prediction_clusters_same_graph/GOBP_enrichment
Script:
scoreClusterOverlaps_GenePlexus.R
Purpose: For a disease, outputs a file with the overlap score between all real and fake trait clusters that have >=5 genes with chronic inflammation genes.
Also outputs a file with the shared genes between all real and fake trait clusters with chronic inflammation genes
Arguments:
- Path to folder containing leiden cluster output files
- Path to chronic inflammation prediction file
- Path to output directory
- Disease of interest
- Chronic inflammation prediction threshold
- List of genes in network the disease genes were clustered on
Run:
Rscript scoreClusterOverlaps_GenePlexus.R \
../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB \
../results/GenePlexus_output/inflammatory_response_GO2EG_expr--ConsensusPathDB--Adjacency--GO--predictions.tsv \
../results/prediction_clusters_same_graph \
Chronic_Obstructive_Airway_Disease \
0.8 \
../data_Zenodo/ConsensusPathDB/ConsensusPathDB_genes.csv
Script:
filterSignificantOverlaps_GenePlexus.R
Purpose: Filters the FDRs for significant values, outputting the significant clusters and real and fake cluster assignments
Arguments:
- overlap_results.Rdata location
- FDR cutoff
- Output directory
- Path to gene cluster assignment files
Run:
Rscript filterSignificantOverlaps_GenePlexus.R \
../results/prediction_clusters_same_graph/scores/chronic_inflammatory_response_GO2ALLEGS_thresh=0.8_predicted_with_ConsensusPathDB_clusteredOn_ConsensusPathDB_overlap_results.Rdata \
.05 \
../results/prediction_clusters_same_graph/ \
../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB
Script:
prepForClusterSaverunner.R
Purpose:
Sets up files for running SAveRUNNER with cluster genes. This instance is stored in data_Zenodo/prediction_clusters_same_graph/SAveRUNNER
Arguments:
- Saverunner input directory
- path to interactome edgelist
- path to "final for alex" file with significant clusters
- path to gene cluster assigments
Run:
Rscript prepForClusterSaverunner.R \
../data_Zenodo/prediction_clusters_same_graph/SAveRUNNER/code/input_files \
../data_Zenodo/prediction_clusters_same_graph/SAveRUNNER/code/input_files/interactome.txt \
../data_Zenodo/prediction_clusters_same_graph/chronic_inflammation_gene_shot_pubs_greater10--predictedWith--ConsensusPathDB--clusteredOn--ConsensusPathDB_final_for_alex.csv \
../data_Zenodo/prediction_clusters_same_graph/chronic_inflammation_gene_shot_pubs_greater10--predictedWith--ConsensusPathDB--clusteredOn--ConsensusPathDB_relevant_gene_cluster_assigments.csv
Drugs were obtained using the SAveRUNNER software, located at https://github.com/sportingCode/SAveRUNNER.
The instance is stored in data_Zenodo/drugs/SAveRUNNER
The figures
has R markdown scripts that will recreate figures (including supplemental) in the paper
The relevant files from DrugCentral are included in data/drugcentral
. They came from a local
PostgreSQL instance of the DrugCentral database, which can be obtained from
DrugCentral.
Relevant files from DGIdb are included in data/dgidb
.
Script:
data/drugcentral/getDrugCentralEntrez.R
Purpose: This script takes tables from DrugCentral and returns a mapping of Drugs and Entrez targets for humans. This output is already provided.
Run:
Rscript getDrugCentralEntrez.R