KnowEnG's Gene Prioritization Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH, BD2K Center of Excellence, Gene Prioritization Pipeline.

This pipeline ranks the rows of a given spreadsheet, where spreadsheet's rows correspond to gene-labels and columns correspond to sample-labels. The ranking is based on correlating gene expression data (network smoothed) against pheno-type data.

There are four prioritization methods, using either pearson or t-test as the measure of correlation:

Options	Method	Parameters
Simple Correlation	simple correlation	correlation
Bootstrap Correlation	bootstrap sampling correlation	bootstrap_correlation
Correlation with network regularization	network-based correlation	net_correlation
Bootstrap Correlation with network regularization	bootstrapping w network correlation	bootstrap_net_correlation

Note: all of the correlation methods mentioned above use the Pearson or t-test correlation measure method.

How to run this pipeline with Our data

1. Clone the Gene_Prioritization_Pipeline Repo

 git clone https://github.com/KnowEnG/Gene_Prioritization_Pipeline.git

2. Install the following (Ubuntu or Linux)

apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install scipy==0.19.1
pip3 install scikit-learn==0.17.1
apt-get install -y libfreetype6-dev libxft-dev
pip3 install matplotlib==1.4.2
pip3 install pyyaml
pip3 install knpackage

3. Change directory to Gene_Prioritization_Pipeline

cd Gene_Prioritization_Pipeline

4. Change directory to test

cd test

5. Create a local directory "run_dir" and place all the run files in it

make env_setup

6. Use one of the following "make" commands to select and run a clustering option:

Command	Option
make run_pearson	pearson correlation
make run_bootstrap_pearson	bootstrap sampling with pearson correlation
make run_net_pearson	pearson correlation with network regularization
make run_bootstrap_net_pearson	bootstrap pearson correlation with network regularization
make run_t_test	t-test correlation
make run_bootstrap_t_test	bootstrap sampling with t-test correlation
make run_net_t_test	t-test correlation with network regularization
make run_bootstrap_net_t_test	bootstrap t-test correlation with network regularization

How to run this pipeline with Your data

Follow steps 1-3 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in ./Gene_Prioritization_Pipeline/data/run_files/zTEMPLATE_GP_BENCHMARKS.yml

* Modify run_paramters file (YAML Format)

set the spreadsheet, network and phenotype data file names to point to your data

* Run the Gene Prioritization Pipeline:

Update PYTHONPATH enviroment variable

export PYTHONPATH='../src':$PYTHONPATH

Run (in test directory with env_setup as described above)

python3 ../src/gene_prioritization.py -run_directory ./run_dir -run_file zTEMPLATE_GP_BENCHMARKS.yml

Description of "run_parameters" file

Key	Value	Comments
method	correlation or net_correlation or bootstrap_correlation or bootstrap_net_correlation	Choose gene prioritization method
correlation_measure	pearson or t_test	Choose correlation measure method
gg_network_name_full_path	directory+gg_network_name	Path and file name of the 4 col network file
spreadsheet_name_full_path	directory+spreadsheet_name	Path and file name of user supplied gene sets
phenotype_name_full_path	directory+phenotype_response	Path and file name of user supplied phenotype response file
results_directory	directory	Directory to save the output files
number_of_bootstraps	5	Number of random samplings
cols_sampling_fraction	0.9	Select 90% of spreadsheet columns
rwr_max_iterations	100	Maximum number of iterations without convergence in random walk with restart
rwr_convergence_tolerence	1.0e-2	Frobenius norm tolerence of spreadsheet vector in random walk
rwr_restart_probability	0.5	alpha in `V_(n+1) = alpha * N * Vn + (1-alpha) * Vo`
top_beta_of_sort	100	Number of top genes selected
top_gamma_of_sort	50	Number of top genes reported
max_cpu	4	Maximum number of processors to use in the parallel correlation section

gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = CCLE_Expression_ensembl.df
phenotype_name = CCLE_drug_ec50_cleaned_NAremoved_pearson.txt

Description of Output files saved in results directory

Any method saves separate files per phenotype with name {phenotype}_{method}_{correlation_measure}_{timestamp}_viz.tsv. Genes are sorted in descending order based on visualization_score.

Response	Gene_ENSEMBL_ID	quantitative_sorting_score	visualization_score	baseline_score
phenotype 1	gene 1	float	float	float
...	...	...	...	...
phenotype 1	gene n	float	float	float

Any method saves sorted genes for each phenotype with name ranked_genes_per_phenotype_{method}_{correlation_measure}_{timestamp}_download.tsv.

Ranking	phenotype 1	phenotype 2	...	phenotype n
1	gene (most significant)	gene (most significant)	...	gene (most significant)
...	...	...	...	...
n	gene (least significant)	gene (least significant)	...	gene (least significant)

Any method saves spreadsheet with top ranked genes per phenotype with name top_genes_per_phenotype_{method}_{correlation_measure}_{timestamp}_download.tsv.

Genes	phenotype 1	...	phenotype n
gene 1	1/0	...	1/0
...	...	...	...
gene n	1/0	...	1/0

References:

The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity https://www.nature.com/articles/nature11003

Name		Name	Last commit message	Last commit date
Latest commit History 793 Commits
build/docker		build/docker
data		data
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KnowEnG's Gene Prioritization Pipeline

How to run this pipeline with Our data

1. Clone the Gene_Prioritization_Pipeline Repo

2. Install the following (Ubuntu or Linux)

3. Change directory to Gene_Prioritization_Pipeline

4. Change directory to test

5. Create a local directory "run_dir" and place all the run files in it

6. Use one of the following "make" commands to select and run a clustering option:

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Modify run_paramters file (YAML Format)

* Run the Gene Prioritization Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

About

Releases

Packages

Contributors 7

Languages

License

KnowEnG/Gene_Prioritization_Pipeline

Folders and files

Latest commit

History

Repository files navigation

KnowEnG's Gene Prioritization Pipeline

How to run this pipeline with Our data

1. Clone the Gene_Prioritization_Pipeline Repo

2. Install the following (Ubuntu or Linux)

3. Change directory to Gene_Prioritization_Pipeline

4. Change directory to test

5. Create a local directory "run_dir" and place all the run files in it

6. Use one of the following "make" commands to select and run a clustering option:

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Modify run_paramters file (YAML Format)

* Run the Gene Prioritization Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages