This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, General Clustering Pipeline.
This pipeline clusters a spreadsheet's columns, with various methods:
Options | Method | Parameters |
---|---|---|
K-means | K Means | kmeans |
hierarchical clustering | hierarchical clustering | hclust |
Linked hierarchical clustering | hierarchical clustering constraint | link_hclust |
Bootstrapped hierarchical clustering | consensus hierarchical clustering | cc_ hclust |
Bootstrapped K-means | consensus K Means | cc_kmeans |
Bootstrapped Linked hierarchical clustering | consensus linked hierarchical clustering | cc_link_hclust |
git clone https://github.com/KnowEnG-Research/General_Clustering_Pipeline.git
apt-get install -y python3-pip libfreetype6-dev libxft-dev libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install pyyaml knpackage scipy==0.19.1 numpy==1.11.1 pandas==0.18.1 matplotlib==1.4.2 scikit-learn==0.17.1
cd General_Clustering_Pipeline
cd test
make env_setup
Command | Option |
---|---|
make run_kmeans_binary | Clustering with k-means |
make run_kmeans_continuous | |
make run_hclust_binary | Hierarchical Clustering |
make run_hclust_continuous | |
make run_link_hclust_binary | Hierarchical linkage Clustering |
make run_link_hclust_continuous | |
make run_cc_kmeans_binary | Consensus Clustering with k-means |
make run_cc_kmeans_continuous | |
make run_cc_hclust_binary | Consensus Hierarchical Clustering |
make run_cc_hclust_continuous | |
make run_cc_link_hclust_binary | Consensus Hierarchical linkage Clustering |
Follow steps 1-5 above then do the following:
mkdir run_dir
cd run_dir
mkdir results
Look for examples of run_parameters in the General_Clustering_Pipeline/data/run_files zTEMPLATE_cc_hclust.yml
Change processing_method to one of: serial, parallel depending on your machine.
processing_method: serial
set the data file targets to the files you want to run, and the parameters as appropriate for your data.
- Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH
- Run
python3 ../src/general_clustering.py -run_directory ./run_dir -run_file zTEMPLATE_cc_net_nmf.yml
Key | Value | Comments |
---|---|---|
method | kmeans,hclust,link_hclust,cc_kmeans, cc_hclust, cc_link_hclust | Choose clustering method |
affinity_metric | euclidean, manhattan, jaccard | Choose clustering affinity |
linkage_criterion | ward, complete, average | Choose clustering affinity |
spreadsheet_name_full_path | directory+spreadsheet_name | Path and file name of user supplied gene sets |
results_directory | directory | Directory to save the output files |
tmp_directory | ./run_dir/tmp | Directory to save the temporary files |
number_of_clusters | 3 | Estimated number of clusters |
number_of_bootstraps | 4 | Number of bootstraps for cc_kmeans, cc_hclust and cc_link_hclust |
rows_sampling_fraction | 0.8 | Select 80% of spreadsheet rows |
cols_sampling_fraction | 0.8 | Select 80% of spreadsheet columns |
top_number_of_rows | 10 | Top number of features to analyze |
processing_method | serial or parallel or distribute | Choose processing method |
parallelism | number of cores | Set number of cores for speed or memory |
threshold | 10 | Threshold to define categorical data and continuous data in evaluation toolbox |
nearest_neighbors | 10 | Number of Nearest Neighbors in cc_link_hclust method |
spreadsheet_name = EXPR_GSE_METABRIC_lymphN_binary.tsv.gz
- Output files of all methods save row by col heatmap variances per row with name row_variance_{method}_{timestamp}_viz.tsv.
variance | |
---|---|
row 1 | float |
... | ... |
row m | float |
- Output files of all the methods save row by col heatmap with name row_by_col_heatmp_{method}_{timestamp}_viz.tsv.
col 1 | ... | col n | |
---|---|---|---|
row 1 | float | ... | float |
... | ... | ... | ... |
row m | float | ... | float |
- Output files of all methods save col to cluster map with name col_labeled_by_cluster_{method}_{timestamp}_viz.tsv.
cluster | |
---|---|
col 1 | int |
... | ... |
col n | int |
- Output files of all methods save row scores by cluster with name row_averages_by_cluster_{method}_{timestamp}_viz.tsv.
cluster 1 | ... | cluster k | |
---|---|---|---|
row 1 | float | ... | float |
... | ... | ... | ... |
row m | float | ... | float |
- Output files of all methods save spreadsheet with top ranked rows per column with name top_row_by_cluster_{method}_{timestamp}_download.tsv.
cluster 1 | ... | cluster k | |
---|---|---|---|
row 1 | 1/0 | ... | 1/0 |
... | ... | ... | ... |
row m | 1/0 | ... | 1/0 |
- All methods save three silhouette scores: silhouette overall score, silhouette per cluster score and silhouette per sample with name silhouette_{method}_{timestamp}_viz.tsv.
- silhouette overall score file: | number of clusters | silhouette score |
- silhouette per cluster score file: | ith clusters | corresponding silhouette score |
- silhouette per sample score file: | ith sample | corresponding silhouette score|