Benchmarking GRN inference methods The full documentation is hosted on ReadTheDocs.
Path to source:
src
You need to have Docker, Java, and Viash installed. Follow these instructions to install the required dependencies.
git clone [email protected]:openproblems-bio/task_grn_inference.git
cd task_grn_inference
# download resources
scripts/download_resources.sh
viash run src/methods/dummy/config.vsh.yaml -- --multiomics_rna resources/grn-benchmark/multiomics_rna.h5ad --multiomics_atac resources/grn-benchmark/multiomics_atac.h5ad --prediction output/dummy.csv
Similarly, run the command for other methods.
scripts/benchmark_grn.sh --grn resources/grn-benchmark/models/collectri.csv
Similarly, run the command for other GRN models.
To add a method to the repository, follow the instructions in the
scripts/add_a_method.sh
script.
GRNs are essential for understanding cellular identity and behavior. They are simplified models of gene expression regulated by complex processes involving multiple layers of control, from transcription to post-transcriptional modifications, incorporating various regulatory elements and non-coding RNAs. Gene transcription is controlled by a regulatory complex that includes transcription factors (TFs), cis-regulatory elements (CREs) like promoters and enhancers, and essential co-factors. High-throughput datasets, covering thousands of genes, facilitate the use of machine learning approaches to decipher GRNs. The advent of single-cell sequencing technologies, such as scRNA-seq, has made it possible to infer GRNs from a single experiment due to the abundance of samples. This allows researchers to infer condition-specific GRNs, such as for different cell types or diseases, and study potential regulatory factors associated with these conditions. Combining chromatin accessibility data with gene expression measurements has led to the development of enhancer-driven GRN (eGRN) inference pipelines, which offer significantly improved accuracy over single-modality methods.
Here, we present a dynamic benchmark platform for GRN inference. This platform provides curated datasets for GRN inference and evaluation, standardized evaluation protocols and metrics, computational infrastructure, and a dynamically updated leaderboard to track state-of-the-art methods. It runs novel GRNs in the cloud, offers competition scores, and stores them for future comparisons, reflecting new developments over time.
The platform supports the integration of new datasets and protocols. When a new feature is added, previously evaluated GRNs are re-assessed, and the leaderboard is updated accordingly. The aim is to evaluate both the accuracy and completeness of inferred GRNs. It is designed for both single-modality and multi-omics GRN inference. Ultimately, it is a community-driven platform. So far, six eGRN inference methods have been integrated: Scenic+, CellOracle, FigR, scGLUE, GRaNIE, and ANANSE.
Due to its flexible nature, the platform can incorporate various benchmark datasets and evaluation methods, using either prior knowledge or feature-based approaches. In the current version, due to the absence of standardized prior knowledge, we use a feature-based approach to benchmark GRNs. Our evaluation utilizes standardized datasets for GRN inference and evaluation, employing multiple regression analysis approaches to assess both accuracy and comprehensiveness.
name | roles |
---|---|
Jalil Nourisa | author |
Robrecht Cannoodt | author |
Antoine Passimier | contributor |
Christian Arnold | contributor |
Marco Stock | contributor |
flowchart LR
file_multiomics_rna_h5ad("multiomics rna")
comp_method[/"Method"/]
file_prediction("GRN")
comp_metric[/"Label"/]
file_score("Score")
file_multiomics_atac_h5ad("multiomics atac")
file_perturbation_h5ad("perturbation")
comp_control_method[/"Control Method"/]
comp_method_r[/"Method r"/]
file_multiomics_rna_h5ad---comp_method
comp_method-->file_prediction
file_prediction---comp_metric
comp_metric-->file_score
file_multiomics_atac_h5ad---comp_method
file_perturbation_h5ad---comp_metric
comp_control_method-->file_prediction
comp_method_r-->file_prediction
RNA expression for multiomics data.
Example file: resources_test/grn-benchmark/multiomics_rna.h5ad
Format:
AnnData object
obs: 'cell_type', 'donor_id'
Slot description:
Slot | Type | Description |
---|---|---|
obs["cell_type"] |
string |
The annotated cell type of each cell based on RNA expression. |
obs["donor_id"] |
string |
Donor id. |
Path:
src/methods
A GRN inference method
Arguments:
Name | Type | Description |
---|---|---|
--multiomics_rna |
file |
(Optional) RNA expression for multiomics data. Default: resources/grn-benchmark/multiomics_rna.h5ad . |
--multiomics_atac |
file |
(Optional) Peak data for multiomics data. Default: resources/grn-benchmark/multiomics_atac.h5ad . |
--prediction |
file |
(Optional, Output) GRN prediction. Default: output/prediction.csv . |
--temp_dir |
string |
(Optional) NA. Default: output/temdir . |
--num_workers |
integer |
(Optional) NA. Default: 4 . |
--tf_all |
file |
(Optional) NA. Default: resources/prior/tf_all.csv . |
--max_n_links |
integer |
(Optional) NA. Default: 50000 . |
GRN prediction
Example file: resources_test/grn_models/collectri.csv
Format:
Tabular data
'source', 'target', 'weight'
Slot description:
Column | Type | Description |
---|---|---|
source |
string |
Source of regulation. |
target |
string |
Target of regulation. |
weight |
float |
Weight of regulation. |
Path:
src/metrics
A metric to evaluate the performance of the inferred GRN
Arguments:
Name | Type | Description |
---|---|---|
--perturbation_data |
file |
(Optional) Perturbation dataset for benchmarking. Default: resources/grn-benchmark/perturbation_data.h5ad . |
--prediction |
file |
GRN prediction. |
--score |
file |
(Optional, Output) File indicating the score of a metric. Default: output/score.h5ad . |
--reg_type |
string |
(Optional) name of regretion to use. Default: ridge . |
--subsample |
integer |
(Optional) number of samples randomly drawn from perturbation data. Default: -2 . |
--max_workers |
integer |
(Optional) NA. Default: 4 . |
--method_id |
string |
(Optional) NA. |
--tf_all |
file |
(Optional) NA. Default: resources/prior/tf_all.csv . |
--apply_tf |
boolean |
(Optional) NA. Default: TRUE . |
--clip_scores |
boolean |
(Optional) clips the r2 scores for each gene to make them within [0, 1]. Default: TRUE . |
File indicating the score of a metric.
Example file: resources_test/scores/score.h5ad
Format:
AnnData object
uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'
Slot description:
Slot | Type | Description |
---|---|---|
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["method_id"] |
string |
A unique identifier for the method. |
uns["metric_ids"] |
string |
One or more unique metric identifiers. |
uns["metric_values"] |
double |
The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |
Peak data for multiomics data.
Example file: resources_test/grn-benchmark/multiomics_atac.h5ad
Format:
AnnData object
obs: 'cell_type', 'donor_id'
Slot description:
Slot | Type | Description |
---|---|---|
obs["cell_type"] |
string |
The annotated cell type of each cell based on RNA expression. |
obs["donor_id"] |
string |
Donor id. |
Perturbation dataset for benchmarking.
Example file: resources_test/grn-benchmark/perturbation_data.h5ad
Format:
AnnData object
obs: 'cell_type', 'sm_name', 'donor_id', 'plate_name', 'row', 'well', 'cell_count'
layers: 'n_counts', 'pearson', 'lognorm'
Slot description:
Slot | Type | Description |
---|---|---|
obs["cell_type"] |
string |
The annotated cell type of each cell based on RNA expression. |
obs["sm_name"] |
string |
The primary name for the (parent) compound (in a standardized representation) as chosen by LINCS. This is provided to map the data in this experiment to the LINCS Connectivity Map data. |
obs["donor_id"] |
string |
Donor id. |
obs["plate_name"] |
string |
Plate name 6 levels. |
obs["row"] |
string |
Row name on the plate. |
obs["well"] |
string |
Well name on the plate. |
obs["cell_count"] |
string |
Number of single cells pseudobulked. |
layers["n_counts"] |
double |
Pseudobulked values using mean approach. |
layers["pearson"] |
double |
(Optional) Normalized values using pearson residuals. |
layers["lognorm"] |
double |
(Optional) Normalized values using shifted logarithm . |
Path:
src/control_methods
A control method.
Arguments:
Name | Type | Description |
---|---|---|
--layer |
string |
(Optional) Which layer of pertubation data to use to find tf-gene relationships. Default: scgen_pearson . |
--prediction |
file |
(Optional, Output) GRN prediction. |
--tf_all |
file |
NA. |
Path:
src/methods_r
A GRN inference method
Arguments:
Name | Type | Description |
---|---|---|
--multiomics_rna_r |
file |
(Optional) NA. |
--multiomics_atac_r |
file |
(Optional) NA. |
--prediction |
file |
(Optional, Output) GRN prediction. |
--temp_dir |
string |
(Optional) NA. Default: output/temdir . |
--num_workers |
integer |
(Optional) NA. Default: 4 . |