boostDM pipeline

Aim

BoostDM is a method to score single base substitutions in cancer genes for their potential to drive tumorigenesis, which has been described in this study:

In silico saturation mutagenesis of cancer genes
Ferran Muiños, Francisco Martinez-Jimenez, Oriol Pich, Abel Gonzalez-Perez, Nuria Lopez-Bigas
URL: https://www.nature.com/articles/s41586-021-03771-1

The method heavily relies on the Intogen pipeline, which undertakes the necessary steps to identify cancer driver genes and infer relevant mutational features signaling positive selection. The Intogen pipeline has been described in this study:

A compendium of mutational cancer driver genes
Francisco Martínez-Jiménez, Ferran Muiños, Inés Sentís, Jordi Deu-Pons, Iker Reyes-Salazar, Claudia Arnedo-Pac, Loris Mularoni, Oriol Pich, Jose Bonet, Hanna Kranas, Abel Gonzalez-Perez, Nuria Lopez-Bigas
URL: https://www.nature.com/articles/s41568-020-0290-x

Current version

https://github.com/bbglab/boostdm-pipeline/releases/tag/2024.07.15-cancer

Resources

There are several public resources related to the boostDM framework:

boostDM website

Intended for exploration of the predictions and explanations resulting from the boostDM pipeline for a collection of models meeting minimum quality criteria. The website is searchable by cancer gene, tumor type, and mutation coordinates.

URL: https://www.intogen.org/boostdm

Intogen website

Intended for exploration of the landscape of mutations and signals of positive selection in driver genes upon analysis of 33,000+ tumor samples (release v2024.06.21). Intogen is instrumental for boostDM as it provides processed data that is used for the training of boostDM models.

URL: https://www.intogen.org

Cancer Genome Interpreter

Computational framework to interpret cancer genome variants intended to guide clinicians towards optimal decision making regarding the treatment of cancer, in particular resolving the implication of variants of unknown significance.

URL: https://www.cancergenomeinterpreter.org

Other resources

GitHub repo containing a collection of scripts and notebooks to generate analyses and figures of the main paper: https://github.com/bbglab/boostdm-analyses
Zenodo repository providing data items generated and used in the main paper and figures: https://zenodo.org/record/4813082

Content

This repo contains the source code to reproduce the training, prediction and post-hoc analysis steps of the boostDM pipeline, starting from the output data coming after the Intogen pipeline.

Prerequisites

HPC environment

It is strongly recommended to run this pipeline in an HPC environment.

Singularity

Two Singularity containers are needed

boostdm.simg
ensembl-vep_111.0.sif

which must be specified in the nextflow.config file as in the following example:

singularity {
    enabled = true
    cacheDir = "./singularity_images/"
    runOptions = "-B " + env.PIPELINE + "/containers_build:/boostdm"
}

process {
    cpus = 1
    executor = 'slurm'
    queue = 'normal,bigrun'
    errorStrategy = 'ignore'
    withLabel: boostdm {container = "file:///${singularity.cacheDir}/boostdm.simg"}
    withLabel: vep {container = "file:///${singularity.cacheDir}/ensembl-vep_111.0.sif"}
}

boostdm.simg is built from a recipe provided in the boostDM repo https://github.com/bbglab/boostdm-pipeline/blob/master/containers_build/boostdm/Singularity, using the following command line:

singularity build boostdm.sif Singularity

ensembl-vep_111.0.sif can be pulled from https://hub.docker.com/r/ensemblorg, using the following command line:

singularity pull --name vep.sif docker://ensemblorg/ensembl-vep:release_111.0

Nextflow

The pipeline runs with Nextflow and it has been tested with Nextflow version 20.07.1 which can be installed with conda using the following command line:

conda install -c bioconda nextflow=20.07.1

Running the pipeline

Input

The current release requires the output of Intogen release v2024.06.21. There are two main folders, referred to as INTOGEN_DATASETS and BOOSTDM_DATASETS, which are generated by the Intogen pipeline. Check out the Intogen documentation: https://intogen-plus.readthedocs.io/en/latest/index.html.

Config

To run the pipeline it is necessary to specify the paths of the data dependencies in the config file nextflow.config:

env {	GENOME_BUILD = "hg38"
	INTOGEN_DATASETS = "./intogen_datasets/"
	BOOSTDM_DATASETS = "./boostdm_datasets/"
	VEP_SATURATION = env.INTOGEN_DATASETS + "/steps/boostDM/saturation/"
	PIPELINE = "./boostdm-pipeline-2024/"
	OUTPUT= "./boostdm-output/"
	MAVE_DATA = "./mave_data/"
    }

The only dependency that is not provided by Intogen is the MAVE_DATA folder. Check out the full documentation for more details: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf

Nextflow run

The pipeline is run with Nextflow DSL=1 and is divided in six steps that are run separately with the following Nextflow scripts:

01_training.nf
02_discovery.nf
03_model-selection.nf
04_prediction.nf
05_output_plots.nf
06_benchmarks.nf

To run each Nextflow script, use the following command line:

nextflow run <nexflow_script>.nf -resume -profile <profile>

Check out the full documentation for more details about the steps of the pipeline: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf

Output

The pipeline output is delivered in the following folder tree:

├── benchmarks
│   ├── cv_tables
│   ├── cv_tables_annotated
│   ├── cv_tables_annotated_chasmplus
│   ├── pr_plots
│   ├── saturation_dbNSFP
│   ├── saturation_mave
│   ├── vep_input
│   └── vep_output_dbNSFP
├── create_datasets
│   └── <intogen cohort>.regression_data.tsv
├── discovery
│   └── discovery.tsv.gz
├── evaluation
│   └── <tumor types>
│       └── <gene>.eval.pickle.gz
├── features_group
├── model_selection
├── output_plots
│   ├── blueprints
│   ├── clustered_blueprints
│   └── discovery_bending
├── saturation
│   ├── annotation
│   │   └── <gene>.<tumor type>.annotated.tsv.gz
│   └── prediction
│       └── <gene>.model.<ttype model>.features.<ttype features>.prediction.tsv.gz
├── splitcv
│   └── <intogen cohort>.cvdata.pickle.gz
├── splitcv_meta
│   └── <tumor types>
│        └── <gene>.cvdata.pickle.gz
└── training_meta
    └── <tumor types>
         └── <gene>.models.pickle.gz

Check out the full documentation for a description of the main output formats: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.vscode		.vscode
config		config
containers_build		containers_build
.gitignore		.gitignore
01_training.nf		01_training.nf
02_discovery.nf		02_discovery.nf
03_model-selection.nf		03_model-selection.nf
04_prediction.nf		04_prediction.nf
05_output_plots.nf		05_output_plots.nf
06_benchmarks.nf		06_benchmarks.nf
HOWTO		HOWTO
HOWTO_NOTEBOOK		HOWTO_NOTEBOOK
LICENSE		LICENSE
README.md		README.md
_scan_errors.py		_scan_errors.py
boostDM-cancer-2024-full-docs.pdf		boostDM-cancer-2024-full-docs.pdf
boostdm.conf		boostdm.conf
cleaner.sh		cleaner.sh
nextflow.config		nextflow.config
scan_errors.sh		scan_errors.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

boostDM pipeline

Aim

Current version

Resources

boostDM website

Intogen website

Cancer Genome Interpreter

Other resources

Content

Prerequisites

HPC environment

Singularity

Nextflow

Running the pipeline

Input

Config

Nextflow run

Output

About

Releases 3

Packages

Contributors 2

Languages

License

bbglab/boostdm-pipeline

Folders and files

Latest commit

History

Repository files navigation

boostDM pipeline

Aim

Current version

Resources

boostDM website

Intogen website

Cancer Genome Interpreter

Other resources

Content

Prerequisites

HPC environment

Singularity

Nextflow

Running the pipeline

Input

Config

Nextflow run

Output

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages