Skip to content

Preprocessing of RNA-seq data using salmon and tximport

Notifications You must be signed in to change notification settings

ATpoint/rnaseq_preprocess

Repository files navigation

rnaseq_preprocess

CI Nextflow run with docker run with apptainer/singularity

Introduction

rnaseq_preprocess is a Nextflow pipeline for RNA-seq quantification with salmon. The processing steps are fastqc first, then quantification with salmon, aggregation to gene level with tximport and a small summary report with MultiQC. Multiple fastq files per sample are supported. These technical replicates will be merged prior to quantification. Optional trimming to a fixed read length is possible. The pipeline is containerized via Docker and Singularity. Outputs can be found in rnaseq_preprocess_results/ including command lines and software versions. The expected Nextflow version is 21.10.6.

Run the test profile to see which output is being produced. Downloading the Docker image may take a minute or two:

NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile docker,test_with_existing_idx,test_resources

See the misc folder which contains the software versions used in the pipeline and the exact command lines. In case of running the pipeline this output will be in the pipeline_info folder of the output directory.

Details

Indexing

The pipeline does not cover the indexing step as there are different sorts of salmon index methods available, for example indexing only the transcriptome without any genome decoys, partial genome decoys and full genome decoys.

Please produce an index up front and then provide the output folder to the --idx option.

The pipeline has a hardcoded 8GB memory limit for the quantification step which should be sufficient for transcriptome-only and partial genome decoy indices. For full genome decoy please modify the withLabel:process_quant memory definition in nextflow.config to something like 20GB depending on organism.

Quantification/tximport

The pipeline runs via a samplesheet which is a CSV file with the columns: sample,r1,r2,libtype. The first column is the name of the sample, followed by the paths to the R1 and R2 files and the salmon libtype. If R2 is left blank then single-end mode is triggered for that sample. Multiple fastq files (lane/technical replicates) are supported. These must have the same sample column and will then be merged prior to quantification. Optionally, a seqtk module can trim reads to a fixed read length, triggered by --trim_reads with a default of 75bp, controlled by --trim_length. The quantification then runs with the salmon options --gcBias --seqBias --posBias (for single-end without --gcBias). Transcript abundance estimates from salmon are then summarized to the gene level using tximport with its lengthScaledTPM option. That means returned gene-level counts are already corrected for average transcript length and can go into any downstream DEG analysis, for example with limma. Both a matrix of counts and effective gene lengths is returned.

Other options:

--idx: path to the salmon index folder
--tx2gene: path to the tx2gene map matching transcripts to genes --samplesheet: path to the input samplesheet
--trim_reads: logical, whether to trim reads to a fixed length
--trim_length: numeric, length for trimming
--quant_additional: additional options to salmon quant beyond --gcBias --seqBias --posBias

We hardcoded 8GB RAM and 6 CPUs for the quantification. On our HPC we use:

NXF_VER=21.10.6 nextflow run atpoint/rnaseq_preprocess -r main -profile singularity,slurm \
    --idx path/to/idx --tx2gene path/to/tx2gene.txt --samplesheet path/to/samplesheet.csv \
    -with-report quant_report.html -with-trace quant_report.trace -bg > quant_report.log

Other options

--merge_keep: logical, whether to keep the merged fastq files
--merge_dir: folder inside the output directory to store the merged fastq files
--trim_keep: logical, whether to keep the trimmed fastq files
--trim_dir: folder inside the output directory to store the trimmed fastq files
--skip_fastqc: logical, whether to skip fastqc
--only_fastqc: logical, whether to only run fastqc and skip quantification
--skip_multiqc: logical, whether to skip multiqc
--skip_tximport: logical, whether to skip the tximport process downstream of the quantification
--fastqc_dir: folder inside the output directory to store the fastqc results
--multiqc_dir: folder inside the output directory to store the multiqc results