sc_preprocess is a containerized preprocessing pipeline for 10x scRNA-seq data written in Nextflow. The pipeline basically implements the code suggestions of the salmon/alevin developers in their tutorial. This covers generation of a genome-decoyed and expanded (spliced+unspliced) transcriptome index directly from reference annotations, quantification of reads against this index, generation of spliced- and unspliced count tables and a basic QC summary report. It optionally supports feature barcoding experiments such as CITE-Seq or cell hashtag oligos (HTO). The indexing procedure relies on extraction of spliced and unspliced counts from a genome/GTF using eisaR followed by index building with salmon. The quantification happens via alevin. Generation of count tables from the quantification results is achieved via tximeta. A QC summary report is provided by alevinQC.
Run the test profile after cloning this repo to see produced output, see also the output description.
NXF_VER=22.10.4 nextflow run main.nf -profile docker,test --features_file $(realpath test/hto.tsv) --samplesheet $(realpath test/samplesheet.csv)
For a summary of all current software versions and command lines (based on the test data) see the command line and software version summaries in the misc folder.
The pipeline covers these steps:
-
Generate a genome-decoyed and expanded reference transcriptome containing all spliced (=exonic) and unspliced (intronic) transcript sequences. This expanded transcriptome is required for quantification to allow output of both spliced- and unspliced counts, e.g. for velocity analysis. Also, this expanded reference is decoyed by the entire reference genome to perform selective alignments in order to improve mapping accuracy by capturing reads that better align to the genome than the transcriptome, removing potential gDNA contaminations and other spurious mappings. The indexing of this expanded reference is done by
salmon
. -
Processing of the reads (fastq files read from a CSV samplesheet) against the expanded reference using
alevin
performing cellular barcode (CB) detection, read mapping, Unique Molecular Identifier (UMI) deduplication and gene count estimation, resulting in per-cell gene-level abundance estimations. -
Split the obtained quantifications into spliced and unspliced count tables and save these in mtx format for easy distribution and loading into downstream analysis environments such as R.
-
Create a per-sample summary report using alevinQC which includes relevant QC metrics such as the average number of reads per cell, number of detected genes per cell and others. Also, a table and plot summarizing number of cells per sample is generated using custom scripts.
-
Optionally, quantify matched reads from a feature barcoding experiment such as CITE-seq or HTO cell hashing against a provided set of reference barcode sequences. The obtained counts are filtered for the cellular barcodes detected in the RNA quantification. In feature barcode mode the pipeline will eventually return a count matrix that only contains those cellular barcodes detected in both the RNA and feature barcode experiment. Some feature barcode sets such as totalSeq-B/C require barcode translation. We provide the default translation table from the CellRanger GitHub in the
assets/
folder to perform this translation if--translate_barcodes
is set and--translate_list
points to that table. See also this thread at biostars.org for details.
Typically one would build an index first and then run the quantification separately as shown below. Still, the entire workflow including indexing and quantification can also be run as a single command.
Typical command line for the indexing on a cluster with SLURM and Singularity, using GENCODE annotations. If not using GENCODE then the --idx_args
argument can be omitted.
NXF_VER=22.10.4 nextflow run main.nf -profile singularity,slurm --idx_only --genome path/to/genome.fa.gz --gtf path/to/gtf.gz --idx_args '\--gencode'
This will parse both spliced and unspliced transcript sequences directly from the genome using the GTF as guide and then run the indexing process on these files, using the provided genome as decoy. Six CPUs and 30GB of RAM are hardcoded in the default profile for this step, and this is sufficient for mouse data.
Indexing parameters
These parameters can be used to customize the indexing:
--only_idx
: logical, whether to only run the indexing process without any quantification and then exit
--genome
: path to the genome fasta file (.fa.gz
)
--gtf
: path to the GTF reference file (.gtf.gz
)
--gene_id
: name of the column storing the gene ID in the GTF, default gene_id
--gene_name
: name of the column storing the gene name in the GTF, default gene_name
--gene_type
: name of the column storing the gene biotype in the GTF, default gene_type
--chrM
: name of the mitochondrial chromosome, default chrM
--rrna
: gene biotype for ribosomal RNAs, default rRNA
--read_length
: an integer indicating the read length of R2 (the non-CB/non-UMI read), default 150
--idx_outdir
: name of output folder inside the sc_preprocess_results
directly to store the output
--idx_name
: name of the index in which the actual index is stored inside --idx_outdir
, default idx
--idx_args
: additional arguments passed to salmon index
step, default --sparse
Typical command line for the quantification on a HPC with SLURM and Singularity, assuming presence of feature barcodes that require barcode translation:
NXF_VER=22.10.4 nextflow run main.nf -profile singularity,slurm \
--samplesheet path/to/samplesheet.csv \
--idx path_to_idx_folder \
--tgmap path_to_tgmap --rrnagenes path_to_rrnagenes_file --mtgenes path_to_mtrnagenes_file \
--features path_to_expanded_features_file --gene2type path_to_gene2type_file \
--features_file path/to/hto.tsv --translate_list path/to/assets/translate_list.txt.gz --translate_barcodes
This will quantify the reads against the index, create spliced- and unspliced count tables and create the alevinQC report. As resources we hardcoded 6 CPUs and 30GB of RAM which works well for mouse samples of typical size. See details below.
Samplesheet
The pipeline reads the fastq file pairs from a samplesheet which is a five-column CSV with a header.
sample,r1,r2,libtype,is_fb
sample1,$baseDir/test/sample1_1.fastq.gz,$baseDir/test/sample1_2.fastq.gz,ISR,false
sample1,$baseDir/test/sample1a_1.fastq.gz,$baseDir/test/sample1a_2.fastq.gz,ISR,false
sample1,$baseDir/test/sample1SF_1.fastq.gz,$baseDir/test/sample1SF_2.fastq.gz,ISR,true
sample2,$baseDir/test/sample2_1.fastq.gz,$baseDir/test/sample2_2.fastq.gz,ISR,false
The first line is the mandatory header followed by the sample to quantify:
- column1 (
sample
) is the per-sample name to be used in the output. This can be any Unix-compatible name, and there is no need to match this name with the fastq file names as other software may require you to do. - column2/3 (
r1/2
) are the paths to the fastq files for that sample. It either must be the full absolute path (don't use~
) or alternative a path relative to the directory that this pipeline is started from using either of the three implicit Nextflow variables$baseDir
,$projectDir
and$launchDir
(see example below). - column4 is the library type so basically the strandedness and layout of the library. That would be
ISR
for 10x Chromium data. - column5 (
is_fb
) is a logical column with eithertrue
orfalse
indicating whether this fastq file pair is a feature barcode experiment. If so then these files will be quantified against the barcode library provided by--features-file
(see below). This column can be empty and then defaults tofalse
. See below for details on feature barcoding experiments.
An example using relative paths (same samples as in the example above) could be:
sample,r1,r2,libtype,is_fb
sample1,$baseDir/test/sample1_1.fastq.gz,$baseDir/test/sample1_2.fastq.gz,ISR,false
sample1,$baseDir/test/sample1a_1.fastq.gz,$baseDir/test/sample1a_2.fastq.gz,ISR,false
sample1,$baseDir/test/sample1SF_1.fastq.gz,$baseDir/test/sample1SF_2.fastq.gz,ISR,true
sample2,$baseDir/test/sample2_1.fastq.gz,$baseDir/test/sample2_2.fastq.gz,ISR,false
Technical replicates
...from sequencing the same library on multiple lanes or in individual sequencing runs can be specified in the samplesheet by using the same sample_id
for these files. In the example above sample1
has two technical replicates (sample_1/2.fastq.gz
and sample1a_1/2.fastq.gz
) which will be merged by the pipeline prior to quantification.
Feature barcoding experiments
If feature barcode experiments are to be quantified then the user must provide a --features_file
which is a tab-separated list that in column1 stores the name of the feature barcode and in column2 the sequence, for example:
hto_1 ACCCACCAGTAAGAC
hto_2 GGTCGAGAGCATTCA
hto_3 CTTGCCGCATGTCAT
If this file is provided then an index will be made for these files and the feature barcode fastq files will be quantified against it.
The counts of the feature barcodes will be included to the bottom of the gene expression count tables. If barcode translation must be performed (totalSeq-B/C),
then use --translate_barcodes --translate_list path/to/translate_list.txt.gz
.
Quantification parameters
These parameters can be used to customize the quantification:
--r1_type
: this flag defines structure of read1 in the experiment so the position and length of the CB and UMI. This must consist of the two options--bc-geometry --umi-geometry
fromalevin
. The defaults are:
--bc-geometry 1[1-16] --umi-geometry 1[17-28]
and indicate that the BC is in read1 from position 1-16 and the UMI from position 17-28, so 16bp CB and 12bp UMI as in Chromium V3. For Chromium V2 it would be 10bp UMIs so one would use--r1_type '\--bc-geometry 1[1-16] --umi-geometry 1[17-26]'
.--r2_type
: same as above but defining the structure of read2. For Chromium V3 (the default) it would be:
--r2_type '\--read-geometry 2[1-91]'
meaning that the first 91bp of R2 should be used for quantification, as recommended by 10x. One could also use1-end
here soalevin
would only the entire read, e.g. in case of 150bp reads. We stick with the default of 91bp here fir consistency between runs as not every sequencing run may produce 150bp reads, depending on the machine and run mode.--R2_type_fb
: same as--r2_type
but the read2 structure for feature barcode experiments. The defaults here assume totalSeqB feature barcoding:
--read-geometry 2[11-25]
meaning that the feature barcodes are 15bp long and starting from at position 11. The original CITE-seq protocol would be[1-15]
.--quants_args
: this flag allows to pass further arguments to thealevin
processing. This can be any of the allowedalevin
arguments (see its manual) such as generating inferential replicates. Is must not include any of-o -i -p
as these options are already defined internally. See the Alevin manual or the help viasalmon alevin -h
for details on further options.
The pipeline will produce an output folder sc_preprocess_results
in the location from which the pipeline was launched that contains:
alevin_idx
: folder with the expanded transcriptome index (idx_gentrome
) and the feature barcode index (idx_features
)alevin_quant
: folder with the alevin outputs (one folder per sample)mtx
: folder with the expression matrices asmtx.gz
and the column and row annotations astsv.gz
. If feature barcode (FBs) libraries were present for a particular sample then the returned files will already be filtered for CBs found in both the RNA and FB experiment. The FB counts will be appended to themtx.gz
so will be the last entries in these files. The same goes for thesample_feature.tsv.gz
file, where the feature barcode names will be the last entries.qc_dir
: folder with the alevinQC html reports for each quantified library -- RNA and FB (if present). Note that in case of FBs this QC report will be based on the full RNA/FB experiment that has not been filtered for CBs present in both experiments. It is useful to judge the quality of both experiments independently.
In this folder there will also be a filesummary_cellnumbers.txt
which summarizes the number of detected cells per sample in the RNA experiment and the number of intersecting cells with the FB experiment. If no FBs were present for that sample NAs are returned. The numbers are also presented as a barplot insummary_cellnumbers.pdf
.pipeline_info
: folder containssoftware_versions.txt
listing all software versions used in the pipeline andcommand_lines.txt
listing the exact command lines used in the processes