Rnaseq pipeline provides meaningful programs to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. This analysis includes various steps including trimming of FASTQ files, contamination removal, alignment, counting, quality control and normalization of sequenced reads, and, in most cases, differential expression (DE) analysis across conditions. This pipeline is supported by nf-core
- cat Merge re-sequenced FastQ files
- FastQC Read QC
- UMI-tools UMI extraction
- Trim galore Adapter and quality trimming
- STAR using Salmon
- STAR via RSEM
- HiSAT2 NO QUANTIFICATION
- SAMtools Sort and index alignments
- UMI-tools UMI-based deduplication
- picard MarkDuplicates Duplicate read marking
- StringTie Transcript assembly and quantification
- bedGraphToBigWig Create bigWig coverage files
- RSeQC An RNA-seq quality control package
- QualiMap Evaluating sequencing alignment data
- dupRadar A package for the assessment of PCR artifacts in RNA-Seq data
- Preseq A package for predicting and estimating the complexity of a genomic sequencing library
- DEseq2 A package for estimating differential gene expression based on the negative binomial distribution
- MultiQC Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks
Rnaseq is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures and it also uses Docker/Singularity containers making installation trivial and results highly reproducible. This guide covers the installation and configuration for Ubuntu.
a. Make sure that Java v8+ is installed
java -version
b. Install Nextflow
curl -fsSL get.nextflow.io | bash
c. Move the file to a directory accessible by your $PATH
variable
sudo mv nextflow /usr/local/bin/
For more information, visit Docker website
a. Update the apt package index, and install the latest version of Docker Engine
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
b. List the versions available in your repo
apt-cache madison docker-ce
c. Install a specific version
sudo apt-get install docker-ce=<VERSION_STRING> docker-ce-cli=<VERSION_STRING> containerd.io
d. Verify that Docker is installed correctly by running the hello-world image
sudo docker run hello-world
e. Enable Docker permissions
sudo chmod 666 /var/run/docker.sock
a. Install nf-core tools
sudo pip3 install nf-core
b. List all nf-core pipelines and show available updates
nf-core list
There are two methods to create the samplesheet that is used as input to rnaseq
pipeline:
-
From
fetchngs
pipeline. It can be found assamplesheet.csv
. Nevertheless, the absolute path offastq_1
andfastq_2
columns must be modified so that thernaseq
pipeline can read the FastQC. -
An executable Python script called
fastq_dir_to_samplesheet.py
has been provided if you would like to auto-create an input samplesheet based on a directory containing FastQ files before you run the pipeline (requires Python 3 installed locally)
a. Download python script
wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py
b. Use as follows
./fastq_dir_to_samplesheet.py <FASTQ_DIR> samplesheet.csv --strandedness reverse
One of the first choice for retrieving the most common reference genomes of diverse organisms is by means of AWS iGenomes
, stored in AWS S3. To obtain human genome as well as its annotation, this repository contains a script aws-igenomes.sh
that can synchronize AWS-iGenomes and download these files
curl -fsSL https://ewels.github.io/AWS-iGenomes/aws-igenomes.sh > aws-igenomes.sh
If you manually provide the genome indexes, it is important to keep in mind that they must be in the following path
<{output}>/genome/index/<{idx}>
where output
is the folder name containing the results and idx
the indexes such as rsem
, hisat2
and salmon
To perform an RNA-seq data analysis, the script scr/rnaseq.sh
was implemented to systematically prepare, validate, and generate the results of pipeline
bash rnaseq.sh -c file.csv -r genome.fa -a genes.gtf -b star_rsem -o results -d rRNA-paths.txt -p 30 -m 230 -x n
-
-c:
Samplesheet file containing information about the samples in the experiment. An example is available indata
-
-r:
Reference genome (FASTA) -
-a:
Genome annotation (GTF) -
-b:
Specifies the alignment algorithm to use. Available options are:star_salmon
,star_rsem
orhisat2
-
-o:
The output directory where the results will be saved
-
-d:
Text file containing paths to fasta files (one per line) that will be used to create the database for SortMeRNA. An example is available indata
-
-e:
Specifies the pseudo aligner to use. Available option is:salmon
-
-p:
CPUs -
-m:
Max memory to be used -
-x:
This execution is a resume of a previous run or it is a new run. The options are:y
orn
-
-t:
A custom configuration file to be used in the pipeline. An example is shown indata
folder -
-i:
Create or not a new Genome index. If not specified, it will be created based on the type of aligner supplied
The Nextflow -bg
flag launches Nextflow in the background or alternatively, you can use screen/tmux
or similar tool to create a detached session which you can log back into at a later time
The script will create a local directory based on the given output name showing the following folders:
-
output_name:
Contains the results of RNA-seq analysis -
work:
Contains the main pipeline workflows -
2022-01-31_15:46:17.COMMAND:
Contains the commands used for the actual launch. File name contains the date (%y%m%d) and the time (%H%M%S) when the command was last run. Thus, if it is resumed, it will be overwritten
Please report bugs through the GitHub issues system