title | author | output | editor_options | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mark Dunning |
|
|
Based on the RNA-Seq workshop by Melbourne Bioinformatics written by Mahtab Mirmomeni, Andrew Lonie, Jessica Chung Original
Modified by David Powell (Monash Bioinformatics Platform)
Further Modified by Mark Dunning of Sheffield Bioinformatics Core
web : sbc.shef.ac.uk
twitter: @SheffBioinfCore
email: [email protected]
This tutorial will cover the basics of RNA-seq using Galaxy; a open-source web-based platform for the analysis of biological data. You should gain an appreciation of the tasks involved in a typical RNA-seq analysis and be comfortable with the outputs generated by the Bioinformatician.
The official Galaxy page has many tutorials on using the service, and examples of other types of analysis that can be performed on the platform.
Those eventually wanted to perform their own RNA-seq analysis (for example in R), should look out for other courses
The data for this tutorial is from the paper, A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae by Nookaew et al. [1] which studies S.cerevisiae strain CEN.PK 113-7D (yeast) under two different metabolic conditions: glucose-excess (batch) or glucose-limited (chemostat).
The RNA-Seq data has been uploaded in NCBI, short read archive (SRA), with accession SRS307298. There are 6 samples in total-- two treatments with three biological replicates each sequenced paired-end. We have selected only the first read, and only two replicates of each condition to keep the data small for this workshop.
We have extracted chromosome I reads from the samples to make the tutorial a suitable length.
For this tutorial, we will assume that the wet-lab stages of the experiment have been performed and we are now in the right-hand branch of the workflow. In this tutorial we will demonstrate the steps of Quality assessment, alignment, quantification and differential expression testing.
Make sure you check your email to activate your account
We can going to import the fastq files for this experiment. This is a standard format for storing raw sequencing reads and their associated quality scores.
You can import the data by:
-
In the tool panel located on the left, under Basic Tools select Get Data > Upload File. Click on the Paste/Fetch data button on the bottom section of the pop-up window.
-
Upload the sequence data by pasting the following links into the text input area. These two files are single-end samples from the batch condition (glucose-excess). Make sure the type is specified as 'fastqsanger' when uploading.
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/batch1_chrI_1.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/batch2_chrI_1.fastq
These two files are single-end samples from the chem condition (glucose-limited). Make sure the type is specified as 'fastqsanger' when uploading.
Then, upload this file of gene definitions. You do not need to specify the type for this file as Galaxy will auto-detect the file as a GTF file. -
You should now have these 5 files in your history:
batch1_chrI_1.fastq
batch2_chrI_1.fastq
chem1_chrI_1.fastq
chem2_chrI_1.fastq
genes.gtf
These files can be renamed by clicking the pen icon if you wish.
Note: Low quality reads have already been trimmed.
Each read is described by 4 lines in the file:-
The quality scores are ASCII representations of how confident we are that a particular base has been called correctly. Letters that are further along the alphabet indicate higher confidence. This is important when trying to identify types of genome variation such as single base changes, but is also indicative of the overall quality of the sequencing. Different scales have been employed over time (resulting in a different set of characters appearing in the file). We will need to tell Galaxy which scale has been used in order that we can process the data correctly.
First of all, we convert the base-calling probability (p) into a Q
score using the formula
- Quality scores $$ Q = -10log_{10}p$$
- Q = 30, p=0.001
- Q = 20, p=0.01
- Q = 10, p=0.1
- These numeric quanties are encoded as ASCII code
- At least 33 to get to meaningful characters (https://en.wikipedia.org/wiki/FASTQ_format)
- look-up the ASCII code for each character
- subtract the offset to get the Q score
- convert to a probability using the formula:-
In practice, we don't have to convert the values as we have software that will do this automatically
- How many lines of data are present in the file
batch1_chrI_1.fastq
- use the tool Text Manipulation -> Line/Word/Character count to find out
- How many reads does this correspond to?
FastQC is a popular tool from Babraham Institute Bioinformatics Group used for quality assessment of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The documentation for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user's attention to possible issues. However, it is worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq), so some warnings might be expected due to the nature of your experiment.
- Select one of the FASTQ files as input and Execute the tool.
- When the tool finishes running, you should have an HTML file in your History. Click on the eye icon to view the various quality metrics.
- Run Fastqc on the remaing fastq files, but don't examine the results just yet.
**Question: Do the data seem to be of reasonable quality? **
You can use the documentation to help interpret the plots
If poor quality reads towards the ends of reads are considered to be a problem, or there is considerable adapter contamination, we can employ various tools to trim our data.
How many tools are available for the task of trimming and removing adapter contamination?
If time allows, use the trim galore tool to trim one of the fastq
files and repeat the QC.
N.B. the fastq
files that we are starting from have already undergone some processing, so you may not see much effect.
It can be quite tiresome to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.
The multiqc tool has been designed for the tasks of aggregating qc reports and combining into a single report that is easy to digest.
Under Which tool was used generate logs? Choose fastqc and select the RawData output from the fastqc run on each of your bam files.
Question: do the fastq files seem to have consistently high-quality?
In this section we map the reads in our FASTQ files to a reference genome. As
these reads originate from mRNA, we expect some of them will cross exon/intron
boundaries when we align them to the reference genome. HISAT2
is a splice-aware
mapper for RNA-seq reads that is based on Bowtie. It uses the mapping results
from Bowtie to identify splice junctions between exons. More information on
HISAT2 can be found here.
NGS: RNA Analysis > HISAT2
In the left tool panel menu, under NGS Analysis, select NGS: RNA Analysis > HISAT2 and set the parameters as follows:
-
Is this single-end or paired-end data? Single-end (as individual datasets)
-
RNA-Seq FASTQ file, forward reads:
(Click on the multiple datasets icon and select all four of the FASTQ files)batch1_chrI_1.fastq
batch2_chrI_1.fastq
chem1_chrI_1.fastq
chem2_chrI_1.fastq
-
Use a built in reference genome or own from your history: Use built-in genome
-
Select a reference genome: S. cerevisiae June 2008 (SGD/SacCer3) (sacCer3)
-
Use defaults for the other fields
-
Execute
Note: This may take a few minutes, depending on how busy the server is.
You should have 5 output files for each of the FASTQ input files:
HISAT2 on data.. aligned reads (BAM)
It will be helpful to rename these to something shorter for the next steps.
batch1.bam
batch2.bam
chem1.bam
chem2.bam
Unlike most of Bioinfomatics, a single standard file format has emerged for aligned reads. Moreoever, this file format is consistent regardless of whether you have DNA-seq, RNA-seq, ChIP-seq... data.
The bam
file is a compressed, binary, version of a sam
file.
- Sequence Alignment/Map (sam)
- The output from an aligner such as
bwa
- Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, DNA-seq etc)
- May contain un-mapped reads
- Potentially large size on disk; ~100s of Gb
- Can be manipulated with standard unix tools; e.g. cat, head, grep, more, less....
- Official specification can be obtained online: -
- We normally work on a compressed version called a
.bam
file. See later. - Header lines starting with an
@
character, followed by tab-delimited lines- Header gives information about the alignment and references sequences used
The first part of the header lists the names (SN
) of the sequences (chromosomes) used in alignment, their length (LN
) and a md5sum "digital fingerprint" of the .fasta
file used for alignment (M5
).
@HD VN:1.5 SO:coordinate GO:none
@SQ SN:1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3 LN:198022430 M5:fdfd811849cc2fadebc929bb925902e5
@SQ SN:4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1
.....
.....
Next we can define the read groups present in the file which we can use to identify which sequencing library, sequencing centre, Lane, sample name etc.
@RG ID:SRR077850 CN:bi LB:Solexa-42057 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR081675 CN:bi LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR080818 CN:bi LB:Solexa-44770 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR084838 CN:bi LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR081730 CN:bi LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
.....
.....
Finally, we have a section where we can record the processing steps used to derive the file
@PG ID:MosaikAligner CL:/share/home/wardag/programs/MOSAIK/bin/MosaikAligner -in /scratch/wardag/NA06984.SRR077850.mapped.illumina.mosaik.CEU.SINGLE.20111114/NA06984.SRR077850.mapped.illumina.mosaik.CEU.SINGLE.20111114.mkb -out
....
....
Next is a tab-delimited section that describes the alignment of each sequence in detail.
SRR081708.237649 163 1 10003 6 1S67M = 10041 105 GACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTAACCCTAA S=<====<<>=><?=?=?>==@??;?>@@@=??@@????@??@?>?@@<@>@'@=?=??=<=>?>?=Q ZA:Z:<&;0;0;;308;68M;68><@;0;0;;27;;>MD:Z:5A11A5A11A5A11A13 RG:Z:SRR081708 NM:i:6 OQ:Z:GEGFFFEGGGDGDGGGDGA?DCDD:GGGDGDCFGFDDFFFCCCBEBFDABDD-D:EEEE=D=DDDDC:
Column | Official Name | Brief |
---|---|---|
1 | QNAME | Sequence ID |
2 | FLAG | Sequence quality expressed as a bitwise flag |
3 | RNAME | Chromosome |
4 | POS | Start Position |
5 | MAPQ | Mapping Quality |
6 | CIGAR | Describes positions of matches, insertions, deletions w.r.t reference |
7 | RNEXT | Ref. name of mate / next read |
8 | PNEXT | Postion of mate / next read |
9 | TLEN | Observed Template length |
10 | SEQ | Sequence |
11 | QUAL | Base Qualities |
There can also be all manner of optional tags as extra columns introduce by an aligner or downstream analysis tool. A common use is the RG
tag which refers back to the read groups in the header.
The "flags" in the sam file can represent useful QC information
- Read is unmapped
- Read is paired / unpaired
- Read failed QC
- Read is a PCR duplicate (see later)
The combination of any of these properties is used to derive a numeric value
For instance, a particular read has a flag of 163
There is a set of properties that a read can possess. If a particular property is observed, a corresponding power of 2 is added multiplied by 1. The final value is derived by summing all the powers of 2.
ReadHasProperty Binary MultiplyBy
isPaired TRUE 1 1
isProperPair TRUE 1 2
isUnmappedQuery FALSE 0 4
hasUnmappedMate FALSE 0 8
isMinusStrand FALSE 0 16
isMateMinusStrand TRUE 1 32
isFirstMateRead FALSE 0 64
isSecondMateRead TRUE 1 128
isSecondaryAlignment FALSE 0 256
isNotPassingQualityControls FALSE 0 512
isDuplicate FALSE 0 1024
Value of flag is given by
1x1 + 1x2 + 0x4 + 0x8 + 0x16 + 1x32 + 0x64 + 1x128 + 0x256 + 0x512 + 0x1024 = 163
See also
The CIGAR (Compact Idiosyncratic Gapped Alignment Report) string is a way of encoding the match between a given sequence and the position it has been assigned in the genome. It is comprised by a series of letters and numbers to indicate how many consecutive bases have that mapping.
Code | Description |
---|---|
M | alignment match |
I | insertion |
D | deletion |
N | skipped |
S | soft-clipping |
H | hard-clipping |
e.g.
68M
- 68 bases matching the reference
1S67M
- 1 soft-clipped read followed by 67 matches
15M87N70M90N16M
- 15 matches following by 87 bases skipped followed by 70 matches etc.
Rather than dealing with .sam
files, we usually analyse a .bam
file.
flagstat will calculate the flag for each read in the bam file and tabulate the results.
idxstats will report the number of reads mapping to each reference sequence (i.e. chromosome)
Download the bam files you have created in the previous step by clicking the disk icon on the right-hand panel. Make sure to click both the Download dataset and Download index buttons. We will now visualise the alignments using the Integrative Genomics Viewer (IGV).
- http://software.broadinstitute.org/software/igv/
- Go to Downloads
- Launch with 1.2Gb
- Click on igv24_mm.jnlp file that is downloaded
Whilst tools like R are very powerful and allow you to perform statistical analyses and test hypotheses, there is no substitute for looking at the data. A trained-eye can quite quickly get a sense of the data quality before any computational analyses have been run. Futhermore, as the person requesting the sequencing, you probably know a lot about the biological context of the samples and what to expect.
- IGV has been developed by the Broad Institute and is able to display most kinds of genomic data
- expression
- ChIP
- whole-genome resequencing
- shRNA
- It is a Java desktop application and can be run either locally of from the Broad website
- You have a link to IGV on the taskbar on the left the screen in-front of you
- To run IGV yourself you will need to agree to the license and download the version for your OS
For more details
- Full set of slides from MRC Clinical Sciences Centre
- IGV tutorial from the Griffth lab - WashU
-
Sample information panel
- Information about samples you have loaded
- e.g. Sample ID, Gender, Age, Tumour / Normal
-
Genome Navigation panel
- Jump to a genomic region in
Chr:Start-End
format - Jump to a gene symbol of interest
- Jump to a genomic region in
-
Data panel
- Your sequencing reads will be displayed here
- Or whatever data you have loaded
- see information on accepted file formats
-
Attribute panel
- Gene locations
- Genome sequence (if zoomed-in at appropriate level)
- Proteins
Go to File -> Load from file and select the aligned bam
files from HISAT2
. Note that the index files .bai
need to be present in the same directory. However, you only need to click on the .bam
- The black dotted vertical lines indicates the centre of the view
- Each of the grey pointed rectangles represents a sequencing reads
- whether the pointed bit is on the left or right indicates if the read is forward or reverse.
- A coverage track is also generated
- You should see the read that we described in detail in the previous section by hovering over the reads to display the information from the
.bam
file
The view in IGV is not static and we can scroll-along the genome by holding-down the left mouse in the data panel and dragging left and right
It's worth noting that the display settings may be showing fewer reads than you have (downsampling) in order to conserve memory. Also, some QC-fail or PCR duplicates may be filtered.
We also have some options on how to display the reads themselves, which we can acccess by right-clicking on the bam track
Sorting alignments by:-
- start
- strand
- base
- mapping quality
- insert size
The reads themselves can also be coloured according to
- insert size
- read strand
- sample
Additional data tracks are also available on the IGV server. These include useful genome annotation tracks, such as:-
- GC percentage
- CpG islands
- Repeat regions
- dbSNP calls
In order to test for differential expression, we need to count up how many times each "feature" is observed is each sample. The goal of such operations is to produce a count table such as that shown below. We can then apply statistical tests to these data
HTSeq-count creates a count matrix using the number of the reads from each bam file that map to the genomic features in the genes.gtf. For each feature (a gene for example) a count matrix shows how many reads were mapped to this feature.
Various rules can be used to assign counts to features
-
Use HTSeq-count to count the number of reads for each feature.
In the left tool panel menu, under NGS Analysis, select NGS Analysis > htseq-count* and set the parameters as follows:- Gene model (GFF) file to count reads over from your current history: genes.gtf
- bam/sam file from your history:
(Select one of six bam files)batch1-accepted_hits.bam
- Use defaults for the other fields
- Execute
-
Repeat for the remaining bam files
DESeq2 is an R package, that is used for analysing differential expression of RNA-Seq data and can either use exact statistical methods or generalised linear models.
In the Galaxy tool panel, under NGS Analysis, select NGS: RNA Analysis > DESeq2 and set the parameters as follows:
- 1. Factor level Batch
- Count files
batch1-htseq
batch2-htseq
- 2. Factor level: Chem
- Select columns containing control:
chem1-htseq
chem2-htseq
- Use defaults for the other fields
- Execute
- Examine the
DeSeq2 result file
by clicking on the eye icon. This file is a list of genes sorted by p-value from using DESeq2 to perform differential expression analysis. - Examine the
DeSeq2 plots
file. This file has some plots from running DESeq2, including PCA and clustering showing relationships between samples
Under Basic Tools, click on Filter and Sort > Filter:
- Filter:
DESeq2 results file
- With following condition:
c7 < 0.05 and (c3 > 1.0 or c3 < -1.0)
- Number of header lines to skip: 1
- Execute
This will keep the genes that have an adjusted p-value (column 7 in the table) of less
or equal to 0.05 and have a fold change of greater than 1 or less than -1. There should be 22 genes in this file.
Rename this file by clicking on the pencil icon of and change the name
from "Filter on data x" to DESeq2_Significant_DE_Genes
Exercise:
Why do you think it is important to use the adjusted p-value to select which genes are differentially-expressed. Why might you also want to specify a fold-change cutoff? Discuss with your neighbours
The htseq tool is designed to produce a separate table of counts for each sample. This is not particularly useful for other tools which require the counts to be displayed in a data matrix where each row is a gene and each column is a particular sample in the dataset.tmp
- Select the count files from your history batch1.htseq, batch2.htseq, etc...
- Keep Column containing gene IDs and Column containing gene counts to 1 and 2 respectively.
- Rename the output to
raw counts
and Download to your computer
[1] Nookaew I, Papini M, Pornputtpong N, Scalcinati G, Fagerberg L, Uhlén M, Nielsen J: A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res 2012, 40 (20):10084 – 10097. doi:10.1093/nar/gks804. Epub 2012 Sep 10
[2] Guirguis A, Slape C, Failla L, Saw J, Tremblay C, Powell D, Rossello F, Wei A, Strasser A, Curtis D: PUMA promotes apoptosis of hematopoietic progenitors driving leukemic progression in a mouse model of myelodysplasia. Cell Death Differ. 2016 Jun;23(6)