Skip to content

Latest commit

 

History

History
675 lines (441 loc) · 25.4 KB

rna_seq_advanced_tutorial.md

File metadata and controls

675 lines (441 loc) · 25.4 KB
title author output editor_options
Mark Dunning
html_notebook html_document
toc toc_float
true
true
df_print toc
paged
true
chunk_output_type
inline

Hands-on RNA-seq Analysis in Galaxy

Acknowledgement

Based on the RNA-Seq workshop by Melbourne Bioinformatics written by Mahtab Mirmomeni, Andrew Lonie, Jessica Chung Original

Modified by David Powell (Monash Bioinformatics Platform)

Further Modified by Mark Dunning of Sheffield Bioinformatics Core

Sheffield Bioinformatics Core

web : sbc.shef.ac.uk
twitter: @SheffBioinfCore
email: [email protected]


Tutorial Overview

This tutorial will cover the basics of RNA-seq using Galaxy; a open-source web-based platform for the analysis of biological data. You should gain an appreciation of the tasks involved in a typical RNA-seq analysis and be comfortable with the outputs generated by the Bioinformatician.

More on Galaxy

The official Galaxy page has many tutorials on using the service, and examples of other types of analysis that can be performed on the platform.

Those eventually wanted to perform their own RNA-seq analysis (for example in R), should look out for other courses

Courses on analysing RNA-seq data in R


Background

Where do the data in this tutorial come from?

The data for this tutorial is from the paper, A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae by Nookaew et al. [1] which studies S.cerevisiae strain CEN.PK 113-7D (yeast) under two different metabolic conditions: glucose-excess (batch) or glucose-limited (chemostat).

The RNA-Seq data has been uploaded in NCBI, short read archive (SRA), with accession SRS307298. There are 6 samples in total-- two treatments with three biological replicates each sequenced paired-end. We have selected only the first read, and only two replicates of each condition to keep the data small for this workshop.

We have extracted chromosome I reads from the samples to make the tutorial a suitable length.

For this tutorial, we will assume that the wet-lab stages of the experiment have been performed and we are now in the right-hand branch of the workflow. In this tutorial we will demonstrate the steps of Quality assessment, alignment, quantification and differential expression testing.

Section 1: Preparation

1. Register as a new user on one of the public Galaxy servers

Make sure you check your email to activate your account

2. Import the RNA-seq data for the workshop.

We can going to import the fastq files for this experiment. This is a standard format for storing raw sequencing reads and their associated quality scores.

**Get Data -> Upload File **

You can import the data by:

  1. In the tool panel located on the left, under Basic Tools select Get Data > Upload File. Click on the Paste/Fetch data button on the bottom section of the pop-up window.

  2. Upload the sequence data by pasting the following links into the text input area. These two files are single-end samples from the batch condition (glucose-excess). Make sure the type is specified as 'fastqsanger' when uploading.

    These two files are single-end samples from the chem condition (glucose-limited). Make sure the type is specified as 'fastqsanger' when uploading.

    Then, upload this file of gene definitions. You do not need to specify the type for this file as Galaxy will auto-detect the file as a GTF file.
  3. You should now have these 5 files in your history:

    • batch1_chrI_1.fastq
    • batch2_chrI_1.fastq
    • chem1_chrI_1.fastq
    • chem2_chrI_1.fastq
    • genes.gtf

    These files can be renamed by clicking the pen icon if you wish.

    Note: Low quality reads have already been trimmed.

Each read is described by 4 lines in the file:-

The quality scores are ASCII representations of how confident we are that a particular base has been called correctly. Letters that are further along the alphabet indicate higher confidence. This is important when trying to identify types of genome variation such as single base changes, but is also indicative of the overall quality of the sequencing. Different scales have been employed over time (resulting in a different set of characters appearing in the file). We will need to tell Galaxy which scale has been used in order that we can process the data correctly.

Deriving the Quality Score

First of all, we convert the base-calling probability (p) into a Q score using the formula

Quality Scores to probabilities

  • look-up the ASCII code for each character
  • subtract the offset to get the Q score
  • convert to a probability using the formula:-

$$ p = 10^{-Q/10} $$

In practice, we don't have to convert the values as we have software that will do this automatically

Question:
  • How many lines of data are present in the file batch1_chrI_1.fastq
    • use the tool Text Manipulation -> Line/Word/Character count to find out
  • How many reads does this correspond to?
-----

Section 2: Quality assessment with FastQC

FastQC is a popular tool from Babraham Institute Bioinformatics Group used for quality assessment of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The documentation for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user's attention to possible issues. However, it is worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq), so some warnings might be expected due to the nature of your experiment.

NGS ANALYSIS > NGS: QC and manipulation -> FastQC

  • Select one of the FASTQ files as input and Execute the tool.
  • When the tool finishes running, you should have an HTML file in your History. Click on the eye icon to view the various quality metrics.
  • Run Fastqc on the remaing fastq files, but don't examine the results just yet.

**Question: Do the data seem to be of reasonable quality? **

You can use the documentation to help interpret the plots

If poor quality reads towards the ends of reads are considered to be a problem, or there is considerable adapter contamination, we can employ various tools to trim our data.

Search for the text *trim* and *adapter* in the Galaxy tool search box (top-right)

How many tools are available for the task of trimming and removing adapter contamination?

If time allows, use the trim galore tool to trim one of the fastq files and repeat the QC.

N.B. the fastq files that we are starting from have already undergone some processing, so you may not see much effect.

Combining QC reports

It can be quite tiresome to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.

The multiqc tool has been designed for the tasks of aggregating qc reports and combining into a single report that is easy to digest.

NGS ANALYSIS > NGS: QC and manipulation -> Multiqc

Under Which tool was used generate logs? Choose fastqc and select the RawData output from the fastqc run on each of your bam files.

Question: do the fastq files seem to have consistently high-quality?

Section 3: Alignment

In this section we map the reads in our FASTQ files to a reference genome. As these reads originate from mRNA, we expect some of them will cross exon/intron boundaries when we align them to the reference genome. HISAT2 is a splice-aware mapper for RNA-seq reads that is based on Bowtie. It uses the mapping results from Bowtie to identify splice junctions between exons. More information on HISAT2 can be found here.

NGS: RNA Analysis > HISAT2

1. Map/align the reads with HISAT2 to the S. cerevisiae reference

In the left tool panel menu, under NGS Analysis, select NGS: RNA Analysis > HISAT2 and set the parameters as follows:

  • Is this single-end or paired-end data? Single-end (as individual datasets)

  • RNA-Seq FASTQ file, forward reads:
    (Click on the multiple datasets icon and select all four of the FASTQ files)

    • batch1_chrI_1.fastq
    • batch2_chrI_1.fastq
    • chem1_chrI_1.fastq
    • chem2_chrI_1.fastq
  • Use a built in reference genome or own from your history: Use built-in genome

  • Select a reference genome: S. cerevisiae June 2008 (SGD/SacCer3) (sacCer3)

  • Use defaults for the other fields

  • Execute

Note: This may take a few minutes, depending on how busy the server is.

2. Rename the output files

You should have 5 output files for each of the FASTQ input files:

HISAT2 on data.. aligned reads (BAM)

It will be helpful to rename these to something shorter for the next steps.

  • batch1.bam
  • batch2.bam
  • chem1.bam
  • chem2.bam

About the bam file format

Unlike most of Bioinfomatics, a single standard file format has emerged for aligned reads. Moreoever, this file format is consistent regardless of whether you have DNA-seq, RNA-seq, ChIP-seq... data.

The bam file is a compressed, binary, version of a sam file.

The .sam file

  • Sequence Alignment/Map (sam)
  • The output from an aligner such as bwa
  • Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, DNA-seq etc)
  • May contain un-mapped reads
  • Potentially large size on disk; ~100s of Gb
    • Can be manipulated with standard unix tools; e.g. cat, head, grep, more, less....
  • Official specification can be obtained online: -
  • We normally work on a compressed version called a .bam file. See later.
  • Header lines starting with an @ character, followed by tab-delimited lines
    • Header gives information about the alignment and references sequences used

The first part of the header lists the names (SN) of the sequences (chromosomes) used in alignment, their length (LN) and a md5sum "digital fingerprint" of the .fasta file used for alignment (M5).

@HD	VN:1.5	SO:coordinate	GO:none
@SQ	SN:1	LN:249250621	M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ	SN:2	LN:243199373	M5:a0d9851da00400dec1098a9255ac712e
@SQ	SN:3	LN:198022430	M5:fdfd811849cc2fadebc929bb925902e5
@SQ	SN:4	LN:191154276	M5:23dccd106897542ad87d2765d28a19a1
.....
.....

Next we can define the read groups present in the file which we can use to identify which sequencing library, sequencing centre, Lane, sample name etc.

@RG	ID:SRR077850	CN:bi	LB:Solexa-42057	PL:illumina	PU:ILLUMINA	SM:NA06984
@RG	ID:SRR081675	CN:bi	LB:Solexa-42316	PL:illumina	PU:ILLUMINA	SM:NA06984
@RG	ID:SRR080818	CN:bi	LB:Solexa-44770	PL:illumina	PU:ILLUMINA	SM:NA06984
@RG	ID:SRR084838	CN:bi	LB:Solexa-42316	PL:illumina	PU:ILLUMINA	SM:NA06984
@RG	ID:SRR081730	CN:bi	LB:Solexa-42316	PL:illumina	PU:ILLUMINA	SM:NA06984
.....
.....

Finally, we have a section where we can record the processing steps used to derive the file

@PG	ID:MosaikAligner	CL:/share/home/wardag/programs/MOSAIK/bin/MosaikAligner -in /scratch/wardag/NA06984.SRR077850.mapped.illumina.mosaik.CEU.SINGLE.20111114/NA06984.SRR077850.mapped.illumina.mosaik.CEU.SINGLE.20111114.mkb -out
....
....

Next is a tab-delimited section that describes the alignment of each sequence in detail.

SRR081708.237649	163	1	10003	6	1S67M	=	10041	105	GACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTAACCCTAA	S=<====<<>=><?=?=?>==@??;?>@@@=??@@????@??@?>?@@<@>@'@=?=??=<=>?>?=Q	ZA:Z:<&;0;0;;308;68M;68><@;0;0;;27;;>MD:Z:5A11A5A11A5A11A13	RG:Z:SRR081708	NM:i:6	OQ:Z:GEGFFFEGGGDGDGGGDGA?DCDD:GGGDGDCFGFDDFFFCCCBEBFDABDD-D:EEEE=D=DDDDC:

Column Official Name Brief
1 QNAME Sequence ID
2 FLAG Sequence quality expressed as a bitwise flag
3 RNAME Chromosome
4 POS Start Position
5 MAPQ Mapping Quality
6 CIGAR Describes positions of matches, insertions, deletions w.r.t reference
7 RNEXT Ref. name of mate / next read
8 PNEXT Postion of mate / next read
9 TLEN Observed Template length
10 SEQ Sequence
11 QUAL Base Qualities

There can also be all manner of optional tags as extra columns introduce by an aligner or downstream analysis tool. A common use is the RG tag which refers back to the read groups in the header.

Dr Mark Dunning presents....Fun with flags!

The "flags" in the sam file can represent useful QC information

  • Read is unmapped
  • Read is paired / unpaired
  • Read failed QC
  • Read is a PCR duplicate (see later)

The combination of any of these properties is used to derive a numeric value

For instance, a particular read has a flag of 163

Derivation

There is a set of properties that a read can possess. If a particular property is observed, a corresponding power of 2 is added multiplied by 1. The final value is derived by summing all the powers of 2.

 	ReadHasProperty 	Binary 	MultiplyBy
isPaired 	TRUE 	1 	1
isProperPair 	TRUE 	1 	2
isUnmappedQuery 	FALSE 	0 	4
hasUnmappedMate 	FALSE 	0 	8
isMinusStrand 	FALSE 	0 	16
isMateMinusStrand 	TRUE 	1 	32
isFirstMateRead 	FALSE 	0 	64
isSecondMateRead 	TRUE 	1 	128
isSecondaryAlignment 	FALSE 	0 	256
isNotPassingQualityControls 	FALSE 	0 	512
isDuplicate 	FALSE 	0 	1024

Value of flag is given by

1x1 + 1x2 + 0x4 + 0x8 + 0x16 + 1x32 + 0x64 + 1x128 + 0x256 + 0x512 + 0x1024 = 163

See also

Have a CIGAR!

The CIGAR (Compact Idiosyncratic Gapped Alignment Report) string is a way of encoding the match between a given sequence and the position it has been assigned in the genome. It is comprised by a series of letters and numbers to indicate how many consecutive bases have that mapping.

Code Description
M alignment match
I insertion
D deletion
N skipped
S soft-clipping
H hard-clipping

e.g.

  • 68M
    • 68 bases matching the reference
  • 1S67M
    • 1 soft-clipped read followed by 67 matches
  • 15M87N70M90N16M
    • 15 matches following by 87 bases skipped followed by 70 matches etc.

Rather than dealing with .sam files, we usually analyse a .bam file.


Quality assessment of the aligned reads

flagstat will calculate the flag for each read in the bam file and tabulate the results.

*Sam Tools -> Flagstat*

idxstats will report the number of reads mapping to each reference sequence (i.e. chromosome)

*Sam Tools -> IdxStats*

Section 4. Visualise the aligned reads with IGV

Download the bam files you have created in the previous step by clicking the disk icon on the right-hand panel. Make sure to click both the Download dataset and Download index buttons. We will now visualise the alignments using the Integrative Genomics Viewer (IGV).

Introducing the IGV Browser

Whilst tools like R are very powerful and allow you to perform statistical analyses and test hypotheses, there is no substitute for looking at the data. A trained-eye can quite quickly get a sense of the data quality before any computational analyses have been run. Futhermore, as the person requesting the sequencing, you probably know a lot about the biological context of the samples and what to expect.

  • IGV has been developed by the Broad Institute and is able to display most kinds of genomic data
    • expression
    • ChIP
    • whole-genome resequencing
    • shRNA
  • It is a Java desktop application and can be run either locally of from the Broad website
    • You have a link to IGV on the taskbar on the left the screen in-front of you
  • To run IGV yourself you will need to agree to the license and download the version for your OS

A quick tour of IGV

For more details

  1. Sample information panel

    • Information about samples you have loaded
    • e.g. Sample ID, Gender, Age, Tumour / Normal
  2. Genome Navigation panel

    • Jump to a genomic region in Chr:Start-End format
    • Jump to a gene symbol of interest
  3. Data panel

    • Your sequencing reads will be displayed here
    • Or whatever data you have loaded
  4. Attribute panel

    • Gene locations
    • Genome sequence (if zoomed-in at appropriate level)
    • Proteins

Example

Go to File -> Load from file and select the aligned bam files from HISAT2. Note that the index files .bai need to be present in the same directory. However, you only need to click on the .bam

  • The black dotted vertical lines indicates the centre of the view
  • Each of the grey pointed rectangles represents a sequencing reads
    • whether the pointed bit is on the left or right indicates if the read is forward or reverse.
  • A coverage track is also generated
  • You should see the read that we described in detail in the previous section by hovering over the reads to display the information from the .bam file

The view in IGV is not static and we can scroll-along the genome by holding-down the left mouse in the data panel and dragging left and right

It's worth noting that the display settings may be showing fewer reads than you have (downsampling) in order to conserve memory. Also, some QC-fail or PCR duplicates may be filtered.

We also have some options on how to display the reads themselves, which we can acccess by right-clicking on the bam track

Sorting alignments by:-

  • start
  • strand
  • base
  • mapping quality
  • insert size

The reads themselves can also be coloured according to

  • insert size
  • read strand
  • sample

Additional data tracks

Additional data tracks are also available on the IGV server. These include useful genome annotation tracks, such as:-

  • GC percentage
  • CpG islands
  • Repeat regions
  • dbSNP calls

Section 5. Quantification (Counting reads in features)

In order to test for differential expression, we need to count up how many times each "feature" is observed is each sample. The goal of such operations is to produce a count table such as that shown below. We can then apply statistical tests to these data

HTSeq-count creates a count matrix using the number of the reads from each bam file that map to the genomic features in the genes.gtf. For each feature (a gene for example) a count matrix shows how many reads were mapped to this feature.

Various rules can be used to assign counts to features

  1. Use HTSeq-count to count the number of reads for each feature.
    In the left tool panel menu, under NGS Analysis, select NGS Analysis > htseq-count* and set the parameters as follows:

    • Gene model (GFF) file to count reads over from your current history: genes.gtf
    • bam/sam file from your history:
      (Select one of six bam files)
      • batch1-accepted_hits.bam
    • Use defaults for the other fields
    • Execute
  2. Repeat for the remaining bam files

Section 6: DEseq2

DESeq2 is an R package, that is used for analysing differential expression of RNA-Seq data and can either use exact statistical methods or generalised linear models.

In the Galaxy tool panel, under NGS Analysis, select NGS: RNA Analysis > DESeq2 and set the parameters as follows:

  • 1. Factor level Batch
  • Count files
    • batch1-htseq
    • batch2-htseq
  • 2. Factor level: Chem
  • Select columns containing control:
    • chem1-htseq
    • chem2-htseq
  • Use defaults for the other fields
  • Execute

2. Examine the outputs from the previous step

  1. Examine the DeSeq2 result fileby clicking on the eye icon. This file is a list of genes sorted by p-value from using DESeq2 to perform differential expression analysis.
  2. Examine the DeSeq2 plots file. This file has some plots from running DESeq2, including PCA and clustering showing relationships between samples

3. Extract the significant differentially expressed genes.

Under Basic Tools, click on Filter and Sort > Filter:

  • Filter: DESeq2 results file
  • With following condition: c7 < 0.05 and (c3 > 1.0 or c3 < -1.0)
  • Number of header lines to skip: 1
  • Execute

This will keep the genes that have an adjusted p-value (column 7 in the table) of less or equal to 0.05 and have a fold change of greater than 1 or less than -1. There should be 22 genes in this file. Rename this file by clicking on the pencil icon of and change the name from "Filter on data x" to DESeq2_Significant_DE_Genes

Exercise:

Why do you think it is important to use the adjusted p-value to select which genes are differentially-expressed. Why might you also want to specify a fold-change cutoff? Discuss with your neighbours

Create a count matrix

The htseq tool is designed to produce a separate table of counts for each sample. This is not particularly useful for other tools which require the counts to be displayed in a data matrix where each row is a gene and each column is a particular sample in the dataset.tmp

*RNA Analysis -> Generate count matrix*
  • Select the count files from your history batch1.htseq, batch2.htseq, etc...
  • Keep Column containing gene IDs and Column containing gene counts to 1 and 2 respectively.
  • Rename the output to raw counts and Download to your computer

Interactive exploration of the results with DEGUST

References

[1] Nookaew I, Papini M, Pornputtpong N, Scalcinati G, Fagerberg L, Uhlén M, Nielsen J: A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res 2012, 40 (20):10084 – 10097. doi:10.1093/nar/gks804. Epub 2012 Sep 10

[2] Guirguis A, Slape C, Failla L, Saw J, Tremblay C, Powell D, Rossello F, Wei A, Strasser A, Curtis D: PUMA promotes apoptosis of hematopoietic progenitors driving leukemic progression in a mouse model of myelodysplasia. Cell Death Differ. 2016 Jun;23(6)