count.Rmd

---
title: "RNA-seq analysis in R"
author: "Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: yes
  html_document:
    toc: yes
    toc_float: yes
minutes: 300
layout: page
subtitle: Counting
bibliography: ref.bib
---


## Introduction and data import

For the purpose of this workshop, we are going to be working with a small part of the mouse reference genome (chromosome 1) to demonstrate how to do read counting using `Subread`.

The raw reads, in fastq files, have been aligned using Bowtie2. The bam files containing the aligned reads can be found in the **`RNAseq/data directory`** under the **`Course_Materials`**.

## Counting

Once our reads have been aligned against the genome, we need to summarise the information across genes or exons. The alignment process produces a set of BAM files, where each file contains the read alignments for each library. In the BAM file, there is a chromosomal location for every read that mapped uniquely. We can determine if the region each read is aligned to corresponds to a particular gene or exon and then summarise across the entire BAM file to get total read counts for each gene or exon. 

We will use **`featureCounts`** programme from the [subRead package](http://subread.sourceforge.net/) to do the counting. In addition to the BAM files, we also need to provide **`featureCounts`** with an annotation file. Usually this will be a GTF/GFF file corresponding to the genome assembly used (a description of the GTF  format  can  be  found  at  [UCSC  website](http://genome.ucsc.edu/FAQ/FAQformat.html#format4)). **`featureCounts`** can also use a simpler annotation format called SAF, this is particularly useful for defining custom/novel features that you wish to count against.

GTF/GFF files define genomic regions covered by different types of genomic features, e.g. genes, transcripts, exons, or UTRs. When using a GTF/GFF file we need to tell **`featureCounts`** what feature type to use to count reads, and what attribute type to summarise the results at. For RNAseq we most commonly wish to count reads aligning to exons, and then to summarise at the gene level. 
Lets have a quick look at the top of a GTF file so we can see what data it contains and what **feature type** and **attribute type** mean:

```
cd RNASeq/data
  
head mmu.mm10.gtf
```

The code below uses **`featureCounts`** to count reads in a bam file against a GTF for the GRCh38 genome assembly.

```
  featureCounts \
      --primary \
      -C \
      -t exon \
      -g gene_id \
      -a mmu.mm10.gtf \
      -o MCL1.DJ.featureCounts \
      MCL1.DJ.bam
```
* **`--primary`** - only count primary alignment
* **`-C`** - do not count reads where the pairs are mapped to different chromosomes
* **`-t exon`** - the **feature** type to count reads against, in this case exons
* **`-g gene_id`** - the **attribute** type to summarise counts by, in this case the gene ID

**`featureCounts`** has many additional options that can be used to alter the ways in which it does the counting.

```
featureCounts --help
```

Running featureCounts generates two output file. A summary statistics table (**`MCL1.DJ.featureCounts.summary`**) and a full table of counts (**`MCL1.DJ.featureCounts`**) for each feature (gene in this case). Let take a look at each file.

```
cat MCL1.DJ.featureCounts.summary
```

The summary table reports the numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. See [subread documentation](http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf) ('Program output' section).

```
head MCL1.DJ.featureCounts
```

The full results table begins with a line containing the command used to generate the counts. It then has a table of 7 columns. The first column is the gene identifier, this will vary depending on the GTF file used, in our case this is a UCSC gene id. The second to fifth columns describe the genes location, and the sixth column is the length of the gene. The final column contains the number of reads assigned to the gene. Note that **`featureCounts`** outputs a row for every gene in the GTF, even the ones with no reads assigned, and the row order is determined by the order in the GTF. This means that if featureCounts is used on mutliple samples with same GTF file, the separate files can be combined easily as the rows always refer to the same gene.


> ## Challenge {.challenge}
>
> 1. Redo the counting over the exons, rather than the genes. Use `featureCounts --help` to find the option you need to use. Make sure featureCounts outputs the results to a new file.
> 1. Redo the counting over genes, allowing for multimapping reads. Compare the results to our intial counts.
>

Notes

* If you are sequencing your own data, the sequencing facility will almost always provide fastq files.  
* For publicly available sequence data from GEO/SRA, the files are usually in the Sequence Read Archive
(SRA) format. Prior to read alignment, these files need to be converted into the
FASTQ format using the fastq-dump utility from the SRA Toolkit. See http:
//www.ncbi.nlm.nih.gov/books/NBK158900 for how to download and use the
SRA Toolkit.