Skip to content

RNASeq II practical Count reads and differential expression analysis

Nourolah edited this page Nov 20, 2018 · 13 revisions

Get sorted bam files

Option 1: Use your own sorted bam file from last lab (STAR mapping).

Option 2: Copy the file to your directory

cd /lustre/haven/courses/EPP622-2018Fa/$USER
mkdir RNASeq_lab_II && cd RNASeq_lab_II
mkdir htseq && cd htseq

Count reads using htseq

Get an interactive session first:

qsub -I -A ACF-UTK0085 -q debug -l nodes=1:ppn=1

Required and optional arguments in htseq

  • Required: htseq-count [options] alignment_file gff_file

  • Options:

    • --format: type of alignment_file data
    • --order: sorting order of alignment_file
    • --stranded: whether the data is from a strand-specific assay
    • --type: feature type (3rd column in gff_file) to be used
    • --idattr: GFF attribute to be used as feature ID

To extract counts from one sorted bam file

module load htseq

htseq-count --format=bam \
--order=pos \
--stranded=no \
--type=gene \
--idattr=gene_id \
../../RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam \
../../RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.gtf \
> DRR016140.counts.txt

To extract counts from one sorted bam file through job submission to ACF

nano htseq_count.qsh
#PBS -A ACF-UTK0085
#PBS -l nodes=1,walltime=1:00:00

cd $PBS_O_WORKDIR

module load htseq

htseq-count --format=bam \
--order=pos \
--stranded=no \
--type=gene \
--idattr=gene_id \
../../RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam \
../../RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.gtf \
> DRR016140.counts.txt
chmod x+u htseq_count.qsh
qsub htseq_count.qsh

To extract counts from all 16 sorted bam files

nano count_reads.sh
for f in alignment_bam_output/*.bam

do

 names=$(echo $f | grep -o "DRR0161[0-9]*") 
 echo "The target bam file is: "$names
 echo "==================================================="
 filename=$(basename $f)
 base=${filename%%.bam*}
 /lustre/haven/courses/EPP622-2018Fa/software/htseq-0.9.1/bin/htseq-count --format=bam --order=pos --stranded=no \
 --type=gene --idattr=gene_id $f\
 Arabidopsis_thaliana.TAIR10.28.gtf \
 > $base.counts.txt 2> $base.out

 echo "The count data has been done for "$names
 echo "==================================================="

done

Change the file mode to make it executable, then run the command

chmod u+x count_reads.sh
./count_reads.sh

Since, for each bam file a range of 30-60 min is required to process the count reads, first copy all 16 results I obtained a few days ago to your directory.

cp -r /lustre/haven/courses/EPP622-2018Fa/nsoltan1/rna_seq/2nd_sec/htseq_counts/ .

Then, transfer all .txt files to desktop of your local computer

scp -r [email protected]:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_II/htseq/htseq_counts/ .

Differential gene expression through R envionment

Install DESeq2 and other required packages

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
install.packages("colorspace")
install.packages("grid")
library("colorspace")
library("grid")
library("DESeq2")

Load data into R

# Change working directory to the folder of read counts
setwd ("~/Desktop/htseq_counts")     

# Get file names
fileName <- list.files(path = ".", pattern = "*.txt")

# Get sample names
sampleName <- sapply(strsplit(fileName, "[.]"), "[", 1)

# Treatment
treatment <- factor(c("control","ABA","saline","dehydration","control","ABA","saline","dehydration","control","ABA","saline","dehydration","control","ABA","saline","dehydration"))

# Phenotype
phenotype <- factor(c("wildtype","wildtype","wildtype","wildtype","ros1-3","ros1-3","ros1-3","ros1-3","dml2-3","dml2-3","dml2-3","dml2-3","ros1dml2-3","ros1dml2-3","ros1dml2-3","ros1dml2-3"))

# Build the sample table
SampleTable <- data.frame(sampleName = sampleName, fileName = fileName, phenotype= phenotype, treatment = treatment)

Differential expression analysis

dds <- DESeqDataSetFromHTSeqCount(sampleTable = SampleTable, directory = ".", design = ~ phenotype + treatment)

# Use control as the reference level or basis of the comparison
dds$treatment <- relevel (dds$treatment, "control")

# Pre-filtering: discard rows that have 0 for all treatments
dim(dds)  ## before filtering
dds = dds[rowSums(counts(dds))>1, ]
dim(dds)  ## after filtering

# Creating deseq2 object
dds <- DESeq(dds)

Results

res <- results(dds)
res
log2 fold change (MLE): treatment saline vs control 
Wald test p-value: treatment saline vs control 
DataFrame with 25861 rows and 6 columns
              baseMean log2FoldChange     lfcSE       stat       pvalue         padj
             <numeric>      <numeric> <numeric>  <numeric>    <numeric>    <numeric>
AT1G01010     88.11433     1.21489045 0.2229059  5.4502382 5.030239e-08 1.915651e-07
AT1G01020    104.77475    -0.04511151 0.1720379 -0.2622185 7.931530e-01 8.374940e-01
AT1G01030     53.02130     1.68283424 0.2594165  6.4869983 8.756333e-11 4.246439e-10
AT1G01040    316.03974    -0.76679911 0.1069656 -7.1686523 7.573964e-13 4.330702e-12
AT1G01050    785.11896     0.34728986 0.1074551  3.2319521 1.229477e-03 2.678463e-03
...                ...            ...       ...        ...          ...          ...
ATMG01370 7.947239e+00     -2.5244264 0.7245580 -3.4840916 0.0004938104  0.001145231
ATMG01380 1.553363e+01      0.6046080 0.4544989  1.3302737 0.1834280940  0.248830093
ATMG01390 1.198452e+04      0.3714974 0.2194473  1.6928778 0.0904787304  0.134277100
ATMG01400 4.436177e-01      1.1208890 2.6229040  0.4273466 0.6691269140           NA
ATMG01410 1.694939e+00     -1.6663480 1.2916832 -1.2900593 0.1970300634  0.264871917

Results explanation:

We choose gene ATMG01410 as an example

  • baseMean: the average of normalized counts across all samples. This represents the intercept of your GLM.

  • log2Foldchange: (stress saline vs control): log2(treated/untreated). Here it is log2(saline/control)

  • lfcSE: standard error of the log2FoldChange estimate

  • stat: statistic for the hypothesis test. For Wald test, it is log2FoldChange/lfcSE.

  • pvalue: the corresponding p-value from Wald test (or likelihood ratio test)

  • padj: adjusted p value due to multiple comparisons.

  • reasons for NA values:

    • This gene has 0 count for all samples
    • This row has an extreme count outlier
    • Filtered by automatic independent filtering for having low mean count.

Get a summary of the results:

summary(res)

We can extract results of comparisons between other levels of treatment:

results(dds, contrast=c("treatment", "control", "ABA"))

We can also extract results of comparisons from the phenotype factor

results(dds, contrast=c("phenotype", "wildtype", "ros1-3"))

MA-plot of the results

plotMA(res)

PCA plot

rld <- rlog(dds)
plotPCA(rld, intgroup = c("treatment", "phenotype"))

Plot counts

We select the gene which has the smallest padj value to plot its count at different levels

mostSigGene <- rownames(dds)[which.min(res$padj)]
mostSigGene
par(mfcol=c(1,2))
plotCounts(dds, gene=mostSigGene, intgroup="treatment")
plotCounts(dds, gene=mostSigGene, intgroup="phenotype")

Exporting results into CSV files

# Reorder significant gene names based on the adjusted p-value

resOrdered <- res[order(res$padj),]

# Cut-off all genes with p < 0.05

sigOrderedRes <- subset(resOrdered, padj < 0.05)

# Extract differential regulated genes and save them to directory

write.csv(as.data.frame(sigOrderedRes), "TREATMENT_saline_vs_control.csv")
Clone this wiki locally