-
Notifications
You must be signed in to change notification settings - Fork 12
RNASeq II practical Count reads and differential expression analysis
Option 1: Use your own sorted bam file from last lab (STAR mapping).
Option 2: Copy the file to your directory
cd /lustre/haven/courses/EPP622-2018Fa/$USER
mkdir RNASeq_lab_II && cd RNASeq_lab_II
mkdir htseq && cd htseq
Get an interactive session first:
qsub -I -A ACF-UTK0085 -q debug -l nodes=1:ppn=1
-
Required:
htseq-count [options] alignment_file gff_file
-
Options:
-
--format
: type of alignment_file data -
--order
: sorting order of alignment_file -
--stranded
: whether the data is from a strand-specific assay -
--type
: feature type (3rd column in gff_file) to be used -
--idattr
: GFF attribute to be used as feature ID
-
module load htseq
htseq-count --format=bam \
--order=pos \
--stranded=no \
--type=gene \
--idattr=gene_id \
../../RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam \
../../RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.gtf \
> DRR016140.counts.txt
nano htseq_count.qsh
#PBS -A ACF-UTK0085
#PBS -l nodes=1,walltime=1:00:00
cd $PBS_O_WORKDIR
module load htseq
htseq-count --format=bam \
--order=pos \
--stranded=no \
--type=gene \
--idattr=gene_id \
../../RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam \
../../RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.gtf \
> DRR016140.counts.txt
chmod x+u htseq_count.qsh
qsub htseq_count.qsh
nano count_reads.sh
for f in alignment_bam_output/*.bam
do
names=$(echo $f | grep -o "DRR0161[0-9]*")
echo "The target bam file is: "$names
echo "==================================================="
filename=$(basename $f)
base=${filename%%.bam*}
/lustre/haven/courses/EPP622-2018Fa/software/htseq-0.9.1/bin/htseq-count --format=bam --order=pos --stranded=no \
--type=gene --idattr=gene_id $f\
Arabidopsis_thaliana.TAIR10.28.gtf \
> $base.counts.txt 2> $base.out
echo "The count data has been done for "$names
echo "==================================================="
done
Change the file mode to make it executable, then run the command
chmod u+x count_reads.sh
./count_reads.sh
Since, for each bam file a range of 30-60 min is required to process the count reads, first copy all 16 results I obtained a few days ago to your directory.
cp -r /lustre/haven/courses/EPP622-2018Fa/nsoltan1/rna_seq/2nd_sec/htseq_counts/ .
Then, transfer all .txt
files to desktop of your local computer
scp -r [email protected]:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_II/htseq/htseq_counts/ .
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
install.packages("colorspace")
install.packages("grid")
library("colorspace")
library("grid")
library("DESeq2")
# Change working directory to the folder of read counts
setwd ("~/Desktop/htseq_counts")
# Get file names
fileName <- list.files(path = ".", pattern = "*.txt")
# Get sample names
sampleName <- sapply(strsplit(fileName, "[.]"), "[", 1)
# Treatment
treatment <- factor(c("control","ABA","saline","dehydration","control","ABA","saline","dehydration","control","ABA","saline","dehydration","control","ABA","saline","dehydration"))
# Phenotype
phenotype <- factor(c("wildtype","wildtype","wildtype","wildtype","ros1-3","ros1-3","ros1-3","ros1-3","dml2-3","dml2-3","dml2-3","dml2-3","ros1dml2-3","ros1dml2-3","ros1dml2-3","ros1dml2-3"))
# Build the sample table
SampleTable <- data.frame(sampleName = sampleName, fileName = fileName, phenotype= phenotype, treatment = treatment)
dds <- DESeqDataSetFromHTSeqCount(sampleTable = SampleTable, directory = ".", design = ~ phenotype + treatment)
# Use control as the reference level or basis of the comparison
dds$treatment <- relevel (dds$treatment, "control")
# Pre-filtering: discard rows that have 0 for all treatments
dim(dds) ## before filtering
dds = dds[rowSums(counts(dds))>1, ]
dim(dds) ## after filtering
# Creating deseq2 object
dds <- DESeq(dds)
res <- results(dds)
res
log2 fold change (MLE): treatment saline vs control
Wald test p-value: treatment saline vs control
DataFrame with 25861 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
AT1G01010 88.11433 1.21489045 0.2229059 5.4502382 5.030239e-08 1.915651e-07
AT1G01020 104.77475 -0.04511151 0.1720379 -0.2622185 7.931530e-01 8.374940e-01
AT1G01030 53.02130 1.68283424 0.2594165 6.4869983 8.756333e-11 4.246439e-10
AT1G01040 316.03974 -0.76679911 0.1069656 -7.1686523 7.573964e-13 4.330702e-12
AT1G01050 785.11896 0.34728986 0.1074551 3.2319521 1.229477e-03 2.678463e-03
... ... ... ... ... ... ...
ATMG01370 7.947239e+00 -2.5244264 0.7245580 -3.4840916 0.0004938104 0.001145231
ATMG01380 1.553363e+01 0.6046080 0.4544989 1.3302737 0.1834280940 0.248830093
ATMG01390 1.198452e+04 0.3714974 0.2194473 1.6928778 0.0904787304 0.134277100
ATMG01400 4.436177e-01 1.1208890 2.6229040 0.4273466 0.6691269140 NA
ATMG01410 1.694939e+00 -1.6663480 1.2916832 -1.2900593 0.1970300634 0.264871917
Results explanation:
We choose gene ATMG01410 as an example
-
baseMean: the average of normalized counts across all samples. This represents the intercept of your GLM.
-
log2Foldchange: (stress saline vs control): log2(treated/untreated). Here it is log2(saline/control)
-
lfcSE: standard error of the log2FoldChange estimate
-
stat: statistic for the hypothesis test. For Wald test, it is log2FoldChange/lfcSE.
-
pvalue: the corresponding p-value from Wald test (or likelihood ratio test)
-
padj: adjusted p value due to multiple comparisons.
-
reasons for NA values:
- This gene has 0 count for all samples
- This row has an extreme count outlier
- Filtered by automatic independent filtering for having low mean count.
Get a summary of the results:
summary(res)
We can extract results of comparisons between other levels of treatment:
results(dds, contrast=c("treatment", "control", "ABA"))
We can also extract results of comparisons from the phenotype factor
results(dds, contrast=c("phenotype", "wildtype", "ros1-3"))
MA-plot of the results
plotMA(res)
PCA plot
rld <- rlog(dds)
plotPCA(rld, intgroup = c("treatment", "phenotype"))
Plot counts
We select the gene which has the smallest padj value to plot its count at different levels
mostSigGene <- rownames(dds)[which.min(res$padj)]
mostSigGene
par(mfcol=c(1,2))
plotCounts(dds, gene=mostSigGene, intgroup="treatment")
plotCounts(dds, gene=mostSigGene, intgroup="phenotype")
# Reorder significant gene names based on the adjusted p-value
resOrdered <- res[order(res$padj),]
# Cut-off all genes with p < 0.05
sigOrderedRes <- subset(resOrdered, padj < 0.05)
# Extract differential regulated genes and save them to directory
write.csv(as.data.frame(sigOrderedRes), "TREATMENT_saline_vs_control.csv")