-
Notifications
You must be signed in to change notification settings - Fork 10
/
50-rnaseq.Rmd
1371 lines (1011 loc) · 48.8 KB
/
50-rnaseq.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Differential expression analysis {#sec-rnaseq}
**Learning Objectives**
The goal of this chapter is
- to understand the different theorical concepts behind a differential expression analysis
- to provide a real-life example of DE analysis analysis running DESeq2 step-by-step
## Theory behind DESeq2
A large number of computational methods have been developed for differential expression analysis
[@Seyednasrollah:2015], differing slightly in their methodology. Here we will present DESeq2,
a widely used bioconductor package dedicated to this type of analysis. For more information read
the original paper [@Huber:2014] and the
[DESeq2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
The starting point of the analysis is a count matrix, and the goal is to identify
genes that are differentially expressed between samples.
The different steps of the analysis are illustrated in the figure below. Briefly,
DESeq2 starts by estimating scaling factors. Then, it estimates the gene-wise
dispersions and shrinks these estimates to generate more accurate estimates of dispersion
to model the counts. Finally, DESeq2 fits a generalized linear model, performs hypothesis
testing and generates a list of differentially expressed genes.
```{r, echo=FALSE, fig.align='center', out.width='60%', purl=TRUE}
knitr::include_graphics("./figs/deseq2_steps.png")
```
### Size factor estimation
DESeq2 expects as an input a matrix of raw counts (un-normalised counts).
These counts are supposed to reflect gene abundance (what we are interested in),
but they are also dependent on other less interesting factors such as gene length,
sequencing biases, sequencing depth or library composition. DESeq2 will estimate
scaling factors that will be used internally to account for the “uninteresting”
factors rendering the expression levels more comparable between samples.
- **Gene length**
As illustrated in the example below, gene 1 and gene 2 have similar levels of expression,
but many more reads map to gene 2 than to gene 1. This might not be related to biology
but it could just reflect the fact that gene 2 is longer. Taking into account gene lengths
in the normalisation procedure is important when the purpose is to compare gene expression
levels within the same sample. However, for differential expression analysis, as gene
expression levels are compared between samples, it is not necessary to use gene lengths to
estimate the scaling factors (it is even not recommended). It was just mentioned here for
information because many RNAseq common normalisation methods such as TPM (transcripts per
million), FPKM (fragments per million), or RPKM (reads per million) take into account gene
lengths.
```{r, echo=FALSE, fig.align='center', out.width='100%', purl=TRUE}
knitr::include_graphics("./figs/read_length.png")
```
- **Sequencing depth**
Each sequencing experiment will produce a certain number of reads
expected to be typically around tens of millions. A fraction of the
raw sequencing reads will be discarded during the quality control, the
alignment and the counting processes, which implies that the total
number of reads for each sample will be different.
As shown in the following example, all genes seem to be expressed at
higher levels in sample 1 than in sample 2, but this is likely because
sample 1 has twice more reads than sample 2. Accounting for sequencing
depth is necessary for differential expression analysis as samples are
compared with each other.
```{r, echo=FALSE, fig.align='center', out.width='100%', purl=TRUE}
knitr::include_graphics("./figs/Sequencing_depth.png")
```
- **Library composition**
The library composition might also be different across samples. To
illustrate this, let's imagine a basic cell expressing only 2 genes
(genes 1 and 2) and assume that a drug treatment induces a strong
expression of gene 3. If the normalisation was done using total number
of reads only, then the counts of gene 1 would be divided by 15 in
control cells, while it would be divided by 165 in treated cells. This
would lead to the misleading conclusion that the treatment has reduced
11 times the expression of gene 1. In this case, the library
composition has changed but not the expression level of gene 1.
```{r, echo=FALSE, fig.align='center', out.width='100%', purl=TRUE}
knitr::include_graphics("./figs/library_composition.png")
```
In a real dataset, a few highly differentially expressed genes,
differences in the number of genes expressed between samples, or
presence of contaminations can skew library composition. Accounting
for it is highly recommended for accurate comparison of expression
between samples [@Anders:2010].
- **DESeq2 normalisation method**
DESeq2 will estimate size factors in a way that takes into account both
library size and library composition, using the median of ratios method:
```{r, echo=FALSE, fig.align='center', out.width='60%', purl=TRUE}
knitr::include_graphics("./figs/SF_formula.png")
```
Let's try to understand what is behind this formula.
**Step 1**: DESeq2 creates a pseudo-reference sample by calculating a
row-wise geometric mean (for each gene). Geometric mean is used instead
of classical mean because it uses log values. It is hence more robust
as it is less influenced by extreme values.
**Step 2**: For every gene in every sample, ratios of
counts/pseudo-reference sample are calculated.
**Step 3**: The median value of all ratios for a given sample is taken
as the scale factor for that sample. Importantly, the method is based
on the assumption that the majority of genes are not differentially
expressed, which implies that rare genes that are really up-regulated
or down-regulated should not influence the median. Furthermore, the
median is calculated skipping genes with a geometric mean of
zero. This will hence automatically eliminate genes expressed in some
samples but not in others and will help to focus the scaling factor on
housekeeping genes.
**Step 4**: Normalised counts can be obtained by dividing each raw count
values in a given sample by that sample’s scaling factor.
```{r, echo=FALSE, fig.align='center', out.width='100%', purl=TRUE}
knitr::include_graphics("./figs/SF_steps.png")
```
### Count modeling
Let's first have a look at the counts distribution for a typical
RNAseq sample:
```{r, echo=FALSE, fig.align='center', out.width='80%', purl=TRUE}
knitr::include_graphics("./figs/distribution_of_counts.png")
```
It is obvious that the count data is not normally distributed. Counts
are integer values, always positive, and we observe a large number of
genes with low counts (or counts about zero), and a few number of
genes with a very high count level.
As seen in the [WSBIM1322 course](http://bit.ly/WSBIM1322) with the
example of the coin toss, count data are often modelised by a binomial
distribution with parameters n and p where p is the discrete
probability distribution of the number of successes in a sequence of n
independent experiments. In an RNAseq experiment, p would be the
probability of getting a read associated to a particular gene given
that n total number of reads were sequenced in the
experiment. However, when n is large and p is low, Poisson
distribution is used instead of binomial. It describes typically the
distribution of a rare event in a large population, which fits better
to RNAseq. Indeed, for each sample, the total number of reads tends to
be in millions, while the counts per gene can vary considerably but
tend to be in tens, hundreds or thousands. Therefore, the chance of a
given read to be mapped to any specific gene is extremely small.
The Poisson distribution has only one parameter indicating its
expected mean. Its variance and all other properties follow from
it. In particular, one key assumption of the Poisson distribution is
that the variance equals the mean.
$$K_{ij} \sim P(µ_{ij} = \sigma^2_{ij})$$
Applying a Poisson distribution to
Rnaseq counts holds true when comparing technical replicates from a
same sample, where the variance only reflects the counting noise. But
when comparing biological replicates, counting noise is not the only
source of variance. The observed count values for each gene within
biological replicates fluctuate more importantly, due to the
combination of biological and technical factors: inter-individual
variations in gene expression, sample purity, cell reponses to
environment (e.g. heat-shock)... Due to this overdispersion, the
Poisson distribution doesn't fit that well to RNAseq counts.
Actually, RNAseq counts are better modelised by an alternative
distribution, the negative-binomial. It is derived from the Poisson
distribution but the negative-binomial distribution has, in addition
to the mean parameter, an extra parameter $α$ called the “dispersion”
parameter to model this “extra” variance that is empirically observed
in RNA-Seq experiments.
$$K_{ij} \sim NB(mean = µ_{ij}, dispersion = \alpha_i)$$
The dispersion parameter $\alpha_i$ defines the relationship between
the variance of the observed count and its mean value[^small_alpha].
[^small_alpha]: Note that as dispersion parameter gets lower and lower,
the variance converges to the same value as the mean, and the negative
binomial turns into a Poisson distribution.
$$VAR(K_{ij}) = µ_{ij} + \alpha_i.µ_{ij}^2$$
Having modelised counts by a negative-binomial distribution, next step
is to estimate, for each gene, the two parameters of the distribution
(mean and dispersion). The mean will be estimated easily from the
observed normalised counts in both conditions, but the dispersion is
not that trivial to estimate accurately.
### Dispersion estimation
Dispersion is a measure of variability in the data ($α = CV^2$).
A gene with a dispersion value of 0.04 means 20% variation
around the expected mean. Estimate the dispersion for each gene would
be quite straightforward if we had for each condition, hundreds of
replicates. Of course, this is not feasible for economic reasons, and
experiments are often done on only 3 replicates. But how to
estimate dispersion reliably based on such a little number of samples?
To overcome this problem, DESeq2 makes the assumption that genes of
similar expression levels have similar dispersions and it will use
information coming from other genes expressed at similar level.
**Step1: Dispersion for each gene is estimated separately.** An
initial estimation of dispersion for each gene is first estimated
using maximum likelihood estimation. In other words, given the count
values of the replicates, the most likely estimate of dispersion is
calculated. For each gene, the dispersion estimate is plotted in
function of the mean expression level (mean counts of replicates).
This produce the so-called "dispersion plot" where each gene is
represented by a black dot.
```{r, echo=FALSE, fig.align='center', out.width='80%', purl=TRUE}
knitr::include_graphics("./figs/dispersion_plot.png")
```
Note that the dispersion plot highlights an intrinsic feature of
RNAseq data: genes with low read counts show substantially higher
dispersion than highly expressed genes.
**Step 2: A curve is fitted to gene-wise dispersion estimates**. A
curve is fitted (displayed as a red line in the dispersion plot),
which represents the estimate for the expected dispersion value for
genes of a given expression strength. The idea behind fitting a curve
to the data is that different genes will have different scales of
biological variability, but, over all genes, there will be a
distribution of reasonable estimates of dispersion.
**Step 3: Shrinkage of gene-wise dispersion estimates toward the
values predicted by the curve**. Initial gene-wise dispersion
estimates will be shrinked (by an empirical Bayes approach) towards
this fitted curve to obtain the final dispersion estimates. The strength
of the shrinkage for each gene will depend on how close the dispersion
values are from the curve, and on the number of samples available (the
more sample (as the sample size increases, the shrinkage decreases in
strength, and eventually becomes negligible).
adjusted dispersion values are represented by the blue dots in the
dispersion plot. For a certain number of genes, the adjusted
dispersion will be significantly increased and this will limit the
number of false-positive that could arise from an underestimated
dispersion. Dispersion estimates that are slightly above the curve are
also shrunk toward the curve. However, genes with extremely high
dispersion[^outlierSD] values are not. In fact DESeq2 assumes that
these genes might not follow the modeling assumptions and could have
higher variability than others for biological or technical
reasons. For these genes, shrinking the values toward the curve could
result in false positives. These genes are shown surrounded by blue
circles in the dispersion plot.
[^outlierSD]: genes with extremely high dispersion are those for which the adjusted
dispersion is more than 2 residual standard deviations above the curve.
Dispersion shrinkage is particularly important to reduce false positives in the
differential expression analysis.
```{r, echo=FALSE, fig.align='center', out.width='80%', purl=TRUE}
knitr::include_graphics("./figs/disp_shrinkage.png")
```
### Generalized linear model
DESeq2 fits a generalized linear model of the form: $$log2(q_{ij}) = \Sigma x_j.β_i$$
where the parameter $q_{ij}$ is proportional to the expected true concentration of
gene i for sample j:
$$ q_{ij} = \frac{µ_{ij}}{SizeFactor_{j}} $$
and $β_i$ represents the log2FC between conditions. $β_i$ coefficients are computed from the data.
In the case of a simple design with one condition (a treatment effect for example), the model
can be written
$$log2(q_{ij}) = \beta_0 + \beta_1.x_j + \epsilon$$
$\beta_0$ is the log2 expression level in the reference (control samples)
$\beta_1$ is the log2FC between treated and control cells
$x_j$ = 0 if sample j is the control sample
$x_j$ = 1 if sample j is the treated sample
### Hypothesis testing
The logFC are computed from the data using the GLM, and these are
associated to standard errors that depend on the variance of the
counts.
```{r, echo=FALSE, fig.align='center', out.width='80%', purl=TRUE}
knitr::include_graphics("./figs/z_stat.png")
```
The ultimate goal of a test for differential expression is to decide
whether, for a given gene, an observed difference in read counts is
significant, that is, whether it is greater than what would be
expected just due to natural random variation.
The null hypothesis $H_0$ is that there is no differential expression across
the sample groups, which is the same as saying that the log2FC = 0.
A statistical test, the **Wald test**, will determine whether the data
provides sufficient evidence to conclude that this value is really
different from zero.
For the Wald test, the log2 fold-change is divided by its standard
error, resulting in a z-statistic. The z-statistic is compared to a
standard normal distribution, and a p-value is computed reporting the
probability that a z-statistic at least as extreme as the observed
value would be selected at random. In principle, if this p-value is
small (below a certain cutoff) the null hypothesis is rejected.
### Multiple testing correction
Recall that a pvalue of 0.05 means that there is only 5% chance of
getting this data if no real difference existed (if $H_0$ was really
true). In other words, choosing a cut off of 0.05 means there is 5%
chance that the wrong decision is made (resulting in a false
positive). But remember the problematic of multiple testing seen in
chapter 7 from [WSBIM1322 course](http://bit.ly/WSBIM1322).
In a typical RNAseq differential expression analysis, we might have
about 20,000 genes to test and usually only a fraction of genes is
really differentially expressed. Imagine a drug treatment that
modifies the expression of about 1000 genes, but that has no impact on
the other ones. The first histogram shows how the distribution of
pvalues for truly modified genes ($H_0$ is false) would look like:
most of the pvalues would be very small. Using a pvalue cutoff of 0.05
should permit to identify most of these differentially expressed
genes. The second histogram shows the distribution of pvalues for
unmodified genes ($H_0$ is true). Here the p-values are uniformly
distributed between 0 and 1, and we can see that 5% of these genes
appear to be significant even though this is only by chance as the
drug had no real effect on them. But 5% of 19000 genes means ... 950
false positive genes! Hence, pvalues obtained from the Wald test must
be corrected for multiple testing to avoid excess false positives.
By default DESeq2 uses Benjamini-Hochberg method to adjust pvalues.
The third histogram bellow illustrates the principle behind this
**False discovery rate (FDR)** adjustment. As differential expression
analysis is done on the whole set of genes, the resulting pvalues will
have a distribution corresponding to the combination of both
histograms. Most of the p-values are uniformly distributed between 0
and 1 but there is a spike to the left close to zero, due to those
p-values for which $H_0$ is false. The correction approach helps to
estimate how many of the significant values are actually false
positives. It tries to find the height where the p-value distribution
flattens out (corresponding to the red line) and incorporates this
height value into the calculation of FDR adjusted p-values.
Choosing a cut off of 0.05 for padjusted values now implies that 5% of
significant tests (but not 5% of all tests as before) will result in false
positives.
```{r, echo=FALSE, fig.align='center', out.width='100%', purl=TRUE}
knitr::include_graphics("./figs/pval_histograms.png")
```
### Independent filtering
Multiple testing adjustment tends to be associated with a loss of power.
To counteract this effect, one possibility is to filter out those tests from
the procedure that have no, or little chance of showing significant evidence,
without even looking at their test statistic. Genes with very low counts are
typically not likely to be significant due to high dispersion. However, these
genes have an influence on the multiple testing adjustment, whose performance
improves if such genes are removed. By removing the weakly-expressed genes from
the input to the FDR procedure, more significant genes can be found among those
that are kept, and this improves the power of the test. This approach is known
as independent filtering.
DESeq2 uses as filtering criterion the mean of normalised counts. Genes with a
mean expression value under a certain threshold are removed. Such filtering is
permissible only if the filter criterion is independent of the actual test statistic,
otherwise, the filtering would invalidate the test and consequently the assumptions
of the FDR procedure. This is why filtering is done on the average expression over
all samples, irrespective of biological condition: this filter is blind to the assignment
of samples to the treatment and control group and hence independent.
The mean expression threshold used by DESeq2 for independentfiltering is defined
automatically by the software. It is chosen in a way that maximizes the number of
genes which will have a significant p-adjusted value.
The threshold chosen (the vertical line) is the lowest quantile for which the number of
rejections is within 1 residual standard deviation to the peak of the curve.
```{r, echo=FALSE, fig.align='center', out.width='100%', purl=TRUE}
knitr::include_graphics("./figs/IF_threshold.png")
```
## Running DESeq2
Let's start by installing the
[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) package.
```{r, echo = FALSE}
suppressPackageStartupMessages(library("DESeq2"))
suppressPackageStartupMessages(library("tidyverse"))
```
```{r}
if (!require("DESeq2"))
BiocManager::install("DESeq2")
library(DESeq2)
library(tidyverse)
```
We will run a DESeq2 analysis using real data.
### Construct DESeqDataSet object
Let's first load the count matrix and the sample metadata.
This dataset corresponds to RNAseq data from a cell line
treated or not by a siRNA.
```{r}
load("wsbim2122_data/deseq2/counts.rda")
load("wsbim2122_data/deseq2/coldata.rda")
coldata
head(counts)
dim(counts)
```
Using these data, we will start by creating a
[DESeqDataSet](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class),
which is a subclass of
[RangedSummarizedExperiment](https://www.rdocumentation.org/packages/SummarizedExperiment/versions/1.2.3/topics/RangedSummarizedExperiment-class)
used by the DESeq2 package to store the read counts and the
intermediate estimated quantities during statistical analysis. The
DESeqDataSet class enforces non-negative integer values in the count
matrix stored as the first element in the assay list. In addition, a
formula which specifies the design of the experiment (the variables
that will be used in modeling) must be provided.
```{r}
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ Condition)
dds
```
As for SummarizedExperiments (see chapter 3 from [WSBIM1322 course](http://bit.ly/WSBIM1322)):
- The Quantitative data can be accessed with `assay()`.
- The sample (columns) metadata can be access with the `colData()`
function.
- The features (rows) metadata can be access with the `rowData()`
column.
- Additional metadata describing the overall experiment can be
accessed with `metadata()`.
`r msmbstyle::question_begin()`
Access the count data from the dds object and plot the count distributions
of each sample.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
as_tibble(assay(dds), rownames = "gene") %>%
pivot_longer(names_to = "sample", values_to = "counts", cols = 2:7) %>%
ggplot(aes(x = log2(counts + 1), fill = sample)) +
geom_histogram(bins = 20) +
facet_wrap(~ sample)
```
`r msmbstyle::solution_end()`
The count matrix contains many rows with only zeros. We can start by removing
these rows of the dds object as anyway no statistical test can be done on
these genes that have no variation in their counts (equal to 0 in all samples).
It is not necessary to pre-filter low count genes before running
the DESeq2 functions, but by removing rows containing only zeros, we reduce the
memory size of the `dds` object and increase the speed of the analysis.
Additional filtering to improve power will be applied later.
```{r}
dds <- dds[rowSums(assay(dds)) > 0, ]
dim(dds)
```
### Run DESeq2
The standard differential expression analysis steps are wrapped into a single
function `DESeq()`. This function will automatically run the following other functions:
- `estimateSizeFactors()` (estimation of size factors)
- `estimateDispersions()` (estimation of dispersion)
- `nbinomWaldTest()` (Negative Binomial GLM fitting and Wald statistics)
```{r}
dds <- DESeq(dds)
```
### PCA
Before anything else, a good practice is to explore the data
and perform quality controls checks.
It is highly recommended to starts by a PCA to assess overall similarity between
the samples:
- Which samples are similar/different to each other?
- Does this fit to the expectation from the experiment’s design?
- Are they any sample outliers which may need to be explored further?
`r msmbstyle::question_begin()`
Here are 3 examples of PCAs that correspond to different experimental designs.
How would you interprete these PCAs and what impact could they
have on the analysis?
```{r, echo=FALSE, fig.align='center', out.width='80%', purl=TRUE}
knitr::include_graphics("./figs/PCA_paired_design.png")
knitr::include_graphics("./figs/PCA_batch_effect.png")
knitr::include_graphics("./figs/PCA_outlier.png")
```
`r msmbstyle::question_end()`
Remember that if one performs PCA directly on a matrix of normalised read counts,
the result typically depends only on the few most strongly expressed genes
because they show the largest absolute differences between samples.
A simple and often used strategy to avoid this is to take the logarithm of the
normalised count values plus a small pseudocount; however, now the genes with
low counts tend to dominate the results because, due to the strong Poisson noise
inherent to small count values, they show the strongest relative differences
between samples. As a solution, DESeq2 offers the regularized-logarithm
transformation `rlog()`.
For genes with high counts, the rlog transformation
differs not much from an ordinary log2 transformation. However for genes with
lower counts, the transformation moderates the variance across the mean,
shrunking the values towards the genes’ averages across all samples.
See `?rlog` for more details about the function.
```{r rlog}
rld <- rlogTransformation(dds)
```
`r msmbstyle::question_begin()`
1. ENSG00000198804 and ENSG00000248671 are two genes expressed respectively at a
very high and a very low level. Inspect the effect of the rlog transformation
on the counts of these genes. Compare to counts that would be transformed by a
classical log transformation (You can use the `normTransform()` function to
normalise and log-transform the dds counts)
2. Draw a scatter plot representing the log counts of sample1 versus the
log counts of sample2 (sample1 and sample2 are 2 control replicates).
Draw another scatter plot representing the "rlog transformed" counts of sample1
versus the "rlog transformed" counts of sample2. Compare the two plots.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
Question 1
```{r}
assay(dds)[c("ENSG00000198804", "ENSG00000248671"),]
# normalises and gives log2(n+1)
log_dds <- normTransform(dds)
assay(log_dds)[c("ENSG00000198804", "ENSG00000248671"), ]
assay(rld)[c("ENSG00000198804", "ENSG00000248671"), ]
```
Question 2
```{r}
fig1 <- as_tibble(assay(log_dds[, 1:2])) %>%
ggplot(aes(x = sample1, y = sample2)) +
geom_point(alpha = 0.2) +
geom_abline(slope = 1, color = "red") +
ggtitle("log2 transformation")
fig2 <- as_tibble(assay(rld[,1:2])) %>%
ggplot(aes(x = sample1, y = sample2))+
geom_point(alpha = 0.2) +
geom_abline(slope = 1, color = "red") +
ggtitle("rlogTransformation")
library("patchwork")
fig1 | fig2
```
`r msmbstyle::solution_end()`
The `plotPCA()` function can then be used on the transformed counts to generate a PCA.
```{r}
plotPCA(rld, intgroup = "Condition")
```
Note that the argument `returnData = TRUE` can be used to obtain a dataframe of PC1 and PC2
for custom plotting.
```{r}
pca_data <- plotPCA(rld, intgroup = "Condition", returnData = TRUE)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Condition, label = name)) +
geom_point() +
geom_text()
```
### Inspecting size factors
It is also advisable to investigate any systematic bias in the sequencing data, such as
whether one sample has been sequenced more deeply than others. One can extract size factors
using the `sizeFactors()` function.
Usually these size factors should vary around 1, indicating comparable sequencing depth.
```{r}
sizeFactors(dds)
```
`r msmbstyle::question_begin()`
Compare Size Factors to sequencing depth.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r size_factor}
SF <- tibble(sample = colnames(dds), SF = sizeFactors(dds)) %>%
ggplot(aes(x = sample, y = SF)) +
geom_col()
SD <- tibble(sample = colnames(dds), SeqDepth = colSums(assay(dds))) %>%
ggplot(aes(x = sample, y = SeqDepth)) +
geom_col()
SF / SD
```
`r msmbstyle::solution_end()`
### Dispersion plot
Plotting the dispersion estimates is a useful diagnostic.
This dispersion plot is typical, with the final dispersion, estimates shrunk from the
gene-wise estimates towards the fitted estimates.
```{r dispersion_plot}
plotDispEsts(dds)
```
`r msmbstyle::question_begin()`
1. Create a new fictive DESeqDataSet object containing only 2 replicates of each
condition instead of the 3 replicates that we have in our `dds` object.
Inspect its dispersion plot and compare it with the dispersion plot of `dds`.
2. Using the `dispersions()` function, you can acceed to the dispersion values.
Compare on a scatter plot the dispersions values obtained with 2 replicates
and with 3 replicates.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
Question 1.
```{r disp_exercice, message = FALSE, warning = FALSE}
dds_2_replicates <- DESeqDataSetFromMatrix(countData = counts[, c(1,2,4,6)],
colData = coldata[c(1,2,4,6), ],
design = ~ Condition)
# Keep only genes present in dds to be able to compare the 2 objects
dds_2_replicates <- dds_2_replicates[rownames(dds), ]
dds_2_replicates <- DESeq(dds_2_replicates)
plotDispEsts(dds_2_replicates)
plotDispEsts(dds)
```
Question 2.
```{r, message = FALSE, warning = FALSE}
tibble(dispersions_3_replicates = dispersions(dds),
dispersions_2_replicates = dispersions(dds_2_replicates)) %>%
ggplot(aes(x = dispersions_3_replicates, y = dispersions_2_replicates)) +
geom_point() +
geom_abline(slope = 1, color = "red")
```
`r msmbstyle::solution_end()`
### DESeq2 results
The `resultsNames()` function gives the names of results that can be extracted.
Note that by default, R will choose a reference level for factors based on alphabetical
order. Here KD condition was hence set arbitrarily as the reference level. This means
that the fold changes evaluated for every gene will correspond to their expression level
in mock cells versus their expression level in KD cells.
```{r}
resultsNames(dds)
dds$Condition
```
In this case, it seems more logical to use mock cells as reference.
Reference levels can be changed using the `relevel()` function. But in this
case don't forget to re-run `DESeq()` function after the re-leveling operation.
```{r, message = FALSE}
dds$Condition <- relevel(dds$Condition, ref = "mock")
dds <- DESeq(dds)
resultsNames(dds)
```
Results can then be extracted using the `results()` function.
The `name` parameter must be an element of resultsNames(object) specifying the
samples to compare.
Note that the `contrast` argument of `results()` can also be used to
extract results of interest. In this case, the two following commands will give
the same results[^contrast].
[^contrast]: Using the `name` argument is however necessary to extract specific
coefficients from more complex designs.
```{r}
res <- results(dds,
name = "Condition_KD_vs_mock")
```
```{r}
res <- results(dds,
contrast = c("Condition", "KD", "mock"))
```
Let's inspect the results and their signification.
```{r}
res_tbl <- as_tibble(res, rownames = "ENSEMBL")
res_tbl
```
**baseMean**: The average of the normalised count values taken over all samples.
**log2FoldChange**: Fold-change between the comparison and control groups,
reported on a logarithmic scale to base 2. In this case, as we are testing
the "Condition_KD_vs_mock" coefficient, positive log2FC indicates a gene up-regulated
in the KD condition compared to the mock condition, while a negative log2FC indicate
a down-regulation.
**lfcSE**: The standard error estimate for the log2FoldChange estimate
**stat**: The value of the test statistic for the gene
**pvalue**: The pvalue of the test for the gene
**padj**: pvalue adjusted for multiple testing
`r msmbstyle::question_begin()`
1. Inspect the results table and identify the 5 "best genes" showing
the lowest padjusted value.
2. Calculate the mean expression level of these 5 "best genes" using
the function `counts()`. Compare with baseMean values.
3. Extract the ß coefficient of these 5 "best genes" from the GLM
using the function `coefficients()`. Compare with log2FoldChange values.
4. Using the function `counts()`, evaluate the mean expression levels
of these 5 "best genes" in mock cells. Compare with ß coefficients.
5. Evaluate the mean expression levels of these 5 "best genes"
in KD cells. Compare with ß coefficients.
6. How many genes have no padjusted value? Why?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
1. Inspect the results table and identify the 5 "best genes" showing
the lowest padjusted value.
```{r}
res_tbl %>%
arrange(padj) %>%
head(5)
```
2. Calculate the mean expression level of these 5 "best genes" using
the function `counts()`. Compare with baseMean values.
```{r}
best_genes <- res_tbl %>%
arrange(padj) %>%
head(5) %>% pull(ENSEMBL)
rowMeans(counts(dds[best_genes], normalize = TRUE))
```
3. Extract the ß coefficient of these 5 "best genes" from the GLM
using the function `coefficients()`. Compare with log2FoldChange values.
```{r}
coefficients(dds[best_genes])
```
4. Using the function `counts()`, evaluate the mean expression levels
of these 5 "best genes" in mock cells. Compare with ß coefficients.
```{r}
# log2 expression levels in mock cells
log2(rowMeans(counts(dds[best_genes, 1:3], normalize = TRUE)))
```
5. Evaluate the mean expression levels of these 5 "best genes"
in KD cells. Compare with ß coefficients.
```{r}
# log2 expression levels in KD cells
log2(rowMeans(counts(dds[best_genes, 4:6], normalize = TRUE)))
```
```{r}
# log2 expression levels in KD cells evaluated from ß coefficients
coefficients(dds[best_genes])[,1] + coefficients(dds[best_genes])[,2]
```
6. How many genes have no padjusted value? Why?
```{r}
summary(res$padj)
```
Let's inspect genes with no padjusted values
```{r}
res_tbl %>%
filter(is.na(padj))
```
These genes all have a very low basemean and are the one that have been filtered
out by the independent filtering procedure.
`r msmbstyle::solution_end()`
### Independent filtering exploration
The filtering threshold that has been used to filter low count genes can be extracted
from the results metadata.
```{r}
metadata(res)$filterThreshold
```
This means that genes whith basemean < `r metadata(res)$filterThreshold` have been
filtered. This represents `r names(metadata(res)$filterThreshold)` of all tested genes!
Remember that the filtering threshold has been fixed in a way that maximizes the
number of genes which will have a significant padjusted value.
We can plot the number of rejections over the basemean quantiles.
The threshold chosen (red vertical line) is the lowest quantile for which the number of
rejections is within 1 residual standard deviation to the peak of the curve.
```{r}
as_tibble(metadata(res)$filterNumRej) %>%
ggplot(aes(x = theta, y = numRej)) +
geom_point() +
geom_vline(xintercept = 0.484,
color = 'red')
```
`r msmbstyle::question_begin()`
1. Re-run the results() function on the same dds object,
but set the independent filtering parameter to FALSE.
Check how many genes have no padj?
2. Compare the number of significant genes obtained with and without
independent filtering
3. Imagine another way of filtering genes with very low counts
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
1. Re-run the results() function on the same dds object,
but set the independent filtering parameter to FALSE.
Check how many genes have no padj?
```{r}
res_no_IF <- results(dds,
name = "Condition_KD_vs_mock",
independentFiltering = FALSE)
res_no_IF_tbl <- as_tibble(res_no_IF, rownames = "ENSEMBL")
res_no_IF_tbl %>%
filter(is.na(padj)) %>% nrow()
```
2. Compare the number of significant genes obtained with and without
independent filtering
```{r}
res_tbl %>% filter(padj < .05) %>% nrow()
res_no_IF_tbl %>% filter(padj < .05) %>% nrow()
```
3. Imagine another way of filtering genes with very low counts
```{r, message = FALSE}
# filter the data to remove genes with few counts
filtering_thr <- 5
# keep genes with counts > 5 in 3 or more samples
keep <- rowSums(counts(dds, normalized = TRUE) >= filtering_thr) >=3
dds_bis <- DESeq(dds[keep, ])
dim(dds_bis)
res_bis <- results(dds_bis,
name = "Condition_KD_vs_mock",
independentFiltering = FALSE)
res_bis_tbl <- as_tibble(res_bis, rownames = "ENSEMBL")
# Compare with previous analysis
# (Independant filtering threshold fixed automatically by DESeq2)
res_tbl %>% filter(padj < 0.05)
res_bis_tbl %>% filter(padj < 0.05)
```
`r msmbstyle::solution_end()`
### p-values distribution
Another useful diagnostic plot is the histogram of pvalues.
It can give you an immediate idea of the proportion of genes differentially expressed,
(the taller the is the left peak, the more p-values are close to 0 and therefore
significant). Furthermore, it gives an idea of how the test behaved across all your
hypotheses, and immediately diagnoses some potential problems.
```{r}
hist(res_tbl$pvalue)
```
`r msmbstyle::question_begin()`
What do you think about the pvalues histogram?
`r msmbstyle::question_end()`