forked from microbiome/course_2021_radboud
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-beta.Rmd
336 lines (239 loc) · 11.5 KB
/
07-beta.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
```{r, echo=FALSE}
library(mia)
library(miaViz)
se <- readRDS("data/se.rds")
tse <- readRDS("data/tse.rds")
tse <- transformCounts(tse, method = "relabundance")
tse_phylum <- agglomerateByRank(tse, rank = "Phylum")
# Indices to be calculated.
# Every index is calculated by default if we don't specify indices.
indices <- c("shannon")
# Indices are stored in colData (i.e., sample metadata). We can specify the name
# of column, or we can use the default name which is the name of index
# (i.e., "shannon" and "faith").
names <- c("Shannon_diversity")
# Calculates indices
tse <- estimateDiversity(tse, index = indices, name = names)
```
# Beta diversity
Beta diversity is another name for sample dissimilarity. It quantifies
differences in the overall taxonomic composition between two samples.
Common indices include Bray-Curtis, Unifrac, Jaccard index, and the
Aitchison distance. Each of these (dis)similarity measures emphasizes
different aspects. For example, UniFrac incorporates phylogenetic
information, and Jaccard index ignores exact abundances and considers
only presence/absence values. For more background information
and examples, you can check the dedicated section in [online
book](https://microbiome.github.io/OMA/microbiome-diversity.html#beta-diversity).
## Examples of PCoA with different settings
Beta diversity estimation generates a (dis)similarity matrix that
contains for each sample (rows) the dissimilarity to any other sample
(columns).
This complex set of pairwise relations can be visualized in
informative ways, and even coupled with other explanatory
variables. As a first step, we compress the information to a lower
dimensionality, or fewer principal components, and then visualize
sample similarity based on that using ordination techniques, such as
Principal Coordinate Analysis (PCoA). PCoA is a non-linear dimension
reduction technique, and with Euclidean distances it is is identical
to the linear PCA (except for potential scaling).
We typically retain just the two (or three) most informative top
components, and ignore the other information. Each sample has a score
on each of these components, and each component measures the variation
across a set of correlated taxa. The top components are then easily
visualized on a two (or three) dimensional display.
Let us next look at some concrete examples.
### PCoA for ASV-level data with Bray-Curtis
Let us start with PCoA based on a Bray-Curtis dissimilarity matrix
calculated at Genus level abundances.
```{r pcoa_asv_bc, message=FALSE}
# Pick the relative abundance table
rel_abund_assay <- assays(tse)$relabundance
# Calculates Bray-Curtis distances between samples. Because taxa is in
# columns, it is used to compare different samples. We transpose the
# assay to get taxa to columns
bray_curtis_dist <- vegan::vegdist(t(rel_abund_assay), method = "bray")
# PCoA
bray_curtis_pcoa <- ecodist::pco(bray_curtis_dist)
# All components could be found here:
# bray_curtis_pcoa$vectors
# But we only need the first two to demonstrate what we can do:
bray_curtis_pcoa_df <- data.frame(pcoa1 = bray_curtis_pcoa$vectors[,1],
pcoa2 = bray_curtis_pcoa$vectors[,2])
# Create a plot
bray_curtis_plot <- ggplot(data = bray_curtis_pcoa_df, aes(x=pcoa1, y=pcoa2)) +
geom_point() +
labs(x = "PC1",
y = "PC2",
title = "Bray-Curtis PCoA") +
theme(title = element_text(size = 10)) # makes titles smaller
bray_curtis_plot
```
### PCoA for ASV-level data with Aitchison distance
Now the same using Aitchison distance. This metric corresponds to
Euclidean distances between CLR transformed sample abundance vectors.
```{r pcoa_asv_aitchison}
# Does clr transformation. Pseudocount is added, because data contains zeros.
tse <- transformCounts(tse, method = "clr", pseudocount = 1)
# Gets clr table
clr_assay <- assays(tse)$clr
# Transposes it to get taxa to columns
clr_assay <- t(clr_assay)
# Calculates Euclidean distances between samples. Because taxa is in columns,
# it is used to compare different samples.
euclidean_dist <- vegan::vegdist(clr_assay, method = "euclidean")
# Does principal coordinate analysis
euclidean_pcoa <- ecodist::pco(euclidean_dist)
# Creates a data frame from principal coordinates
euclidean_pcoa_df <- data.frame(pcoa1 = euclidean_pcoa$vectors[,1],
pcoa2 = euclidean_pcoa$vectors[,2])
# Creates the plot
euclidean_plot <- ggplot(data = euclidean_pcoa_df, aes(x=pcoa1, y=pcoa2)) +
geom_point() +
labs(x = "PC1",
y = "PC2",
title = "Euclidean PCoA with CLR transformation") +
theme(title = element_text(size = 12)) # makes titles smaller
euclidean_plot
```
### PCoA aggregated to Phylum level
We use again the Aitchison distances in this example but this time applied to the phylum level.
```{r pcoa_phylum_aitchison}
# Does clr transformation. Psuedocount is added, because data contains zeros.
tse_phylum <- transformCounts(tse_phylum, method = "clr", pseudocount = 1)
# Gets clr table
clr_phylum_assay <- assays(tse_phylum)$clr
# Transposes it to get taxa to columns
clr_phylum_assay <- t(clr_phylum_assay)
# Calculates Euclidean distances between samples. Because taxa is in columns,
# it is used to compare different samples.
euclidean_phylum_dist <- vegan::vegdist(clr_assay, method = "euclidean")
# Does principal coordinate analysis
euclidean_phylum_pcoa <- ecodist::pco(euclidean_phylum_dist)
# Creates a data frame from principal coordinates
euclidean_phylum_pcoa_df <- data.frame(
pcoa1 = euclidean_phylum_pcoa$vectors[,1],
pcoa2 = euclidean_phylum_pcoa$vectors[,2])
# Creates a plot
euclidean_phylum_plot <- ggplot(data = euclidean_phylum_pcoa_df,
aes(x=pcoa1, y=pcoa2)) +
geom_point() +
labs(x = "PC1",
y = "PC2",
title = "Aitchison distances at Phylum level") +
theme(title = element_text(size = 12)) # makes titles smaller
euclidean_phylum_plot
```
## Highlighting external variables
We can map other variables on the same plot for example by coloring
the points accordingly.
### Discrete grouping variable shown with colors
```{r pcoa_genus}
# Adds the variable we later use for coloring to the data frame
euclidean_patient_status_pcoa_df <- cbind(euclidean_pcoa_df,
patient_status = colData(tse)$patient_status)
# Creates a plot
euclidean_patient_status_plot <- ggplot(data = euclidean_patient_status_pcoa_df,
aes(x=pcoa1, y=pcoa2,
color = patient_status)) +
geom_point() +
labs(x = "PC1",
y = "PC2",
title = "PCoA with Aitchison distances") +
theme(title = element_text(size = 12)) # makes titles smaller
euclidean_patient_status_plot
```
### PCoA plot with continuous variable
We can do the same as above using any continuous variable. E.g. let us
see how the plotted samples differ in their alpha diversity (as an
example of a continuous variable):
```{r pcoa_coloring}
# Adds coloring information to the data frame, creates new column
euclidean_shannon_pcoa_df <- cbind(euclidean_pcoa_df,
shannon = colData(tse)$Shannon_diversity)
# Creates a plot
euclidean_shannon_plot <- ggplot(data = euclidean_shannon_pcoa_df,
aes(x=pcoa1, y=pcoa2,
color = shannon)) +
geom_point() +
labs(x = "PC1",
y = "PC2",
title = "PCoA with Aitchison distances") +
theme(title = element_text(size = 12)) # makes titles smaller
euclidean_shannon_plot
```
## Estimating associations with an external variable
Next to visualizing whether any variable is associated with
differences between samples, we can also quantify the strength of the
association between community composition (beta diversity) and
external factors.
The standard way to do this is to perform a so-called permutational
multivariate analysis of variance (PERMANOVA). This method takes as
input the abundance table, which measure of distance you want to base
the test on and a formula that tells the model how you think the
variables are associated with each other.
```{r permanova}
# First we get the relative abundance table
rel_abund_assay <- assays(tse)$relabundance
# again transpose it to get taxa to columns
rel_abund_assay <- t(rel_abund_assay)
# then we can perform the method
permanova_cohort <- vegan::adonis(rel_abund_assay ~ cohort,
data = colData(tse),
permutations = 9999)
# we can obtain a the p value for our predictor:
print(paste0("Different different cohorts and variance of abundance ",
"between samples, p-value: ",
as.data.frame(permanova_cohort$aov.tab)["cohort", "Pr(>F)"]))
```
The cohort variable is not significantly associated with
microbiota composition (p-value is over 0.05).
We can, however, visualize those taxa whose abundances drive the
differences between cohorts. We first need to extract the model
coefficients of taxa:
```{r permanova_coefs}
# Gets the coefficients
coef <- coefficients(permanova_cohort)["cohort1",]
# Gets the highest coefficients
top.coef <- sort(head(coef[rev(order(abs(coef)))],20))
# Plots the coefficients
top_taxa_coeffient_plot <- ggplot(data.frame(x = top.coef,
y = factor(names(top.coef),
unique(names(top.coef)))),
aes(x = x, y = y)) +
geom_bar(stat="identity") +
labs(x="", y="", title="Top Taxa") +
theme_bw()
top_taxa_coeffient_plot
```
The above plot shows taxa as code names, and it is hard to tell which
bacterial groups they represent. However, it is easy to add human readable
names. We can fetch those from our rowData. Here we use Genus level names:
```{r}
# Gets corresponding Genus level names and stores them to top.coef
names <- rowData(tse)[names(top.coef), ][,"Genus"]
# Adds new labels to the plot
top_taxa_coeffient_plot <- top_taxa_coeffient_plot +
scale_y_discrete(labels = names) # Adds new labels
top_taxa_coeffient_plot
```
There are many alternative and complementary methods for analysing
community composition. For more examples, see a dedicated section on
beta diversity in the [online
book](https://microbiome.github.io/OMA/microbiome-diversity.html#beta-diversity).
## Community typing
A dedicated section presenting examples on community typing is in the
[online book](https://microbiome.github.io/OMA/microbiome-community.html#community-typing).
## Exercises
* Visualize community variation with different methods (PCA, MDS, t-SNE...) by using the options in the alternative method, plotReducedDim [OMA](https://microbiome.github.io/OMA/microbiome-diversity.html#estimating-beta-diversity). Compare results obtained with different dissimilarities (Euclidean, Bray-Curtis, Unifrac..) and transformations (CLR, compositional..) of your own choice."
* Investigate the influence of the data transformations on
statistical analysis: Visualize community variation with PCoA with
the following options: 1) Bray-Curtis distances for compositional
data; 2) Euclidean distances for CLR-transformed data.
* Community-level comparisons: Use PERMANOVA to investigate whether
the community composition differs between two groups of individuals
(e.g. males and females, or some other grouping of your
choice). You can also include covariates such as age and gender,
and see how this affects the results.
* Perform community typing for the data using the DMM method [OMA](https://microbiome.github.io/OMA/microbiome-community.html#community-typing)
* Example [Solutions](08-5-ex-sol-ADHD.html)