forked from microbiome/OMA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path22_communitytyping.Rmd
231 lines (167 loc) · 6.48 KB
/
22_communitytyping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# Community typing {#community-typing}
```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```
```{r load-pkg-data}
library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns
```
## Community typing
### Dirichlet Multinomial Mixtures (DMM)
This section focus on DMM analysis.
One technique that allows to search for groups of samples that are
similar to each other is the [Dirichlet-Multinomial Mixture
Model](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030126). In
DMM, we first determine the number of clusters (k) that best fit the
data (model evidence) using Laplace approximation. After fitting the
model with k clusters, we obtain for each sample k probabilities that
reflect the probability that a sample belongs to the given cluster.
Let's cluster the data with DMM clustering.
```{r dmm}
# Runs model and calculates the most likely number of clusters from 1 to 7.
# Since this is a large dataset it takes long computational time.
# For this reason we use only a subset of the data; agglomerated by Phylum as a rank.
tse <- GlobalPatterns
tse <- agglomerateByRank(tse, rank = "Phylum", agglomerateTree=TRUE)
tse_dmn <- runDMN(tse, name = "DMN", k = 1:7)
```
```{r}
# It is stored in metadata
tse_dmn
```
Return information on metadata that the object contains.
```{r}
names(metadata(tse_dmn))
```
This returns a list of DMN objects for a closer investigation.
```{r}
getDMN(tse_dmn)
```
Show Laplace approximation (model evidence) for each model of the k models.
```{r}
library(miaViz)
plotDMNFit(tse_dmn, type = "laplace")
```
Return the model that has the best fit (see also `BestDMNFit`, which returns the index of the best fit).
```{r}
getBestDMNFit(tse_dmn, type = "laplace")
```
### PCoA for ASV-level data with Bray-Curtis; with DMM clusters shown with colors
Group samples and return DMNGroup object that contains a summary (see also `performDMNgroupCV`, which returns an equivalent data.frame).
Patient status is used for grouping.
```{r}
dmn_group <- calculateDMNgroup(tse_dmn, variable = "SampleType", exprs_values = "counts",
k = 2, seed=.Machine$integer.max)
dmn_group
```
Mixture weights (rough measure of the cluster size).
```{r}
DirichletMultinomial::mixturewt(getBestDMNFit(tse_dmn))
```
Samples-cluster assignment probabilities / how probable it is that sample belongs
to each cluster
```{r}
head(DirichletMultinomial::mixture(getBestDMNFit(tse_dmn)))
```
Contribution of each taxa to each component
```{r}
head(DirichletMultinomial::fitted(getBestDMNFit(tse_dmn)))
```
Get the assignment probabilities
```{r}
prob <- DirichletMultinomial::mixture(getBestDMNFit(tse_dmn))
# Add column names
colnames(prob) <- c("comp1", "comp2")
# For each row, finds column that has the highest value. Then extract the column
# names of highest values.
vec <- colnames(prob)[max.col(prob,ties.method = "first")]
# Add info to colData
colData(tse)$dmm_component <- vec
```
Computing the euclidean PCoA and storing it as a data frame
```{r}
# Does clr transformation. Pseudocount is added, because data contains zeros.
tse <- transformSamples(tse, method = "relabundance", pseudocount = 1)
tse <- transformCounts(tse, assay_name = "relabundance", method = "clr")
library(scater)
# Calculate PCoA
tse <- runMDS(tse, exprs_values = "clr", method = "euclidean")
```
```{r}
# Create a plot
euclidean_dmm_plot <- plotReducedDim(tse, "MDS", colour_by = "dmm_component") +
labs(x = "Coordinate 1",
y = "Coordinate 2",
title = "PCoA with Aitchison distances") +
theme(title = element_text(size = 12)) # makes titles smaller
euclidean_dmm_plot
```
## Community Detection
Another approach for discovering communities within the samples of the
data, is to run community detection algorithms after building a
graph. The following demonstration builds a graph based on the k
nearest-neighbors and performs the community detection on the fly.
_`bluster`_ [@R-bluster] package offers several clustering methods,
among which graph-based are present, enabling the community detection
task.
Installing package:
```{r}
if(!require(bluster)){
BiocManager::install("bluster")
}
```
The algorithm used is "short random walks" [@Pons2006]. Graph is
constructed using different k values (the number of nearest neighbors
to consider during graph construction) using the robust centered log
ratio (rclr) assay data. Then plotting the communities using UMAP
[@McInnes2018] ordination as a visual exploration aid. In the
following demonstration we use the `enterotype` dataset from the
[@R-mia] package.
```{r, message=FALSE, warning=FALSE}
library(bluster)
library(patchwork) # For arranging several plots as a grid
library(scater)
data("enterotype", package="mia")
tse <- enterotype
tse <- transformSamples(tse, method = "relabundance")
tse <- transformCounts(tse, method = "rclr")
# Performing and storing UMAP
tse <- runUMAP(tse, name="UMAP", exprs_values="rclr")
k <- c(2,3,5,10)
ClustAndPlot <- function(x) {
# Creating the graph and running the short random walks algorithm
graph_clusters <- clusterRows(t(assays(tse)$rclr), NNGraphParam(k=x))
# Results of the clustering as a color for each sample
plotUMAP(tse, colour_by = I(graph_clusters)) +
labs(title = paste0("k = ", x))
}
# Applying the function for different k values
plots <- lapply(k,ClustAndPlot)
# Displaying plots in a grid
(plots[[1]] + plots[[2]]) / (plots[[3]] + plots[[4]])
```
Similarly, the _`bluster`_ [@R-bluster] package offers clustering
diagnostics that can be used for judging the clustering quality (see
[Assorted clustering
diagnostics](http://bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html)).
In the following, Silhouette width as a diagnostic tool is computed
and results are visualized for each case presented earlier. For more
about Silhouettes read [@Rousseeuw1987].
```{r, message=FALSE, warning=FALSE}
ClustDiagPlot <- function(x) {
# Getting the clustering results
graph_clusters <- clusterRows(t(assays(tse)$rclr), NNGraphParam(k=x))
# Computing the diagnostic info
sil <- approxSilhouette(t(assays(tse)$rclr), graph_clusters)
# Plotting as a boxlpot to observe cluster separation
boxplot(split(sil$width, graph_clusters), main=paste0("k = ", x))
}
# Applying the function for different k values
res <- lapply(k,ClustDiagPlot)
```
## Additional Community Typing
For more community typing techniques applied to the 'SprockettTHData' data set, see the attached .Rmd file.
Link:
* [Rmd](add-comm-typing.Rmd)