forked from microbiome/OMA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path24_clustering.Rmd
750 lines (571 loc) · 22.9 KB
/
24_clustering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
# Community typing (clustering) {#clustering}
```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```
```{r load-pkg-data1}
library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns
```
Clustering is an unsupervised machine learning technique. The idea of it is to find
clusters from the data. A cluster is a group of features/samples that share pattern.
For example, with clustering, we can find group of samples that share similar
community composition. There are multiple clustering algorithms available.
## Hiearchical clustering
Hiearchical clustering is a clustering method that aims to find hiearchy between
samples/features. There are to approaches: agglomerative ("bottom-up")
and divisive ("top-down").
In agglomerative approach, each observation is first unique cluster.
Algorithm continues by agglomerating similar clusters. Divisive approach starts
with one cluster that contains all the observations. Clusters are splitted recursively
to clusters that differ the most. Clustering ends when each cluster contains only one
observation.
Hiearchical clustering can be visualized with dendrogram tree. In each splitting
point, the three is divided into two clusters leading to hierarchy.
Let's load data from mia package.
```{r hclust1}
library(mia)
library(vegan)
# Load experimental data
data(peerj13075)
(tse <- peerj13075)
```
Hierarchical clustering requires 2 steps. In the fist step, dissimilarities are
calculated. In prior to that, data transformation is applied if needed. Since
sequencing data is compositional, relative transformation is applied.
In the second step, clustering is performed based on dissimilarities.
```{r hclust2}
# Apply transformation
tse <- transformSamples(tse, method = "relabundance")
# Get the assay
assay <- assay(tse, "relabundance")
# Transpose assay --> samples are now in rows --> we are clustering samples
assay <- t(assay)
# Calculate distances
diss <- vegdist(assay, method = "bray")
# Perform hierarchical clustering
hc <- hclust(diss, method = "complete")
# To visualize, convert hclust object into dendrogram object
dendro <- as.dendrogram(hc)
# Plot dendrogram
plot(dendro)
```
We can use dendrogram to determine the number of clusters. Usually, the tree is splitted
where the length of branches are the longest.
However, as we can see from the dendrogram, clusters are no clear. Let's use an algorithm to
solve the best number of clusters.
```{r hclust3}
if( !require(NbClust) ){
install.packages("NbClust")
library(NbClust)
}
# Determine the optimal number of clusters
res <- NbClust(diss = diss, distance = NULL, method = "ward.D2",
index = "silhouette")
res$Best.nc
```
Based on the result, let's divide observations into 15 clusters.
```{r hclust4}
if( !require(dendextend) ){
install.packages("dendextend")
library(dendextend)
}
# Find clusters
cutree(hc, k = 15)
# Making colors for 6 clusters
col_val_map <- randomcoloR::distinctColorPalette(15) %>%
as.list() %>% setNames(paste0("clust_",seq(15)))
dend <- color_branches(dendro, k=15, col=unlist(col_val_map))
labels(dend) <- NULL
plot(dend)
```
## K-means clustering
Because, we were not able to find clusters with hierarchical clustering, let's try
k-means clustering. In k-means clustering, observations are divided into clusters
so that the mean distances between observations and cluster centers are minimized;
an observation belongs to cluster whose center is the nearest.
The algorithm starts by dividing observation to random clusters whose number is
defined by user. The centroids of clusters are then calculated. After that, observations'
allocation to clusters are updated so that the means are minimized. Again, centroid
are calculated, and algorithm continues iteratively until the assignments do not change.
The number of clusters can be determined based on algorithm. Here we utilize silhouette
analysis.
```{r kmeans1}
if( !require(factoextra) ){
install.packages("factoextra")
library(factoextra)
}
# Convert dist object into matrix
diss <- as.matrix(diss)
# Perform silhouette analysis and plot the result
fviz_nbclust(diss, kmeans, method = "silhouette")
```
Based on the result of silhouette analysis, we choose 3 to be the number of clusters
in k-means clustering.
```{r kmeans2}
library(scater)
# The first step is random, add seed for reproducibility
set.seed(15463)
# Perform k-means clustering with 3 clusters
km <- kmeans(diss, 3, nstart = 25)
# Add the result to colData
colData(tse)$clusters <- as.factor(km$cluster)
# Perform PCoA so that we can visualize clusters
tse <- runMDS(tse, exprs_values = "relabundance", FUN = vegan::vegdist, method = "bray")
# Plot PCoA and color clusters
plotReducedDim(tse, "MDS", colour_by = "clusters")
```
## Dirichlet Multinomial Mixtures (DMM)
This section focus on DMM analysis.
One technique that allows to search for groups of samples that are
similar to each other is the [Dirichlet-Multinomial Mixture
Model](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030126). In
DMM, we first determine the number of clusters (k) that best fit the
data (model evidence) using Laplace approximation. After fitting the
model with k clusters, we obtain for each sample k probabilities that
reflect the probability that a sample belongs to the given cluster.
Let's cluster the data with DMM clustering.
```{r dmm1}
# Runs model and calculates the most likely number of clusters from 1 to 7.
# Since this is a large dataset it takes long computational time.
# For this reason we use only a subset of the data; agglomerated by Phylum as a rank.
tse <- GlobalPatterns
tse <- agglomerateByRank(tse, rank = "Phylum", agglomerateTree=TRUE)
```
```{r dmm2}
tse_dmn <- mia::runDMN(tse, name = "DMN", k = 1:7)
```
```{r}
# It is stored in metadata
tse_dmn
```
Return information on metadata that the object contains.
```{r}
names(metadata(tse_dmn))
```
This returns a list of DMN objects for a closer investigation.
```{r}
getDMN(tse_dmn)
```
Show Laplace approximation (model evidence) for each model of the k models.
```{r}
library(miaViz)
plotDMNFit(tse_dmn, type = "laplace")
```
Return the model that has the best fit.
```{r}
getBestDMNFit(tse_dmn, type = "laplace")
```
### PCoA for ASV-level data with Bray-Curtis; with DMM clusters shown with colors
Group samples and return DMNGroup object that contains a summary.
Patient status is used for grouping.
```{r}
dmn_group <- calculateDMNgroup(tse_dmn, variable = "SampleType", exprs_values = "counts",
k = 2, seed=.Machine$integer.max)
dmn_group
```
Mixture weights (rough measure of the cluster size).
```{r}
DirichletMultinomial::mixturewt(getBestDMNFit(tse_dmn))
```
Samples-cluster assignment probabilities / how probable it is that sample belongs
to each cluster
```{r}
head(DirichletMultinomial::mixture(getBestDMNFit(tse_dmn)))
```
Contribution of each taxa to each component
```{r}
head(DirichletMultinomial::fitted(getBestDMNFit(tse_dmn)))
```
Get the assignment probabilities
```{r}
prob <- DirichletMultinomial::mixture(getBestDMNFit(tse_dmn))
# Add column names
colnames(prob) <- c("comp1", "comp2")
# For each row, finds column that has the highest value. Then extract the column
# names of highest values.
vec <- colnames(prob)[max.col(prob,ties.method = "first")]
```
Computing the euclidean PCoA and storing it as a data frame
```{r}
# Does clr transformation. Pseudocount is added, because data contains zeros.
tse <- transformCounts(tse, method = "relabundance", pseudocount = 1)
tse <- transformCounts(tse, "relabundance", method = "clr")
library(scater)
# Does principal coordinate analysis
df <- calculateMDS(tse, exprs_values = "clr", method = "euclidean")
# Creates a data frame from principal coordinates
euclidean_pcoa_df <- data.frame(pcoa1 = df[,1],
pcoa2 = df[,2])
```
```{r}
# Creates a data frame that contains principal coordinates and DMM information
euclidean_dmm_pcoa_df <- cbind(euclidean_pcoa_df,
dmm_component = vec)
# Creates a plot
euclidean_dmm_plot <- ggplot(data = euclidean_dmm_pcoa_df,
aes(x=pcoa1, y=pcoa2,
color = dmm_component)) +
geom_point() +
labs(x = "Coordinate 1",
y = "Coordinate 2",
title = "PCoA with Aitchison distances") +
theme(title = element_text(size = 12)) # makes titles smaller
euclidean_dmm_plot
```
## Community Detection
Another approach for discovering communities within the samples of the
data, is to run community detection algorithms after building a
graph. The following demonstration builds a graph based on the k
nearest-neighbors and performs the community detection on the fly.
_`bluster`_ [@R-bluster] package offers several clustering methods,
among which graph-based are present, enabling the community detection
task.
Installing package:
```{r}
if(!require(bluster)){
BiocManager::install("bluster")
}
```
The algorithm used is "short random walks" [@Pons2006]. Graph is
constructed using different k values (the number of nearest neighbors
to consider during graph construction) using the robust centered log
ratio (rclr) assay data. Then plotting the communities using UMAP
[@McInnes2018] ordination as a visual exploration aid. In the
following demonstration we use the `enterotype` dataset from the
[@R-mia] package.
```{r, message=FALSE, warning=FALSE}
library(bluster)
library(patchwork) # For arranging several plots as a grid
library(scater)
data("enterotype", package="mia")
tse <- enterotype
tse <- transformCounts(tse, method = "rclr")
# Performing and storing UMAP
tse <- runUMAP(tse, name="UMAP", exprs_values="rclr")
k <- c(2,3,5,10)
ClustAndPlot <- function(x) {
# Creating the graph and running the short random walks algorithm
graph_clusters <- clusterRows(t(assays(tse)$rclr), NNGraphParam(k=x))
# Results of the clustering as a color for each sample
plotUMAP(tse, colour_by = I(graph_clusters)) +
labs(title = paste0("k = ", x))
}
# Applying the function for different k values
plots <- lapply(k,ClustAndPlot)
# Displaying plots in a grid
(plots[[1]] + plots[[2]]) / (plots[[3]] + plots[[4]])
```
Similarly, the _`bluster`_ [@R-bluster] package offers clustering
diagnostics that can be used for judging the clustering quality (see
[Assorted clustering
diagnostics](http://bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html)).
In the following, Silhouette width as a diagnostic tool is computed
and results are visualized for each case presented earlier. For more
about Silhouettes read [@Rousseeuw1987].
```{r, message=FALSE, warning=FALSE}
ClustDiagPlot <- function(x) {
# Getting the clustering results
graph_clusters <- clusterRows(t(assays(tse)$rclr), NNGraphParam(k=x))
# Computing the diagnostic info
sil <- approxSilhouette(t(assays(tse)$rclr), graph_clusters)
# Plotting as a boxlpot to observe cluster separation
boxplot(split(sil$width, graph_clusters), main=paste0("k = ", x))
}
# Applying the function for different k values
res <- lapply(k,ClustDiagPlot)
```
## Biclustering
Biclustering methods cluster rows and columns simultaneously in order
to find subsets of correlated features/samples.
Here, we use following packages:
- [biclust](https://cran.r-project.org/web/packages/biclust/index.html)
- [cobiclust](https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13582)
_cobiclust_ is especially developed for microbiome data whereas _biclust_ is more
general method. In this section, we show three different cases and example
solutions to apply biclustering to them.
1. Taxa vs samples
2. Taxa vs biomolecule/biomarker
3. Taxa vs taxa
Biclusters can be visualized using heatmap or boxplot, for instance. For checking purposes,
also scatter plot might be valid choice.
Check more ideas for heatmaps from chapters \@ref(viz-chapter) and \@ref(microbiome-community.
### Taxa vs samples
When you have microbial abundance matrices, we suggest to use _cobiclust_ which is
designed for microbial data.
Load example data
```{r load-pkg-data2}
library(mia)
data("HintikkaXOData")
mae <- HintikkaXOData
```
Only the most prevalent taxa are included in analysis.
```{r cobiclust_1}
# Subset data in the first experiment
mae[[1]] <- subsetByPrevalentTaxa(mae[[1]], rank = "Genus", prevalence = 0.2, detection = 0.001)
# clr-transform in the first experiment
mae[[1]] <- transformSamples(mae[[1]], method = "relabundance", pseudocount = 1)
mae[[1]] <- transformSamples(mae[[1]], "relabundance", method = "clr")
```
_cobiclust_ takes counts table as an input and gives _cobiclust_ object as an output.
It includes clusters for taxa and samples.
```{r cobiclust_2}
if(!require(cobiclust)){
install.packages("cobiclust")
library(cobiclust)
}
# Do clustering; use counts table´
clusters <- cobiclust(assay(mae[[1]], "counts"))
# Get clusters
row_clusters <- clusters$classification$rowclass
col_clusters <- clusters$classification$colclass
# Add clusters to rowdata and coldata
rowData(mae[[1]])$clusters <- factor(row_clusters)
colData(mae[[1]])$clusters <- factor(col_clusters)
# Order data based on clusters
mae[[1]] <- mae[[1]][order(rowData(mae[[1]])$clusters), order(colData(mae[[1]])$clusters)]
# Print clusters
clusters$classification
```
Next we can plot clusters. Commonly used plot is heatmap with annotations.
```{r cobiclust_3, fig.width=14, fig.height=12}
if(!require(pheatmap)){
install.packages("pheatmap")
library(pheatmap)
}
# z-transform for heatmap
mae[[1]] <- transformFeatures(mae[[1]], assay_name = "clr", method = "z", name = "clr_z")
# Create annotations. When column names are equal, they should share levels.
# Here samples include 3 clusters, and taxa 2. That is why we have to make
# column names unique.
annotation_col <- data.frame(colData(mae[[1]])[, "clusters", drop = F])
colnames(annotation_col) <- "col_clusters"
annotation_row <- data.frame(rowData(mae[[1]])[, "clusters", drop = F])
colnames(annotation_row) <- "row_clusters"
# Create a heatmap
pheatmap(assay(mae[[1]], "clr_z"), cluster_rows = F, cluster_cols = F,
annotation_col = annotation_col,
annotation_row = annotation_row)
```
Boxplot is commonly used to summarize the results:
```{r cobiclust_4}
if(!require(ggplot2)){
install.packages("ggplot2")
library(ggplot2)
}
if(!require(patchwork)){
install.packages("patchwork")
library(patchwork)
}
# ggplot requires data in melted format
melt_assay <- meltAssay(mae[[1]], assay_name = "clr", add_col_data = T, add_row_data = T)
# patchwork two plots side-by-side
p1 <- ggplot(melt_assay) +
geom_boxplot(aes(x = clusters.x, y = clr)) +
labs(x = "Taxa clusters")
p2 <- ggplot(melt_assay) +
geom_boxplot(aes(x = clusters.y, y = clr)) +
labs(x = "Sample clusters")
p1 + p2
```
### Taxa vs biomolecules
Here, we analyze cross-correlation between taxa and metabolites. This is a case, where
we use _biclust_ method which is suitable for numeric matrices in general.
```{r biclust_1}
# Samples must be in equal order
# (Only 1st experiment was ordered in cobiclust step leading to unequal order)
mae[[1]] <- mae[[1]][ , colnames(mae[[2]]) ]
# Make rownames unique since it is require by other steps
rownames(mae[[1]]) <- make.unique(rownames(mae[[1]]))
# Calculate correlations
corr <- getExperimentCrossCorrelation(mae, 1, 2,
assay_name1 = "clr",
assay_name2 = "nmr",
mode = "matrix",
cor_threshold = 0.2)
```
_biclust_ takes matrix as an input and returns _biclust_ object.
```{r biclust_2}
# Load package
if(!require(biclust)){
install.packages("biclust")
library(biclust)
}
# Set seed for reproducibility
set.seed(3973)
# Find biclusters
bc <- biclust(corr, method=BCPlaid(), fit.model = y ~ m,
background = TRUE, shuffle = 100, back.fit = 0, max.layers = 10,
iter.startup = 10, iter.layer = 100, verbose = FALSE)
bc
```
The object includes cluster information. However compared to _cobiclust_,
_biclust_ object includes only information about clusters that were found, not general cluster.
Meaning that if one cluster size of 5 features was found out of 20 features,
those 15 features do not belong to any cluster. That is why we have to create an
additional cluster for features/samples that are not assigned into any cluster.
```{r biclust_3}
# Functions for obtaining biclust information
# Get clusters for rows and columns
.get_biclusters_from_biclust <- function(bc, assay){
# Get cluster information for columns and rows
bc_columns <- t(bc@NumberxCol)
bc_columns <- data.frame(bc_columns)
bc_rows <- bc@RowxNumber
bc_rows <- data.frame(bc_rows)
# Get data into right format
bc_columns <- .manipulate_bc_data(bc_columns, assay, "col")
bc_rows <- .manipulate_bc_data(bc_rows, assay, "row")
return(list(bc_columns = bc_columns, bc_rows = bc_rows))
}
# Input clusters, and how many observations there should be, i.e., the number of samples or features
.manipulate_bc_data <- function(bc_clusters, assay, row_col){
# Get right dimension
dim <- ifelse(row_col == "col", ncol(assay), nrow(assay))
# Get column/row names
if( row_col == "col" ){
names <- colnames(assay)
} else{
names <- rownames(assay)
}
# If no clusters were found, create one. Otherwise create additional cluster which
# contain those samples that are not included in clusters that were found.
if( nrow(bc_clusters) != dim ){
bc_clusters <- data.frame(cluster = rep(TRUE, dim))
} else {
# Create additional cluster that includes those samples/features that
# are not included in other clusters.
vec <- ifelse(rowSums(bc_clusters) > 0, FALSE, TRUE)
# If additional cluster contains samples, then add it
if ( any(vec) ){
bc_clusters <- cbind(bc_clusters, vec)
}
}
# Adjust row and column names
rownames(bc_clusters) <- names
colnames(bc_clusters) <- paste0("cluster_", 1:ncol(bc_clusters))
return(bc_clusters)
}
```
```{r biclust_4}
# Get biclusters
bcs <- .get_biclusters_from_biclust(bc, corr)
bicluster_rows <- bcs$bc_rows
bicluster_columns <- bcs$bc_columns
# Print biclusters for rows
head(bicluster_rows)
```
Let's collect information for the scatter plot.
```{r biclust_5}
# Function for obtaining sample-wise sum, mean, median, and mean variance for each cluster
.sum_mean_median_var <- function(tse1, tse2, assay_name1, assay_name2, clusters1, clusters2){
list <- list()
# Create a data frame that includes all the information
for(i in 1:ncol(clusters1) ){
# Subset data based on cluster
tse_subset1 <- tse1[clusters1[,i], ]
tse_subset2 <- tse2[clusters2[,i], ]
# Get assay
assay1 <- assay(tse_subset1, assay_name1)
assay2 <- assay(tse_subset2, assay_name2)
# Calculate sum, mean, median, and mean variance
sum1 <- colSums2(assay1, na.rm = T)
mean1 <- colMeans2(assay1, na.rm = T)
median1 <- colMedians(assay1, na.rm = T)
var1 <- colVars(assay1, na.rm = T)
sum2 <- colSums2(assay2, na.rm = T)
mean2 <- colMeans2(assay2, na.rm = T)
median2 <- colMedians(assay2, na.rm = T)
var2 <- colVars(assay2, na.rm = T)
list[[i]] <- data.frame(sample = colnames(tse1), sum1, sum2, mean1, mean2,
median1, median2, var1, var2)
}
return(list)
}
# Calculate info
df <- .sum_mean_median_var(mae[[1]], mae[[2]], "clr", "nmr", bicluster_rows, bicluster_columns)
```
Now we can create a scatter plot. X-axis includes median clr abundance of microbiome
and y-axis median absolute concentration of each metabolite. Each data point represents
a single sample.
From the plots, we can see that there is low negative correlation in both cluster 1 and 3.
This means that when abundance of bacteria belonging to cluster 1 or 3 is higher,
the concentration of metabolites of cluster 1 or 3 is lower, and vice versa.
```{r biclust_6, fig.width=14, fig.height=6, fig.show="keep", out.width="33%"}
pics <- list()
for(i in seq_along(df)){
pics[[i]] <- ggplot(df[[i]]) +
geom_point(aes(x = median1, y = median2)) +
labs(title = paste0("Cluster ", i),
x = "Taxa (clr median)",
y = "Metabolites (abs. median)")
print(pics[[i]])
}
# pics[[1]] + pics[[2]] + pics[[3]]
```
_pheatmap_ does not allow boolean values, so they must be converted into factors.
```{r biclust_7}
bicluster_columns <- data.frame(apply(bicluster_columns, 2, as.factor))
bicluster_rows <- data.frame(apply(bicluster_rows, 2, as.factor))
```
Again, we can plot clusters with heatmap.
```{r biclust_8, fig.width=10, fig.height=10}
# Adjust colors for all clusters
if( ncol(bicluster_rows) > ncol(bicluster_columns) ){
cluster_names <- colnames(bicluster_rows)
} else {
cluster_names <- colnames(bicluster_columns)
}
annotation_colors <- list()
for(name in cluster_names){
annotation_colors[[name]] <- c("TRUE" = "red", "FALSE" = "white")
}
# Create a heatmap
pheatmap(corr, cluster_cols = F, cluster_rows = F,
annotation_col = bicluster_columns,
annotation_row = bicluster_rows,
annotation_colors = annotation_colors)
```
### Taxa vs taxa
Third and final example deals with situation where we want to analyze correlation
between taxa. _biclust_ is suitable for this.
```{r biclust_9}
# Calculate cross-correlation
corr <- getExperimentCrossCorrelation(mae, 1, 1,
assay_name1 = "clr", assay_name2 = "clr",
mode = "matrix",
cor_threshold = 0.2, verbose = F, show_warning = F)
# Find biclusters
bc <- biclust(corr, method=BCPlaid(), fit.model = y ~ m,
background = TRUE, shuffle = 100, back.fit = 0, max.layers = 10,
iter.startup = 10, iter.layer = 100, verbose = FALSE)
```
```{r biclust_10}
# Get biclusters
bcs <- .get_biclusters_from_biclust(bc, corr)
bicluster_rows <- bcs$bc_rows
bicluster_columns <- bcs$bc_columns
```
```{r biclust_11}
# Create a column that combines information
# If row/column includes in multiple clusters, cluster numbers are separated with "_&_"
bicluster_columns$clusters <- apply(bicluster_columns, 1,
function(x){paste(paste(which(x)), collapse = "_&_") })
bicluster_columns <- bicluster_columns[, "clusters", drop = FALSE]
bicluster_rows$clusters <- apply(bicluster_rows, 1,
function(x){paste(paste(which(x)), collapse = "_&_") })
bicluster_rows <- bicluster_rows[, "clusters", drop = FALSE]
```
```{r biclust_12, fig.width=14, fig.height=12}
# Convert boolean values into factor
bicluster_columns <- data.frame(apply(bicluster_columns, 2, as.factor))
bicluster_rows <- data.frame(apply(bicluster_rows, 2, as.factor))
pheatmap(corr, cluster_cols = F, cluster_rows = F,
annotation_col = bicluster_columns,
annotation_row = bicluster_rows)
```
## Additional Community Typing
For more community typing techniques applied to the 'SprockettTHData' data set, see the attached .Rmd file.
Link:
* [Rmd](add-comm-typing.Rmd)