22_communitytyping.Rmd

# Community typing {#community-typing}

```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```

```{r load-pkg-data}
library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns
```


## Community typing

### Dirichlet Multinomial Mixtures (DMM)

This section focus on DMM analysis. 

One technique that allows to search for groups of samples that are
similar to each other is the [Dirichlet-Multinomial Mixture
Model](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030126). In
DMM, we first determine the number of clusters (k) that best fit the
data (model evidence) using Laplace approximation. After fitting the
model with k clusters, we obtain for each sample k probabilities that
reflect the probability that a sample belongs to the given cluster.

Let's cluster the data with DMM clustering. 

```{r dmm}
# Runs model and calculates the most likely number of clusters from 1 to 7.
# Since this is a large dataset it takes long computational time.
# For this reason we use only a subset of the data; agglomerated by Phylum as a rank.
tse <- GlobalPatterns
tse <- agglomerateByRank(tse, rank = "Phylum", agglomerateTree=TRUE)
tse_dmn <- runDMN(tse, name = "DMN", k = 1:7)
```

```{r}
# It is stored in metadata
tse_dmn
```

Return information on metadata that the object contains.

```{r}
names(metadata(tse_dmn))
```

This returns a list of DMN objects for a closer investigation.

```{r}
getDMN(tse_dmn)
```


Show Laplace approximation (model evidence) for each model of the k models.

```{r}
library(miaViz)
plotDMNFit(tse_dmn, type = "laplace")
```

Return the model that has the best fit (see also `BestDMNFit`, which returns the index of the best fit).

```{r}
getBestDMNFit(tse_dmn, type = "laplace")
```
### PCoA for ASV-level data with Bray-Curtis; with DMM clusters shown with colors

Group samples and return DMNGroup object that contains a summary (see also `performDMNgroupCV`, which returns an equivalent data.frame).
Patient status is used for grouping.

```{r}
dmn_group <- calculateDMNgroup(tse_dmn, variable = "SampleType",  exprs_values = "counts",
                               k = 2, seed=.Machine$integer.max)

dmn_group
```

Mixture weights  (rough measure of the cluster size).


```{r}
DirichletMultinomial::mixturewt(getBestDMNFit(tse_dmn))
```


Samples-cluster assignment probabilities / how probable it is that sample belongs
to each cluster

```{r}
head(DirichletMultinomial::mixture(getBestDMNFit(tse_dmn)))
```

Contribution of each taxa to each component

```{r}
head(DirichletMultinomial::fitted(getBestDMNFit(tse_dmn)))
```
Get the assignment probabilities


```{r}
prob <- DirichletMultinomial::mixture(getBestDMNFit(tse_dmn))
# Add column names
colnames(prob) <- c("comp1", "comp2")

# For each row, finds column that has the highest value. Then extract the column 
# names of highest values.
vec <- colnames(prob)[max.col(prob,ties.method = "first")]

# Add info to colData
colData(tse)$dmm_component <- vec 
```

Computing the euclidean PCoA and storing it as a data frame

```{r}
# Does clr transformation. Pseudocount is added, because data contains zeros.
tse <- transformSamples(tse, method = "relabundance", pseudocount = 1)
tse <- transformCounts(tse, assay_name = "relabundance", method = "clr")

library(scater)
# Calculate PCoA
tse <- runMDS(tse, exprs_values = "clr", method = "euclidean")

```

```{r}
# Create a plot
euclidean_dmm_plot <- plotReducedDim(tse, "MDS", colour_by = "dmm_component") +
    labs(x = "Coordinate 1",
       y = "Coordinate 2",
       title = "PCoA with Aitchison distances") +  
  theme(title = element_text(size = 12)) # makes titles smaller

euclidean_dmm_plot
```

## Community Detection

Another approach for discovering communities within the samples of the
data, is to run community detection algorithms after building a
graph. The following demonstration builds a graph based on the k
nearest-neighbors and performs the community detection on the fly.

_`bluster`_ [@R-bluster] package offers several clustering methods,
among which graph-based are present, enabling the community detection
task.

Installing package:

```{r}
if(!require(bluster)){
  BiocManager::install("bluster")
}
```

The algorithm used is "short random walks" [@Pons2006]. Graph is
constructed using different k values (the number of nearest neighbors
to consider during graph construction) using the robust centered log
ratio (rclr) assay data. Then plotting the communities using UMAP
[@McInnes2018] ordination as a visual exploration aid.  In the
following demonstration we use the `enterotype` dataset from the
[@R-mia] package.

```{r, message=FALSE, warning=FALSE}
library(bluster)
library(patchwork) # For arranging several plots as a grid
library(scater)

data("enterotype", package="mia")
tse <- enterotype
tse <- transformSamples(tse, method = "relabundance")
tse <- transformCounts(tse, method = "rclr")

# Performing and storing UMAP
tse <- runUMAP(tse, name="UMAP", exprs_values="rclr")

k <- c(2,3,5,10)
ClustAndPlot <- function(x) {
  # Creating the graph and running the short random walks algorithm  
  graph_clusters <- clusterRows(t(assays(tse)$rclr), NNGraphParam(k=x))
  
  # Results of the clustering as a color for each sample
  plotUMAP(tse, colour_by = I(graph_clusters)) +
    labs(title = paste0("k = ", x))
}

# Applying the function for different k values
plots <- lapply(k,ClustAndPlot)

# Displaying plots in a grid
(plots[[1]] + plots[[2]]) / (plots[[3]] + plots[[4]])
```

Similarly, the _`bluster`_ [@R-bluster] package offers clustering
diagnostics that can be used for judging the clustering quality (see
[Assorted clustering
diagnostics](http://bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html)).
In the following, Silhouette width as a diagnostic tool is computed
and results are visualized for each case presented earlier. For more
about Silhouettes read [@Rousseeuw1987].

```{r, message=FALSE, warning=FALSE}

ClustDiagPlot <- function(x) {
  # Getting the clustering results
  graph_clusters <- clusterRows(t(assays(tse)$rclr), NNGraphParam(k=x))
  
  # Computing the diagnostic info
  sil <- approxSilhouette(t(assays(tse)$rclr), graph_clusters)
  
  # Plotting as a boxlpot to observe cluster separation
  boxplot(split(sil$width, graph_clusters), main=paste0("k = ", x))
  
}
# Applying the function for different k values
res <- lapply(k,ClustDiagPlot)
```

## Additional Community Typing

For more community typing techniques applied to the 'SprockettTHData' data set, see the attached .Rmd file.

Link:

   * [Rmd](add-comm-typing.Rmd)