12_quality_control.Rmd

# Exploration and quality Control {#quality-control}


```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```

This chapter focuses on the quality control and exploration of
microbiome data and establishes commonly used descriptive
summaries. Familiarizing with the peculiarities of a given data set is
the essential basis for any data analysis and model building.

The dataset should not suffer from severe technical biases, and you
should at least be aware of potential challenges, such as outliers,
biases, unexpected patterns and so forth. Standard summaries and
visualizations can help, and the rest comes with experience. The
exploration and quality control can be iterative processes.


```{r, message=FALSE}
library(mia)
```


## Abundance

Abundance visualization is an important data exploration
approach. `miaViz` offers the function `plotAbundanceDensity` to plot
the most abundant taxa with several options.

In the following few demonstrations are shown, using the [@Lahti2014]
dataset. A Jitter plot based on relative abundance data, similar to
the one presented at [@Salosensaari2021] supplementary figure 1, could
be visualized as follows:

```{r, warning=FALSE, message=FALSE}
# Loading example data
#library(microbiomeDataSets)
#tse <- atlas1006()
library(miaTime)
data(hitchip1006)
tse <- hitchip1006

# Add relative abundances
tse <- transformSamples(tse, method = "relabundance")

library(miaViz)
plotAbundanceDensity(tse, layout = "jitter", assay_name = "relabundance",
                     n = 40, point_size=1, point_shape=19, point_alpha=0.1) + 
    scale_x_log10(label=scales::percent)
```

The relative abundance values for the top-5 taxonomic features can be
visualized as a density plot over a log scaled axis, with
"nationality" indicated by colors:

```{r, warning=FALSE, message=FALSE}
plotAbundanceDensity(tse, layout = "density", assay_name = "relabundance",
                     n = 5, colour_by="nationality", point_alpha=1/10) +
    scale_x_log10()
```


## Prevalence

Prevalence quantifies the frequency of samples where certain microbes
were detected (above a given detection threshold). The prevalence can
be given as sample size (N) or percentage (unit interval).

Investigating prevalence allows you either to focus on changes which
pertain to the majority of the samples, or identify rare microbes,
which may be _conditionally abundant_ in a small number of samples,
however.

The population prevalence (frequency) at a 1% relative abundance
threshold (`detection = 1/100` and `as_relative = TRUE`), can look
like this. 

```{r exploration-prevalence}
head(getPrevalence(tse, detection = 1/100, sort = TRUE, as_relative = TRUE))
```

The function arguments `detection` and `as_relative` can also be used
to access, how many samples do pass a threshold for raw counts. Here
the population prevalence (frequency) at the absolute abundance
threshold (`as_relative = FALSE`) at read count 1 (`detection = 1`) is
accessed.

```{r concepts_prevalence2}
head(getPrevalence(tse, detection = 1, sort = TRUE, assay_name = "counts",
                   as_relative = FALSE))
```

If the output should be used for subsetting or storing the data in the
`rowData`, set `sort = FALSE`.


### Prevalence analysis

To investigate microbiome prevalence at a selected taxonomic level, two 
approaches are available.

First the data can be agglomerated to the taxonomic level and `getPrevalence` 
applied on the resulting object.

```{r}
# Agglomerate taxa abundances to Phylum level, and add the new table to the altExp slot
altExp(tse,"Phylum") <- agglomerateByRank(tse, "Phylum")
# Check prevalence for the Phylum abundance table from the altExp slot
head(getPrevalence(altExp(tse,"Phylum"), detection = 1/100, sort = TRUE,
                   assay_name = "counts", as_relative = TRUE))
```


Alternatively, the `rank` argument could be set to perform the agglomeration on
the fly.

```{r}
head(getPrevalence(tse, rank = "Phylum", detection = 1/100, sort = TRUE,
                   assay_name = "counts", as_relative = TRUE))
```

Note that, by default, `na.rm = TRUE` is used for agglomeration in
`getPrevalence`, whereas the default for `agglomerateByRank` is
`FALSE` to prevent accidental data loss.

If you only need the names of the prevalent taxa, `getPrevalentTaxa`
is available. This returns the taxa that exceed the given prevalence
and detection thresholds.

```{r core-members, message=FALSE, warning=FALSE, eval = FALSE}
getPrevalentTaxa(tse, detection = 0, prevalence = 50/100)
prev <- getPrevalentTaxa(tse, detection = 0, prevalence = 50/100,
                         rank = "Phylum", sort = TRUE)
prev
```

Note that the `detection` and `prevalence` thresholds are not the same, since
`detection` can be applied to relative counts or absolute counts depending on 
whether `as_relative` is set `TRUE` or `FALSE`


The function ‘getPrevalentAbundance’ can be used to check the total
relative abundance of the prevalent taxa (between 0 and 1).


### Rare taxa

Related functions are available for the analysis of rare taxa
(`rareMembers`; `rareAbundance`; `lowAbundance`, `getRareTaxa`,
`subsetByRareTaxa`).


### Plotting prevalence

To plot the prevalence, add the prevalence of each taxa to the
`rowData`. Here, we are analysing the Phylum level abundances, which
are stored in the altExp slot.

```{r}
rowData(altExp(tse,"Phylum"))$prevalence <- 
    getPrevalence(altExp(tse,"Phylum"), detection = 1/100, sort = FALSE,
                  assay_name = "counts", as_relative = TRUE)
```

The prevalences can be then plotted via the plotting functions from
the `scater` package.
 
```{r, message=FALSE, warning=FALSE}
library(scater)
plotRowData(altExp(tse,"Phylum"), "prevalence", colour_by = "Phylum")
```

The prevalence can be also visualized on the taxonomic tree with the
`miaViz` package.

```{r}
altExps(tse) <- splitByRanks(tse)
altExps(tse) <-
   lapply(altExps(tse),
          function(y){
              rowData(y)$prevalence <- 
                  getPrevalence(y, detection = 1/100, sort = FALSE,
                                assay_name = "counts", as_relative = TRUE)
              y
          })
top_phyla <- getTopTaxa(altExp(tse,"Phylum"),
                        method="prevalence",
                        top=5L,
                        assay_name="counts")
top_phyla_mean <- getTopTaxa(altExp(tse,"Phylum"),
                             method="mean",
                             top=5L,
                             assay_name="counts")
x <- unsplitByRanks(tse, ranks = taxonomyRanks(tse)[1:6])
x <- addTaxonomyTree(x)
```
 
After some preparation the data is assembled and can be plotted via 
`plotRowTree`.

```{r plot-prev-prev, message=FALSE, fig.cap="Prevalence of top phyla as judged by prevalence"}
library(miaViz)
plotRowTree(x[rowData(x)$Phylum %in% top_phyla,],
            edge_colour_by = "Phylum",
            tip_colour_by = "prevalence",
            node_colour_by = "prevalence")
```
```{r plot-prev-mean, message=FALSE, fig.cap="Prevalence of top phyla as judged by mean abundance"}
plotRowTree(x[rowData(x)$Phylum %in% top_phyla_mean,],
            edge_colour_by = "Phylum",
            tip_colour_by = "prevalence",
            node_colour_by = "prevalence")
```

## Quality control {#qc}

Next, let us load the `GlobalPatterns` data set to illustrate standard
microbiome data summaries.

```{r, message=FALSE}
library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns 
```


### Top taxa  

The `getTopTaxa` identifies top taxa in the data.   

```{r top-feature-taxo}
# Pick the top taxa
top_features <- getTopTaxa(tse, method="median", top=10)

# Check the information for these
rowData(tse)[top_features, taxonomyRanks(tse)]
```


### Library size / read count  

The total counts/sample can be calculated using the
`perCellQCMetrics`/`addPerCellQC` from the `scater` package. The former one 
just calculates the values, whereas the latter one directly adds them to the
`colData`.

```{r lib-size}
library(scater)
perCellQCMetrics(tse)
tse <- addPerCellQC(tse)
colData(tse)
```

The distribution of calculated library sizes can be visualized as a
histogram (left), or by sorting the samples by library size (right).

```{r plot-viz-lib-size-1, fig.width=8, fig.height=4, fig.cap="Library size distribution."}
library(ggplot2)

p1 <- ggplot(as.data.frame(colData(tse))) +
        geom_histogram(aes(x = sum), color = "black", fill = "gray", bins = 30) +
        labs(x = "Library size", y = "Frequency (n)") + 
        # scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x), 
        # labels = scales::trans_format("log10", scales::math_format(10^.x))) +
        theme_bw() +
        theme(panel.grid.major = element_blank(), # Removes the grid
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black")) # Adds y-axis

library(dplyr)
df <- as.data.frame(colData(tse)) %>%
        arrange(sum) %>%
        mutate(index = 1:n())
p2 <- ggplot(df, aes(y = index, x = sum/1e6)) +
        geom_point() +	
        labs(x = "Library size (million reads)", y = "Sample index") +	
        theme_bw() +
        theme(panel.grid.major = element_blank(), # Removes the grid
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black")) # Adds y-axis

library(patchwork)
p1 + p2
```

Library sizes - and other variables from `colData` - can be also visualized by using 
specified function called `plotColData`.

```{r plot-viz-lib-size-2, fig.width=8, fig.height=4, fig.cap="Library sizes per sample."}
library(ggplot2)
# Sort samples by read count, order the factor levels, and store back to tse as DataFrame
# TODO: plotColData could include an option for sorting samples based on colData variables
colData(tse) <- as.data.frame(colData(tse)) %>%
                 arrange(X.SampleID) %>%
        	 mutate(X.SampleID = factor(X.SampleID, levels=X.SampleID)) %>%
		 DataFrame
plotColData(tse,"sum","X.SampleID", colour_by = "SampleType") + 
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    labs(y = "Library size (N)", x = "Sample ID") 	    
```

```{r plot-viz-lib-size-3, fig.width=8, fig.height=4, fig.cap="Library sizes per sample type."}
plotColData(tse,"sum","SampleType", colour_by = "SampleType") + 
    theme(axis.text.x = element_text(angle = 45, hjust=1))
```

In addition, data can be rarefied with
[subsampleCounts](https://microbiome.github.io/mia/reference/subsampleCounts.html),
which normalises the samples to an equal number of reads. However,
this practice has been discouraged for the analysis of differentially
abundant microorganisms (see [@mcmurdie2014waste]).
  

### Contaminant sequences

Samples might be contaminated with exogenous sequences. The impact of
each contaminant can be estimated based on their frequencies and
concentrations across the samples.

The following [decontam functions](https://microbiome.github.io/mia/reference/isContaminant.html)
are based on the [@davis2018simple] and support such functionality:

* `isContaminant`, `isNotContaminant`
* `addContaminantQC`, `addNotContaminantQC`


## Session Info {-}

```{r sessionInfo, echo=FALSE, results='asis'}
prettySessionInfo()
```