scater_template.Rmd

---
title: "`r id`"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This document shows the quality control of the data set `r id`, using the
package
[scater](https://www.bioconductor.org/packages/release/bioc/html/scater.html).

## Load the necessary packages

```{r}
suppressPackageStartupMessages(library(MultiAssayExperiment))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(scater))
```

## Dataset summary

```{r, echo = FALSE}
knitr::kable(data.frame(n_samples = nrow(colData(maex)), 
                        n_genes = nrow(experiments(maex)[["gene"]]),
                        n_transcripts = nrow(experiments(maex)[["tx"]]),
                        organism = metadata(maex)$organism,
                        genome = metadata(maex)$genome),
             align = "c",
             caption = "Data set summary")
```


The data is stored as a
[MultiAssayExperiment](https://bioconductor.org/packages/devel/bioc/html/MultiAssayExperiment.html)
object. This object contains both gene-level and transcript-level data. For the
transcript-level data, it contains both counts and TPMs. There are two different
gene-level summaries. The one named `gene` is obtained by summing the TPMs and
counts, respectively, across all transcripts for a given gene. The one named
`genelstpm` contains the same TPMs as `gene`, but the counts slot is obtained by
taking the summed TPMs and scaling them to the count scale. The latter procedure
avoids problems related to differential isoform usage and the fact that regular
counts depend on the length of the feature as well as on its abundance.

```{r}
print(maex)
```

```{r}
if ("salmon_summary" %in% names(metadata(maex)))
  DT::datatable(metadata(maex)$salmon_summary, options = list(scrollX = TRUE))
if ("rapmap_summary" %in% names(metadata(maex)))
  DT::datatable(metadata(maex)$rapmap_summary, options = list(scrollX = TRUE))
```

In this report, we use the gene-level TPMs and the counts obtained by scaling 
the TPMs (if available). For UMI data sets, we use the gene-level tag counts.
The data object also contains a slot with phenotypical data for the cells.

```{r}
if (all(c("count_lstpm", "TPM") %in% names(assays(experiments(maex)[["gene"]])))) {
  cts <- assays(experiments(maex)[["gene"]])[["count_lstpm"]]
  tpms <- assays(experiments(maex)[["gene"]])[["TPM"]]
} else {
  cts <- assays(experiments(maex)[["gene"]])[["count"]]
  tpms <- NULL
}
phn <- colData(maex)

phn[, paste(phenoid, collapse = ".")] <- as.character(interaction(as.data.frame(phn[, phenoid])))
```

Although no statistical testing will be performed in this report, we use the 
sample annotation denoted ``r paste(phenoid, collapse = ".")`` to stratify and
color the samples for some of the plots.

```{r}
table(phn[, paste(phenoid, collapse = ".")])
```


### Create a SingleCellExperiment object

The functions in the
[scater](https://www.bioconductor.org/packages/release/bioc/html/scater.html)
package operate on objects of the class `SingleCellExperiment`. Here, we generate such an
object, containing the counts, TPMs and phenotypic data.

```{r}
sce <- SingleCellExperiment(assays = list(counts = cts, 
                                          tpm = tpms), 
                            colData = as.data.frame(phn))
logcounts(sce) <- log2(calculateCPM(sce, use.size.factors = FALSE) + 1)
```

## Plot overview for each cell

First we plot the relative proportion of each library that is accounted for by
the most highly expressed genes. The cells are stratified by the ``r phenoid``
annotation.

```{r}
tryCatch({
  plotScater(sce, block1 = paste(phenoid, collapse = "."))
}, error = function(e) print(e))
```

## Calculate quality metrics and filter

Next we calculate various quality control metrics for the data set. If there are
at least two ERCC spike-in transcripts that are expressed (count > 0) anywhere in
the data set, we will consider these as the control features. Otherwise, no
control features are used.

```{r}
if (sum(rowSums(counts(sce)[grep("^ERCC-", rownames(sce)), ] > 0) > 
        min(table(colData(sce)[, paste(phenoid, collapse = ".")]))) > 1) {
  withcontrols <- TRUE
  sce <- calculateQCMetrics(sce, feature_controls = list(ERCC = grep("^ERCC-", rownames(sce), 
                                                                     value = TRUE)))
} else {
  sce <- calculateQCMetrics(sce)
  withcontrols <- FALSE
}
```

We filter the data set to keep only those features that are observed in at least 1 cell. 

```{r}
keep_features <- rowSums(counts(sce) > 0) >= 1
sce <- sce[keep_features, ]
```

Next, we exclude the cells with less than 5 detected features

```{r}
sce <- sce[, sce$total_features > 5]
```


## Plot QC metrics

Next, we plot some of the QC metrics calculated above. First, we show the 50
features with the highest expression in the data set.

```{r}
plotQC(sce, type = "highest-expression")
```

Next, we find the principal components that are most highly correlated with the 
number of observed features. These are plotted both to show the association to
the number of observed features, and the correlation between each other.

```{r, warning = FALSE}
plotQC(sce, type = "find-pcs", variable = "total_features")
plotQC(sce, type = "find-pcs", variable = "total_features", plot_type = "pairs-pcs")
```

Next, we visualize a subset of the sample annotations in terms of the amount of
marginal variance that they explain. These are also visualized in a pairs plot
to show the associations among them.

```{r}
expl_vars <- c(phenoid, "log10_total_counts", "log10_total_features", 
               "pct_counts_top_200_features", ifelse(withcontrols, "log10_counts_feature_controls", NA),
               ifelse(withcontrols, "pct_counts_feature_controls", NA))
plotQC(sce, type = "explanatory-variables", variables = expl_vars[!is.na(expl_vars)])
plotQC(sce, type = "explanatory-variables", variables = expl_vars[!is.na(expl_vars)], method = "pairs")
```

We also construct a couple of plots comparing various features of the cells. The
cells are colored by the ``r paste(phenoid, collapse = ".")`` annotation.

```{r}
plotPhenoData(sce, aes_string(x = "log10_total_counts", y = "total_features",
                              colour = paste(phenoid, collapse = "."))) + 
  guides(colour = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps)
plotPhenoData(sce, aes_string(x = "pct_counts_top_200_features", y = "total_features", 
                              colour = paste(phenoid, collapse = "."))) + 
  guides(colour = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps)
if (withcontrols) {
  plotPhenoData(sce, aes_string(x = "total_features", y = "pct_counts_ERCC", 
                                color = paste(phenoid, collapse = "."))) + 
    guides(colour = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps)
}
```

Finally, we show the frequency of expression (the number of cells in which a
gene is observed) against the average expression of the gene.

```{r}
tryCatch({
  plotQC(sce, type = "exprs-freq-vs-mean")
}, error = function(e) NULL)
```


## Plot low-dimensional representations

Finally, we use principal component analysis (PCA) as well as t-SNE to generate
two-dimensional representations of the cells in the data set. The cells are
colored by the ``r phenoid`` annotation.

```{r, fig.width = 8}
print(scater::plotPCA(sce, colour_by = paste(phenoid, collapse = ".")) + 
        guides(fill = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps))
if (length(phenoid) > 1) {
  for (pid in phenoid) {
    print(scater::plotPCA(sce, colour_by = pid) + 
            guides(fill = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps))
  }
}
if (!is.null(tpms)) {
  if (any(duplicated(tpm(sce))) || any(duplicated(t(tpm(sce))))) {
    tpm(sce) <- tpm(sce) + matrix(rnorm(length(tpm(sce)), mean = 0, sd = 1e-5), 
                                        nrow = nrow(tpm(sce)))
  }
  tsne <- tryCatch(
    scater::plotTSNE(sce, exprs_values = "tpm", return_SCE = TRUE),
    error = function(e) NULL)
} else {
  if (any(duplicated(exprs(sce))) || any(duplicated(t(exprs(sce))))) {
    exprs(sce) <- exprs(sce) + matrix(rnorm(length(exprs(sce)), mean = 0, sd = 1e-5), 
                                            nrow = nrow(exprs(sce)))
  }
  tsne <- tryCatch(
    scater::plotTSNE(sce, exprs_values = "exprs", return_SCE = TRUE),
    error = function(e) NULL)
}
if (!is.null(tsne)) {
  print(scater::plotReducedDim(tsne, colour_by = paste(phenoid, collapse = "."),
                               use_dimred = "TSNE") + 
          guides(fill = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps))
  if (length(phenoid) > 1) {
    for (pid in phenoid) {
      print(scater::plotReducedDim(tsne, colour_by = pid, use_dimred = "TSNE") + 
              guides(fill = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps))
    }
  }
}
```

## Session info

```{r}
sessionInfo()
```