-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathscater_template.Rmd
executable file
·251 lines (203 loc) · 8.96 KB
/
scater_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: "`r id`"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This document shows the quality control of the data set `r id`, using the
package
[scater](https://www.bioconductor.org/packages/release/bioc/html/scater.html).
## Load the necessary packages
```{r}
suppressPackageStartupMessages(library(MultiAssayExperiment))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(scater))
```
## Dataset summary
```{r, echo = FALSE}
knitr::kable(data.frame(n_samples = nrow(colData(maex)),
n_genes = nrow(experiments(maex)[["gene"]]),
n_transcripts = nrow(experiments(maex)[["tx"]]),
organism = metadata(maex)$organism,
genome = metadata(maex)$genome),
align = "c",
caption = "Data set summary")
```
The data is stored as a
[MultiAssayExperiment](https://bioconductor.org/packages/devel/bioc/html/MultiAssayExperiment.html)
object. This object contains both gene-level and transcript-level data. For the
transcript-level data, it contains both counts and TPMs. There are two different
gene-level summaries. The one named `gene` is obtained by summing the TPMs and
counts, respectively, across all transcripts for a given gene. The one named
`genelstpm` contains the same TPMs as `gene`, but the counts slot is obtained by
taking the summed TPMs and scaling them to the count scale. The latter procedure
avoids problems related to differential isoform usage and the fact that regular
counts depend on the length of the feature as well as on its abundance.
```{r}
print(maex)
```
```{r}
if ("salmon_summary" %in% names(metadata(maex)))
DT::datatable(metadata(maex)$salmon_summary, options = list(scrollX = TRUE))
if ("rapmap_summary" %in% names(metadata(maex)))
DT::datatable(metadata(maex)$rapmap_summary, options = list(scrollX = TRUE))
```
In this report, we use the gene-level TPMs and the counts obtained by scaling
the TPMs (if available). For UMI data sets, we use the gene-level tag counts.
The data object also contains a slot with phenotypical data for the cells.
```{r}
if (all(c("count_lstpm", "TPM") %in% names(assays(experiments(maex)[["gene"]])))) {
cts <- assays(experiments(maex)[["gene"]])[["count_lstpm"]]
tpms <- assays(experiments(maex)[["gene"]])[["TPM"]]
} else {
cts <- assays(experiments(maex)[["gene"]])[["count"]]
tpms <- NULL
}
phn <- colData(maex)
phn[, paste(phenoid, collapse = ".")] <- as.character(interaction(as.data.frame(phn[, phenoid])))
```
Although no statistical testing will be performed in this report, we use the
sample annotation denoted ``r paste(phenoid, collapse = ".")`` to stratify and
color the samples for some of the plots.
```{r}
table(phn[, paste(phenoid, collapse = ".")])
```
### Create a SingleCellExperiment object
The functions in the
[scater](https://www.bioconductor.org/packages/release/bioc/html/scater.html)
package operate on objects of the class `SingleCellExperiment`. Here, we generate such an
object, containing the counts, TPMs and phenotypic data.
```{r}
sce <- SingleCellExperiment(assays = list(counts = cts,
tpm = tpms),
colData = as.data.frame(phn))
logcounts(sce) <- log2(calculateCPM(sce, use.size.factors = FALSE) + 1)
```
## Plot overview for each cell
First we plot the relative proportion of each library that is accounted for by
the most highly expressed genes. The cells are stratified by the ``r phenoid``
annotation.
```{r}
tryCatch({
plotScater(sce, block1 = paste(phenoid, collapse = "."))
}, error = function(e) print(e))
```
## Calculate quality metrics and filter
Next we calculate various quality control metrics for the data set. If there are
at least two ERCC spike-in transcripts that are expressed (count > 0) anywhere in
the data set, we will consider these as the control features. Otherwise, no
control features are used.
```{r}
if (sum(rowSums(counts(sce)[grep("^ERCC-", rownames(sce)), ] > 0) >
min(table(colData(sce)[, paste(phenoid, collapse = ".")]))) > 1) {
withcontrols <- TRUE
sce <- calculateQCMetrics(sce, feature_controls = list(ERCC = grep("^ERCC-", rownames(sce),
value = TRUE)))
} else {
sce <- calculateQCMetrics(sce)
withcontrols <- FALSE
}
```
We filter the data set to keep only those features that are observed in at least 1 cell.
```{r}
keep_features <- rowSums(counts(sce) > 0) >= 1
sce <- sce[keep_features, ]
```
Next, we exclude the cells with less than 5 detected features
```{r}
sce <- sce[, sce$total_features > 5]
```
## Plot QC metrics
Next, we plot some of the QC metrics calculated above. First, we show the 50
features with the highest expression in the data set.
```{r}
plotQC(sce, type = "highest-expression")
```
Next, we find the principal components that are most highly correlated with the
number of observed features. These are plotted both to show the association to
the number of observed features, and the correlation between each other.
```{r, warning = FALSE}
plotQC(sce, type = "find-pcs", variable = "total_features")
plotQC(sce, type = "find-pcs", variable = "total_features", plot_type = "pairs-pcs")
```
Next, we visualize a subset of the sample annotations in terms of the amount of
marginal variance that they explain. These are also visualized in a pairs plot
to show the associations among them.
```{r}
expl_vars <- c(phenoid, "log10_total_counts", "log10_total_features",
"pct_counts_top_200_features", ifelse(withcontrols, "log10_counts_feature_controls", NA),
ifelse(withcontrols, "pct_counts_feature_controls", NA))
plotQC(sce, type = "explanatory-variables", variables = expl_vars[!is.na(expl_vars)])
plotQC(sce, type = "explanatory-variables", variables = expl_vars[!is.na(expl_vars)], method = "pairs")
```
We also construct a couple of plots comparing various features of the cells. The
cells are colored by the ``r paste(phenoid, collapse = ".")`` annotation.
```{r}
plotPhenoData(sce, aes_string(x = "log10_total_counts", y = "total_features",
colour = paste(phenoid, collapse = "."))) +
guides(colour = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps)
plotPhenoData(sce, aes_string(x = "pct_counts_top_200_features", y = "total_features",
colour = paste(phenoid, collapse = "."))) +
guides(colour = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps)
if (withcontrols) {
plotPhenoData(sce, aes_string(x = "total_features", y = "pct_counts_ERCC",
color = paste(phenoid, collapse = "."))) +
guides(colour = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps)
}
```
Finally, we show the frequency of expression (the number of cells in which a
gene is observed) against the average expression of the gene.
```{r}
tryCatch({
plotQC(sce, type = "exprs-freq-vs-mean")
}, error = function(e) NULL)
```
## Plot low-dimensional representations
Finally, we use principal component analysis (PCA) as well as t-SNE to generate
two-dimensional representations of the cells in the data set. The cells are
colored by the ``r phenoid`` annotation.
```{r, fig.width = 8}
print(scater::plotPCA(sce, colour_by = paste(phenoid, collapse = ".")) +
guides(fill = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps))
if (length(phenoid) > 1) {
for (pid in phenoid) {
print(scater::plotPCA(sce, colour_by = pid) +
guides(fill = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps))
}
}
if (!is.null(tpms)) {
if (any(duplicated(tpm(sce))) || any(duplicated(t(tpm(sce))))) {
tpm(sce) <- tpm(sce) + matrix(rnorm(length(tpm(sce)), mean = 0, sd = 1e-5),
nrow = nrow(tpm(sce)))
}
tsne <- tryCatch(
scater::plotTSNE(sce, exprs_values = "tpm", return_SCE = TRUE),
error = function(e) NULL)
} else {
if (any(duplicated(exprs(sce))) || any(duplicated(t(exprs(sce))))) {
exprs(sce) <- exprs(sce) + matrix(rnorm(length(exprs(sce)), mean = 0, sd = 1e-5),
nrow = nrow(exprs(sce)))
}
tsne <- tryCatch(
scater::plotTSNE(sce, exprs_values = "exprs", return_SCE = TRUE),
error = function(e) NULL)
}
if (!is.null(tsne)) {
print(scater::plotReducedDim(tsne, colour_by = paste(phenoid, collapse = "."),
use_dimred = "TSNE") +
guides(fill = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps))
if (length(phenoid) > 1) {
for (pid in phenoid) {
print(scater::plotReducedDim(tsne, colour_by = pid, use_dimred = "TSNE") +
guides(fill = guide_legend(nrow = nrw, byrow = TRUE)) + theme(legend.position = lps))
}
}
}
```
## Session info
```{r}
sessionInfo()
```