forked from microbiome/OMA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path12_quality_control.Rmd
339 lines (257 loc) · 11.2 KB
/
12_quality_control.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
# Exploration and quality Control {#quality-control}
```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```
This chapter focuses on the quality control and exploration of
microbiome data and establishes commonly used descriptive
summaries. Familiarizing with the peculiarities of a given data set is
the essential basis for any data analysis and model building.
The dataset should not suffer from severe technical biases, and you
should at least be aware of potential challenges, such as outliers,
biases, unexpected patterns and so forth. Standard summaries and
visualizations can help, and the rest comes with experience. The
exploration and quality control can be iterative processes.
```{r, message=FALSE}
library(mia)
```
## Abundance
Abundance visualization is an important data exploration
approach. `miaViz` offers the function `plotAbundanceDensity` to plot
the most abundant taxa with several options.
In the following few demonstrations are shown, using the [@Lahti2014]
dataset. A Jitter plot based on relative abundance data, similar to
the one presented at [@Salosensaari2021] supplementary figure 1, could
be visualized as follows:
```{r, warning=FALSE, message=FALSE}
# Loading example data
#library(microbiomeDataSets)
#tse <- atlas1006()
library(miaTime)
data(hitchip1006)
tse <- hitchip1006
# Add relative abundances
tse <- transformSamples(tse, method = "relabundance")
library(miaViz)
plotAbundanceDensity(tse, layout = "jitter", assay_name = "relabundance",
n = 40, point_size=1, point_shape=19, point_alpha=0.1) +
scale_x_log10(label=scales::percent)
```
The relative abundance values for the top-5 taxonomic features can be
visualized as a density plot over a log scaled axis, with
"nationality" indicated by colors:
```{r, warning=FALSE, message=FALSE}
plotAbundanceDensity(tse, layout = "density", assay_name = "relabundance",
n = 5, colour_by="nationality", point_alpha=1/10) +
scale_x_log10()
```
## Prevalence
Prevalence quantifies the frequency of samples where certain microbes
were detected (above a given detection threshold). The prevalence can
be given as sample size (N) or percentage (unit interval).
Investigating prevalence allows you either to focus on changes which
pertain to the majority of the samples, or identify rare microbes,
which may be _conditionally abundant_ in a small number of samples,
however.
The population prevalence (frequency) at a 1% relative abundance
threshold (`detection = 1/100` and `as_relative = TRUE`), can look
like this.
```{r exploration-prevalence}
head(getPrevalence(tse, detection = 1/100, sort = TRUE, as_relative = TRUE))
```
The function arguments `detection` and `as_relative` can also be used
to access, how many samples do pass a threshold for raw counts. Here
the population prevalence (frequency) at the absolute abundance
threshold (`as_relative = FALSE`) at read count 1 (`detection = 1`) is
accessed.
```{r concepts_prevalence2}
head(getPrevalence(tse, detection = 1, sort = TRUE, assay_name = "counts",
as_relative = FALSE))
```
If the output should be used for subsetting or storing the data in the
`rowData`, set `sort = FALSE`.
### Prevalence analysis
To investigate microbiome prevalence at a selected taxonomic level, two
approaches are available.
First the data can be agglomerated to the taxonomic level and `getPrevalence`
applied on the resulting object.
```{r}
# Agglomerate taxa abundances to Phylum level, and add the new table to the altExp slot
altExp(tse,"Phylum") <- agglomerateByRank(tse, "Phylum")
# Check prevalence for the Phylum abundance table from the altExp slot
head(getPrevalence(altExp(tse,"Phylum"), detection = 1/100, sort = TRUE,
assay_name = "counts", as_relative = TRUE))
```
Alternatively, the `rank` argument could be set to perform the agglomeration on
the fly.
```{r}
head(getPrevalence(tse, rank = "Phylum", detection = 1/100, sort = TRUE,
assay_name = "counts", as_relative = TRUE))
```
Note that, by default, `na.rm = TRUE` is used for agglomeration in
`getPrevalence`, whereas the default for `agglomerateByRank` is
`FALSE` to prevent accidental data loss.
If you only need the names of the prevalent taxa, `getPrevalentTaxa`
is available. This returns the taxa that exceed the given prevalence
and detection thresholds.
```{r core-members, message=FALSE, warning=FALSE, eval = FALSE}
getPrevalentTaxa(tse, detection = 0, prevalence = 50/100)
prev <- getPrevalentTaxa(tse, detection = 0, prevalence = 50/100,
rank = "Phylum", sort = TRUE)
prev
```
Note that the `detection` and `prevalence` thresholds are not the same, since
`detection` can be applied to relative counts or absolute counts depending on
whether `as_relative` is set `TRUE` or `FALSE`
The function ‘getPrevalentAbundance’ can be used to check the total
relative abundance of the prevalent taxa (between 0 and 1).
### Rare taxa
Related functions are available for the analysis of rare taxa
(`rareMembers`; `rareAbundance`; `lowAbundance`, `getRareTaxa`,
`subsetByRareTaxa`).
### Plotting prevalence
To plot the prevalence, add the prevalence of each taxa to the
`rowData`. Here, we are analysing the Phylum level abundances, which
are stored in the altExp slot.
```{r}
rowData(altExp(tse,"Phylum"))$prevalence <-
getPrevalence(altExp(tse,"Phylum"), detection = 1/100, sort = FALSE,
assay_name = "counts", as_relative = TRUE)
```
The prevalences can be then plotted via the plotting functions from
the `scater` package.
```{r, message=FALSE, warning=FALSE}
library(scater)
plotRowData(altExp(tse,"Phylum"), "prevalence", colour_by = "Phylum")
```
The prevalence can be also visualized on the taxonomic tree with the
`miaViz` package.
```{r}
altExps(tse) <- splitByRanks(tse)
altExps(tse) <-
lapply(altExps(tse),
function(y){
rowData(y)$prevalence <-
getPrevalence(y, detection = 1/100, sort = FALSE,
assay_name = "counts", as_relative = TRUE)
y
})
top_phyla <- getTopTaxa(altExp(tse,"Phylum"),
method="prevalence",
top=5L,
assay_name="counts")
top_phyla_mean <- getTopTaxa(altExp(tse,"Phylum"),
method="mean",
top=5L,
assay_name="counts")
x <- unsplitByRanks(tse, ranks = taxonomyRanks(tse)[1:6])
x <- addTaxonomyTree(x)
```
After some preparation the data is assembled and can be plotted via
`plotRowTree`.
```{r plot-prev-prev, message=FALSE, fig.cap="Prevalence of top phyla as judged by prevalence"}
library(miaViz)
plotRowTree(x[rowData(x)$Phylum %in% top_phyla,],
edge_colour_by = "Phylum",
tip_colour_by = "prevalence",
node_colour_by = "prevalence")
```
```{r plot-prev-mean, message=FALSE, fig.cap="Prevalence of top phyla as judged by mean abundance"}
plotRowTree(x[rowData(x)$Phylum %in% top_phyla_mean,],
edge_colour_by = "Phylum",
tip_colour_by = "prevalence",
node_colour_by = "prevalence")
```
## Quality control {#qc}
Next, let us load the `GlobalPatterns` data set to illustrate standard
microbiome data summaries.
```{r, message=FALSE}
library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns
```
### Top taxa
The `getTopTaxa` identifies top taxa in the data.
```{r top-feature-taxo}
# Pick the top taxa
top_features <- getTopTaxa(tse, method="median", top=10)
# Check the information for these
rowData(tse)[top_features, taxonomyRanks(tse)]
```
### Library size / read count
The total counts/sample can be calculated using the
`perCellQCMetrics`/`addPerCellQC` from the `scater` package. The former one
just calculates the values, whereas the latter one directly adds them to the
`colData`.
```{r lib-size}
library(scater)
perCellQCMetrics(tse)
tse <- addPerCellQC(tse)
colData(tse)
```
The distribution of calculated library sizes can be visualized as a
histogram (left), or by sorting the samples by library size (right).
```{r plot-viz-lib-size-1, fig.width=8, fig.height=4, fig.cap="Library size distribution."}
library(ggplot2)
p1 <- ggplot(as.data.frame(colData(tse))) +
geom_histogram(aes(x = sum), color = "black", fill = "gray", bins = 30) +
labs(x = "Library size", y = "Frequency (n)") +
# scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
# labels = scales::trans_format("log10", scales::math_format(10^.x))) +
theme_bw() +
theme(panel.grid.major = element_blank(), # Removes the grid
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) # Adds y-axis
library(dplyr)
df <- as.data.frame(colData(tse)) %>%
arrange(sum) %>%
mutate(index = 1:n())
p2 <- ggplot(df, aes(y = index, x = sum/1e6)) +
geom_point() +
labs(x = "Library size (million reads)", y = "Sample index") +
theme_bw() +
theme(panel.grid.major = element_blank(), # Removes the grid
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) # Adds y-axis
library(patchwork)
p1 + p2
```
Library sizes - and other variables from `colData` - can be also visualized by using
specified function called `plotColData`.
```{r plot-viz-lib-size-2, fig.width=8, fig.height=4, fig.cap="Library sizes per sample."}
library(ggplot2)
# Sort samples by read count, order the factor levels, and store back to tse as DataFrame
# TODO: plotColData could include an option for sorting samples based on colData variables
colData(tse) <- as.data.frame(colData(tse)) %>%
arrange(X.SampleID) %>%
mutate(X.SampleID = factor(X.SampleID, levels=X.SampleID)) %>%
DataFrame
plotColData(tse,"sum","X.SampleID", colour_by = "SampleType") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs(y = "Library size (N)", x = "Sample ID")
```
```{r plot-viz-lib-size-3, fig.width=8, fig.height=4, fig.cap="Library sizes per sample type."}
plotColData(tse,"sum","SampleType", colour_by = "SampleType") +
theme(axis.text.x = element_text(angle = 45, hjust=1))
```
In addition, data can be rarefied with
[subsampleCounts](https://microbiome.github.io/mia/reference/subsampleCounts.html),
which normalises the samples to an equal number of reads. However,
this practice has been discouraged for the analysis of differentially
abundant microorganisms (see [@mcmurdie2014waste]).
### Contaminant sequences
Samples might be contaminated with exogenous sequences. The impact of
each contaminant can be estimated based on their frequencies and
concentrations across the samples.
The following [decontam functions](https://microbiome.github.io/mia/reference/isContaminant.html)
are based on the [@davis2018simple] and support such functionality:
* `isContaminant`, `isNotContaminant`
* `addContaminantQC`, `addNotContaminantQC`
## Session Info {-}
```{r sessionInfo, echo=FALSE, results='asis'}
prettySessionInfo()
```