-
Notifications
You must be signed in to change notification settings - Fork 3
/
brainspan_part1.Rmd
460 lines (360 loc) · 20 KB
/
brainspan_part1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
---
title: "Smoller Brainspan Part I"
output:
html_document:
theme: cosmo
toc: true
toc_depth: 4
fig_caption: true
fig_width: 8
fig_height: 6
author: "Meeta Mistry"
---
```{r setup, echo=FALSE}
# Setup report details
clientname="Erin Dunn"
clientemail="[email protected]"
lablocation="MGH"
analystname="Meeta Mistry"
analystemail="[email protected]"
```
Array analysis for `r clientname` (`r clientemail`) at `r lablocation`. Contact `r analystname` (`r analystemail`) for additional details. Request from client was:
> We want to look at developmental changes in the postmortem human brain using gene expression data from publicly available from microarray studies. We hope to identify gene groups that share a similar trajectory of expression change, and use these groups to then overlay genetic association data. It doesn't look like BrainCloud is an ideal dataset because of the issues encountered previously, related to the PMI ([link to report](https://www.dropbox.com/s/ovpaggapjmuvhln/brainCloud.html?dl=0)). Given this, I think we should probably focus on BrainSpan and run a parallel set of analyses to see if things look cleaner.
## Workflow:
* grab the BrainSpan data from [GEO](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25219) and associated metadata from publication [Kang et al., Nature 2011](http://www.nature.com/nature/journal/v478/n7370/full/nature10523.html)
* explore the metadata across all brain regions
* focus on OFC to look at correlations between demographic factors and Stage (our primary variable of interest)
* use raw data to assess QC pre-normalization
* differential expression analysis
## Setup
### Bioconductor and R libraries used
```{r libraries, echo=TRUE}
loadlibs <- function(){
library(ggplot2)
library(gtable)
library(scales)
library(RColorBrewer)
library(GEOquery)
library(affy)
library(arrayQualityMetrics)
library(reshape)
library(xtable)
library(ruv)
library(limma)
library(Biobase)
library(gridExtra)
library(stringr)
library(knitr)
library(png)
library(sva)
library(dplyr)
library(CHBUtils)
}
suppressPackageStartupMessages(loadlibs())
```
### Get variables
* get base directory for analyses
* specify data and results directories
* specify column headers used in metadata file
```{r variables, echo=TRUE}
# Setup directory variables
baseDir <- '.'
dataDir <- file.path(baseDir, "data")
metaDir <- file.path(dataDir, "meta")
resultsDir <- file.path(baseDir, "results")
```
## Load the expression data
We are using data that has been pre-processed using the RMA (Robust Multi-array Average) algorithm. RMA is a tool commonly used to correct Affymetrix expression data. The algorithm uses only the PM (perfect match) probes to compute background adjusted observed intensities based on a model that assumes noise is normally distributed and signal is exponentially distributed. Corrected data is then quantile normalized and log2 transformed. The median of all probe sets within one gene (transcript cluster) was used as the estimate of gene expression for a total of 17,565 mainly protein coding genes.
```{r dataimport GEO, echo=TRUE, warning=FALSE, message=FALSE}
# Load GEO data
gse_gene <- getGEO(filename=file.path(dataDir, 'geo/GSE25219-GPL5175_series_matrix.txt.gz'))
# gse_probe <- getGEO(filename=file.path(dataDir, 'geo/GSE25219-GPL5188_series_matrix.txt.gz'))
```
## Extract metadata and relevant categories
Metadata was loaded in from two files 1) donor-level 2) sample-level. Below we have summarized the data for each unique brain region (right hemisphere only). We included a plot to illustrate that the brain regions have varying number of donors. Brain regions with greater than 20 donors are those which we will focus on (a total of 16 regions).
```{r metadata extract, eval=FALSE, echo=FALSE}
### This code is NOT RUN in the report. It provides the script for parsing through the metadata provided in the GEO object. The relevant information was written to file and for ease that file is simply uploaded in the next code block.
names(pData(gse_gene))
pheno_gene <- pData(gse_gene)[,c(8,10:18)]
for (c in 2:ncol(pheno_gene)){
var <- as.character(pheno_gene[,c])
var.split <- strsplit(var, ":")
getlist<- sapply(var.split, "[[", 2)
getlist <- str_trim(getlist)
pheno_gene[,c] <- getlist
}
pheno_gene <- cbind(sapply(pheno_gene[1:7], factor), pheno_gene[,8:10])
colnames(pheno_gene) <-c ("SampleName", "BrainCode", "BrainRegion", "Hemisphere", "Sex", "Age",
"Stage", "PMI", "pH", "RIN")
write.table(pheno_gene, file=file.path(metaDir, 'brainspan_samples_metadata.txt'), sep="\t", quote=F)
```
```{r metadata from file, echo=TRUE, warning=FALSE, message=FALSE}
pheno_gene <- read.delim(file.path(metaDir, 'brainspan_samples_metadata.txt'), row.names=1)
pheno_gene$PMI <- as.numeric(as.character(pheno_gene$PMI))
pheno_gene <- pheno_gene[which(pheno_gene$Hemisphere == "R"),]
meta_donor <- read.delim(file.path(metaDir, 'brainspan_donor_metadata.txt'), row.names=1)
region <- group_by(pheno_gene, BrainRegion)
donors <- summarise(region, donors=n_distinct(BrainCode), meanPMI = mean(PMI, na.rm=T), minPMI = min(PMI, na.rm=T),
maxPMI = max(PMI, na.rm=T), meanPH = mean(pH, na.rm=T), minPH = min(pH, na.rm=T), maxPH = max(pH, na.rm=T),
meanRIN = mean(RIN, na.rm=T), minRIN = min(RIN, na.rm=T), maxRIN = max(RIN, na.rm=T),
Male=length(which(Sex == 'M')), Female=length(which(Sex == 'F')))
```
## Demographic data per brain region
```{r donortable-plot, warning=FALSE, message=FALSE, results='asis', fig.align='center', echo=FALSE}
kable(donors, row.names=F, format='markdown')
ggplot(donors, aes(x=BrainRegion, y=donors)) +
geom_bar() +
ggtitle('Samples per Brain Region') +
xlab('Brain Region') +
ylab('Number of Samples') +
theme(axis.text.x = element_text(colour="grey20",size=15,angle=45,hjust=.5,vjust=.5,face="plain"),
plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.25)))
```
## Developmental stages
Based on the metadata, there are a total of 15 stages; these are described below in more detail. **Note: Not all Stages have been sampled for every brain region**
```{r dev-stages, echo=FALSE, fig.align='center', results='asis'}
# Age table
age.table <-as.character(unique(meta_donor$Period))
age.table <- strsplit(age.table, ",")
age.table <- do.call("rbind", age.table)
row.names(age.table) <- rep("", nrow(age.table))
colnames(age.table) <- c("Stage", "Description")
kable(data.frame(age.table), format="markdown", row.names=FALSE)
```
## Metadata exploration: focus OFC
In total there are `r length(unique(pheno_gene$BrainCode))` donors, from which multiple brain regions were sampled. As an example we have focused below on the Orbitofrontal cortex to explore how the different demographic factors vary with Stage.
```{r testMeta, echo=TRUE, results='asis', fig.align='center'}
# Subset data by region
meta_ofc <- droplevels(pheno_gene[which(pheno_gene$BrainRegion == "OFC"),])
meta_ofc$Stage <- factor(meta_ofc$Stage)
# Age distribution
ggplot(meta_ofc, aes(Stage)) +
geom_bar() +
ggtitle('Orbitofrontal Cortex: Age distribution') +
xlab('Developmental Stage') +
ylab('Number of Samples') +
theme(axis.text.x = element_text(colour="grey20",size=15,angle=0,hjust=.5,vjust=.5,face="plain"),
plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.25)))
# pH measurements
ggplot(na.omit(meta_ofc), aes(x=Stage, y=pH, fill=Stage)) +
geom_boxplot() +
ggtitle('Orbitofrontal Cortex: pH levels') +
xlab('Stage') +
guides(fill=FALSE) +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
# Time between death and sample collection; remove non-numeric values
ggplot(na.omit(meta_ofc), aes(x=Stage, y=PMI, fill=Stage)) +
geom_boxplot() +
ggtitle('Orbitofrontal Cortex: Postmortem Intervals') +
xlab('Stage') +
ylab('Postmortem Interval (hours)') +
guides(fill=FALSE) +
theme(legend.position="none",
plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
# RNA Integrity
ggplot(na.omit(meta_ofc), aes(x=Stage, y=RIN, fill=Stage)) +
geom_boxplot() +
ggtitle('Orbitofrontal Cortex: RIN') +
xlab('Stage') +
ylab('RIN') +
guides(fill=FALSE) +
theme(legend.position="none",
plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
```
As we previously observed with the [Braincloud](http://braincloud.jhmi.edu/downloads.htm) data, in this dataset there is also a positive correlation observed with developmental stage and PMI and a negative correlations with RIN. This is not surprising, as fetal samples are more likely to have been obtained faster (with less chance of RNA degradation). The authors in [Kang et al.](http://www.nature.com/nature/journal/v478/n7370/full/nature10523.html) acknowledge the problem and account for it by incorporating both PMI and RIN as covariates in the model.
## Quality Control
We used ArrayQualityMetrics to assess the quality of the array data we obtained from [GEO](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25219). The full QC report can be found at this [link](./results/report_OFC_march2014/index.html), where all figures and analysis is described in detail. Below we have plotted some of the QC figures and provide some insight into the interpretation.
```{r organize eset, echo=FALSE}
# Add data
data_ofc <- exprs(gse_gene)
data_ofc <-data_ofc[,which(colnames(data_ofc) %in% rownames(meta_ofc))]
eset.ofc <- new("ExpressionSet", exprs=data_ofc)
# Add metadata
fetal <- rep("NA", nrow(meta_ofc))
stage.num <- as.numeric(as.character(meta_ofc$Stage))
fetal[which(stage.num <= 7)] <- "Fetal"
fetal[which(stage.num > 7)] <- "Postnatal"
fetal <- factor(fetal)
meta_new <-cbind(meta_ofc, fetal)
meta_new$Stage <- stage.num
pData(eset.ofc) <- meta_new
fData (eset.ofc) <- fData(gse_gene)
```
```{r QC_report, echo=TRUE, eval=FALSE}
arrayQualityMetrics(expressionset=eset.ofc, intgroup=c('fetal'),
outdir='./results/report_OFC', force=TRUE, do.logtransform=FALSE)
```
### Expression data clusters into 'fetal' and 'postnatal' sample groups
To measure the degree of (dis) similarity between samples, a distance matrix was computed. Each sample is represented as a vector of expression with vector length equating to the total number of probes. The euclidian distance was used to measure the similarity between sample vectors and then hierarchical clustering was applied to evaluate how samples group together based on similarities. The dendrogram below demonstrates a high degree of similarity among fetal samples and likewise with postnatal samples. There are two fetal samples that cluster with postnatal samples. These same samples also appear as outliers in the heatmap provided in the [QC report](./results/report_OFC_march2014/index.html#S1).
```{r clusteringDendro, echo=FALSE, fig.align='center', fig.height=12, fig.width=10, warning=FALSE, message=FALSE}
require(ggdendro)
meta_new$Stage_Name <- sapply(meta_new$Stage, function(x)
age.table[which(age.table[,1] == x), 2],
USE.NAMES=FALSE)
pData(eset.ofc) <- meta_new
x <-eset.ofc
meta.x <- pData(x)
myDist <- dist(t(exprs(x)))
myTree <-hclust(myDist)
dhc <- as.dendrogram(myTree)
ddata <- dendro_data(dhc, type="rectangle")
ddata$labels <- merge(ddata$labels, meta.x, by.x="label", by.y="row.names")
ggplot(segment(ddata)) +
geom_segment(aes(x=x, y=y, xend=xend, yend=yend)) +
theme_dendro() +
geom_text(data=label(ddata), aes(x=x, y=y, label=Stage_Name, color=fetal, hjust=-0.1), size=4) +
coord_flip() + scale_y_reverse(expand=c(0.2, 50)) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank())
```
### Density plots
Another means of outlier detection is to assess the [distribution of expression](./results/report_OFC_march2014/index.html#S2) for each array. Both the boxplots and density distribution illustrate consistency in shape and range across arrays. Density plots are smoothed histograms, and for microrarray data we typically observe a peak on the low end representing background signal, with frequency tapering off at the higher expression values. A [small bump](http://nebc.nerc.ac.uk/nebc_website_frozen/nebc.nerc.ac.uk//tools/bioinformatics-docs/other-bioinf/microarray-quality.html#dens) in expression may appear at higher values representative of 'true' signal. For these denisty plots we see a typical curve with the bump followed by a fairly heavy right tail, suggesting unusually high signal.
```{r image2 , fig.align='center', echo=FALSE}
img2 <- readPNG("./results/report_OFC_march2014//dens.png")
grid.raster(img2)
```
```{r rawQC, eval=FALSE, echo=FALSE}
## This code is NOT RUN. Below is an evaluation of a random sampling of raw data. This analysis was undertaken to investigate whether the unusually high signal was a consequnce of normalization procedures. Since it did not result in anything conclusive it is not included in the final report.
## Quick check with raw data
# Load libraries
require(oligo)
require(pd.huex.1.0.st.v2)
# Get data
celFiles <- list.files(file.path(dataDir, 'geo_march2014/CEL'), full.names=TRUE)
affyRaw <- read.celfiles(celFiles, verbose=FALSE)
# Get metadata
samples <- sapply(celFiles, function(x){
s <- strsplit(x, "/")[[1]][5]
strsplit(s, "_")[[1]][1]}, USE.NAMES=FALSE)
covars <- pheno_gene[which(rownames(pheno_gene) %in% samples),]
covars[,"BrainCode"] <- factor(as.character(covars[,"BrainCode"]))
colnames(affyRaw) <- rownames(covars)
pData(affyRaw) <- covars
kable(covars, format="markdown")
arrayQualityMetrics(expressionset=affyRaw,
outdir=file.path(resultsDir, 'report_raw_CEL'),
force=TRUE,
do.logtransform=TRUE,
intgroup=c("BrainCode"))
## QC raw data: pre-normalization
# The raw data seems better than what we obtained from GEO, with higher signal intensities. Two samples show a slightly wider distribution and another is skewed to the left. Will check how this changes after normalization.
img3 <- readPNG("./results/report_raw_CEL/dens.png")
grid.raster(img3)
geneSummaries <- rma(affyRaw, target="core", background=T, normalize=T)
## QC raw data: post-normalization
# Repeat the previous QC using the normalized data, and we see that the distributions are similar to what we had found originally pulled from GEO. One option is removing those wide distribution samples.
arrayQualityMetrics(expressionset=geneSummaries,
outdir=file.path(resultsDir, 'report_rma.core'),
force=TRUE,
do.logtransform=FALSE,
intgroup=c("BrainCode"))
img4 <- readPNG("./results/report_rma.core/dens.png")
grid.raster(img4)
```
## DE Analysis: linear modeling including RIN and PMI as covariates
In the [Kang et al](http://www.nature.com/nature/journal/v478/n7370/full/nature10523.html) study, an ANOVA was used to evaluate genes that were temporally regulated. Age (Stage) was modeled as a continuous variable and both RIN and PMI were included as covariates. To stay consistent, we performed the same analysis using only the Orbitofrontal cortex data.
```{r anova}
# Remove NA values
meta_new <- meta_new[which(!is.na(meta_new$PMI)),]
data_new <- exprs(eset.ofc)[,rownames(meta_new)]
# Update expression set
exprs(eset.ofc) <-data_new
pData(eset.ofc) <- meta_new
# Model fit
mod<-model.matrix(~Stage + PMI + RIN, pData(eset.ofc))
fit<-lmFit(eset.ofc, mod)
fit<-eBayes(fit)
# Get results
topStage <-topTable(fit,coef=2,number=nrow(exprs(eset.ofc)), adjust.method="BH")
topStage$threshold <- as.logical(topStage$adj.P.Val < 0.01)
topPMI<-topTable(fit,coef=3,number=nrow(exprs(eset.ofc)), adjust.method="BH")
topRIN<-topTable(fit,coef=4,number=nrow(exprs(eset.ofc)), adjust.method="BH")
```
### Significant genes
For each probe, the model fit results in a p-value which is generated based on the t-statistic for the coefficent. P-values are corrected for multiple testing and the FDR is reported for each probe. At a very liberal FDR of 0.05, there are **alot of differentially expressed genes with Age/Developemntal Stage** (`r length(which(topStage$adj.P.Val < 0.05))` genes). There are fewer genes associated with RIN (`r length(which(topRIN$adj.P.Val < 0.05))` genes) and none associated with postmortem interval at that same threshold. The p-value distributions are plotted below for each of the three factors in the model (Stage, PMI and RIN). For Stage and RIN there are a _large proportion_ of genes with low p-values, suggesting that a large majority of genes are differentially expressed.
```{r pval-hist, echo=FALSE, fig.align='center'}
par(mfrow=c(2,2))
hist(topStage$P.Value, col="grey", border=F, main="Stage", xlab="P-value")
hist(topPMI$P.Value, col="grey", border=F, main="PMI", xlab="P-value")
hist(topRIN$P.Value, col="grey", border=F, main="RIN", xlab="P-value")
```
### Developmental expression changes also identify with PMI and RIN
Since the two factors are correlated, there is a possibility that they are confounded making it hard to distinguish whether expression changes are due to developmental stage or RIN. Taking the top 6 probes that are affected by Stage, we plotted expression values against age, PMI and RIN. Although the changes are not identical, we still see a similarity in the trend. This indicates that factors need to be better corrected for to ensure we are identifying expression changes truly associated only with Devlopmental Stage.
```{r topgenes, echo=FALSE}
ordered <- topStage[order(topStage$adj.P.Val),]
# Subset expression data to genes of interest
exp.sub <- data_new[row.names(ordered)[1:6], ]
meta.sub <- meta_new[order(meta_new$Stage),]
exp.sub <- exp.sub[,rownames(meta.sub)]
# Merge with phenotype information
df <- melt(exp.sub)
df <- merge(df, meta.sub, by.x='X2', by.y='row.names')
```
```{r topPlot, echo=FALSE}
p1 <- ggplot(df, aes(x=Stage, y=value)) +
geom_smooth(method=loess) +
facet_wrap(~X1) +
theme(axis.title.x = element_blank(),
plot.margin = unit(c(1, 0, 1, 1), "lines")) +
scale_y_continuous(limits = c(3, 13), oob=rescale_none) +
ggtitle('Age') +
ylab('Expression values')
p2 <- ggplot(df, aes(x=PMI, y=value)) +
geom_smooth(method=loess) +
facet_wrap(~X1) +
theme(axis.title = element_blank(),
axis.text.y = element_blank(),
plot.background = element_blank(),
axis.ticks.y = element_blank(),
plot.margin = unit(c(1, 1, 1, 0), "lines")) +
scale_y_continuous(limits = c(3,13), oob=rescale_none) +
ggtitle('Postmortem Interval')
# Set side-by-side
gt1 <- ggplot_gtable(ggplot_build(p1))
gt2 <- ggplot_gtable(ggplot_build(p2))
newWidth = unit.pmax(gt1$widths[2:3], gt2$widths[2:3])
# Set new size
gt1$widths[2:3] = as.list(newWidth)
gt2$widths[2:3] = as.list(newWidth)
# Arrange
grid.arrange(gt1, gt2, ncol=2)
```
```{r topPlot2, echo=FALSE}
p1 <- ggplot(df, aes(x=Stage, y=value)) +
geom_smooth(method=loess) +
facet_wrap(~X1) +
theme(axis.title.x = element_blank(),
plot.margin = unit(c(1, 0, 1, 1), "lines")) +
scale_y_continuous(limits = c(3, 13), oob=rescale_none) +
ggtitle('Age') +
ylab('Expression values')
p2 <- ggplot(df, aes(x=RIN, y=value)) +
geom_smooth(method=loess) +
facet_wrap(~X1) +
theme(axis.title = element_blank(),
axis.text.y = element_blank(),
plot.background = element_blank(),
axis.ticks.y = element_blank(),
plot.margin = unit(c(1, 1, 1, 0), "lines")) +
scale_y_continuous(limits = c(3,13), oob=rescale_none) +
ggtitle('RIN')
# Set side-by-side
gt1 <- ggplot_gtable(ggplot_build(p1))
gt2 <- ggplot_gtable(ggplot_build(p2))
newWidth = unit.pmax(gt1$widths[2:3], gt2$widths[2:3])
# Set new size
gt1$widths[2:3] = as.list(newWidth)
gt2$widths[2:3] = as.list(newWidth)
# Arrange
grid.arrange(gt1, gt2, ncol=2)
```