-
Notifications
You must be signed in to change notification settings - Fork 3
/
brainspan_part2.Rmd
440 lines (341 loc) · 24.3 KB
/
brainspan_part2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
---
title: "Smoller Brainspan Part II"
output:
html_document:
theme: cosmo
toc: true
toc_depth: 4
fig_caption: true
fig_width: 8
fig_height: 6
author: "Meeta Mistry"
---
```{r setup, echo=FALSE}
# Setup report details
clientname="Erin Dunn"
clientemail="[email protected]"
lablocation="MGH"
analystname="Meeta Mistry"
analystemail="[email protected]"
```
Array analysis for `r clientname` (`r clientemail`) at `r lablocation`. Contact `r analystname` (`r analystemail`) for additional details.
> In Part I, we verified that the problem of confounding variables exists within the [Kang et al](http://www.nature.com/nature/journal/v478/n7370/full/nature10523.html) developmental [expression data](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25219) that we are using. In the next stages we aim to remove the effects of confounders by applying various methods which include _isva_ best described in [Teschendorff A.E. et al., 2011](http://bioinformatics.oxfordjournals.org/content/27/11/1496.long) and _RUV_ (remove unwanted variation) a method described in more detail [Gagnon-Bartsch J.A., 2011](http://biostatistics.oxfordjournals.org/content/13/3/539.short).
**Note: ISVA is not included in the final HTML report but code remains in the markdown file.**
## Workflow
* Download the most up-to-date version of the GEO data
* Identify a set of negative control genes
* RUV results and interpretations
* Considerations
## Setup
### Bioconductor and R libraries used
```{r libraries, echo=TRUE}
loadlibs <- function(){
library(ggplot2)
library(gtable)
library(scales)
library(RColorBrewer)
library(GEOquery)
library(affy)
library(arrayQualityMetrics)
library(reshape)
library(xtable)
library(isva)
library(limma)
library(Biobase)
library(gridExtra)
library(CHBUtils)
library(png)
library(stringr)
library(dplyr)
library(ruv)
}
suppressPackageStartupMessages(loadlibs())
```
### Set variables
```{r variables, echo=TRUE}
# Setup directory variables
baseDir <- '.'
dataDir <- file.path(baseDir, "data")
metaDir <- file.path(dataDir, "meta")
resultsDir <- file.path(baseDir, "results")
#covarsfilename <- 'covdesc.txt'
```
## Load the expression data
There are two sources of Brainspan data:
1. Link on [Brainspan](http://www.brainspan.org/static/download.html) website
2. GEO submission which is associated with the publication [Kang et al, Nature 2011](http://www.ncbi.nlm.nih.gov/pubmed/22031440)
The data available for download on the **Brainspan website is missing a large number of samples** compared to the GEO dataset. Correspondence with Ying Zhu ([email protected]) from Nenad Sestan's lab provided us with instruction to use the full dataset on GEO as opposed to the data on Brainspan. The GEO entry GSE25219 was updated since the Part I report and the most recent version of data can be downloaded from GEO [GSE25219](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25219). **Data used in this report was downloaded from GEO on 10/09/2014.**
The expression data obtained from GEO has been pre-processed with RMA, quantile normalized and log2 transformed (as in [Part I](./brainspan_part1.Rmd)). The dataset contains samples from 16 different brain regions, but we focused on the orbitofrontal cortex samples for the remaining analyses in this report.
```{r dataimport GEO, warning=FALSE, message=FALSE}
# Load GEO data
gse_gene <- getGEO(filename=file.path(dataDir, 'geo/GSE25219-GPL5175_series_matrix.txt.gz'))
```
## Load the metadata and extract OFC expression data
```{r metadata from file, echo=TRUE, warning=FALSE, message=FALSE}
pheno_gene <- read.delim(file.path(metaDir, 'brainspan_samples_metadata.txt'), row.names=1)
meta_donor <- read.delim(file.path(metaDir, 'brainspan_donor_metadata.txt'), row.names=1)
# Subset data by region and hemisphere
meta_ofc <- pheno_gene[which(pheno_gene$BrainRegion == "OFC" & pheno_gene$Hemisphere == "R"),]
meta_ofc$PMI <- as.numeric(as.character(meta_ofc$PMI))
meta_ofc$Stage <- factor(meta_ofc$Stage)
# Add data
data_ofc <- exprs(gse_gene)
data_ofc <-data_ofc[,which(colnames(data_ofc) %in% rownames(meta_ofc))]
eset.ofc <- new("ExpressionSet", exprs=data_ofc)
# Add metadata
fetal <- rep("NA", nrow(meta_ofc))
stage.num <- as.numeric(as.character(meta_ofc$Stage))
fetal[which(stage.num <= 7)] <- "Fetal"
fetal[which(stage.num > 7)] <- "Postnatal"
fetal <- factor(fetal)
meta_new <-cbind(meta_ofc, fetal)
meta_new$Stage <- stage.num
meta_new <- droplevels(meta_new)
pData(eset.ofc) <- droplevels(meta_new)
fData (eset.ofc) <- fData(gse_gene)
```
```{r, echo=FALSE, fig.align='center'}
# Age distribution
ggplot(meta_new, aes(factor(Stage), fill=fetal)) +
geom_bar() +
ggtitle('Orbitofrontal Cortex: Age distribution') +
xlab('Developmental Stage') +
ylab('Number of Samples') +
theme(axis.text.x = element_text(colour="grey20",size=15,angle=0,hjust=.5,vjust=.5,face="plain"),
plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.25)))
# RIN distribution
ggplot(meta_new, aes(x=factor(Stage), y=RIN, fill=fetal)) +
geom_boxplot() +
ggtitle('Orbitofrontal Cortex: RIN') +
xlab('Stage') +
ylab('RIN ') +
theme(axis.text.x = element_text(colour="grey20",size=15,angle=0,hjust=.5,vjust=.5,face="plain"),
plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.25)))
```
## QC: Comparing GEO versions
We are using a new version of expression data from GEO and we are unsure of what the difference is. The total number of samples is the same between versions, but perhaps there is some thing about the expression that has changed. To gain a better understanding of how the expression data may have changed from the previous version we performed a quick check of the data quality. The new data (OFC only, to stay consistent with Part I) was run through [arrayQualityMetrics](http://www.bioconductor.org/packages/release/bioc/vignettes/arrayQualityMetrics/inst/doc/arrayQualityMetrics.pdf) and diagnostic plots were compared to those generated from Part I GEO data.
Below we plotted the false color heatmaps generated from the two different GEO versions, as this is where we observed a small discrepancy. The left hand plot is the heatmap of the sample-to-sample distances between arrays in the old version. Smaller distances (in blue) indicate the samples are more similar to one another, where distances closer to 1 (yellow) suggest divergence. Patterns in the left hand plot indicate clustering of the arrays by fetal and postnatal age with the _exception of two samples_. The right hand plot shows a false color heatmap using the current (most up to date) version of the data, where the outliers no longer exist. There is definitely some change in the data as samples cluster somewhat differently, although a bit unsure of the source of change. As such, we will continue using the current version.
```{r qc-compare, fig.align='center', echo=FALSE, fig.width=11}
img2 <- readPNG("./results/report_OFC_march2014/hm.png")
img3 <- readPNG("./results/report_OFC/hm.png")
grid.arrange(rasterGrob(img2), rasterGrob(img3), ncol=2)
```
```{r isva, fig.align='center', warning=FALSE, message=FALSE, eval=FALSE, echo=FALSE}
## This code was NOT RUN for this report, although results have been generated
## ISVA: Differential expression analysis
# It makes sense to model the effect of confounding factors on the data as statistically independent random variables, as it better reflects the way the confounding noise is generated. This requires them to be uncorrelated with the primary variable of interest in a non-linear fashion, a stronger condition than the _linear_ uncorrelatedness imposed by an SVD. In order to model CFs as statistically independent variables, isva uses independent component analysis (ICA). Since we are using Stage as an ordinal factor (a special case of categorical), it is going to be problematic to have Stages with only one sample. Therefore we will remove samples from Stages 9 and 10. Alot of differentially expressed genes!
# Remove single sample stages
remove <- which(meta_new$Stage == 9 | meta_new$Stage == 10)
meta.isva <- meta_new[-remove,]
data.isva <- data_ofc[,-remove]
# Idenitify confounding variables
cf.m <- meta.isva[,c('PMI', 'RIN')]
factor.log <- as.logical(rep("FALSE",2))
diseaseStage <- ordered(meta.isva$Stage, levels=c(2:8, 12:15))
# Run ISVA
isva.res <- DoISVA(data.isva, diseaseStage, cf.m = cf.m, factor.log, th=0.001)
hist(isva.res$spv, main="P-value distribution from isva", xlab="P-value", col="grey", border=F)
```
```{r topPlot, echo=FALSE, fig.align='center', echo=FALSE, eval=FALSE}
### Comparing the top expression changes with age to evaluate effects on PMI
# The isava method generates a very large number of significant genes (6118 genes; FDR=0.001). The next step is to evaluate how effectively the isva has removed confounding effects, leaving us with only age-related changes. Below we have taken the top nine genes from the isva result and plotted expression against Age. In the second panel the same genes are plotted to evaluate expression change with PMI. The expression change is almost identical - ordinal can be a special case of continuous, _if_ the categories are equally spaced or perfectly ordered. But in our case, the categories or stages are not equally spaced (i.e., there's not a one unit or some other consistent unit difference between stage 1 and stage 2) so this is not modeled correctly (some top genes don't even show change).
# Get top genes from sorted p-value list
top <- which(isva.res$rk <= 9)
topSub <- data_ofc[top, ]
# Merge with phenotype information
df <- melt(topSub)
df <- merge(df, meta_ofc, by.x='X2', by.y='row.names')
p1 <- ggplot(df, aes(x=as.numeric(Stage), y=value)) +
geom_smooth(method=loess) +
facet_wrap(~X1) +
theme(axis.title.x = element_blank(),
plot.margin = unit(c(1, 0, 1, 1), "lines")) +
ggtitle('Age') +
ylab('Expression values')
p2 <- ggplot(na.omit(df), aes(x=PMI, y=value)) +
geom_smooth(method=loess) +
facet_wrap(~X1) +
theme(axis.title = element_blank(),
axis.text.y = element_blank(),
plot.background = element_blank(),
axis.ticks.y = element_blank(),
plot.margin = unit(c(1, 1, 1, 0), "lines")) +
ggtitle('Postmortem Interval')
# Set side-by-side
gt1 <- ggplot_gtable(ggplot_build(p1))
gt2 <- ggplot_gtable(ggplot_build(p2))
newWidth = unit.pmax(gt1$widths[2:3], gt2$widths[2:3])
# Set new size
gt1$widths[2:3] = as.list(newWidth)
gt2$widths[2:3] = as.list(newWidth)
# Arrange
grid.arrange(gt1, gt2, ncol=2)
```
## RUV: remove unwanted variation
We next applied [RUV](http://www.stat.berkeley.edu/~johann/ruv/) to remove unwanted variation in the data - namely to see if we could remove the effects of confounding variables RIN, PMI, and other possible non-measured confounders. The strategy underlying RUV is to use negative control genes and identify a set of K surrogate variables based on the expression of those genes. These surrogate variables are incorporated into the linear model as covariates to correct for unwanted variation. The negative control genes are genes whose expression levels are known a priori to be truly unassociated with the biological factor of interest. In our case, genes that we know a priori do not change in their expression over the course of development.
Our original strategy was to use _housekeeping genes_ as provided in [Eisenberg E., and Levanon E.Y.](http://www.ncbi.nlm.nih.gov/pubmed/12850439) as our list of _negative control genes_. The RUV starter analysis generates two panels of diagnostic plots, 1) for all genes that are being assayed and 2) for only the nagtive control gene set (please see the [how-to manual](http://www.stat.berkeley.edu/~johann/ruv/howto/howto.html) for step-by-step description and interpretation). These panels are generated for all levels of K specified. Using these diagnostic plots we can then determine what K is best to use (i.e how many surrogate variables to include)
However, the diganostic plots from the RUV starter analysis demonstrated that the **housekeeping genes** behave similar to the rest of the genes assayed. Thus, this was **not a good set of negative controls** and they are picking up some biology along with the unwanted factors (RUV report can be found [here](./ruv_hk_full/index.html) ).
### Finding a better set of negative control genes
Two factors that are adding unwanted variation in our data are RIN and PMI. A good set of negative controls would therefore be to find a set of genes that are **associated with these variables (RIN and PMI) yet unaffected by age**. In our current cohort, RIN and PMI are highly correlated with age (our primary variable of interest), so it is almost impossible to extract changes that are unique to the confounders. However, if we use only postantal samples this correlation disappears. Below we have plotted Stage on the x-axis and PMI and RIN on the y-axis for each sample in the _postnatal only cohort_.
```{r rin-pmi, echo=TRUE, eval=TRUE, fig.align='center'}
# Keep only samples >= Stage 8
samples <- row.names(pData(eset.ofc))[which(pData(eset.ofc)$Stage >= 8)]
eset.nc <- eset.ofc[,samples]
# Plot relationship with Age
p1 <- ggplot(pData(eset.nc), aes(x=Stage, y=PMI)) +
geom_point(shape=1) +
geom_smooth(method=lm) + # Add linear regression line
ggtitle('PMI') +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
p2 <- ggplot(pData(eset.nc), aes(x=Stage, y=RIN)) +
geom_point(shape=1) +
geom_smooth(method=lm) + # Add linear regression line
ggtitle('RIN') +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
grid.arrange(p1, p2, ncol=2)
```
### Using 'postnatal only' cohort to find negative control genes
Since there is no correlation between our unwanted factors (RIN and PMI) and our primary variable of interest (Stage) in the postnatal only cohort, the gene set we identify is less likely to be age associated. We cannot be 100% sure as there still may be age/Stage related changes that across the postnatal span of life that overlap with changes with RIN and PMI.
We ran two linear models using the 'postnatal only' cohort, one for RIN and one for PMI. Since both are continuous variables we are essentially performing a linear regression for each probe. The resulting p-value for each probe tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value indicates that you can reject the null hypothesis and changes in the predictor's value are related to changes in the response variable (gene expression). The p-values generated for each association test (~17,000 probes) are plotted using the histogram below. The p-value is on the x-axis, and the y-axis represents the frequency of each p-value.
Results from these analyses suggest that there are no genes that appear to differ substantially in expression as a function of PMI (plot on left). While the plot on the right suggests there is a small subset of genes that change as a function of RIN.
```{r negctl, fig.align='center', fig.width=12, echo=FALSE, warning=FALSE, message=FALSE}
# model PMI
mod <- model.matrix(~PMI, pData(eset.nc))
fit<-lmFit(eset.nc, mod)
fit<-eBayes(fit)
topPMI<-topTable(fit,coef=2,number=nrow(exprs(eset.ofc)), adjust.method="BH")
p1 <- ggplot(topPMI, aes(x=P.Value)) +
geom_histogram(color="black", fill="grey") +
ggtitle('PMI') +
xlab('P-value') +
ylab('') +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
# model RIN
mod <- model.matrix(~RIN, pData(eset.nc))
fit<-lmFit(eset.nc, mod)
fit<-eBayes(fit)
topRIN<-topTable(fit,coef=2,number=nrow(exprs(eset.ofc)), adjust.method="BH")
p2 <- ggplot(topRIN, aes(x=P.Value)) +
geom_histogram(color="black", fill="grey") +
ggtitle('RIN') +
xlab('P-value') +
ylab('') +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
grid.arrange(p1, p2, ncol=2)
```
To complement the associations (or lack thereof) found via linear modeling, we also computed correlations. For each probe we measure the correlation between 1) RIN and expression and 2) PMI and expression. A histogram of correlation values are plotted below. Correlations with PMI form a narrow distribution centered around zero, indicating that most genes have little or no correlation with PMI. With RIN we see a much broader distribution of correlation values, indicating genes show a higher magnitude of association with RIN.
```{r correlations, fig.align='center', fig.width=15, echo=FALSE, warning=FALSE, message=FALSE}
cor.PMI <- apply(exprs(eset.nc), 1, function(x){cor(x, pData(eset.nc)$PMI)})
cor.RIN <- apply(exprs(eset.nc), 1, function(x){cor(x, pData(eset.nc)$RIN)})
df <- data.frame(cor.PMI, cor.RIN, row.names=names(cor.PMI))
p1 <- ggplot(df, aes(x=cor.PMI)) +
geom_histogram(color="black", fill="grey") +
ggtitle('PMI') +
xlab('Correlation') +
ylab('') +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
p2 <- ggplot(df, aes(x=cor.RIN)) +
geom_histogram(color="black", fill="grey") +
ggtitle('RIN') +
xlab('Correlation') +
ylab('') +
theme(plot.title = element_text(size = rel(2.0)),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
grid.arrange(p1, p2, ncol=2)
```
**Our new negative control set of genes is generated by selecting the top 50 genes ranked by p-values obtained from linear modeling. We will refer to this set as 'BrainSpan negative controls'**
### Validate Brainspan negative controls
To evaluate if this new set of genes serve as 'better' negative controls, we need to see if and how much they are driving age/Stage related changes. To do this revert back to the expression data from the full cohort (fetal + postnatal samples) and extract only rows corresponding to our top genes. Using this subset of expression data we perform a PCA as we are interested in finding the directions (components) that maximize the variance in our dataset (right plot). Using all genes on the array (left plot) we see that there is a clear demarkation between fetal and postnatal samples (i.e the first PC is expalained by fetal versus postnatal). Using only the top RIN associated genes there is still separation of samples, but to a lesser extent.
_Note: It appears that many genes change in expression between fetal and postnatal stage of development, and finding the perfect set of negative control genes will be difficult. We may want to consider using only postnatal samples._
```{r pcacheck, echo=FALSE, fig.align='center', fig.width=12}
## All genes PCA
myPca <- prcomp(t(exprs(eset.ofc)))
# Plot first factor of interest
tmpPCAData <- as.data.frame(myPca$x[,1:5])
par(mfrow=c(1,2))
plot(PC2 ~ PC1, data=tmpPCAData, col=c("black", "red")[pData(eset.ofc)[,"fetal"]],
pch=19, main="All genes")
legend("top", inset=.05, title="", legend=levels(pData(eset.ofc)[,"fetal"]), fill=c('black', 'red'), horiz=FALSE)
## Negative control genes PCA
topranked <- which(row.names(exprs(eset.ofc)) %in% row.names(topRIN)[1:50])
myPca <- prcomp(t(exprs(eset.ofc)[topranked,]))
# Plot first factor of interest
tmpPCAData <- as.data.frame(myPca$x[,1:5])
plot(PC2 ~ PC1, data=tmpPCAData, col=c("black", "red")[pData(eset.ofc)[,"fetal"]],
pch=19, main="Negative control genes")
legend("top", inset=.05, title="", legend=levels(pData(eset.ofc)[,"fetal"]), fill=c('black', 'red'), horiz=FALSE)
```
### RUV with Brainspan negative controls
We ran the RUV starter analysis as we had done previously, this time using the Brainspan negative controls rather than the housekeeping genes. Since we don't really have a healthy mix of PMI and RIN among the patients within each stage, we also changed our primary variable from continous (Stage) to a two-group comparison (Fetal vs. Postnatal).
```{r setup-for-RUV, echo=TRUE}
# Get X and Y
mod <- model.matrix(~fetal, pData(eset.ofc))
X <- as.matrix(mod[,2])
Y <- t(exprs(eset.ofc))
# Assign negative control genes
ctl<-rep("FALSE", nrow(exprs(eset.ofc)))
ctl[topranked]<-"TRUE"
ctl<-as.logical(ctl)
# Extract gene information into columns
geneCol <- as.character(fData(eset.ofc)[,'gene_assignment'])
geneNames <- sapply(geneCol, function(x){strsplit(as.character(x), "//")[[1]][2]})
geneNames <- str_trim(geneNames)
geneDesc <- sapply(geneCol, function(x){strsplit(as.character(x), "//")[[1]][3]})
geneChr <- sapply(geneCol, function(x){strsplit(as.character(x), "//")[[1]][4]})
geneChr <- str_trim(geneChr)
geneinfo <- data.frame(geneNames, geneDesc, geneChr, row.names=row.names(fData(eset.ofc)))
```
```{r ruvstarter, eval=FALSE}
# A quick first look at the data
ruv_starter_analysis(Y, X, ctl, geneinfo = t(geneinfo))
# Retry and use kset to indicate specific K
ruv_starter_analysis(Y, X, ctl, geneinfo = t(geneinfo), kset=c(10:15), do_ruv4 = F, do_ruvinv = F, do_ruvrinv = F)
```
### Results and interpretation
The RUV starter analysis produces diagnostic plots to determine how well our negative control genes worked. We will reference a few key figures below but the full report can be found [here](./ruv_negctl_full/index.html). The plots look better than those produced with the housekeeping genes, but still not ideal. The relative log expression (RLE) plots, are boxplots of expression within each sample. Below you see that while the plots for the whole expression data stay fairly constant (left plot), the expression for control genes is quite variable (right plot. This demonstrates that the gene set is not just a random sample from the array.
```{r rle-plot, fig.align='center', echo=FALSE}
imgrle_Y <- readPNG("./ruv_negctl_full/general/rle/rleY.png")
imgrle_Yc <- readPNG("./ruv_negctl_full/general/rle/rleYc.png")
grid.arrange(rasterGrob(imgrle_Y), rasterGrob(imgrle_Yc), ncol=2)
```
The cancor plots are mainly useful for giving you a sense of what's going on in the data, and giving you some circumstantial evidence. This plot shows, for each value of K (number of surrogate variables to consider), the square of the first canonical correlation between X (Stage; our primary variable of interest) and the first K left singular values of Y (black) and Ynegctl (green). It seems at lower K ( < 10 ) is where we see the green curve stay low while the black line jumps up. This indicates that at lower K the negative controls are _less correlated_ with X. Whether this means that they are relatively _uninfluenced by X is not conclusive_.
The cancor^2 of about 0.7 is, by absolute standards, very high. So while our procedure for selecting negative controls helped a bit, they are in fact still quite influenced by X. There's another possible explanation, which is that the negative controls are good, and aren't influenced by X, and the only reason they are correlated with X is because the unwanted variation is itself highly correlated with X.
```{r cancor-plot, fig.align='center', echo=FALSE}
imgcancor <- readPNG("./ruv_negctl_full/general/cancor/cancor.png")
grid.raster(imgcancor)
```
### Considerations
* RUV may not be able to remove the confounding effects if the unwanted variation is highy correlated with our primary variable of interest. RIN values correlate with X at 0.5. How high is high?
* Another method for finding negative control genes is to use genes correlated with RIN but _not_ correlated with Stage. Is there anything _not_ correlated with Stage?
* Incorporate positive control genes
* Try RUV-4. RUV-2 is pretty sensitive to "bad" negative controls. As for RUV-4, I'd say stick to K=1 or 2, at least for now.
```{r ruv-2, echo=FALSE, eval=FALSE}
## This code can be run once an appropriate K has been determined. From the RUV-2 run we can remove the effects of surrogate variables to get 'cleaned' data for use in downstream analysis
# RUV-2
ruv.10 <- RUV2(Y, X, ctl, k=10, Z = 1, v = NULL, fullW = NULL, inputcheck = TRUE)
# Histograms
hist (ruv.10$p, col="grey", border=F, xlab="P-value", main="RUV2 with Stage as continuous factor")
# Get SVA object
svaobj<-as.matrix(ruvfit$W)
# Use cleaning function
regressClean<-function(y,mod, svaobj,P=ncol(mod)) {
X=cbind(mod,svaobj)
Hat=solve(t(X)%*%X)%*%t(X)
beta=(Hat%*%t(y))
cleany=y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),])
return(cleany)
}
```