Using the t.test
function:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
We will focus on two sample unpaired t-test, assuming unequal variances, as this is the most common scenario in proteomics. Using a paired test when appropriate is essential, as it will substantially increase your test power.
We are going to use the rnorm
function in this an the next section
to quickly generate normally distributed data. Its inputs are
n
: the number of data points to be generated;mean
: the mean of the normal distribution to draw the data from (default is 0);sd
: the standard deviation of the normal distribution to draw the data from (default is 1).
-
Generate 200 numbers drawn from a normal distribution of mean 0 and standard deviation 1. Verify that the parameters of the randomly data are correct. What figure would you use to visualise such data?
-
Same as above for a normal distribution of mean 2 and standard deviation 0.5.
-
Compare your values with your neighbour's. Are they identical?
Let's now apply a t-test on two sets of values drawn from identical and different distributions:
t1 <- t.test(rnorm(5), rnorm(5))
t1
##
## Welch Two Sample t-test
##
## data: rnorm(5) and rnorm(5)
## t = -0.20081, df = 7.9673, p-value = 0.8459
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.969972 1.654570
## sample estimates:
## mean of x mean of y
## 0.2301373 0.3878384
t2 <- t.test(rnorm(5), rnorm(5, mean = 4))
t2
##
## Welch Two Sample t-test
##
## data: rnorm(5) and rnorm(5, mean = 4)
## t = -11.721, df = 6.0162, p-value = 2.283e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.144158 -3.368255
## sample estimates:
## mean of x mean of y
## -0.5893708 3.6668359
What we see above is a pretty output that is convenient to visualise
interactively. The output of the t.test
is an object of class
htest, which contains the following values:
## [1] "statistic, parameter, p.value, conf.int, estimate, null.value, alternative, method, data.name"
We can extract any of these with the $
accessor
t2$p.value
## [1] 2.282681e-05
When using ratio data (as in SILAC or 15N), one would use a one sample t-test.
logsilac <- rnorm(3)
t.test(logsilac, mu = 0)
##
## One Sample t-test
##
## data: logsilac
## t = 2.2358, df = 2, p-value = 0.1549
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.4051798 1.2817941
## sample estimates:
## mean of x
## 0.4383072
Let's use the mulvey2015
dataset, introduced previously, and focus
on time points 1 and 6.
- How can we conveniently use the sample metadata to create the relevant subset?
library("pRolocdata")
data(mulvey2015)
time16 <- mulvey2015[, mulvey2015$times %in% c(1, 6)]
head(exprs(time16))
## rep1_0hr rep1_XEN rep2_0hr rep2_XEN rep3_0hr rep3_XEN
## P48432 0.486 0.058 0.446 0.017 0.525 0.033
## Q62315-2 0.388 0.068 0.345 0.065 0.452 0.038
## P55821 0.349 0.055 0.314 0.034 0.436 0.055
## P17809 0.321 0.101 0.303 0.093 0.414 0.052
## Q8K3F7 0.363 0.047 0.297 0.045 0.395 0.019
## Q60953-2 0.371 0.039 0.314 0.046 0.386 0.057
- Use the
t.test
function to test P48432 for differences in time points 1 and 6.
t.test(exprs(time16)[1, time16$time == 1],
exprs(time16)[1, time16$time != 1])
##
## Welch Two Sample t-test
##
## data: exprs(time16)[1, time16$time == 1] and exprs(time16)[1, time16$time != 1]
## t = 17.471, df = 3.0184, p-value = 0.0003943
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3680382 0.5312951
## sample estimates:
## mean of x mean of y
## 0.4856667 0.0360000
In high throughput biology, we have to repeat our tests over every feature (transcript, protein, ...). As we are using a programming language, this is something easy to do.
Below, we use the apply function, that will iterate a function over all elements of its input.
time1 <- time16$time == 1
time6 <- time16$time != 1
## first attempt
pv <- apply(exprs(time16), 1,
function(x) t.test(x[time1], x[time6]))
## second attempt
pv <- apply(exprs(time16), 1,
function(x) t.test(x[time1], x[time6])$p.value)
We now have calculated a p-value for each of the 2337 proteins in the data; let's add them to the feature metadata slot.
fData(time16)$p.value <-
apply(exprs(time16), 1,
function(x) t.test(x[time1], x[time6])$p.value)
fData(time16)$fc <-
apply(exprs(time16), 1,
function(x) mean(x[time1])/mean(x[time6]))
fData(time16)$lfc <- log2(fData(time16)$fc)
We could also use the rowttests
function from the genefilter
package to run a t-test on all rows at once.
There are 3 important factors to consider when assessing the results of a test for differential expression:
- The significance of the test, i.e. the p-values
- The magnitude of the change, i.e. the fold-change
- The (average) intensity of the measurements
MAplot(time16)
plot(fData(time16)$lfc, -log10(fData(time16)$p.value))
See this section for details.
Applying this to our data, we obtain
hist(fData(time16)$p.value)
library("qvalue")
fData(time16)$q.value <- qvalue(fData(time16)$p.value)$qvalue
plot(fData(time16)$lfc, -log10(fData(time16)$q.value))
summary(fData(time16)$q.value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.320e-06 1.203e-03 2.535e-03 1.064e-02 9.559e-03 8.990e-02
Two values are used when computing a t statistics: the effect size (i.e. difference in means) and the sample standard deviations. Any uncertainty on these values will have a negative impact on the test results.
For example, in case of low quantitation values, the difference in means can fluctuate quite substantially due to random variations and favour misleading large effect sizes and false positive.
Another example is small sample size, which can bias the estimation of the sample standard deviations. To address this, moderated t statistics compute variance estimates that are a compromise between global variance estimate and individual gene-wise variance estimates.
From the limma reference:
This procedure is implemented in the limma software package (Ritchie et al., 2015) and the resulting EB [Empirical Bayes] tests have been shown to offer improved statistical power and false discovery rate (FDR) control relative to the ordinary gene-wise t-tests, especially when the sample sizes are small (Kooperberg et al., 2005; Murie et al., 2009; Ji and Liu, 2010; Jeanmougin et al., 2010). The limma software has been used successfully in thousands of published biological studies using data from a variety of genomic technologies, especially studies using expression microarrays and RNA-seq.
The limma comes with extensive documentation, available
with the limmaUsersGuide()
function. While is doesn't explicitly
mention proteomics, its methodology is applicable to proteomics data.
As discussed in the lecture, count data cannot be handled using a test for continuous data. One could log-transform the data (adding one to the data to keep 0 counts). Alternatively, using a dedicated count distribution has proved successful. Methods originally developed for high throughput sequencing data, have benefited from tremendous development within the Bioconductor project, and can be readily applied to proteomics count data.
The msmsTests package applies various such count-based
tests on MSnSet
objects containing spectral counting data. The
package provides a test data msms.spk
, described as follows in the
manual page:
A MSnSet with a spectral counts in the expression matrix and a
treatment factor in the phenoData slot.
The spectral counts matrix has samples in the columns, and
proteins in the rows. Each sample consists in 500ng of standard
yeast lisate spiked with 100, 200, 400 and 600fm of a mix of 48
equimolar human proteins (UPS1, Sigma-Aldrich). The dataset
contains a different number of technical replicates of each
sample.
library("msmsTests")
data(msms.spk)
- Familiarise yourself with the experimental design of this dataset. Hint: look at the phenoData slot.
pData(msms.spk)
## treat
## Y500U100_001 U100
## Y500U100_002 U100
## Y500U100_003 U100
## Y500U100_004 U100
## Y500U200_001 U200
## Y500U200_002 U200
## Y500U200_003 U200
## Y500U200_004 U200
## Y500U200_010 U200
## Y500U200_011 U200
## Y500U400_002 U400
## Y500U400_003 U400
## Y500U400_004 U400
## Y500U600_001 U600
## Y500U600_002 U600
## Y500U600_003 U600
## Y500U600_004 U600
## Y500U600_005 U600
## Y500U600_006 U600
table(msms.spk$treat)
##
## U100 U200 U400 U600
## 4 6 3 6
- How many samples and proteins are there in the data
dim(msms.spk)
## [1] 685 19
- Look at the distribution of all proteins and compare it to the spike
in proteins. The spikes all contain the suffix
"HUMAN"
that can be extracted with the grep function.
boxplot(exprs(msms.spk))
spks <- grep("HUMAN", featureNames(msms.spk))
boxplot(exprs(msms.spk[spks, ]))
We are going to model the data according to the negative-binomial
distribution, using the implementation of the edgeR
package, which uses an Empirical Bayes method to share information
across features and is this particularly relevant with a restricted
number of replicates. We will focus on the U200
and U600
conditions.
e <- msms.spk[, msms.spk$treat %in% c("U200", "U600")]
table(e$treat)
##
## U100 U200 U400 U600
## 0 6 0 6
We now also need to remove proteins that are left with only 0.
e <- filterZero(e, pNA = 0.99)
We can run the NB spectral counts differential test using the
msms.edgeR
, providing
- an
MSnSet
, heree
- an alternative and null hypothesis,
H1
(there is a treatment effect) andH0
(there is no effect, the expression is essentially constant) - the groups,
e$treat
- a column-wise scaling offset (optional)
H0 <- "y ~ 1"
H1 <- "y ~ treat"
## normalising condition
div <- colSums(exprs(e))
res <- msms.edgeR(e, H1, H0,
fnm = "treat",
div = div)
head(res)
## LogFC LR p.value
## YKL060C 0.12787365 1.313108448 2.518326e-01
## YDR155C -0.18013963 8.799648268 3.012886e-03
## YOL086C 0.42632911 6.941099054 8.423734e-03
## YJR104C 0.01211616 0.006652729 9.349932e-01
## YGR192C -0.37239199 27.241017293 1.796076e-07
## YLR150W -0.26921988 12.941054872 3.214538e-04
- Inspect the p-values distribution and, if relevant, adjust as demonstrated above.
hist(res$p.value, breaks = 50)
library("multtest")
adj <- mt.rawp2adjp(res$p.value)
res$BH <- adj$adjp[order(adj$index), "BH"]
## with(res, plot(LogFC, -log10(BH)))
- Visualise the results on a volcano plot
sig <- res$BH < 0.01
plot(res$LogFC, -log10(res$BH),
col = ifelse(sig, "red", "black"),
pch = ifelse(grepl("HUMAN", featureNames(e)),
19, 1))
- Estimate the number true/false positives and negatives and an alpha level of 0.01.
sig <- res$BH < 0.01
TP <- length(grep("HUMAN", rownames(res)[sig]))
FP <- sum(sig) - TP
FN <- length(grep("HUMAN", rownames(res)[!sig]))
TN <- sum(!sig) - FN
stopifnot(TN + FP + FN + TP == nrow(e))
tab <- data.frame(TP, FP, TN, FN)