-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path2_03_comparisons_p-values.Rmd
executable file
·233 lines (144 loc) · 20.5 KB
/
2_03_comparisons_p-values.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# Statistical comparisons {#statistical-comparisons}
## Making comparisons
Scientific inquiry often requires us to evaluate predictions about differences between populations. The simplest such case involves just two populations, e.g.
‘Does enzyme activity differ among control and drug-treated cell lines?’
‘Do maize plants photosynthesise at different rates at 25°C and 20°C?’
'Do purple and green plants differ in their biomass?'
In this setting we're evaluating whether or not two *statistical populations* are different in some way. To do this, we have to step through exactly the same kind of process discussed in the last few chapters.
This chapter demonstrates how to compare two populations by employing ideas like null hypotheses and *p*-values. The goal is not really to learn how to compare populations---instead, we want to continue developing our understanding of significance tests and evaluate predictions.
As we do this we're going to introduce a new component of the frequentist machine called a 'test statistic'. This is an important idea---it will crop up every time we introduce a new type of statistical test.
### A new example {#morph-weights-eg}
Let's continue with the familiar purple and green plants example. This time, instead of working with morph frequencies, we'll consider plant size. We introduce this by stepping through the process introduced in the [learning from data](#learning-from-data) chapter:
```{r, include=FALSE}
morph_data <- readr::read_csv(file = "./data_csv/MORPH_DATA.CSV")
sum_stat <-
morph_data %>%
group_by(Colour) %>%
summarise(mean = mean(Weight), sd = sd(Weight), n = n())
```
**Question, hypothesis and prediction:** We think the purple and green morph plants might be different in some way. Specifically, we want to address the question: Why have the purple morphs increased in frequency in the new population? Our working hypotheses is that purple plants are fitter than green plants. Since fitness is closely related to seed production, and seed production is typically positively correlated with plant size, we predict that purple morphs should be larger in the new population.
**Statistical populations:** What statistical populations are we interested in? This time, we will conceive of each morph as its own statistical population. That measn there are now two distinct statistical populations in play: a purple plant population and a green plant population. This change of focus from our earlier analysis where we considered all the plants as one population is perfectly valid. Statistical populations are not really fixed things---they are defined by the problem we're working on.
**Variables:** Which variable should we study? An obvious way to address a prediction of size differences of plants is to measure the weight of individuals of each morph in our samples. To be more precise, 'dry weight biomass' is the standard, reliable measure of how 'big' a plant is. Dry weight mass is a numeric variable measured on a ratio scale (zero really does mean 'nothing').
**Population parameters:** Our prediction is that purple morphs are larger than green morphs. What do we actually mean by that? We probably don't mean that every purple plant is bigger than every green plant. Instead, we need to know if purple plants are *generally* bigger than green ones. This is a statement about 'central tendency'---we want to evaluate whether purple plants are larger than green plants, *on average*. The population parameters of interest are therefore the *mean* dry weights of each morph.
**Gather samples:** The next step is to gather appropriate samples. Since this is a made-up example we'll cut to the chase. We have already seen the samples we're going to use. When we read in the 'MORPH_DATA.CSV' in the previous chapter we found a numeric variable called `Weight`. This contains our dry weight biomass information. The categorical `Colour` variable analysed in the previous chapter tells us which kind of colour morph each observation corresponds to.
OK, now we're ready to tackle the remaining steps in the 'learning from data' process: estimating the population parameter(s), estimate uncertainty and answer the question. We work through each of these steps in detail next.
```{block, type='do-something'}
Once again, best way to understand what follows is to actually work through the example. And once again, you are strongly encouraged to do this... You can carry on developing the script you began in the last chapter or start a new one in the template project for the book examples. Either way, for the code below to work you will need to have downloaded the 'MORPH_DATA.CSV' data file and read this into R, assigning it the name `morph_data`.
```
### Examine the data
The first step is to calculate point estimates of the mean dry weight of each morph. These are our 'best guess' of the population means. It can also be useful to know something about the sample size and variability of dry weight in each sample. We can summarise variability using the standard deviation (`sd`) of each sample. Here is how to do this with **dplyr**:
```{r, message=FALSE}
# using morph data...
morph_data %>%
# ...group the data by colour morph category
group_by(Colour) %>%
# ... calculate the mean, sd and sample size of weight in each category
summarise(mean = mean(Weight),
sd = sd(Weight),
samp_size = n())
```
This shows the mean dry weight of the purple morph is indeed greater than that of green morph. Maybe our hypothesis is true. The standard deviation estimates indicate that the dry weight of purple morphs is also a bit more variable than the green morphs.
These numbers are point estimates taken from limited samples. If we sampled the populations again sampling variation would lead to different estimates. We're not yet in a position to conclude that purple morphs are bigger than green morphs because we don't know if the difference is just down to 'luck'.
We should also visualise the sample distributions of colour morph weights. We could do this in a variety of ways but since we only have two samples we may as well use a fairly information-rich summary. Histograms are a good choice when we have a reasonable amount of data:
```{r two-morph-dist, echo = TRUE, out.width='75%', fig.asp=1, fig.align='center', fig.cap='Size distributions of purple and green morph samples'}
ggplot(morph_data, aes(x = Weight)) +
geom_histogram(binwidth = 50) +
facet_wrap(~Colour, ncol = 1) # <- separate histogram for each colour morph!
```
What does this plot tell us? We're interested in the degree of similarity of the two samples. It looks like purple morph individuals tend to have higher dry weights than green morphs. We already knew this of course. What the plot confirms is that the difference is probably not due to a few extreme cases ('outliers').
So... the histograms confirm there might be a general difference in the size of the two morphs. However, there is also a lot of overlap between the two dry weight distributions. Maybe the difference between the sample means is simply a consequence of sampling variation? We need to construct a statistical test to address this question.
## Constructing a test
```{r, include=FALSE}
set.seed(27081975)
nperm <- 2500
perm.out <- numeric(nperm)
perm.eg <- list()
data.i <- morph_data
ids <- morph_data$Colour
for (i in 1:nperm) {
morph.labels <- sample(ids, replace = FALSE)
perm.out[i] <-
mutate(data.i, Colour = morph.labels) %>%
group_by(Colour) %>% summarise(mean = mean(Weight)) %$% diff(mean)
if (i <= 3) {
perm.eg[[i]] <- morph_data$Weight
names(perm.eg[[i]]) <- morph.labels
}
}
names(perm.eg) <- paste("Sample", 1:3)
```
We're going to develop a frequentist test to evaluate whether the two population means are likely to be different. Regardless of the precise details, this kind of problem involves a four-step process:
1. First, we *assume* there is no difference between the population means. That is, we hypothesize that all the data are sampled from a pair of populations that have the same population mean. To put it another way, we pretend there is really only one population. We know about this trick. It's a statement of the **null hypothesis**.
2. Next, we use information in the real samples to work out what would happen if we were to repeatedly take many new samples in this hypothetical situation of 'no difference'. We summarise this by calculating the null distribution of some kind of **test statistic** (<-- something new).
3. We then ask, "if there were no difference between the two groups, what is the probability that we would observe a difference that is the same as, or more extreme than, the one we observed in the true sample?" We know about this number too. It's a __*p*-value__.
4. If the observed difference is sufficiently improbable, then we conclude that we have found a **statistically significant** result. A statistically significant result is one that is inconsistent with the hypothesis of no difference.
This chain of logic is very similar to that used to construct the statistical test in the previous chapter. The main elaboration here is the introduction of the test statistic. When dealing with comparisons involving more than one parameter we need to work with a single numeric summary of 'the effect of interest', however defined. Its purpose is to summarise the effect as a single number so that we can construct a simple test. That's why it is called the 'test statistic'.
There are many different kinds of test statistic and many different ways to go about realising this process above. We're going to use something called a **permutation test** here. We're not using this because we want you to learn how to apply it. Instead, just as with the bootstrap, the permutation tests offers a simple way to see how frequentist tests work without getting lost in mathematical detail.
### Permutation tests
A hypothesis of 'no difference' between the mean dry weights of purple and green morphs implies the following: the labels 'purple' and 'green' are meaningless because the two morphs are effectively sampled from the same population. These labels don't carry any real information so they may as well have been randomly assigned to each individual.
This suggests we can evaluate the statistical significance of the observed difference as follows:
1. Make a copy of the original sample of purple and green dry weights, but do so by randomly assigning the labels 'purple' and 'green' to this new copy of the data. Do this in such a way that the original sample sizes are preserved. The process of assigning random labels is called **permutation**.
(We have to preserve the original sample sizes because we want to mimic the sampling process that we actually used, i.e. we want to hold everything constant apart from the labelling of individuals.)
2. Repeat the permutation scheme many times until we have a large number of artificial samples; 1000-10000 randomly permuted samples may be sufficient.
3. For each permuted sample, calculate whatever test statistic captures the relevant information. In our example, this is the *difference* between the mean dry weight of purple and green morphs in each permuted sample.
(It doesn't matter which way round we calculate the difference.)
4. Compare the observed test statistic from the real sample to the distribution of sample statistics from the randomly permuted samples.
This scheme is called a permutation test because it involves random permutation of the group labels. Why is this useful? *Each unique random permutation yields an observation from the null distribution of the difference among sample means, under the assumption that this difference is really zero in the population.* We can use this to assess whether or not the observed difference is consistent with the hypothesis of no difference by looking at where it lies relative to this null distribution.
### Carrying out a permutation test
It's ease to implement a permutation test in R. We won't show the code this time because it uses quite a few tricks that won't be needed again. It is worth having a quick look at the permuted samples though. The first 40 values from two permuted samples are:
*Sample 1-*
```{r, echo=FALSE}
head(perm.eg$'Sample 1', 40)
```
*Sample 2-*
```{r, echo=FALSE}
head(perm.eg$'Sample 2', 40)
```
The data from each permutation are stored as numeric vectors, where each element of the vector is named according to the morph type it corresponds to---these are 'the labels' referred to above. The set of numbers does not vary among permuted samples. The *only* difference between them is the labelling which has been randomly assigned in each permuted sample.
The difference between the mean dry weights in the first permutation is `r perm.out[1]`, while in the second sample, the difference is `r perm.out[2]`. What matters for our test is the distribution of these differences over the complete set of permutations. This is the approximation to the sampling distribution of the difference between means under the null hypothesis---i.e. it's our null distribution.
We can visualise this null distribution by making a histogram that summarises the `r nperm` mean differences from the permuted samples:
```{r permute-dist, echo = FALSE, out.width='80%', fig.asp=0.75, fig.align='center', fig.cap='Difference between means of permuted samples'}
ggplot(data.frame(perm.out), aes(x = perm.out)) +
geom_histogram(fill = grey(0.4), binwidth = 6) +
geom_vline(xintercept = diff(sum_stat$mean), colour = "red") +
xlab("Difference")
```
Notice that the distribution is centred at zero. This makes perfect sense---if we take a set of numbers and randomly allocate them to groups, on average, we expect the difference between the means of the groups to be zero.
The red line shows the estimated value of the difference between the mean purple and green morph dry weights *in the real sample*. This is the test statistic. The key thing to notice here is the location of the test statistic within the null distribution.
In this example, the estimated difference lies at the end of the right 'tail' of the null distribution. What does that tell us? It suggests the observed difference (the red line) is unlikely to arise through pure sampling variation when the population means of the two groups really were identical.
So... it looks like the data are inconsistent with the null hypothesis of no difference. What we need is a number to quantify this inconsistency---the all-important *p*-value.
### Calculating the *p*-value {-}
```{r, echo=FALSE}
nhgher <- sum(perm.out >= diff(sum_stat$mean))
nlower <- sum(perm.out <= -diff(sum_stat$mean))
```
We use the null distribution to quantify the probability of ending up with the observed difference under the null hypothesis. Only `r nhgher` out of the `r nperm` permutations ended up being equal to, or 'more extreme' (more positive) than, the observed difference. This means the probability of finding a difference in the means equal to, or more positive than, the observed difference is about, *p* = `r nhgher`/`r nperm` = `r nhgher/nperm`. This is the *p*-value for our significance test.
Let's run through the interpretation of that *p*-value. Here's the general chain of logic again... The *p*-value is the probability of obtaining a test statistic equal to, or 'more extreme' than, the estimated value, assuming the null hypothesis is true. The null hypothesis is one of no effect, so a small *p*-value can be interpreted as evidence for the effect (e.g. a difference between means) being present. That's a weird way to seek evidence for the presence of an effect but that's how frequentist statistics works.
How low does the *p*-value actually have to be before we decide we have 'enough evidence'? Less than 0.01? Less than 0.001? Any threshold is somewhat arbitrary to be honest, but in biology, a significance threshold of *p* < 0.05 is conventionally used. If we find *p* < 0.05, then we conclude that we found a 'statistically significant' effect at the 5% level.
Here's how this logic applies to our example... The permutation test assumed there was no difference between the purple and green morphs, so the low *p*-value indicates that the estimated difference between the mean dry weight of purple and green morphs was unlikely to have occurred by chance *if* there is really no difference at the population level. We interpret the low *p*-value as evidence for the existence of a difference in mean dry weight among the populations of purple and green morphs. Since *p* = `r nhgher/nperm`, we say we found a statistically significant difference at the 5% level.
Here's how we might summarise our analysis in in a written report:
```{r, echo = FALSE}
nG <- filter(sum_stat, Colour == "Green" )$n
nP <- filter(sum_stat, Colour == "Purple")$n
```
> The mean dry weight biomass of purple plants (`r nP`) was significantly greater than that of green plants (`r nG`) (one-tailed permutation test, p<0.05).
We report the sample sizes used, the type of test employed, and the significance threshold we passed (not the raw *p*-value).
```{block, type='advanced-box'}
**Directional tests**
The test we just did is a 'one-tailed' test. It's called a 'one-tailed' because we only looked at one end of the null distribution. This kind of test is appropriate for evaluating directional predictions (e.g. purple > green). If, instead of testing whether purple plants were larger than green plants, we just want to know if they were different in some way (i.e. in either direction), we should use a 'two-tailed' test. These work by looking at both ends of the null distribution. We won't do this here though---the one- vs two-tailed distinction is discussed in the [one-tailed vs. two-tailed tests](#one-two-tailed-tests) supplementary chapter.
```
<!-- ```{block, type='advanced-box'} -->
<!-- **What do people mean when they 'compare samples'?** -->
<!-- By comparing the central tendency (e.g. the mean) of two or more samples we can evaluate whether something we've measured changes, on average, among populations. We do this using the information in the samples to learn about the populations. -->
<!-- It's common to use the phrase 'comparing samples' when discussing statistical tests. However, when someone uses a statistical test to 'compare samples' what they are really doing is 'using information in the sample to compare population parameters'. This distinction may seem unnecessarily pedantic. However, it's important to be aware of the correct description because this helps us understand what a statistical test is really doing. -->
<!-- That said, saying or writing 'using information in the sample to compare population parameters' all the time is very tedious so we often revert to the phrase 'comparing samples'. We'll do this from time to time but try to keep in mind what is really meant by the 'comparing samples' phrase. -->
<!-- ``` -->
## What have we learned?
Our goal was not really to learn how permutation tests work. Just as with bootstrapping in the previous chapter---we just used it to demonstrate the logic of how frequentist statistics work. In this instance, we wanted to see how to evaluate a difference between two groups. The basic ideas are no different from those introduced in the previous chapter...
1. define what constitutes an 'effect' (e.g. a difference between means), then assume that there is 'no effect'---i.e. define the **null hypothesis**,
2. select an appropriate **test statistic** that can distinguish between the presence of an 'effect' and 'no effect',
(In practice, each kind of statistical test uses a standard test statistic. We don't have to pick these ourselves.)
3. construct the corresponding **null distribution** of the test statistic by working out what would happen if we were to take frequent samples in the 'no effect' situation,
4. and finally, use the null distribution and the observed test statistic to calculate a __*p*-value__that quantifies how frequently the observed difference, or a more extreme difference, would be observed under the hypothesis of no effect.
We only actually introduced one new idea in this chapter. When evaluating differences among populations we need to work with a single number that can distinguish between 'effect' and 'no effect'. This is called the test statistic. Sometimes this can be expressed in terms of familiar quantities like means (we just used a difference between means above). However, this isn't always the case. For example, we can use something called an *F*-ratio to evaluate the statistical significance of differences among more than two means.
We'll get to these kinds of tests later. For now, our task is to explore simple 'parametric tests'. These kinds of tests allow us to ask questions without resorting to things bootstrapping or permutation tests.