-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path2_01_standard_error.Rmd
executable file
·180 lines (115 loc) · 16.4 KB
/
2_01_standard_error.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Sampling error {#sampling-error}
```{r, echo=FALSE}
load(file = "./_rda_objects/plant_morphs.rda")
freqs1 <- table(sample1$morph)
samp1_est <- round(100 * freqs1["purple"] / sampsize1)
```
In the previous chapter we introduced the idea that point estimates of a population parameter are always imperfect, in that they won't exactly match its true value. This uncertainty unavoidable. This means it is not enough to have estimated something--we have to also know about the uncertainty of the estimate. We use the machinery of statistics to quantify this uncertainty.
Once we have pinned down the uncertainty we can start to provide meaningful answers to our scientific questions. We will arrive at this 'getting to the answer step' in the next chapter. First we have to develop the uncertainty idea by considering things such as sampling error, sampling distributions and standard errors.
## Sampling error
For now, let's continue with the plant polymorphism example from the previous chapter. We had taken one sample of 20 plants from our hypothetical population and found that the frequency of purple plants in that sample was `r samp1_est`%. This is a point estimate of purple plant frequency based on a random sample of `r sampsize1` plants.
What happens if we repeat that same sampling process to arrive at a new, completely independent sample? This is what the population looks like, along with this new sample highlighted with red circles:
```{r plants-samp2, echo = FALSE, out.width='50%', fig.asp=1, fig.align='center', fig.cap='Plants sampled on the second occasion'}
sample2 <- sample_n(plantdata, size = sampsize1)
baseplt + geom_point(data = sample2, colour = "red", shape = 1, size = 5)
freqs2 <- table(sample2$morph)
samp2_est <- round(100*freqs2["purple"]/sampsize1)
```
This time we ended up sampling `r freqs2["green"]` green plants and `r freqs2["purple"]` purple plants, so our second estimate of the purple morph frequency is `r samp2_est`%. This is quite different from the first estimate. It is actually lower than that seen in the original study population. Our hypothesis that the purple morph is more prevalent in the new population is beginning to look a little shaky in light of this second sample.
Nothing about the study population changed between the first and second sample. What's more, we used a completely reliable sampling scheme to generate these samples---there was nothing biased or 'incorrect' about the way individuals were sampled. The two different estimates of purple morph frequency arise from nothing more than chance variation in the sampling process.
This variation--present whenever we observe a sample---has a special name. It is called **sampling error**. The word 'error' has nothing to do with mistakes or problems. In statistics, it is often used to refer to the variability or 'noise' in some process. Sampling error is the reason why we have to use statistics. Every time we estimate something there's some sampling error lurking in the estimate. If we ignore it, we risk drawing incorrect conclusions about the world.
(By the way, another name for sampling error is 'sampling variation'. We'll use both terms in this book because they are both widely used)
It is important to realise that sampling error is not really a property of any particular sample. It is a consequence of the nature of the variable(s) we're studying and the sampling method used to investigate the population. Let's try to understand what all this means.
## Sampling distributions
We can develop our simple simulation example to explore the consequences of sampling error. This time, rather than taking one sample at a time, we will simulate thousands of independent samples. The number of plants sampled ('n') will always be `r sampsize1`. Every sample will be drawn from the same population, i.e. the population parameter (purple morph frequency) will never change across samples. This means any variation we observe is due to nothing more than sampling error.
Here is a graphical summary of one such repeated simulation exercise:
```{r samp-dist-1, echo = FALSE, out.width='100%', fig.asp=0.5, fig.align='center', fig.cap='Distribution of number of purple morphs sampled (n = 20)'}
x <- rbinom(n = 100000, size = sampsize1, prob = prop.purp)
out <- data.frame(n.purple = factor(x, levels = as.character(0:sampsize1)))
limits <- as.character(0:sampsize1)
ggplot(out, aes(x = n.purple)) + geom_bar() +
scale_x_discrete(limits = limits, drop = FALSE) +
xlab("No. of purple morph individuals (n = 20)") + ylab("Count")
```
This bar plot summarises the results from taking 100000 independent samples of our population. Each time we took `r sampsize1` individuals from our hypothetical population and calculated the number of purple morphs found. The bar plot shows the number of times we found 0, 1, 2, 3, ... purple individuals, all the way up to the maximum possible (`r sampsize1`). We could have converted these numbers to frequencies but we're just summarising the raw distribution of purple morph counts at this point.
This distribution has a special name. It is called a **sampling distribution**. In order to work this out, we have to postulate values for the population parameters and we have to know how the population was sampled. This can often be done using mathematical reasoning. Here, we used brute-force simulation to approximate the sampling distribution of purple morph counts that arises when we sample `r sampsize1` individuals from our hypothetical population.
What does this particular sampling distribution show? It shows us the range of outcomes we can expect to observe when we repeat the same sampling process over and over again. The most common outcome is 8 purple morphs, which would yield an estimate of 8/20 = 40% for the purple morph frequency.
Although there is no way to know this without being told, a 40% purple morph frequency is actually the number used to simulate the original, hypothetical data set. So it turns out that the true value of the parameter we're trying to learn about also happens to be the most common estimate we expect to see under repeated sampling. That's probably not too surprising.
Returning to our question---what is the purple morph frequency---we now have our answer. The purple morph frequency is 40%. Ok, but we have obviously 'cheated' here because we used information from 1000s of samples to arrive at that answer. In the real world we typically have to work with only one sample.
So what use is this sampling distribution idea? It turns out that sampling distributions are key to 'doing frequentist statistics'. To understand why we need to look a bit more carefully at that sampling distribution we made.
Look at the spread of the sampling distribution along the x-axis. The range of outcomes is roughly 2 to 15. This corresponds to estimated purple morph frequencies the in the range of 10-75%. This tells us that when we sample `r sampsize1` individuals, the sampling error for a frequency estimate can be quite large.
The sampling distribution we summarised above is only relevant for the case where `r sampsize1` individuals are sampled, and the frequency of purple plants in the population is 40%. If we change either of those two numbers and repeat the repeated sampling process we will end up with a different sampling distribution. That's what was meant by, "It is a consequence of the nature of the variable(s) we're studying and the sampling method used to investigate the population."
Once we know how to calculate the sampling distribution for a particular problem, we can start to make statements about sampling error, to quantify uncertainty, and begin to address scientific questions. Fortunately, we don't have to work any of the details out for ourselves---statisticians have already done the hard work for us.
## The effect of sample size
One of the most important aspects of a sampling scheme is the **sample size** ---often denoted 'n'. This is the number of observations of objects, items, etc in a sample. What happens when we change the sample size?
To get a sense of why sample size matters, we'll repeat in our sampling exercise using two new sample sizes. First we'll use a sample size of `r sampsize2` individuals, then we'll take a sample of `r sampsize3` individuals. As before, we'll take a total 100000 repeated samples from the population.
Here's a graphical summary of the resulting sampling distributions:
```{r samp-dist-2, echo = FALSE, out.width='100%', fig.asp=0.5, fig.align='center', fig.cap='Distribution of number of purple morphs sampled (n = 40)'}
x <- rbinom(n = 100000, size = sampsize2, prob = prop.purp)
out <- data.frame(n.purple = factor(x, levels = as.character(0:sampsize2)))
ggplot(out, aes(x = n.purple)) + geom_bar() +
scale_x_discrete(limits = as.character(seq(0, sampsize2, 1)),
breaks = as.character(seq(0, sampsize2, 2)),
drop = FALSE) +
xlab("No. of purple morph individuals") + ylab("Count")
```
```{r samp-dist-3, echo = FALSE, out.width='100%', fig.asp=0.5, fig.align='center', fig.cap='Distribution of number of purple morphs sampled (n = 80)'}
x <- rbinom(n = 100000, size = sampsize3, prob = prop.purp)
out <- data.frame(n.purple = factor(x, levels = as.character(0:sampsize3)))
ggplot(out, aes(x = n.purple)) + geom_bar() +
scale_x_discrete(limits = as.character(seq(0, sampsize3, 1)),
breaks = as.character(seq(0, sampsize3, 4)),
drop = FALSE) +
xlab("No. of purple morph individuals") + ylab("Count")
```
We plotted each of them over the full range of possible outcomes, i.e. the x axis runs from 0-`r sampsize2` and 0-`r sampsize3`, respectively, in both plots. This is important because it allows us to compare the spread of each sampling distribution relative to the range of all possible outcomes. What do these two plots tell us about the effect of changing sample size?
The range of outcomes in the first plot (n = `r sampsize2`) is roughly 6 to 26, which corresponds to estimated purple morph frequencies in the range of 15-65%. The range of outcomes in the second plot (n = `r sampsize3`) is roughly 16 to 48, which corresponds to estimated frequencies in the range of 20-60%. The implications of this not-so-rigorous assessment are clear. We reduced sampling error by increasing the sample size. This makes intuitive sense: the composition of a large sample should more closely approximate that of the true population than a small sample.
How much data do we need to collect to accurately estimate a frequency? Here is the approximate sampling distribution of the purple morph frequency estimate when we sample `r (sampsizebig <- 500)` individuals:
```{r samp-dist-big, echo = FALSE, out.width='100%', fig.asp=0.6, fig.align='center', fig.cap='Distribution of number of purple morphs sampled (n = 500)'}
x <- rbinom(n = 100000, size = sampsizebig, prob = prop.purp)
out <- data.frame(n.purple = x)
ggplot(out, aes(x = n.purple)) +
geom_histogram(binwidth = 2) + xlim(0, sampsizebig) +
xlab("No. of purple morph individuals") + ylab("Count")
```
Now the range of outcomes is about 160 to 240, corresponding to purple morph frequencies in the 32-48% range. This is a big improvement over the smaller samples we just considered, but still, even with 500 individuals in a sample, we still expect quite a lot of uncertainty in our estimate. The take home message is that we need a lot of data to reduce sampling error.
## The standard error
We've not been very careful about how we quantify the spread of a sampling distribution so far---we just estimated the range of purple morph counts by looking at histograms. This is fine for an investigation of general patterns, but to make rigorous comparisons, we really need a quantitative measure of sampling error variation. One such quantity is called the **standard error**.
The standard error is quite a simple idea, though its definition is a bit of mouthful: the standard error of an estimate is the standard deviation of the estimate's sampling distribution. Don't worry if that is confusing. Most of us find the definition of standard error confusing when we first encounter it. The key point is that it is a measure of the spread, or dispersion, of a distribution. The distribution in this case is the sampling distribution associated with some kind of estimate.
(It is common to use a shorthand abbreviations such "SE", "S.E.", "se" or "s.e." in place of 'standard error' when referring to the standard error in text.)
It's much easier to make sense of something abstract like a standard error by looking at a concrete example. Let's calculate the standard error of the estimated purple morph frequency. It is possible to do this using mathematical reasoning; however, it is much easier to understand what's happening if we use 'brute force' simulations in R.
We'll show you the code to as we go but keep in mind there's really no need to commit any of it to memory. It's just here for readers who like to see how things work. We start by specifying the number of samples to take (`n_samples`), the sample size to use for each sample (`sample_size`), and the value of the population morph frequency (`purple_prob`). We set up the simulation by creating these variables:
```{r}
# number of independent samples
n_samples <- 100000
# sample size to use for each sample
sample_size <- 80
# probability a plant will be purple
purple_prob <- 0.4
```
This sets up a simulation to calculate the expected standard error when the purple morph frequency is 40% and the sample size is `r sample_size`, using `r n_samples` samples. The next step requires a bit more knowledge of R and probability theory:
```{r}
# repeat the sampling/estimation procedure many times
raw_samples <- rbinom(n = n_samples, size = sample_size, prob = purple_prob)
# convert results to %
percent_samples <- 100 * raw_samples / sample_size
```
The details of the R code are not important here. A minimum of A-level statistics is needed to understand what the `rbinom` function is doing, but in a nutshell, it is the workhorse that runs the simulation. Mostly, we're showing the code to demonstrate that seemingly complex things are often easy to do in R.
What matters here is the result. This is stored the result in a vector called `percent_samples`. Here are its first 50 values:
```{r, echo=FALSE}
head(percent_samples, 50)
```
These numbers are all part of the sampling distribution of morph frequency estimates. Each value is one possible estimate of the purple morph frequency when we samples of `r sample_size` individuals from a population where the true value is `r 100*purple_prob`%. There are `r n_samples` such estimates stored in `percent_samples`.
How do we then calculate the standard error? This is *defined* as the standard deviation of the sampling distribution, which we just approximated with our simulation. So all we need to do now is calculate their standard deviation using the `sd` function:
```{r}
sd(percent_samples)
```
The standard error is about `r round(sd(percent_samples), 1)`. Why is this number useful? It gives us a means to reason about the variability in a sampling distribution. When a sampling distribution is 'well-behaved', then roughly speaking, about 95% of estimates are expected to lie within a range of about four standard errors.
Not convinced? Look at the second bar plot we produced where the sample size was 80 and the purple morph frequency was 40%. What is the approximate range of simulated values? How close is this to $4 \times 5.5$? Quite close!
In summary, the standard error quantifies the variability of a sampling distribution. We said in the previous chapter that a point estimate is useless without some kind of associated measure of uncertainty. The standard error is one such measure.
## What is the point of all this?
Why have we just spent so much time looking at what happens when we take repeated samples from a population with known properties? When we collect data in the real world we only have the luxury of working with a small number of samples---usually just one in fact. We also won't know anything much the population parameter of interest in that situation. This lack of knowledge is the reason for collecting some data in the first place!
The short answer is that before we can use frequentist statistics we need to have a sense of:
- how point estimates behave under repeated sampling (i.e. **sampling distributions**),
- and how **sampling error** and **standard error** relate to sampling distributions.
Once we understand those ideas we're can make sense of how frequentist statistical tests work. That's what we'll do in the next few chapters.