Include a one-paragraph summary of the course here.
+
+
+
+
+
+
+Learning Objectives
+
+
+
+
+
List course learning objectives here.
+
These describe concepts the learners should grasp and techniques they should be able to use by the end of the course.
+
You can think of these as completing the phrase “after this course, the participant should be able to…”
+
They are not supposed to be as detailed as the learning objectives of each section, but more high-level.
+
+
+
+
+
Target Audience
+
Brief description of target audience here.
+
+
+
Prerequisites
+
Detail any prerequisite skills needed to attend this course, with links to other relevant materials/courses if possible.
+
+
+
+
Exercises
+
Exercises in these materials are labelled according to their level of difficulty:
+
+
+
+
+
+
+
+
Level
+
Description
+
+
+
+
+
+
Exercises in level 1 are simpler and designed to get you familiar with the concepts and syntax covered in the course.
+
+
+
+
Exercises in level 2 combine different concepts together and apply it to a given task.
+
+
+
+
Exercises in level 3 require going beyond the concepts and syntax introduced to solve new problems.
+
+
+
+
+
+
+
Authors
+
+
About the authors:
+
+
Alexia Cardona
+Affiliation: Bioinformatics Training Facility, University of Cambridge
+Roles: conceptualisation
+
Hugo Tavares
+Affiliation: Bioinformatics Training Facility, University of Cambridge
+Roles: writing - original draft; conceptualisation; coding
+
Martin van Rongen
+Affiliation: Bioinformatics Training Facility, University of Cambridge
+Roles: writing - review & editing; conceptualisation; coding
+
+
+
+
Citation
+
+
Please cite these materials if:
+
+
You adapted or used any of them in your own teaching.
+
These materials were useful for your research work. For example, you can cite us in the methods section of your paper: “We carried our analyses based on the recommendations in TODO.”.
+
+
You can cite these materials as:
+
+
TODO
+
+
Or in BibTeX format:
+
@Misc{,
+ author = {},
+ title = {},
+ month = {},
+ year = {},
+ url = {},
+ doi = {}
+}
+
+
+
Acknowledgements
+
+
+
List any other sources of materials that were used.
+
Or other people that may have advised during the material development (but are not authors).
All traditional statistical test make use of various named distributions or require certain assumptions to be made about the parent distribution (such as the shape of the distribution is symmetric for non-parametric tests like Wilcoxon) in order work properly. For parametric tests like the t-test or ANOVA this is the normal distribution. If these assumptions are met then traditional statistical tests are fine, but what can we do when we can’t assume normality or if the distribution of the data is just weird?
+
Resampling techniques are the tools that work here. They can allow us to test hypotheses about our data using only the data itself (without appeal to any assumptions about the shape or form of the parent distribution). They are in some ways a much simpler approach to statistics, but because they rely on the ability to generate thousands and tens of thousands of random numbers very quickly, they simply weren’t considered practical back in the day. Even now, they aren’t widely used because they require the user (you, in case you’d forgotten what’s going on at this time of day) to do more than click a button on a stats package or even know the name of the test. These techniques require a mix of statistical knowledge and programming; a combination of skills that isn’t all that common! There are three broad areas of resampling methods (although they are all quite closely related):
+
+
Permutation Methods
+
Bootstrapping
+
Cross-validation
+
+
Permutation methods are what we will focus on in this practical and they allow us to carry out hypothesis testing.
+
Bootstrapping is a technique for estimating confidence intervals for parameter estimates. We effectively treat our dataset as if it was the parent distribution, draw samples from it and calculate the statistic of choice (the mean usually) using these sub-samples. If we repeat this process many times, we will eventually be able to construct a distribution for our sample statistic. This can be used to give us a confidence interval for our statistic.
+
Cross-validation is at the heart of modern machine learning approaches but existed long before this technique became sexy/fashionable. You divide your data up into two sets: a training set that you use to fit your model and a testing set that you use to evaluate your model. This allows your model accuracy to be empirically measured. There are several variants of this technique (holdout, k-fold cross validation, leave-one-out-cross-validation (LOOCV), leave-p-out-cross-validation (LpOCV) etc.), all of which do essentially the same thing; the main difference between them being a trade-off between the amount of time it takes to perform versus the reliability of the method.
+
We won’t cover bootstrapping or cross-validation in this practical, but feel free to Google them.
# A Python data analysis and manipulation tool
+import pandas as pd
+
+# Python equivalent of `ggplot2`
+from plotnine import*
+
+
+
+
4.1.4 Functions
+
+
+
+
+
+
+
+
+
+
4.2 Purpose and aim
+
If we wish to test for a difference between two groups in the case where the assumptions of a two-sample t-test just aren’t met, then a two-sample permutation test procedure is appropriate. It is also appropriate even if the assumptions of a t-test are met, but in that case, it would be easier to just do the t-test.
+
One of the additional benefits of permutation test is that we aren’t just restricted to testing hypotheses about the means of the two groups. We can test hypotheses about absolutely anything we want! So, we could see if the ranges of the two groups differed significantly etc.
+
+
+
4.3 Data and hypotheses
+
Let’s consider an experimental data set where we have measured the weights of two groups of 12 female mice (so 24 mice in total). One group of mice was given a perfectly normal diet (control) and the other group of mice was given a high fat diet for several months (highfat).
+
We want to test whether there is any difference in the mean weight of the two groups. We still need to specify the hypotheses:
+
\(H_0\): there is no difference in the means of the two groups
+
\(H_1\): there is a difference in the means of the two groups Let’s read in the data and look at it:
Looking at the data, it appears that there might be a difference between the mean weight of the two groups. The weights of the mice on the highfat diet appears somewhat higher than on control, although there is quite some overlap between the data.
+
The medians (the horizontal lines in the boxes) are shifted - as are the boxes. So, the first thing we probably want to do is to calculate the exact difference in means:
# A tibble: 2 × 2
+ diet mean_weight
+ <chr> <dbl>
+1 control 23.8
+2 highfat 26.8
+
+
+
We’ll want to use this difference later on, so we store it in a variable:
+
+
obs_diff_weight <- mice_weight %>%
+group_by(diet) %>%
+summarise(mean_weight =mean(weight)) %>%
+ungroup() %>%
+# calculate the difference in weight
+# (there are many ways that you can do this)
+pull(mean_weight) %>%
+diff()
+
+obs_diff_weight
What we want to know is: is this difference unusual/big/statistically significant? Specifically, how likely would it be to get a difference this big if there were no difference between the two groups?
+
+
+
4.4 Permutation theory
+
The key idea behind permutation techniques is that if the null hypothesis is true, and there is no difference between the two groups then if I were to switch some of the mice from one group to the next then this wouldn’t change the difference between the groups too much. If on the other hand there actually is a difference between the groups (with one group having much higher weights than the other), then if I were to switch some mice between the groups then this should average out the two groups leading to a smaller difference in group means.
+
So, what we do is we shuffle the mice weights around lots and lots of times, calculating the difference between the group means each time. Once we have done this shuffling hundreds or thousands of times, we will have loads of possible values for the difference in the two group means. At this stage we can look at our actual difference (the one we calculated from our original data) and see how this compares to all of the simulated differences.
+
We can calculate how many of the simulated differences are bigger than our real difference and this proportion is exactly the p-value that we’re looking for!
We will be randomly shuffling our data around. So we set the seed, to aid reproducibility for the example.
+
+
seed <-2602
+
+
We need to approach this a bit logically, since we are going to iterate a process multiple times. We can break down the steps into the following:
+
+
Define the number of permutations.
+
Permute (randomly assign) the diet labels, without replacing.
+
Calculate the new difference in means between the groups.
+
Store the difference and repeat.
+
+
+
set.seed(seed)
+
+# Set the number of permutations
+reps <-1000
+
+# Create a place to store the values
+permuted_stats <-tibble(permuted_diff =numeric(reps))
+
+# Loop through process
+for(i in1:reps){
+# Get the data
+ permuted_data <- mice_weight
+
+# Permute (reshuffle) the group labels
+ permuted_data$diet <-sample(permuted_data$diet)
+
+# Calculate the new group differences
+ permuted_diff <- permuted_data %>%
+group_by(diet) %>%
+summarise(mean_weight =mean(weight)) %>%
+ungroup() %>%
+pull(mean_weight) %>%
+diff()
+
+# Store the calculated difference
+ permuted_stats$permuted_diff[i] <- permuted_diff
+}
+
+
+
+
+
seed =2602
+
+
+
np.random.seed(seed)
+
+# Set the number of permutations
+reps =1000
+
+# Create a place to store the values
+permuted_stats = pd.DataFrame({'permuted_diff': [0] * reps})
+
+# Loop through process
+for i inrange(reps):
+# Get the data
+ permuted_data_py = mice_weight_py
+
+# Permute the group labels
+ permuted_data_py['diet'] = np.random.permutation(permuted_data_py['diet'].values)
+
+# Calculate the new group difference
+ permuted_diff = (permuted_data_py
+ .groupby('diet')['weight']
+ .mean()
+ .diff()
+ .iloc[-1])
+
+# Store the calculated difference
+ permuted_stats['permuted_diff'][i] = permuted_diff
The histogram is centred around zero, as we would expect: under the null hypothesis in this analysis there shouldn’t be any difference between the groups.
+
The blue vertical line shows us the value of our actual observed difference.
+
We can see that our observed difference is unlikely (because it’s out in the tails of the distribution rather than in the middle), but we want to be able to say exactly how unlikely. To do that we need to calculate the proportion of simulated differences that were bigger than our observed value. And, because we’re interested in a two-tailed test, we also need to include any simulated values that were less than -3.02 (so the lower tail).
+
We do this in the following steps:
+
+
Count the number of occurrences where the permuted difference (permuted_diff) is larger than the observed weight difference (obs_diff_weight).
+
Divide this by the number of times we permuted the data (reps)
If we divide this number by 1000 (the number of permutations), we get a value of 0.051.
+
In this case we have fixed the random number generator. You might not have done that and normally you probably don’t want to either. In that case you will get a slightly different value every time you run this. In order to get more precision on this p-value we will need to run more than 1,000 replicates.
+
A good rule of thumb is that, for 1,000 replicates, a permutation test will return a p- value to within 1% of the actual value (so in this case the p-value is probably between 0.041 and 0.061). If we go to 10,000 replicates, then the error in the estimated p-value reduces to about 0.1%.
If we divide this number by 1000 (the number of permutations), we get a value of 0.05.
+
In this case we have fixed the random number generator. You might not have done that and normally you probably don’t want to either. In that case you will get a slightly different value every time you run this. In order to get more precision on this p-value we will need to run more than 1,000 replicates.
+
A good rule of thumb is that, for 1,000 replicates, a permutation test will return a p- value to within 1% of the actual value (so in this case the p-value is probably between 0.04 and 0.06). If we go to 10,000 replicates, then the error in the estimated p-value reduces to about 0.1%.