diff --git a/01-sampling.md b/01-sampling.md index f4dee0b..f6a2963 100644 --- a/01-sampling.md +++ b/01-sampling.md @@ -1,30 +1,31 @@ --- title: " What is sampling" -teaching: 10 -exercises: 2 +teaching: 5 +exercises: 0 --- :::::::::::::::::::::::::::::::::::::: questions -- How do you write a lesson using R Markdown and `{sandpaper}`? +- What is sampling? +- What requirements should a good sample fulfill? + :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives -- Explain how to use markdown with the new lesson template -- Demonstrate how to include pieces of code, figures, and nested challenge blocks +- Introduce the concept of sampling. :::::::::::::::::::::::::::::::::::::::::::::::: -## +

-Let's start with an example, and thereby define some terminology. We have a lake with frogs in it, and there are light and dark green frogs. There’s a sunny side of the lake, and a shadowy area by the trees. Now imagine you want to estimate the fraction of light green frogs in the lake. There are too many frogs to count them all, so you catch a few and count how many of them are light coloured. This is a sample. A sample are randomly independently drawn events from a population of interest. The population of interest, in this case, are all the frogs in that lake. How can we draw randomly and independently? One obvious thing you could randomize in this experiment is the location at which you cath the frogs, because from the above picture you could get the impression that light-coloured frogs gather more in the shadows, while the dark-green frogs like the sun. Therefore, if we caught all the frogs in the same area, like in sample 1, this would probably overrepresent light frogs, thus not representing the population well. When randomizing the locations, this is less likely to be the case (see for example sample 2). You get similar problems if the observations are not independent. One example of dependent observations would be if you start with one frog, then catch the one right next to it, and so on. This is also likely to overrepresent one colour of frogs, and the reason why observations shouldn’t depend on each other. The sample size is the number of frogs in one sample. And the distribution is a set of rules that the random frog catches follow. +Let's start with an example, and thereby define some terminology. We have a lake with frogs in it, and there are light and dark green frogs. There’s a sunny side of the lake, and a shadowy area by the trees. Now imagine you want to estimate the fraction of light green frogs in the lake. There are too many frogs to count them all, so you catch a few and count how many of them are light coloured. This is a sample. A **sample** are randomly independently drawn events from a **population of interest**. The population of interest, in this case, are all the frogs in that lake. How can we draw **randomly and independently**? One obvious thing you could randomize in this experiment is the location at which you cath the frogs, because from the above picture you could get the impression that light-coloured frogs gather more in the shadows, while the dark-green frogs like the sun. Therefore, if we caught all the frogs in the same area, like in sample 1, this would probably over-represent light frogs, thus not representing the population well. When randomizing the locations, this is less likely to be the case (see for example sample 2). You get similar problems if the observations are not independent. One example of dependent observations would be if you start with one frog, then catch the one right next to it, and so on. This is also likely to over-represent one colour of frogs, and the reason why observations shouldn’t depend on each other. The **sample size** is the number of frogs in one sample. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor diff --git a/02-distributions.md b/02-distributions.md index 4b32a4c..ea25d7d 100644 --- a/02-distributions.md +++ b/02-distributions.md @@ -1,12 +1,12 @@ --- title: "What is a probability distribution?" -teaching: 10 -exercises: 2 +teaching: 8 +exercises: 4 --- :::::::::::::::::::::::::::::::::::::: questions -- What is a probability distribution`? +- What is a probability distribution? :::::::::::::::::::::::::::::::::::::::::::::::: @@ -19,39 +19,57 @@ exercises: 2 ## Overview probability distributions -CONTENT STILL TO COME FROM VIDEO +Most data analyses assume that data comes from some distribution. A probability distribution assigns probabilities to possible outcomes of an experiment. In terms of sampling, even if the sampling is supposed to be random, it doesn't mean that there are no rules -- so one could say that the probability distribution defines the rules for randomness. -![](https://vimeo.com/647705308) +Let's get back to our example of the lake of frogs, which is inhabited by light and dark green frogs. Let's say, the true fraction of light green frogs is $1/3$, and you decide sample of $n=10$ frogs from that lake at random. +Then, if you count the number of light-colored frogs within that net, there are 11 possible outcomes: The number can be between 0 and 10. -:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor +Below, you see a plot out of this, where each of these events has a probability. A suitable distribution for describing this scenario is the *binomial distribution*, which assumes a number of trials (frog catches) which can have two outcomes (light, dark). -Inline instructor notes can help inform instructors of timing challenges -associated with the lessons. They appear in the "Instructor View" +

+ +

-:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::: challenge +If the true fraction of light frogs is one third, then the most likely outcome is catching 3 light frogs. Seeing 10 light frogs is rather unlikely: The probability is close to zero, and we would consider this a rare event. + +

+ +

-## Challenge 1: Which of the following statements are true? +::::::::::::::::::::::::::::::::::::: challenge +## Challenge: Which of the following statements are true? + 1. A probability distribution assigns probabilities to possible outcomes of an experiment. 2. The probabilities in a statistical distribution sum/integrate up to 1. 3. If the experiments are not randomized, the results don't follow a statistical distribution. - :::::::::::::::::::::::: solution -## Solution - Answers 1 and 2 are correct. To 3: If experiments are not randomized, the results still follow some distribution, but they are likely to not represent reality well. +:::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::: + +

+ +

-## Challenge 2: Discrete distributions +There are two types of probability distributions: discrete and continuous. + +A **discrete distribution** is what we have just seen, in this case the observations can only take integer values, in our example the counts from 0 to 10. In between those values, the probabilities are zero, that’s why it’s called a probability mass function: The probability mass is on defined points. + +In a **continuous distribution**, we have a probability density function. An example is the Gaussian distribution. That would be suitable if we measured the sizes of frogs, and they are well described by a mean size and a variance. On a continuous scale, there are infinitely many values, so that the probability for a specific value, for example a frog size of exactly 9cm is zero. +What makes sense instead is to ask for the probability of an observation to fall into a certain interval, for example between 8 and 10 cm, and we get this probability from integrating over the probability density function. + + +::::::::::::::::::::::::::::::: challenge +## Challenge: Discrete distributions What is the probability of an outcome of X=1.5 in a discrete distribution? @@ -60,9 +78,7 @@ What is the probability of an outcome of X=1.5 in a discrete distribution? - 0.15 :::::::::::::::::::::::: solution - The value $1.5$ is not discrete, and can therefore not occur in a discrete distribution. Its probability is zero. - ::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/03-binomial.md b/03-binomial.md index c57ded0..e888d98 100644 --- a/03-binomial.md +++ b/03-binomial.md @@ -1,68 +1,55 @@ --- title: "The binomial distribution" -teaching: 10 -exercises: 2 +teaching: 5 +exercises: 0 --- :::::::::::::::::::::::::::::::::::::: questions -- What is the binomial distribution and? +- What is the binomial distribution? - What kind of data is it used on? :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives - +- Explain how the binomial distribution describes outcomes of counting. :::::::::::::::::::::::::::::::::::::::::::::::: -## Overview probability distributions - -The binomial distribution is what we have just seen in the example: We use it when we have a fixed sample size and count the number of "successes" in that sample -- for example mutations in a genome, or the number of cells within a sample that show a certain phenotype. - -TRANSLATE VIDEO - - -:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor -Inline instructor notes can help inform instructors of timing challenges -associated with the lessons. They appear in the "Instructor View" +The binomial distribution is what we have just seen in the example: We use it when we have a fixed sample size and count the number of "successes" in that sample. Examples are: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: +- how many locations in the genome carry a mutation +- the number of cells within a sample that show a certain phenotype +- how many patients out of 100 have a certain disease +- how many out of 10 frogs are light green -::::::::::::::::::::::::::::::::::::: challenge +

+ +

-## Challenge 1: Which of the following statements are true? +The binomial model has two parameters, which means the probabilities for the individual outcomes depend on two things: +- $n$ is the number of trials, or frogs, or patients, and it’s fixed. +- $p$ is the success probability. -We are in a diagnostic laboratory that gets blood samples from incoming hospital patients and tests them for some disease. Which of these experiments can be modeled with a binomial distribution? +Then the probability of observing $k$ successes out of $n$ draws (for example $k=4$ light coloured frogs out of $N=10$) can be described by this formula: -1. Counting the total number of samples that get tested over one day. -2. Counting the number of positive samples out of 50 samples that get tested successively. -3. Measuring all the blood sample's volumes (in mL). +$$P(X=k) = {n\choose k}p^k(1-p)^k$$ +You don't have to remember this piece of math -- it's just to make the point that you can calculate the probability of an event that is modeled with the binomial distribution, if you know the success probability $p$ and the number of trials $n$, i.e. the parameters. -:::::::::::::::::::::::: solution -## Solution - -Counting the number of positive samples out of 50 samples that get tested successively. +::::::::::::::::::: callout +In the binomial we just define a particular outcome as success, for example a light-coloured frog, or a patient with disease, even though that may not be a favourable outcome. +::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::: +Here is what the distribution looks like for a success probability of 0.3 and a sample size of 10. -## Challenge 2: Discrete distributions +

+ +

-What is the probability of an outcome of X=1.5 in a discrete distribution? - -- 0 -- 0.5 -- 0.15 - -:::::::::::::::::::::::: solution - -The value $1.5$ is not discrete, and can therefore not occur in a discrete distribution. Its probability is zero. - -::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +The expected value of the binomial is $n \times p$, which is quite intuitive: If we catch 10 frogs and the probability of being light-green is 0.3, then we expect to catch 3 light-green frogs on average. diff --git a/04-distributions-R.md b/04-distributions-R.md index 0f6d3c7..e135ac6 100644 --- a/04-distributions-R.md +++ b/04-distributions-R.md @@ -1,19 +1,19 @@ --- title: "Probability distributions in R" -teaching: 10 -exercises: 2 +teaching: 5 +exercises: 7 --- :::::::::::::::::::::::::::::::::::::: questions -- What is the binomial distribution and? -- What kind of data is it used on? +- How can I calculate probabilities in R? :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives - +- Demonstrate and practice the use of the base R functions for calculating probabilities. +- Explain the concept of cumulative distribution. :::::::::::::::::::::::::::::::::::::::::::::::: @@ -31,9 +31,15 @@ The first letter specifies if we want to look at the density, probability distri The arguments depend on the distribution we are looking at, but always include the parameters of that function. -**Calculating probabilities:** Let's use the example where we caught 10 frogs and count how many of them are light-colored. +## Calculating probabilities + + +Let's use the example where we caught 10 frogs and count how many of them are light-colored. + -![](../images/binomial_frogs.png) +

+ +

For known parameters, we can calculate the the chances of counting exactly 5 light-colored frogs: @@ -47,7 +53,7 @@ dbinom(x=5, size=n, prob=p) [1] 0.1029193 ``` -We can ask for the probability of catching at most (or at least) 5 light frogs. In this case, we need the cumulative probability distribution starting with `p`: +We can ask for the probability of catching at most 5 light frogs. In this case, we need the cumulative probability distribution starting with `p`: ```r @@ -58,6 +64,8 @@ pbinom(q=5, size=n, prob=p) # at most [1] 0.952651 ``` +Similarly, we can ask for the probability of catching more than 5 light frogs: + ```r pbinom(q=5, size=n,prob=p, lower.tail=FALSE) # larger than ``` @@ -66,12 +74,14 @@ pbinom(q=5, size=n,prob=p, lower.tail=FALSE) # larger than [1] 0.04734899 ``` + + Catching at least 5 light frogs is a rare event. ::::::::::::::::::::::::::::::::::::: challenge -## Challenge 1: Disease prevalence +## Challenge: Disease prevalence There is a disease with a known prevalence of 4%. You have a group of 100 randomly selected persons. Use the above functions to calculate @@ -81,8 +91,6 @@ There is a disease with a known prevalence of 4%. You have a group of 100 random :::::::::::::::::::::::: solution -## Solution - 1. Exactly 7 persons: ```r @@ -105,20 +113,7 @@ pbinom(q=6, size=100, prob=0.04, lower.tail=FALSE) ::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::: -## Challenge 2: Discrete distributions - -What is the probability of an outcome of X=1.5 in a discrete distribution? - -- 0 -- 0.5 -- 0.15 - -:::::::::::::::::::::::: solution - -The value $1.5$ is not discrete, and can therefore not occur in a discrete distribution. Its probability is zero. - -::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/05-Poisson.md b/05-Poisson.md new file mode 100644 index 0000000..069dbd0 --- /dev/null +++ b/05-Poisson.md @@ -0,0 +1,107 @@ +--- +title: "The Poisson distribution" +teaching: 7 +exercises: 3 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What is the Poisson distribution? +- What kind of data is it used on? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives +- Explain how the Poisson distribution is derived from the binomial. +- Learn to apply the Poisson distribution in R +:::::::::::::::::::::::::::::::::::::::::::::::: + + +## A special case of the binomial distribution + +There is an approximation for the binomial distribution which can often be convenient, specifically if we have + +- many trials (large $n$) and +- a small success probability $p$. + + +

+ +

+ +An example: + +- We now fill the net for 1h, which is for a fixed period of time. +- We catch about 100 frogs per hour, which means after an hour we have about 100 frogs in the net, and $n \approx 100$. +- The fraction of light frogs is low, only 2 % ($p=0.02$). +- After filling the net, we count only the light ones. + +We can now ask how many light frogs we can expect to catch per hour. This expected number is called $\lambda$, and it's given by + $$\lambda = n*p = 100 \cdot 0.02 = 2.$$ + +For the possible outcomes, we again just look at the number of light frogs, we are not interested in the number of dark frogs, or how many frogs we caught in total. +In this scenario **we have reduced the parameters to just one, the rate lambda**, which is the expected number of frogs per hour. And the probabilities of the outcomes can be approximated with a Poisson distribution. + + +## What is the Poisson used for? + +Even though the Poisson is derived as an approximation of the Binomial, we don't necessarily need two categories of events to use it. We can also use it to count events of a single category. + + +

+ +

+ +For example, consider a two lakes with frogs of only one colour. We might want to compare the density of frogs in these lakes, which can be done by comparing the Poisson rates. For this, we count frogs within $2 m^2$ regions in both lakes. For each individual lake, this counting process could be described by a Poisson, with the rate giving the average number of frogs per $2 m^2$. + + + + +The poisson describes counting events over a fixed domain, which can be a period of time, or a fixed space. We assume here that events have an underlying rate, called lambda. + +Examples: + +- Counting frogs for an hour, or within a defined area of the lake. +- counting cells or particles in microscopy images within a fixed volume +- counting how many times something happens in the cell within a fixed period of time. +- Also mutations in the genome can be approximated by Poisson, because the genome has many base pairs, and the fraction of mutated base pairs is low. + +## Properties of the Poisson distribution + +The distribution has only one parameter, the rate. + +The probability for counting $k$ events over whatever fixed domain you have chosen is + +$$\large P(X=k) = \frac{\color{purple}\lambda^k e^{-\color{purple}\lambda}}{k!} .$$ + +In the next plot, Poisson distributions for different rates are shown: + +

+ +

+ + + + + +If we look at the shape of the distribution, we see that + +- for low rates, it is clinched towards zero and has long tails towards larger values. +- for larger rates (for example a rate of $10$ shown here in purple) the shape looks more and more like a Gaussian. Indeed, high-valued counts can often be well described with a Gaussian as well. + + +Another important feature of the Poisson is that its variance and mean are the same, they are both lambda. This means in turn that we can estimate lambda from the sample mean. + + +::::::::::::::::::::::::: challenge + +We are in a diagnostic laboratory that gets blood samples from incoming hospital patients and tests them for some disease. Which of these experiments can be modeled with a Poisson distribution? + +1. Counting the number of positive samples out of 50 samples that get tested successively. +2. Counting the number of samples that test positive within an hour. +3. Counting the number of samples that get tested within an hour. + +:::::::::::::::::: solution +All of these scenarios can be modeled with a Poisson. Counting the number of positive samples out of 50 is more suitable for a binomial distribution, but can indeed be *approximated* by Poisson. +:::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::: diff --git a/06-simulations.md b/06-simulations.md new file mode 100644 index 0000000..9cac987 --- /dev/null +++ b/06-simulations.md @@ -0,0 +1,64 @@ +--- +title: "Simulations in R" +teaching: 2 +exercises: 5 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can I make my own data in R? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Learn to use the random number generators in R to simulate data. + +:::::::::::::::::::::::::::::::::::::::::::::::: + + + +Here's a short interlude on random numbers in R, which you can use to simulate your own data. This can be very useful to set up toy models and see what the data or certain plots would look like in theory. + +For example, we could simulate frog counts from 100 binomial experiments, that is the counts of light colored frogs from filling a net one hundred times: + + +```r +set.seed(85) +size = 10 # number of frogs per net +prob = 0.3 # true percentage of light colored frogs +n = 100 # number of binomial experiments +binomial_frog_counts <- rbinom(n=n, size=size, prob=prob) +head(binomial_frog_counts) +``` + +```{.output} +[1] 3 1 2 3 3 4 +``` + +Here, we used `set.seed()` for reproducibility: The seed gives an initialization to the process of drawing random numbers. So any time we run the same simulation with the same seed, we will get the same random numbers. If we don't set the seed, the random numbers will look different each time we run the code. + + +::::::::::::::::: challenge +## Poisson random numbers + +1. Simulate 100 random frog counts with a Poisson rate of 4. +2. Calculate the mean of the frog count. + +::::::::::::::::: solution + +```r +set.seed(81) +# 1. simulation +frog_counts <-rpois(n = 200, lambda = 4) + +# 2. calculate the mean +mean(frog_counts) +``` + +```{.output} +[1] 3.955 +``` +::::::::::::::::::::::::: +:::::::::::::::::::::::::::: + diff --git a/07-gamma-poisson.md b/07-gamma-poisson.md new file mode 100644 index 0000000..ebf39e3 --- /dev/null +++ b/07-gamma-poisson.md @@ -0,0 +1,117 @@ +--- +title: "The Gamma-Poisson distribution" +teaching: 7 +exercises: 7 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What is the Gamma-Poisson distribution? +- What kind of data is it used on? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Explain the Gamma-Poisson distribution and how it compares to the Poisson distribution. +- Further practice the use of probabilities and random numbers in R. +:::::::::::::::::::::::::::::::::::::::::::::::: + + +## The Gamma Poisson distribution + +In biology we often have the problem that the Poisson doesn’t fit very well, because the observed distribution is more spread out than expected for Poisson, that means the variance is larger than the mean.This is due to additional variation that we haven’t captured, or that we cannot control for. + + +

+ +

+ +To stay with the frogs counts -- but this time each individual frog was caught from a different lake. +And each lake has slightly different properties that affect the rate at which we can catch frogs in it. There might be more or less frogs in them, or better hiding places. + +We can **model** this by saying that for each data point, the rate lambda is drawn from a distribution, in this case a gamma distribution. You don’t have to know anything about the gamma distribution, just that it has a bell shape, and thus gives us some average rate and a variance. +What we then have is individual frog counts that all come from a slightly different Poisson distribution, so that the overall distribution will be more spread out than each single Poisson. + + +We call this a **gamma-poisson distribution**. It's a poisson distribution where the rate $\lambda$ for each data point was drawn from a gamma distribution. It has two parameters: the mean (average rate), and a scale parameter, which indicates how much the lambdas are spread. + +Examples for usage: + +- modelling gene expression data, because there is always biological variation between the samples, and we cannot measure where exactly it comes from +- modelling cell counts that come from slightly different volumes, or from different individuals. + + + + +### The Gamma Poisson distribution in R + +The Gamma Poisson distribution goes by two names: "Gamma Poisson" or "negative binomial". In R, its suffix is `nbinom`. To make things more confusing, the Gamma Poisson can be *parametrized* in different ways. This means, it is possible to describe the same distribution with different combinations of parameters. +Above, I introduced a parametrization with + +- mean $\mu$ (the average Poisson rate) and +- scale $\theta$ (a measure for how much the Poisson rate varies between individual counts), + +because I find it most intuitive. The argument `mu` in `dnbinom` lets you define $\mu$. The argument `size` is the inverse of $\theta$, that is for a small `size` you will get a distribution with a large overdispersion (=spread). For very large values of `size`, the distribution will tend towards a Poisson distribution. + + +:::::::::::::::::::::: callout +## Note +Consult the "Details" in the help function (`help(dnbinom)`) if you choose to parametrize in a different way. +:::::::::::::::::::::::: + + +:::::::::::::::::::::: challenge + +## Compare Gamma poisson to the Poisson distribution +To demonstrate how the Gamma Poisson distribution differs from a Poisson, let's compare the means and variances. + +1. Simulate 100 random frog counts with a Poisson rate of 4 then calculate the mean. +2. Simulate 100 random frog counts with a Poisson rate of 4, then calculate the variance using the function `var`. +3. Simulate 100 random frog counts from different lakes with a mean 4 and `size=2`, then calculate the mean. +4. Simulate 100 random frog counts from different lakes with a mean of 4 and `size=2`, then calculate the variance using the function `var`. + +::::::::::::::::::::::: solution + + + +```r +library(tidyverse) +# 1. Calculate the mean of a Poisson distribution +rpois(n=100, lambda = 4) %>% mean() +``` + +```{.output} +[1] 3.93 +``` + +```r +# 2. Calculate the variance +rpois(n=100, lambda = 4) %>% var() +``` + +```{.output} +[1] 3.168081 +``` + +```r +# 3. Gamma poisson mean +rnbinom(n=100, mu=4, size=2) %>% mean() +``` + +```{.output} +[1] 4.22 +``` + +```r +# 4. Gamma poisson variance +rnbinom(n=100, mu=4, size=2) %>% + var() +``` + +```{.output} +[1] 14.78222 +``` + +::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::: diff --git a/08-gaussian.md b/08-gaussian.md new file mode 100644 index 0000000..9ede48c --- /dev/null +++ b/08-gaussian.md @@ -0,0 +1,38 @@ +--- +title: "The Gaussian distribution" +teaching: 2 +exercises: 0 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What is the Gaussian distribution and what kind of data is it used on? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Refresh the Gaussian distribution. +- Give examples what it is used on. +:::::::::::::::::::::::::::::::::::::::::::::::: + + +## The Gaussian distribution + +We covered the Gaussian -- also known as *Normal distribution* earlier in this tutorial as an example, and you have most likely come across it before. + +It is applicable to repeated measurements of the same thing, for example + +- frog lengths, +- temperatures or +- pixel intensities. + + +

+ +

+ +The Gaussian distribution is continuous and has two parameters: + +- the *mean*, which is estimated by the sample mean, and +- the *variance*, which is estimated by the sample variance. diff --git a/09-visualizing-distributions.md b/09-visualizing-distributions.md new file mode 100644 index 0000000..9b60c92 --- /dev/null +++ b/09-visualizing-distributions.md @@ -0,0 +1,76 @@ +--- +title: "Visualizing distributions" +teaching: 5 +exercises: 0 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can I visualize the distribution of my data? +- How can I find out whether my data is well described by a certain probability distribution? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Introduce the histogram, ECDF and QQ-plot as tools for comparing an empirical to a theoretical distribution. + +:::::::::::::::::::::::::::::::::::::::::::::::: + +## Tools for visually comparing distributions + +Usually, we have some data and want to find the "right" distribution for it. That means, we want to find a distribution that models the data in a suitable way. The workflow should look something like this: + +1. **Fit your data to a distribution that you consider plausible.** We'll see how to do this in R. The fit will give you the best parameters for this distribution, given your data. +2. **Visually compare the theory to your data points.** The theory in this case is the fit you just obtained. In the visualization, a good fit doesn't show systematic deviations of the data points from theory. +3. **Do the same with other plausible distributions.** +4. **Decide which fit looks best to you.** + + +In this episode, I'll introduce the most common visualizations for empirical distributions. In the following episodes, we'll see how to produce them in R. + +## The histogram + +

+ +

+ +Once we have fitted a distribution to the data, we can visually compare the theory, that is the fitted distribution, to the actual data and decide whether this is a good fit. +One very common thing to look at is the histogram. The more data points you have, the more it resembles the underlying probability distribution. So we could look at this histogram here and check whether it looks similar to the Gaussian shape, shown in blue. +A problem about this plot is that, as you might have noticed already, a lot of distributions look roughly bell-shaped. And often it’s difficult to tell subtle differences from a histogram. + + +## The cumulative distribution + +This is why it is also useful to look at the cumulative distribution. +**For every value $k$, the cumulative distribution gives the percentage of data points that have a smaller value than this $k$.** + +

+ +

+ +In this case, you can look up a certain frog size, and the graph will tell you what percentage of frogs is smaller than that. +For example, if we look up the size of 6 cm, the cumulative distribution tells us that roughly 75% of the frogs in the sample are smaller than 6 cm. +The empirical cumulative distribution comes from the data, and one little step for each data point. The theoretical distribution gives the percentage of data points that you’d expect to be smaller than a certain value, for the distribution that you are comparing to. +In this graph, it’s often easier to see systematic deviations from the expectation, for example here the blue line is substantially steeper than the black line, so maybe here we are comparing to a theoretical distribution with a too low variance. + + + +## The QQ-plot + +Another useful plot is the QQ-plot. It’s very often used for checking whether the data is normal, but it can also be used for comparing to other distributions. + + +

+ +

+ +Here, we plot sample quantiles against theoretical quantiles. The 25th quantile is the value $k$ at which 25% of the data points are smaller then k. You can determine the quantile for each data point, and compare it to theory, so each point in this plot is a comparison between theory and data. +The nice thing about this plot is that when the theory and data match well, all points end up on a straight line. And it’s easy for our eyes to decide whether something is a straight line or not. + + +:::::::::::::::::::::: callout +### A note on fitting + +When we say we "fit" a distribution, we are usually talking about the *maximum likelihood approach*. This means we find the parameters for which it is most likely to see the given data. This is an optimization problem - and thus can be solved conveniently by the computer. You might also hear people say they *minimize the deviance* when they fit data, and this is just another term for minimizing the distance between the line (theory) and the points (data). The exact mathematical formula for the deviance to be minimized depends on the type of distribution we are fitting to -- for example, when fitting a Gaussian distribution, we minimize the sum of squares. But no worries: The tools for fitting that you'll get to know in this tutorial know their formulae :) +:::::::::::::::::::::::::: diff --git a/10-histogram.md b/10-histogram.md new file mode 100644 index 0000000..20f6d98 --- /dev/null +++ b/10-histogram.md @@ -0,0 +1,154 @@ +--- +title: "Histograms in R" +teaching: 5 +exercises: 10 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can I plot a histogram of my data in R? +- How can I compare my data to a distribution using histograms? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Give examples for and practice plotting histograms with `ggplot` and `goodfit`. +- Learn to interpret the results. + + +:::::::::::::::::::::::::::::::::::::::::::::::: + + + + +## Start with some data + +For demonstration, let's simulate frog counts and sizes with random draws from a Poisson and a Gaussian distribution. This code should by now look familiar to you: + + +```r +set.seed(51) # set a seed for reproducibility +frog_counts <-rpois(n = 200, lambda = 4) +frog_sizes <- rnorm(n = 200, mean = 7, sd = 2) +frog_counts_different_lakes <- rnbinom(n=200, size=2, mu=4) +``` + +## Plotting a histogram +We can then use `ggplot2` to plot histograms from the simulations. The histogram will have a shape that is specific for the distribution: + + +```r +data.frame(frog_counts) %>% + ggplot(aes(x=frog_counts))+ + geom_histogram(binwidth=1) +``` + + + + +::::::::::::::::::: challenge + +## Exercise: Plot your first histogram + +Start with the following set-up: + + +```r +set.seed(51) # set a seed for reproducibility +frog_counts <-rpois(n = 200, lambda = 4) +``` + +Note how I used `binwidth=1` for displaying the count data. What happens if you don't? + +::::::::::::::::::::::::::: solution +An automatic bin-width of 30 is chosen. Decide for yourself whether this gives you a good overview over your data. It's often a good idea to play around with the `binwidth` parameter. +::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::: + + +### Relation between histogram and distribution function + +The theoretical distribution gives the expected frequency of the random numbers. + +For example for the Poisson frog counts, we can calculate the expected frequency of the counts from 0 to 20: + +```r +counts <- 0:20 +expected_freq <- dpois(counts, lambda = 4) * length(frog_counts) +``` + +Then we can plot the expected counts as a line on top of the histogram: + +```r +data.frame(frog_counts) %>% + ggplot(aes(x=frog_counts))+ + geom_histogram(binwidth=1)+ + geom_line(data=data.frame(counts,expected_freq), aes(counts,expected_freq)) +``` + + + + +### The `goodfit` function + +You might not want to code this plot every time you visually inspect a fit. Luckily, there are convenient functions that do this for you. + +The `goodfit` function from the `vcd` package allows you to fit a sample to a discrete distribution of interest. +Here, we fit the frog counts to a Poisson: + +```r +library(vcd) +my_fit <- goodfit(frog_counts,"poisson") +my_fit$par +``` + +```{.output} +$lambda +[1] 3.83 +``` + +```r +plot(my_fit) +``` + + + +This is how a good fit looks like: The bars all roughly stop at zero, some above and some below, which is due to the sample's randomness. + + +::::::::::::::::::: challenge + +## Exercise: Fit a Poisson distribution + + +Start with the following set-up: + +```r +set.seed(51) # set a seed for reproducibility +frog_counts_different_lakes <- rnbinom(n=200, size=2, mu=4) +``` + +Use the `goodfit` function to fit the `frog_counts_different_lakes` data with a Poisson. + +:::::::::::::::::::::::::::: + + +This histogram in the above challenge should show you that there is a systematic problem: The bars at the periphery hang very low and those around the peaks hang high. This indicates that the fit isn't too good. + + +:::::::::::::::::::: challenge + +## Exercise: Fit a Gamma-Poisson + +Start with the following set-up: + +```r +set.seed(51) # set a seed for reproducibility +frog_counts_different_lakes <- rnbinom(n=200, size=2, mu=4) +``` + +1. Fit the frog counts from different lakes with a Gamma-Poisson distribution instead (hint: in the `goodfit` function, it is called `nbinomial`). +2. Can you make out the visual difference between a good and a bad fit? + +:::::::::::::::::::::: diff --git a/11-cdf.md b/11-cdf.md new file mode 100644 index 0000000..350d723 --- /dev/null +++ b/11-cdf.md @@ -0,0 +1,57 @@ +--- +title: 'The cumulative distribution function' +teaching: 3 +exercises: 0 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What is the empirical cumulative distribution function? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Demonstrate the cumulative distribution function with an example. + +:::::::::::::::::::::::::::::::::::::::::::::::: + + + + + + +The cumulative distribution is the integral of the distribution, and thus the empirical cumulative distribution is the integral of the histogram. It gives you the percentage of values that are smaller than a certain value. +We can visualize it with `stat_ecdf` in `ggplot2`. + + +```r +data.frame(frog_sizes) %>% + ggplot(aes(x=frog_sizes))+ + stat_ecdf() +``` + + + +As for the histogram, you could calculate the expected values for the cumulative distribution and add them to the plot: + + +```r +sizes <- seq(0,14,.5) +theoretical_cdf <- pnorm(sizes, mean=7, sd=2) + +data.frame(frog_sizes) %>% + ggplot(aes(frog_sizes)) + + stat_ecdf()+ + geom_line(data=data.frame(sizes,theoretical_cdf), aes(sizes, theoretical_cdf),colour="green") +``` + + + +Off this plot you can read for a frog size of $x$ (e.g. 7cm), that about 50% of the values are smaller than this value. So this is really just another way to display the histogram. + +:::::::::::::::::::::: callout +## Note +The procedure above was shown for demonstration purposes. We will not practice overlaying empirical and theoretical distribution functions with `ggplot`, because in the next episode, we'll introduce a tool that does the work for you. +::::::::::::::::::::::::::: + diff --git a/12-qq.md b/12-qq.md new file mode 100644 index 0000000..0586361 --- /dev/null +++ b/12-qq.md @@ -0,0 +1,134 @@ +--- +title: 'The QQ-plot' +teaching: 5 +exercises: 3 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What is the QQ-plot? +- How can I create a QQ-plot of my data? +- Why is it useful? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Demonstrate the calculation of quantiles in R. +- Explain the QQ-plot. +- Introduce functions to make a qq-plot in R. + +:::::::::::::::::::::::::::::::::::::::::::::::: + + + + +## The QQ-plot + +The qq-plot compares the quantiles of two distributions. + +Quantiles are the inverse of the cumulative distribution, i.e., the `qnorm` function is the inverse of `pnorm`: + +You can use `pnorm` to ask for the probability of seeing a value of $-2$ or smaller: + +```r +pnorm(q = -2,mean=0, sd=2) +``` + +```{.output} +[1] 0.1586553 +``` + +Then you can use `qnorm` to ask that is the value for which 15% of the other values are smaller. Here, I demonstrate this by plugging the result into the `pnorm` function into the `qnorm` function: + +```r +qnorm(p= pnorm(q = -2,mean=0, sd=2), mean=0, sd=2) +``` + +```{.output} +[1] -2 +``` + +Let's compare the quantiles of the simulated frog sizes to the theoretical quantiles of a normal distribution. +There are specialized functions for qq-plots, so we don't have to calculate the theoretical values by hand: + + +```r +data.frame(frog_sizes) %>% + ggplot(aes(sample=frog_sizes))+ + geom_qq()+ + geom_qq_line() +``` + + + +By default, the `geom_qq` function assumes that we compare to a standard normal distribution. +This fit doesn't look too bad, although for low values the points stray away from the line. This shouldn't surprise you, because remember: The normal distribution approximates the Poisson distribution (with which the simulation was generated) well for large values, but has limitations for low-valued counts. + +Let's set up a qq-plot where we compare to a Poisson distribution: + + +```r +data.frame(frog_counts) %>% + ggplot(aes(sample=frog_counts))+ + geom_qq(distribution=qpois, dparams=list(lambda=mean(frog_counts)))+ + geom_abline() +``` + + + +This fit looks better. We can still argue whether a qq-plot is the best representation for a Poisson fit, because due to the distribution's discreteness, many data points end up on the exact same spot in the plot (overplotting). Thus, we loose information in this visualization. + +## One fits all - `fitdistrplus` package + +If you want a quick overview, you can use the `fitdistrplus` package, which produces a series of plots. + +Suppose you fit the frog counts to a normal distribution: + +```r +library(fitdistrplus) +my_fit <- fitdist(frog_counts, dist = "norm") +my_fit # gives you the parameter estimates +``` + +```{.output} +Fitting of the distribution ' norm ' by maximum likelihood +Parameters: + estimate Std. Error +mean 3.830000 0.1495176 +sd 2.114498 0.1057248 +``` + +```r +plot(my_fit) +``` + + + +With some practice, this plot quickly allows you to see that you are comparing discrete data to a continuous distribution. Also, the QQ-plot doesn't really give a straight line and the histogram seems to be skewed to the left, compared to theory. + +::::::::::::::::::::::::: challenge + +What does it look like if you fit the `frog_counts` to a Poisson instead? + +Start from this set-up: + +```r +set.seed(51) # set a seed for reproducibility +frog_counts <-rpois(n = 200, lambda = 4) +library(fitdistrplus) +``` + +::::::::::::::::: solution + + +```r +my_fit <- fitdist(frog_counts, dist = "pois") +plot(my_fit) +``` + + + +:::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::: + diff --git a/13-summary.md b/13-summary.md new file mode 100644 index 0000000..f0bc67b --- /dev/null +++ b/13-summary.md @@ -0,0 +1,38 @@ +--- +title: 'Summary' +teaching: 5 +exercises: 0 +--- + + + +### Sampling and distributions + +- In statistics, experiments are viewed as **sampling** processes. Sampling means "drawing events" from a population of interest. If the events are drawn randomly and independently, the results are more likely to give a good representation of reality. +- A **probability distribution ** assigns probabilities to the possible outcomes (in continuous distributions: to intervals of outcomes). +- **Discrete** distributions give probabilities for discrete outcomes, for example counts. For **continuous** distributions, every value is assigned a **density**, and probabilities can be calculated by integrating over the density in the interval of interest. + +### Common distributions of biological data + +- The **binomial** distribution models the number of successes in a series of trials. +- The **Poisson** distribution is often used for modeling count data. It assumes that counts were over a fixed volume or period of time. It is an approximation of the binomial distribution. +- The **negative binomial** distribution is useful for overdispersed count data, that is count data that are poorly fit by Poisson, because they are not collected over fixed domains, or are subject to some additional variation that isn't captured by Poisson. +- The **Gaussian** or **normal** distribution models repeated measurements of the same thing, and is a continuous distribution. Many distributions (also discrete ones) can be approximated by a Gaussian distribution for sufficiently large sample sizes. + +Confused by what to approximate with what? Here is an overview: + + + + + +### What distribution models your data best? + +1. **Fit your data** to a distribution that you consider plausible. The fit gives you the best parameters for this distribution given your data. + +2. **Visually compare** the theory (i.e. the fitted distribution) to your data points. A good fit doesn’t show systematic deviations of the data points from theory. + +3. Do the same with other plausible distributions. + +4. Decide which of the fits looks best to you. + +Don't be discouraged if step 4 isn't always obvious! It just takes a little practice. The following exercises are designed to build some intuition on that. diff --git a/14-exercises.md b/14-exercises.md new file mode 100644 index 0000000..45af3b0 --- /dev/null +++ b/14-exercises.md @@ -0,0 +1,78 @@ +--- +title: 'Exercises' +teaching: 0 +exercises: 90 +--- + +## Exercises + +Here are some exercises for you to try yourself out in R, to practice your knowledge on probability distributions, and to construct and interpret meaningful visualizations. + +Please feel encouraged to work in RStudio on your own computer for this! This way, you will have an installation and some scripts ready once you decide to work on your own data. + +To install all packages necessary for completing the exercises and running the examples in this tutorial, run the following command on your computer: + + +```r +source("https://www.huber.embl.de/users/kaspar/biostat_2021/install_packages_biostat.R") +``` + + +### The `discoveries` data + +Consider the `discoveries` data. This data set is contained in base R and has the number of "great inventions" for a number of years. These are clearly count data. + + + +1. Look up the example data using the R `help` function. +2. Compare the fit to a Poisson with a fit to a negative binomial (=gamma poisson) distribution. +3. Which of the two distribution do you think describe the data better? And what could be a reason for that (remember what the data describe)? +4. What could be problematic about fitting a Poisson or negative binomial distribution to these data? Which assumption could be violated? + + +```r +discov <- discoveries[1:100] +library(vcd) +``` + +```{.output} +Loading required package: grid +``` + + +### ELISA example + +This example is modified from [chapter 1](https://www.huber.embl.de/msmb/Chap-Generative.html) in MSMB by Susan Holmes and Wolfgang Huber. + +When testing certain pharmaceutical compounds, it is important to detect proteins that provoke an allergic reaction. The molecular sites that are responsible for such reactions are called epitopes. + +ELISA assays are used to detect specific epitopes at different positions along a protein. The protein is tested at 100 different positions, supposed to be independent. For each patient, this position can either be a hit, or not. +We’re going to study the data for 50 patients tallied at each of the 100 positions. + +Start with the following lines: + +```r +epitope_data <- data.frame(position=1:100, + count=c(2, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 2, 2, 7, 1, 0, 2, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0)) +``` +In this data frame, the number of hits among the 50 patients is counted at each position. + + +1. Plot the data in a meaningful way. +2. Fit the counts to a Poisson distribution. What is the fitted rate parameter? +3. Does this look like a good fit? +4. ELISAs can give false positives at a certain rate. False positive means declaring a hit – we think we have an epitope – when there is none. Assume that most of the positions actually don't contain an epitope. In this case, you can consider the fitted Poisson model as the "background noise" model, with lambda giving the expected number of false positives. Given this model, what are the chances of seeing a value as large as 7, if no epitope is present? + + +### Mice data + +Load the mice data: + +```r +mice_pheno <- read.csv2(file= url("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"), sep=",") +mice_pheno$Bodyweight <- as.numeric(mice_pheno$Bodyweight) +``` + +1. Plot a histogram, and a normal-QQ plot for female control mice. Would you say the weights follow a normal distribution? +2. For comparing the control and high-fat group (for female mice), show the ECDF of both in the same plot. + diff --git a/config.yaml b/config.yaml deleted file mode 100644 index 8ea4b1e..0000000 --- a/config.yaml +++ /dev/null @@ -1,81 +0,0 @@ -#------------------------------------------------------------ -# Values for this lesson. -#------------------------------------------------------------ - -# Which carpentry is this (swc, dc, lc, or cp)? -# swc: Software Carpentry -# dc: Data Carpentry -# lc: Library Carpentry -# cp: Carpentries (to use for instructor training for instance) -# incubator: The Carpentries Incubator -carpentry: 'incubator' - -# Overall title for pages. -title: 'Biostatistical Basics' - -# Date the lesson was created (YYYY-MM-DD, this is empty by default) -created: 2022-12-21 - -# Comma-separated list of keywords for the lesson -keywords: 'statistics, biostatistics, lesson, The Carpentries' - -# Life cycle stage of the lesson -# possible values: pre-alpha, alpha, beta, stable -life_cycle: 'pre-alpha' - -# License of the lesson -license: 'CC-BY 4.0' - -# Link to the source repository for this lesson -source: 'https://github.com/carpentries/workbench-template-rmd' - -# Default branch of your lesson -branch: 'main' - -# Who to contact if there are any issues -contact: 'sarah.kaspar@embl.de' - -# Navigation ------------------------------------------------ -# -# Use the following menu items to specify the order of -# individual pages in each dropdown section. Leave blank to -# include all pages in the folder. -# -# Example ------------- -# -# episodes: -# - introduction.md -# - first-steps.md -# -# learners: -# - setup.md -# -# instructors: -# - instructor-notes.md -# -# profiles: -# - one-learner.md -# - another-learner.md - -# Order of episodes in your lesson -episodes: -- 01-sampling.Rmd -- 02-distributions.Rmd -- 03-binomial.Rmd -- 04-distributions-R.Rmd - -# Information for Learners -learners: - -# Information for Instructors -instructors: - -# Learner Profiles -profiles: - -# Customisation --------------------------------------------- -# -# This space below is where custom yaml items (e.g. pinning -# sandpaper and varnish versions) should live - - diff --git a/fig/05-Poisson-rendered-unnamed-chunk-1-1.png b/fig/05-Poisson-rendered-unnamed-chunk-1-1.png new file mode 100644 index 0000000..ec3705b Binary files /dev/null and b/fig/05-Poisson-rendered-unnamed-chunk-1-1.png differ diff --git a/fig/10-histogram-rendered-goodfit-1.png b/fig/10-histogram-rendered-goodfit-1.png new file mode 100644 index 0000000..5b3f759 Binary files /dev/null and b/fig/10-histogram-rendered-goodfit-1.png differ diff --git a/fig/10-histogram-rendered-histogram-expected-freq-1.png b/fig/10-histogram-rendered-histogram-expected-freq-1.png new file mode 100644 index 0000000..5903e0d Binary files /dev/null and b/fig/10-histogram-rendered-histogram-expected-freq-1.png differ diff --git a/fig/10-histogram-rendered-histograms-1.png b/fig/10-histogram-rendered-histograms-1.png new file mode 100644 index 0000000..fb4fb55 Binary files /dev/null and b/fig/10-histogram-rendered-histograms-1.png differ diff --git a/fig/11-cdf-rendered-ecdf-cdf-plot-1.png b/fig/11-cdf-rendered-ecdf-cdf-plot-1.png new file mode 100644 index 0000000..200675c Binary files /dev/null and b/fig/11-cdf-rendered-ecdf-cdf-plot-1.png differ diff --git a/fig/11-cdf-rendered-ecdf-plot-1.png b/fig/11-cdf-rendered-ecdf-plot-1.png new file mode 100644 index 0000000..504d9ad Binary files /dev/null and b/fig/11-cdf-rendered-ecdf-plot-1.png differ diff --git a/fig/12-qq-rendered-fitdistrplus-pois-solution-1.png b/fig/12-qq-rendered-fitdistrplus-pois-solution-1.png new file mode 100644 index 0000000..edcd285 Binary files /dev/null and b/fig/12-qq-rendered-fitdistrplus-pois-solution-1.png differ diff --git a/fig/12-qq-rendered-qq-plot-normal-1.png b/fig/12-qq-rendered-qq-plot-normal-1.png new file mode 100644 index 0000000..87078b5 Binary files /dev/null and b/fig/12-qq-rendered-qq-plot-normal-1.png differ diff --git a/fig/12-qq-rendered-qq-plot-pois-1.png b/fig/12-qq-rendered-qq-plot-pois-1.png new file mode 100644 index 0000000..13fe539 Binary files /dev/null and b/fig/12-qq-rendered-qq-plot-pois-1.png differ diff --git a/fig/12-qq-rendered-unnamed-chunk-2-1.png b/fig/12-qq-rendered-unnamed-chunk-2-1.png new file mode 100644 index 0000000..272fc02 Binary files /dev/null and b/fig/12-qq-rendered-unnamed-chunk-2-1.png differ diff --git a/fig/binomial.png b/fig/binomial.png new file mode 100644 index 0000000..bb23874 Binary files /dev/null and b/fig/binomial.png differ diff --git a/fig/distributions-1.png b/fig/distributions-1.png new file mode 100644 index 0000000..4be376e Binary files /dev/null and b/fig/distributions-1.png differ diff --git a/fig/frog_sizes.png b/fig/frog_sizes.png new file mode 100644 index 0000000..abe0204 Binary files /dev/null and b/fig/frog_sizes.png differ diff --git a/fig/gamma-poisson.png b/fig/gamma-poisson.png new file mode 100644 index 0000000..d157c0b Binary files /dev/null and b/fig/gamma-poisson.png differ diff --git a/fig/histogram-cumulative-qq.png b/fig/histogram-cumulative-qq.png new file mode 100644 index 0000000..e7548f1 Binary files /dev/null and b/fig/histogram-cumulative-qq.png differ diff --git a/fig/histogram-cumulative.png b/fig/histogram-cumulative.png new file mode 100644 index 0000000..29419a0 Binary files /dev/null and b/fig/histogram-cumulative.png differ diff --git a/fig/histogram.png b/fig/histogram.png new file mode 100644 index 0000000..9892932 Binary files /dev/null and b/fig/histogram.png differ diff --git a/fig/many_lambdas_plot.png b/fig/many_lambdas_plot.png new file mode 100644 index 0000000..a911133 Binary files /dev/null and b/fig/many_lambdas_plot.png differ diff --git a/fig/poisson-derivation.png b/fig/poisson-derivation.png new file mode 100644 index 0000000..03c4c54 Binary files /dev/null and b/fig/poisson-derivation.png differ diff --git a/fig/poisson-lake.png b/fig/poisson-lake.png new file mode 100644 index 0000000..0a0bdaa Binary files /dev/null and b/fig/poisson-lake.png differ diff --git a/fig/sampling-frogs-2.png b/fig/sampling-frogs-2.png new file mode 100644 index 0000000..2c5f03f Binary files /dev/null and b/fig/sampling-frogs-2.png differ diff --git a/fig/sampling-frogs-3.png b/fig/sampling-frogs-3.png new file mode 100644 index 0000000..67e9347 Binary files /dev/null and b/fig/sampling-frogs-3.png differ diff --git a/md5sum.txt b/md5sum.txt index 269ee1f..528511b 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -1,14 +1,25 @@ "file" "checksum" "built" "date" -"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2022-12-06" -"LICENSE.md" "afaf427b4223952624dcb6d8ded53ec0" "site/built/LICENSE.md" "2022-12-06" -"config.yaml" "675de058e9a7c31327e9234be83d03e7" "site/built/config.yaml" "2022-12-21" -"index.md" "a02c9c785ed98ddd84fe3d34ddb12fcd" "site/built/index.md" "2022-12-06" -"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2022-12-06" -"episodes/01-sampling.Rmd" "170524857bf66af6c8e58f6a9c84f637" "site/built/01-sampling.md" "2022-12-21" -"episodes/02-distributions.Rmd" "76436c6df768a4a0a8a63b6643d6b1ea" "site/built/02-distributions.md" "2022-12-21" -"episodes/03-binomial.Rmd" "8db481bbe10d7379d3375526bf7472a6" "site/built/03-binomial.md" "2022-12-21" -"episodes/04-distributions-R.Rmd" "06213db23ac19941b60db58449ee6650" "site/built/04-distributions-R.md" "2022-12-21" -"instructors/instructor-notes.md" "60b93493cf1da06dfd63255d73854461" "site/built/instructor-notes.md" "2022-12-06" -"learners/setup.md" "f81f62ee0b5d70f52b205ab65a5d0bbf" "site/built/setup.md" "2022-12-06" -"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2022-12-06" -"renv/profiles/lesson-requirements/renv.lock" "54217c1b4af46a849129dbcf4fde5d5a" "site/built/renv.lock" "2022-12-06" +"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2023-02-07" +"LICENSE.md" "afaf427b4223952624dcb6d8ded53ec0" "site/built/LICENSE.md" "2023-02-07" +"config.yaml" "b3a6558de58d1043930bfcbabfffd264" "site/built/config.yaml" "2023-02-07" +"index.md" "a02c9c785ed98ddd84fe3d34ddb12fcd" "site/built/index.md" "2023-02-07" +"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2023-02-07" +"episodes/01-sampling.Rmd" "342c523ceab962b2295c18ad1e6bb705" "site/built/01-sampling.md" "2023-02-07" +"episodes/02-distributions.Rmd" "90ea7c23a1000b658a8815e64fa35e1b" "site/built/02-distributions.md" "2023-02-07" +"episodes/03-binomial.Rmd" "170e04aa363f9420a164a0a8f14f425c" "site/built/03-binomial.md" "2023-02-07" +"episodes/04-distributions-R.Rmd" "385e313a59c56717e5c6c6c72a33f540" "site/built/04-distributions-R.md" "2023-02-07" +"episodes/05-Poisson.Rmd" "b889478a83b88a6b15a435b7d3709dff" "site/built/05-Poisson.md" "2023-02-07" +"episodes/06-simulations.Rmd" "561cfa655084956def3b0d48ebcd0f56" "site/built/06-simulations.md" "2023-02-07" +"episodes/07-gamma-poisson.Rmd" "dbe19407e4ccb7a2d38ebef3078b441a" "site/built/07-gamma-poisson.md" "2023-02-07" +"episodes/08-gaussian.Rmd" "02ac4ac25481f707a7d2c0f203bc7e3e" "site/built/08-gaussian.md" "2023-02-07" +"episodes/09-visualizing-distributions.Rmd" "240be2dbde9016f96c0b40d92bfd6382" "site/built/09-visualizing-distributions.md" "2023-02-07" +"episodes/10-histogram.Rmd" "da1430dcea994d95b00733bc3c08041d" "site/built/10-histogram.md" "2023-02-07" +"episodes/11-cdf.Rmd" "953c1c0de80726e2b8d00e84c9f864a5" "site/built/11-cdf.md" "2023-02-07" +"episodes/12-qq.Rmd" "08ccd8b1aed4c6e180d57446bc3b402a" "site/built/12-qq.md" "2023-02-07" +"episodes/13-summary.Rmd" "f10ff2bf92d538787244ba8aa01fc0e3" "site/built/13-summary.md" "2023-02-07" +"episodes/14-exercises.Rmd" "024dfc610cd4b29cb0f7cc25ec5b88ca" "site/built/14-exercises.md" "2023-02-07" +"instructors/instructor-notes.md" "60b93493cf1da06dfd63255d73854461" "site/built/instructor-notes.md" "2023-02-07" +"learners/setup.md" "f1875f8d8b1f1f5adfd264b68e2f388f" "site/built/setup.md" "2023-02-07" +"learners/summary.md" "0019b0eb4b437228e1c325ede327630d" "site/built/summary.md" "2023-02-07" +"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2023-02-07" +"renv/profiles/lesson-requirements/renv.lock" "84e12790ec7661570f8a8423890ac8fb" "site/built/renv.lock" "2023-02-07" diff --git a/renv.lock b/renv.lock deleted file mode 100644 index cadc517..0000000 --- a/renv.lock +++ /dev/null @@ -1,318 +0,0 @@ -{ - "R": { - "Version": "4.2.2", - "Repositories": [ - { - "Name": "carpentries", - "URL": "https://carpentries.r-universe.dev" - }, - { - "Name": "carpentries_archive", - "URL": "https://carpentries.github.io/drat" - }, - { - "Name": "CRAN", - "URL": "https://cran.rstudio.com" - } - ] - }, - "Packages": { - "R6": { - "Package": "R6", - "Version": "2.5.1", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "470851b6d5d0ac559e9d01bb352b4021", - "Requirements": [] - }, - "base64enc": { - "Package": "base64enc", - "Version": "0.1-3", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "543776ae6848fde2f48ff3816d0628bc", - "Requirements": [] - }, - "bslib": { - "Package": "bslib", - "Version": "0.4.1", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "89a0cd0c45161e3bd1c1e74a2d65e516", - "Requirements": [ - "cachem", - "htmltools", - "jquerylib", - "jsonlite", - "memoise", - "rlang", - "sass" - ] - }, - "cachem": { - "Package": "cachem", - "Version": "1.0.6", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "648c5b3d71e6a37e3043617489a0a0e9", - "Requirements": [ - "fastmap", - "rlang" - ] - }, - "cli": { - "Package": "cli", - "Version": "3.4.1", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "0d297d01734d2bcea40197bd4971a764", - "Requirements": [] - }, - "digest": { - "Package": "digest", - "Version": "0.6.30", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "bf1cd206a5d170d132ef75c7537b9bdb", - "Requirements": [] - }, - "evaluate": { - "Package": "evaluate", - "Version": "0.18", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "6b6c0f8467cd4ce0b500cabbc1bd1763", - "Requirements": [] - }, - "fastmap": { - "Package": "fastmap", - "Version": "1.1.0", - "Source": "Repository", - "Repository": "CRAN", - "Hash": "77bd60a6157420d4ffa93b27cf6a58b8", - "Requirements": [] - }, - "fs": { - "Package": "fs", - "Version": "1.5.2", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "7c89603d81793f0d5486d91ab1fc6f1d", - "Requirements": [] - }, - "glue": { - "Package": "glue", - "Version": "1.6.2", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "4f2596dfb05dac67b9dc558e5c6fba2e", - "Requirements": [] - }, - "highr": { - "Package": "highr", - "Version": "0.9", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "8eb36c8125038e648e5d111c0d7b2ed4", - "Requirements": [ - "xfun" - ] - }, - "htmltools": { - "Package": "htmltools", - "Version": "0.5.3", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "6496090a9e00f8354b811d1a2d47b566", - "Requirements": [ - "base64enc", - "digest", - "fastmap", - "rlang" - ] - }, - "jquerylib": { - "Package": "jquerylib", - "Version": "0.1.4", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "5aab57a3bd297eee1c1d862735972182", - "Requirements": [ - "htmltools" - ] - }, - "jsonlite": { - "Package": "jsonlite", - "Version": "1.8.3", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "8b1bd0be62956f2a6b91ce84fac79a45", - "Requirements": [] - }, - "knitr": { - "Package": "knitr", - "Version": "1.41", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "6d4971f3610e75220534a1befe81bc92", - "Requirements": [ - "evaluate", - "highr", - "stringr", - "xfun", - "yaml" - ] - }, - "lifecycle": { - "Package": "lifecycle", - "Version": "1.0.3", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "001cecbeac1cff9301bdc3775ee46a86", - "Requirements": [ - "cli", - "glue", - "rlang" - ] - }, - "magrittr": { - "Package": "magrittr", - "Version": "2.0.3", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "7ce2733a9826b3aeb1775d56fd305472", - "Requirements": [] - }, - "memoise": { - "Package": "memoise", - "Version": "2.0.1", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "e2817ccf4a065c5d9d7f2cfbe7c1d78c", - "Requirements": [ - "cachem", - "rlang" - ] - }, - "rappdirs": { - "Package": "rappdirs", - "Version": "0.3.3", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "5e3c5dc0b071b21fa128676560dbe94d", - "Requirements": [] - }, - "renv": { - "Package": "renv", - "Version": "0.16.0", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "c9e8442ab69bc21c9697ecf856c1e6c7", - "Requirements": [] - }, - "rlang": { - "Package": "rlang", - "Version": "1.0.6", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "4ed1f8336c8d52c3e750adcdc57228a7", - "Requirements": [] - }, - "rmarkdown": { - "Package": "rmarkdown", - "Version": "2.18", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "8063c4e953cefb651e8cd58c82c82d2d", - "Requirements": [ - "bslib", - "evaluate", - "htmltools", - "jquerylib", - "jsonlite", - "knitr", - "stringr", - "tinytex", - "xfun", - "yaml" - ] - }, - "sass": { - "Package": "sass", - "Version": "0.4.4", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "c76cbac7ca04ce82d8c38e29729987a3", - "Requirements": [ - "R6", - "fs", - "htmltools", - "rappdirs", - "rlang" - ] - }, - "stringi": { - "Package": "stringi", - "Version": "1.7.8", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "a68b980681bcbc84c7a67003fa796bfb", - "Requirements": [] - }, - "stringr": { - "Package": "stringr", - "Version": "1.5.0", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "671a4d384ae9d32fc47a14e98bfa3dc8", - "Requirements": [ - "cli", - "glue", - "lifecycle", - "magrittr", - "rlang", - "stringi", - "vctrs" - ] - }, - "tinytex": { - "Package": "tinytex", - "Version": "0.42", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "7629c6c1540835d5248e6e7df265fa74", - "Requirements": [ - "xfun" - ] - }, - "vctrs": { - "Package": "vctrs", - "Version": "0.5.1", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "970324f6572b4fd81db507b5d4062cb0", - "Requirements": [ - "cli", - "glue", - "lifecycle", - "rlang" - ] - }, - "xfun": { - "Package": "xfun", - "Version": "0.35", - "Source": "Repository", - "Repository": "RSPM", - "Hash": "f576593107bdf9aa7db48ef75a8c05fb", - "Requirements": [] - }, - "yaml": { - "Package": "yaml", - "Version": "2.3.6", - "Source": "Repository", - "Repository": "CRAN", - "Hash": "9b570515751dcbae610f29885e025b41", - "Requirements": [] - } - } -} diff --git a/setup.md b/setup.md index 7272a33..ed2d128 100644 --- a/setup.md +++ b/setup.md @@ -2,48 +2,39 @@ title: Setup --- -Setup instructions live in this document. Please specify the tools and the data -sets the Learner needs to have installed. -## Data Sets +## Summary -Download the [data zip file](data/data.zip) and unzip it to your Desktop +At the end of this course, you'll able to: -## Software Setup +- describe what sampling and probability distributions are +- list some common distributions of biological data +- visualise the distribution of your data in R +- pick a suitable distribution to model your data with -::::::::::::::::::::::::::::::::::::::: discussion +## Time -### Details +The episodes can be taught in about 2 hours. Your individual reading time may be different. -Setup for different systems can be presented in dropdown menus via a `solution` -tag. They will join to this discussion block, so you can give a general overview -of the software used in this lesson here and fill out the individual operating -systems (and potentially add more, e.g. online setup) in the solutions blocks. +For completing the exercises (episode 13), we typically plan 1.5 hours. -::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::: solution -### Windows +## Prerequisites -Use PuTTY +Before starting this course, we recommend you complete [a first tutorial on data handling and visualisation](https://www.ebi.ac.uk/training/online/courses/biostatistics-introduction/data-handling-and-visualisation/) or have basic knowledge or R and the tidyverse. -::::::::::::::::::::::::: +## Setup -:::::::::::::::: solution +You need R and RStudio running on your computer, as we will not fix installations during the course. -### MacOS +Links for installation: -Use Terminal.app +- R (install this first): https://cloud.r-project.org/ +- RStudio: https://www.rstudio.com/products/rstudio/download/ -::::::::::::::::::::::::: +If you have an EMBL account, an alternative to installation can be using [rstudio.embl.de](rstudio.embl.de) - please check that you can log in, in case you want to use this option. +To install all packages necessary for completing the exercises and running the demonstrations, run the following command from the console in RStudio: -:::::::::::::::: solution - -### Linux - -Use Terminal - -::::::::::::::::::::::::: - +source("https://www.huber.embl.de/users/kaspar/biostat_2021/install_packages_biostat.R") \ No newline at end of file diff --git a/summary.md b/summary.md new file mode 100644 index 0000000..19d3c04 --- /dev/null +++ b/summary.md @@ -0,0 +1,13 @@ +--- +title: Summary +--- + +## Summary + +At the end of this course, you'll able to: + +- describe what sampling and probability distributions are +- list some common distributions of biological data +- visualise the distribution of your data in R +- pick a suitable distribution to model your data with +