-
Notifications
You must be signed in to change notification settings - Fork 5
/
search.json
93 lines (93 loc) · 32.1 KB
/
search.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
[
{
"objectID": "cs1/cs1-practical-one_sample_t_test.html#libraries-and-functions",
"href": "cs1/cs1-practical-one_sample_t_test.html#libraries-and-functions",
"title": "One-sample t-test",
"section": "Libraries and functions",
"text": "Libraries and functions\n\ntidyverseRPython\n\n\n\n\n\n\n\n\n\nLibraries\nDescription\n\n\n\n\nlibrary(tidyverse)\nA collection of R packages designed for data science\n\n\nlibrary(rstatix)\nConverts base R stats functions to a tidyverse-friendly format. Also contains extra functionality that we’ll use.\n\n\n\n\n\n\n\n\n\n\nFunctions\nDescription\n\n\n\n\nrstatix::t_test()\nPerforms a one-sample t-test, Student’s t-test and Welch’s t-test in later sections.\n\n\nrstatix::shapiro_test()\nPerforms a Shapiro-Wilk test for normality.\n\n\nggplot2::stat_qq()\nPlots a Q-Q plot for comparison with a normal distribution.\n\n\nggplot2::stat_qq_line()\nAdds a comparison line to the Q-Q plot.\n\n\n\n\n\n\n\n\n\n\n\n\nFunctions\nDescription\n\n\n\n\nt.test()\nPerforms a one-sample t-test, Student’s t-test and Welch’s t-test in later sections.\n\n\nqqnorm()\nPlots a Q-Q plot for comparison with a normal distribution.\n\n\nqqline()\nAdds a comparison line to the Q-Q plot.\n\n\nshapiro.test()\nPerforms a Shapiro-Wilk test for normality.\n\n\n\n\n\n\n\n\nLibraries\nDescription\n\n\n\n\nplotnine\nThe Python equivalent of ggplot2.\n\n\npandas\nA Python data analysis and manipulation tool.\n\n\nscipy.stats\nA Python module containing statistical functions.\n\n\n\n\n\n\n\n\n\n\nFunctions\nDescription\n\n\n\n\nscipy.stats.shapiro()\nPerform the Shapiro-Wilk test for normality.\n\n\nscipy.stats.ttest_1samp()\nCalculate the T-test for the mean of ONE group of scores.\n\n\nplotnine.stats.stat_qq()\nPlots a Q-Q plot for comparison with a normal distribution.\n\n\nplotnine.stats.stat_qq_line()\nAdds a comparison line to the Q-Q plot."
},
{
"objectID": "cs1/cs1-practical-one_sample_t_test.html#data-and-hypotheses",
"href": "cs1/cs1-practical-one_sample_t_test.html#data-and-hypotheses",
"title": "One-sample t-test",
"section": "Data and hypotheses",
"text": "Data and hypotheses\nFor example, suppose we measure the body lengths of male guppies (in mm) collected from the Guanapo River in Trinidad. We want to test whether the data support the hypothesis that the mean body is actually 20 mm. We form the following null and alternative hypotheses:\n\n\\(H_0\\): The mean body length is equal to 20mm (\\(\\mu =\\) 20).\n\\(H_1\\): The mean body length is not equal to 20mm (\\(\\mu \\neq\\) 20).\n\nWe will use a one-sample, two-tailed t-test to see if we should reject the null hypothesis or not.\n\nWe use a one-sample test because we only have one sample.\nWe use a two-tailed t-test because we want to know if our data suggest that the true (population) mean is different from 20 mm in either direction rather than just to see if it is greater than or less than 20 mm (in which case we would use a one-tailed test).\nWe’re using a t-test because we don’t know any better yet and because I’m telling you to. We’ll look at what the precise assumptions/requirements need to be in a moment.\n\nMake sure you have downloaded the data (see: Datasets) and placed it within your working directory.\n\ntidyverseRPython\n\n\nFirst we load the relevant libraries:\n\n# load tidyverse\nlibrary(tidyverse)\n\n# load rstatix, a tidyverse-friendly stats package\nlibrary(rstatix)\n\nWe then read in the data and create a table containing the data.\n\n# import the data\nfishlengthDF <- read_csv(\"../data/CS1-onesample.csv\")\n\nfishlengthDF\n\n# A tibble: 29 × 3\n id river length\n <dbl> <chr> <dbl>\n 1 1 Guanapo 19.1\n 2 2 Guanapo 23.3\n 3 3 Guanapo 18.2\n 4 4 Guanapo 16.4\n 5 5 Guanapo 19.7\n 6 6 Guanapo 16.6\n 7 7 Guanapo 17.5\n 8 8 Guanapo 19.9\n 9 9 Guanapo 19.1\n10 10 Guanapo 18.8\n# … with 19 more rows\n\n\nThe first line reads the data into R and creates an object called a tibble, which is a type of data frame. This data frame contains 3 columns: a unique id, river encoding the river and length with the measured guppy length.\n\n\nWe then read in the data and create a vector containing the data.\n\n# import the data\nfishlengthDF <- read.csv(\"../data/CS1-onesample.csv\")\n\n# create a vector containing the data\nfishlength_r <- fishlengthDF$length\n\nThe first line reads the data into R and creates an object called a data frame. This data frame only contains a single column of numbers called “Guanapo” (the name of the river). In most situations, and for most statistical analyses, having our data stored in a data frame is exactly what we’d want. However, for one sample tests we actually need our data to be stored as a vector. So, the second line extracts the values that are in the Guanapo column of our fishlengthDF data frame and creates a simple vector of numbers that we have called fishlength_r This step is only necessary for one-sample tests and when we look at more complex data sets, we won’t need to do this second step at all.\n\n\nWe then read the data in:\n\n# load the data\nfishlength_py = pd.read_csv('../data/CS1-onesample.csv')\n\n# inspect the data\nfishlength_py.head()\n\n id river length\n0 1 Guanapo 19.1\n1 2 Guanapo 23.3\n2 3 Guanapo 18.2\n3 4 Guanapo 16.4\n4 5 Guanapo 19.7"
},
{
"objectID": "cs1/cs1-practical-one_sample_t_test.html#summarise-and-visualise",
"href": "cs1/cs1-practical-one_sample_t_test.html#summarise-and-visualise",
"title": "One-sample t-test",
"section": "Summarise and visualise",
"text": "Summarise and visualise\nSummarise the data and visualise it:\n\ntidyverseRPython\n\n\n\nsummary(fishlengthDF)\n\n id river length \n Min. : 1 Length:29 Min. :11.2 \n 1st Qu.: 8 Class :character 1st Qu.:17.5 \n Median :15 Mode :character Median :18.8 \n Mean :15 Mean :18.3 \n 3rd Qu.:22 3rd Qu.:19.7 \n Max. :29 Max. :23.3 \n\nfishlengthDF %>% \n ggplot(aes(x = river, y = length)) +\n geom_boxplot()\n\n\n\n\n\n\n\nsummary(fishlength_r)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n 11.2 17.5 18.8 18.3 19.7 23.3 \n\nboxplot(fishlength_r, main = \"Male guppies\", ylab = \"Length (mm)\")\n\n\n\n\n\n\nFirst we have a look at a numerical summary of the data:\n\nfishlength_py.describe()\n\n id length\ncount 29.000000 29.000000\nmean 15.000000 18.296552\nstd 8.514693 2.584636\nmin 1.000000 11.200000\n25% 8.000000 17.500000\n50% 15.000000 18.800000\n75% 22.000000 19.700000\nmax 29.000000 23.300000\n\n\n\n(\n ggplot(fishlength_py,\n aes(x = 'river',\n y = 'length'))\n + geom_boxplot()\n)\n\n\n\n\n\n\n\nThe data do not appear to contain any obvious errors, and whilst both the mean and median are less than 20 (18.3 and 18.8 respectively) it is not absolutely certain that the sample mean is sufficiently different from this value to be “statistically significant”, although we may anticipate such a result."
},
{
"objectID": "cs1/cs1-practical-one_sample_t_test.html#assumptions",
"href": "cs1/cs1-practical-one_sample_t_test.html#assumptions",
"title": "One-sample t-test",
"section": "Assumptions",
"text": "Assumptions\nWhen it comes to one-sample tests, we have two options:\n\nt-test\nWilcoxon signed-rank test\n\nFor us to use a t-test for this analysis (and for the results to be valid) we have to make two assumptions:\n\nThe parent distribution from which the sample is taken is normally distributed (and as such the sample data are normally distributed themselves).\n\n\n\n\n\n\n\nNote\n\n\n\nIt is worth noting though that the t-test is actually pretty robust in situations where the sample data are not normal. For sufficiently large sample sizes (your guess is as good as mine, but conventionally this means about 30 data points), you can use a t-test without worrying about whether the underlying population is normally distributed or not.\n\n\n\nEach data point in the sample is independent of the others. This is in general not something that can be tested for and instead has to be considered from the sampling procedure. For example, taking repeated measurements from the same individual would generate data that are not independent.\n\nThe second point we know nothing about and so we ignore it here (this is an issue that needs to be considered from the experimental design), whereas the first assumption can be checked. There are three ways of checking for normality:\nIn increasing order of rigour, we have\n\nHistogram\nQuantile-quantile plot\nShapiro-Wilk test\n\n\nHistogram of the data\nPlot a histogram of the data, which gives:\n\ntidyverseRPython\n\n\n\nfishlengthDF %>% \n ggplot(aes(x = length)) +\n geom_histogram(bins = 15)\n\n\n\n\n\n\n\nhist(fishlength_r, breaks = 15)\n\n\n\n\n\n\n\n(\n ggplot(fishlength_py, aes(x = \"length\"))\n + geom_histogram(bins = 15)\n)\n\n\n\n\n\n\n\nThe distribution appears to be uni-modal and symmetric, and so it isn’t obviously non-normal. However, there are a lot of distributions that have these simple properties but which aren’t normal, so this isn’t exactly rigorous. Thankfully there are other, more rigorous tests.\nNB. By even looking at this distribution to assess the assumption of normality we are already going far beyond what anyone else ever does. Nevertheless, we will continue.\n\n\nQ-Q plot of the data\nQ-Q plot is the short for quantile-quantile plot. This diagnostic plot (as it is sometimes called) is a way of comparing two distributions. How Q-Q plots work won’t be explained here but ask a demonstrator if you really want to know what is going on.\nConstruct a Q-Q Plot of the quantiles of the data against the quantiles of a normal distribution:\n\ntidyverseRPython\n\n\n\nfishlengthDF %>% \n ggplot(aes(sample = length)) +\n stat_qq() +\n stat_qq_line()\n\n\n\n\n\n\n\n# plot the Q-Q plot\nqqnorm(fishlength_r)\n\n# and add a comparison line\nqqline(fishlength_r)\n\n\n\n\n\n\n\n(\n ggplot(fishlength_py, aes(sample = \"length\"))\n + stat_qq()\n + stat_qq_line()\n)\n\n\n\n\n\n\n\nWhat is important to know is that if the data were normally distributed then all of the points should lie on (or close to) the diagonal line in this graph.\nIn this case, the points lie quite close to the line for the most part but the sample quantiles (points) from either end of the sample distribution are either smaller (below the line on the left) or larger (above the line on the right) than expected if they were supposed to be normally distributed. This suggests that the sample distribution is a bit more spread out than would be expected if it came from a normal distribution.\nIt is important to recognise that there isn’t a simple unambiguous answer when interpreting these types of graph, in terms of whether the assumption of normality has been well met or not and instead it often boils down to a matter of experience.\nIt is a very rare situation indeed where the assumptions necessary for a test will be met unequivocally and a certain degree of personal interpretation is always needed. Here you have to ask yourself whether the data are normal “enough” for you to be confident in the validity of the test.\nBelow are four examples of QQ plots for different types of distributions:\n\n\n\n\n\nThese two graphs relate to 200 data points that have been drawn from a normal distribution. Even here you can see that the points do not all lie perfectly on the diagonal line in the QQ plot, and a certain amount of deviation at the top and bottom of the graph can happen just by chance (if I were to draw a different set of point then the graph would look slightly different).\n\n\n\n\n\nThese two graphs relate to 200 data points that have been drawn from a uniform distribution. Uniform distributions are more condensed than normal distributions, and this is reflected in the QQ plot having a very pronounced S-shaped pattern to it (this is colloquially known as snaking).\n\n\n\n\n\nThese two graphs relate to 200 data points that have been drawn from a t distribution. t distributions are more spread out than normal distributions, and this is reflected in the QQ plot again having a very pronounced S-shaped pattern to it, but this time the snaking is a reflection of that observed for the uniform distribution.\n\n\n\n\n\nThese two graphs relate to 200 data points that have been drawn from an exponential distribution. Exponential distributions are not symmetric and are very skewed compared with normal distributions. The significant right-skew in this distribution is reflected in the QQ plot again having points that curve away above the diagonal line at both ends (a left-skew would have the points being below the line at both ends).\nIn all four cases it is worth noting that the deviations are only at the ends of the plot.\n\n\nShapiro-Wilk test\nThis is one of a number of formal statistical test that assess whether a given sample of numbers come from a normal distribution. It calculates the probability of getting the sample data if the underlying distribution is in fact normal. It is very easy to carry out in R.\nPerform a Shapiro-Wilk test on the data:\n\ntidyverseRPython\n\n\n\nfishlengthDF %>% \n shapiro_test(length)\n\n# A tibble: 1 × 3\n variable statistic p\n <chr> <dbl> <dbl>\n1 length 0.949 0.176\n\n\n\nvariable indicated the variable that was used to perform the test on\nstatistic gives the calculated W-value (0.9493842)\np gives the calculated p-value (0.1764229)\n\n\n\n\nshapiro.test(fishlength_r)\n\n\n Shapiro-Wilk normality test\n\ndata: fishlength_r\nW = 0.94938, p-value = 0.1764\n\n\n\nThe 1st line gives the name of the test and the 2nd line reminds you what the data set was called\nThe 3rd line contains the two key outputs from the test:\nThe calculated w-value is 0.9494 (we don’t need to know this)\nThe p-value is 0.1764\n\n\n\n\nstats.shapiro(fishlength_py.length)\n\nShapiroResult(statistic=0.9493839740753174, pvalue=0.17642046511173248)\n\n\n\n\n\nAs the p-value is bigger than 0.05 (say) then we can say that there is insufficient evidence to reject the null hypothesis that the sample came from a normal distribution.\nIt is important to recognise that the Shapiro-Wilk test is not without limitations. It is rather sensitive to the sample size being considered. In general, for small sample sizes, the test is very relaxed about normality (and nearly all data sets are considered normal), whereas for large sample sizes the test can be overly strict, and it can fail to recognise data sets that are very nearly normal indeed.\n\n\nAssumptions overview\n\n\n\n\n\n\nImportant\n\n\n\nIn terms of assessing the assumptions of a test it is always worth considering several methods, both graphical and analytic, and not just relying on a single method.\n\n\nIn the fishlength example, the graphical Q-Q plot analysis was not especially conclusive as there was some suggestion of snaking in the plots, but the Shapiro-Wilk test gave a non-significant p-value (0.1764). Putting these two together, along with the original histogram and the recognition that there were only 30 data points in the data set I personally would be happy that the assumptions of the t-test were met well enough to trust the result of the t-test, but you may not be…\nIn which case we would consider an alternative test that has less stringent assumptions (but is less powerful): the one-sample Wilcoxon signed-rank test."
},
{
"objectID": "cs1/cs1-practical-one_sample_t_test.html#implement-the-test",
"href": "cs1/cs1-practical-one_sample_t_test.html#implement-the-test",
"title": "One-sample t-test",
"section": "Implement the test",
"text": "Implement the test\nPerform a one-sample, two-tailed t-test:\n\ntidyverseRPython\n\n\n\nfishlengthDF %>% \n t_test(length ~ 1,\n mu = 20,\n alternative = \"two.sided\")\n\nThe t_test() function requires three arguments:\n\nthe formula, here we give it length ~ 1 to indicate it is a one-sample test on length\nthe mu is the mean to be tested under the null hypothesis, here it is 20\nthe alternative argument gives the type of alternative hypothesis and must be one of two.sided, greater or less. We have no prior assumptions on whether the alternative fish length would be greater or less than 20, so we choose two.sided.\n\n\n\n\nt.test(fishlength_r,\n mu = 20,\n alternative = \"two.sided\")\n\n\n One Sample t-test\n\ndata: fishlength_r\nt = -3.5492, df = 28, p-value = 0.001387\nalternative hypothesis: true mean is not equal to 20\n95 percent confidence interval:\n 17.31341 19.27969\nsample estimates:\nmean of x \n 18.29655 \n\n\n\nThe first argument must be a numerical vector of data values.\nThe second argument must be a number and is the mean to be tested under the null hypothesis.\nThe third argument gives the type of alternative hypothesis and must be one of two.sided, greater or less. We have no prior assumptions on whether the alternative fish length would be greater or less than 20, so we choose two.sided.\n\n\n\n\nstats.ttest_1samp(fishlength_py.length,\n popmean = 20, \n alternative = \"two-sided\")\n\n\nThe first argument must be a numerical series of data values.\nThe second argument must be a number and is the mean to be tested under the null hypothesis.\n\nIn Python you can only two a two-sided 1-sample t-test (i.e. you can only test whether the mean is different from 20 but not whether it is greater than or less than – why they chose to do this is beyond me)."
},
{
"objectID": "cs1/cs1-practical-one_sample_t_test.html#interpreting-the-output-and-report-results",
"href": "cs1/cs1-practical-one_sample_t_test.html#interpreting-the-output-and-report-results",
"title": "One-sample t-test",
"section": "Interpreting the output and report results",
"text": "Interpreting the output and report results\nThis is the output that you should now see in the console window:\n\ntidyverseRPython\n\n\n\n\n# A tibble: 1 × 7\n .y. group1 group2 n statistic df p\n* <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>\n1 length 1 null model 29 -3.55 28 0.00139\n\n\n\nthe statistic column gives us the t-statistic of -3.5492 (we’ll need this for reporting)\nthe df column tells us there are 28 degrees of freedom (again we’ll need this for reporting)\nthe p column gives us the p-value of 0.00139\n\n\n\n\n\n\n One Sample t-test\n\ndata: fishlength_r\nt = -3.5492, df = 28, p-value = 0.001387\nalternative hypothesis: true mean is not equal to 20\n95 percent confidence interval:\n 17.31341 19.27969\nsample estimates:\nmean of x \n 18.29655 \n\n\n\nThe 1st line gives the name of the test and the 2nd line reminds you what the dataset was called\nThe 3rd line contains the three key outputs from the test:\n\nThe calculated t-value is -3.5492 (we’ll need this for reporting)\nThere are 28 degrees of freedom (again we’ll need this for reporting)\nThe p-value is 0.001387.\n\nThe 4th line simply states the alternative hypothesis\nThe 5th and 6th lines give the 95th confidence interval (we don’t need to know this)\nThe 7th, 8th and 9th lines give the sample mean again (18.29655).\n\n\n\n\n\nTtest_1sampResult(statistic=-3.5491839564647205, pvalue=0.0013868577835348002)\n\n\nThe output is very minimal. The 1st number in brackets is the t-value and the 2nd number is the p-value\n\n\n\nThe p-value is what we’re mostly interested in. It gives the probability of us getting a sample such as ours if the null hypothesis were actually true.\nSo:\n\na high p-value means that there is a high probability of observing a sample such as ours and the null hypothesis is probably true whereas\na low p-value means that there is a low probability of observing a sample such as ours and the null hypothesis is probably not true.\n\nIt is important to realise that the p-value is just an indication and there is no absolute certainty here in this interpretation.\nPeople, however like more definite answers and so we pick an artificial probability threshold (called a significance level) in order to be able to say something more decisive. The standard significance level is 0.05 and since our p-value is smaller than this we choose to say that “it is very unlikely that we would have this particular sample if the null hypothesis were true”. Consequently, we can reject our null hypothesis and state that:\n\nA one-sample t-test indicated that the mean body length of male guppies (\\(\\mu\\) = 18.29mm) differs significantly from 20 mm (t = -3.55, df = 28, p = 0.0014).\n\nThe above sentence is an adequate concluding statement for this test and is what we would write in any paper or report. Note that we have included (in brackets) information on the actual mean value of our group(\\(\\mu\\) = 18.29mm), the test statistic (t = -3.55), the degrees of freedom (df = 28), and the p-value (p = 0.0014). In some journals you are only required to report whether the p-value is less than the critical value (e.g. p < 0.05) but I would always recommend reporting the actual p-value obtained.\nPlease feel free to ask a demonstrator if any aspect of this section is unclear as this does form the core of classical hypothesis testing and the logic here applies to all of the rest of the tests."
},
{
"objectID": "cs1/cs1-practical-one_sample_t_test.html#exercise-gastric-juices",
"href": "cs1/cs1-practical-one_sample_t_test.html#exercise-gastric-juices",
"title": "One-sample t-test",
"section": "Exercise: gastric juices",
"text": "Exercise: gastric juices\n\nThe following data are the dissolving times (in seconds) of a drug in agitated gastric juice:\n42.7, 43.4, 44.6, 45.1, 45.6, 45.9, 46.8, 47.6\nDo these results provide any evidence to suggest that dissolving time for this drug is different from 45 seconds?\n\nCreate a tidy data frame and save it in .csv format\nWrite down the null and alternative hypotheses.\nSummarise and visualise the data and perform an appropriate one-sample t-test.\n\nWhat can you say about the dissolving time? (what sentence would you use to report this)\n\nCheck the assumptions for the test.\n\nWas the test valid?\n\n\n\n\nAnswer\n\n\nHypotheses\n\\(H_0\\) : mean \\(=\\) 45s\n\\(H_1\\) : mean \\(\\neq\\) 45s\n\n\nData, summarise & visualise\nWe can create a data frame in Excel and save it as an .csv file, for example as CS1-gastric_juices.csv. It contains two columns, an id column and a dissolving_time column with the measured values.\n\ntidyverseRPython\n\n\n\n# load the data\ndissolving <- read_csv(\"../data/CS1-gastric_juices.csv\")\n\n# have a look at the data\ndissolving\n\n# A tibble: 8 × 2\n id dissolving_time\n <dbl> <dbl>\n1 1 42.7\n2 2 43.4\n3 3 44.6\n4 4 45.1\n5 5 45.6\n6 6 45.9\n7 7 46.8\n8 8 47.6\n\n# summarise the data\nsummary(dissolving)\n\n id dissolving_time\n Min. :1.00 Min. :42.70 \n 1st Qu.:2.75 1st Qu.:44.30 \n Median :4.50 Median :45.35 \n Mean :4.50 Mean :45.21 \n 3rd Qu.:6.25 3rd Qu.:46.12 \n Max. :8.00 Max. :47.60 \n\n\nWe can look at the histogram and box plot of the data:\n\n# create a histogram\ndissolving %>% \n ggplot(aes(x = dissolving_time)) +\n geom_histogram(bins = 4)\n\n\n\n# create a boxplot\ndissolving %>% \n ggplot(aes(y = dissolving_time)) +\n geom_boxplot()\n\n\n\n\n\n\n\n# load the data\ndissolving_r <- read.csv(\"../data/CS1-gastric_juices.csv\")\n\n# have a look at the data\ndissolving_r\n\n id dissolving_time\n1 1 42.7\n2 2 43.4\n3 3 44.6\n4 4 45.1\n5 5 45.6\n6 6 45.9\n7 7 46.8\n8 8 47.6\n\n# summarise the data\nsummary(dissolving_r)\n\n id dissolving_time\n Min. :1.00 Min. :42.70 \n 1st Qu.:2.75 1st Qu.:44.30 \n Median :4.50 Median :45.35 \n Mean :4.50 Mean :45.21 \n 3rd Qu.:6.25 3rd Qu.:46.12 \n Max. :8.00 Max. :47.60 \n\n\n\nhist(dissolving_r$dissolving_time)\n\n\n\nboxplot(dissolving_r$dissolving_time)\n\n\n\n\n\n\n\n# load the data\ndissolving_py = pd.read_csv(\"../data/CS1-gastric_juices.csv\")\n\n# have a look at the data\ndissolving_py.head()\n\n# summarise the data\n\n id dissolving_time\n0 1 42.7\n1 2 43.4\n2 3 44.6\n3 4 45.1\n4 5 45.6\n\ndissolving_py.describe()\n\n id dissolving_time\ncount 8.00000 8.000000\nmean 4.50000 45.212500\nstd 2.44949 1.640068\nmin 1.00000 42.700000\n25% 2.75000 44.300000\n50% 4.50000 45.350000\n75% 6.25000 46.125000\nmax 8.00000 47.600000\n\n\nWe can look at the histogram and box plot of the data:\n\n# create a histogram\n(\n ggplot(dissolving_py,\n aes(x = \"dissolving_time\"))\n + geom_histogram(bins = 4)\n)\n\n\n\n\n\n# create a box plot\n(\n ggplot(dissolving_py,\n aes(x = 1, y = \"dissolving_time\"))\n + geom_boxplot()\n)\n\n\n\n\nPython (or plotnine in particular) gets a bit cranky if you try to create a geom_boxplot but do not define the x aesthetic. Hence us putting it as 1. The value is meaningless, however.\n\n\n\nThere are only 8 data points, so the histogram is rather uninformative. Thankfully the box plot is a bit more useful here. We can see:\n\nThere don’t appear to be any major errors in data entry and there aren’t any huge outliers\nThe median value in the box-plot (the thick black line) is pretty close to 45 and so I wouldn’t be surprised if the mean of the data isn’t significantly different from 45. We can confirm that by looking at the mean and median values that we calculated using the summary command from earlier.\nThe data appear to be symmetric, and so whilst we can’t tell if they’re normal they’re a least not massively skewed.\n\n\n\nAssumptions\nNormality:\n\ntidyverseRPython\n\n\n\n# perform Shapiro-Wilk test\ndissolving %>% \n shapiro_test(dissolving_time)\n\n# A tibble: 1 × 3\n variable statistic p\n <chr> <dbl> <dbl>\n1 dissolving_time 0.980 0.964\n\n\n\n# create a Q-Q plot\ndissolving %>% \n ggplot(aes(sample = dissolving_time)) +\n stat_qq() +\n stat_qq_line(colour = \"red\")\n\n\n\n\n\n\n\nshapiro.test(dissolving_r$dissolving_time)\n\n\n Shapiro-Wilk normality test\n\ndata: dissolving_r$dissolving_time\nW = 0.98023, p-value = 0.9641\n\n\n\nqqnorm(dissolving_r$dissolving_time)\nqqline(dissolving_r$dissolving_time)\n\n\n\n\n\n\n\n# Perform Shapiro-Wilk test to check normality\nstats.shapiro(dissolving_py.dissolving_time)\n\nShapiroResult(statistic=0.9802345037460327, pvalue=0.9640554785728455)\n\n\n\n# Create a Q-Q plot\n(\n ggplot(dissolving_py,\n aes(sample = \"dissolving_time\"))\n + stat_qq()\n + stat_qq_line()\n)\n\n\n\n\n\n\n\n\nThe Shapiro test has a p-value of 0.964 which (given that it is bigger than 0.05) suggests that the data are normal enough.\nThe Q-Q plot isn’t perfect, with some deviation of the points away from the line but since the points aren’t accelerating away from the line and, since we only have 8 points, we can claim, with some slight reservations, that the assumption of normality appears to be adequately well met.\n\nOverall, we are somewhat confident that the assumption of normality is well-enough met for the t-test to be an appropriate method for analysing the data. Note the ridiculous number of caveats here and the slightly political/slippery language I’m using. This is intentional and reflects the ambiguous nature of assumption checking. This is an important approach to doing statistics that you need to embrace.\nIn reality, if I found myself in this situation I would also try doing a non-parametric test on the data (Wilcoxon signed-rank test) and see whether I get the same conclusion about whether the median dissolving time differs from 45s. Technically, you don’t know about the Wilcoxon test yet as you haven’t done that section of the materials. Anyway, if I get the same conclusion then my confidence in the result of the test goes up considerably; it doesn’t matter how well an assumption has been met, I get the same result. If on the other hand I get a completely different conclusion from carrying out the non-parametric test then all bets are off; I now have very little confidence in my test result as I don’t know which one to believe (in the case that the assumptions of the test are a bit unclear). In this example a Wilcoxon test also gives us a non-significant result and so all is good.\n\n\nImplement test\n\ntidyverseRPython\n\n\n\n# perform one-sample t-test\ndissolving %>% \n t_test(dissolving_time ~ 1,\n mu = 45,\n alternative = \"two.sided\")\n\n# A tibble: 1 × 7\n .y. group1 group2 n statistic df p\n* <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>\n1 dissolving_time 1 null model 8 0.366 7 0.725\n\n\n\n\n\nt.test(dissolving_r$dissolving_time,\n mu = 45 ,\n alternative = \"two.sided\")\n\n\n One Sample t-test\n\ndata: dissolving_r$dissolving_time\nt = 0.36647, df = 7, p-value = 0.7248\nalternative hypothesis: true mean is not equal to 45\n95 percent confidence interval:\n 43.84137 46.58363\nsample estimates:\nmean of x \n 45.2125 \n\n\n\n\n\nstats.ttest_1samp(dissolving_py.dissolving_time,\n popmean = 45, \n alternative = \"two-sided\")\n\nTtest_1sampResult(statistic=0.36647318560088843, pvalue=0.7248382429835611)\n\n\n\n\n\n\nA one-sample t-test indicated that the mean dissolving time of the drug is not significantly different from 45s (t=0.366 , df=7 , p=0.725)\n\n\n\nAnd that, is that."
},
{
"objectID": "index.html",
"href": "index.html",
"title": "Welcome to Core statistics",
"section": "",
"text": "Welcome to Core statistics!\nThese sessions are intended to enable you to perform core data analysis techniques appropriately and confidently using R.\nThey are not a “how to mindlessly use a stats program” course!"
},
{
"objectID": "index.html#core-aims",
"href": "index.html#core-aims",
"title": "Welcome to Core statistics",
"section": "Core aims",
"text": "Core aims\nThere are several things that we try to achieve during this course.\n\n\n\n\n\n\nCourse aims\n\n\n\nTo know what to do when presented with an arbitrary data set e.g.\n\nKnow what data analysis techniques are available\nKnow which ones are allowable\nBe able to carry these out and understand the results"
},
{
"objectID": "index.html#core-topics",
"href": "index.html#core-topics",
"title": "Welcome to Core statistics",
"section": "Core topics",
"text": "Core topics\n\nSimple hypothesis testing\nCategorical predictor variables\nContinuous predictors\nTwo predictor variables\nMultiple predictor variables\nPower analysis"
},
{
"objectID": "index.html#practicals",
"href": "index.html#practicals",
"title": "Welcome to Core statistics",
"section": "Practicals",
"text": "Practicals\nEach practical document is divided up into various sections. In each section there will be some explanatory text which should help you to understand what is going on and what you’re trying to achieve. There may be a list of commands relevant to that section which will be displayed in boxes like this:\n\n\n\n\n\n\nConditional operators\n\n\n\nTo set filtering conditions, use the following relational operators:\n\n> is greater than\n>= is greater than or equal to\n< is less than\n<= is less than or equal to\n== is equal to\n!= is different from\n%in% is contained in\n\nTo combine conditions, use the following logical operators:\n\n& AND\n| OR"
},
{
"objectID": "index.html#index-datasets",
"href": "index.html#index-datasets",
"title": "Welcome to Core statistics",
"section": "Datasets",
"text": "Datasets\nThis course uses various data sets. The easiest way of accessing these is by creating an R-project in RStudio. Then download the data folder here by right-clicking on the link and Save as…. Next unzip the file and copy it into your working directory. Your data should then be accessible via <working-directory-name>/data/."
},
{
"objectID": "about.html",
"href": "about.html",
"title": "About",
"section": "",
"text": "1 + 1\n\n[1] 2"
}
]