-
Notifications
You must be signed in to change notification settings - Fork 90
/
04-Case-Study.Rmd
186 lines (128 loc) · 4.99 KB
/
04-Case-Study.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: "Case Study: Friday the 13th Effect"
output: html_notebook
---
<!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License. -->
```{r setup}
library(fivethirtyeight)
library(tidyverse)
```
## Task
Reproduce this figure from fivethirtyeight's article [*Some People Are Too Superstitious To Have A Baby On Friday The 13th*](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/):
![](resources/bialik-fridaythe13th-2.png)
## Data
In the `fivethiryeight` package there are two datasets containing birth data, but for now let's just work with one, `US_births_1994_2003`. Note that since we have data from 1994-2003, our results may differ somewhat from the figure based on 1994-2014.
## Your Turn 1
With your neighbour, brainstorm the steps needed to get the data in a form ready to make the plot.
```{r}
US_births_1994_2003
```
## Some overviews of the data
Whole time series:
```{r}
ggplot(US_births_1994_2003, aes(x = date, y = births)) +
geom_line()
```
There is so much fluctuation it's really hard to see what is going on.
Let's try just looking at one year:
```{r}
US_births_1994_2003 %>%
filter(year == 1994) %>%
ggplot(mapping = aes(x = date, y = births)) +
geom_line()
```
Strong weekly pattern accounts for most variation.
## Strategy
Use the figure as a guide for what the data should like to make the final plot. We want to end up with something like:
---------------------------
day_of_week avg_diff_13
------------- -------------
Mon -2.686
Tues -1.378
Wed -3.274
... ...
---------------------------
## Your Turn 2
Extract just the 6th, 13th and 20th of each month:
```{r}
US_births_1994_2003 %>%
select(-date)
```
## Your Turn 3
Which arrangement is tidy?
**Option 1:**
-----------------------------------------------------
year month date_of_month day_of_week births
------ ------- --------------- ------------- --------
1994 1 6 Thurs 11406
1994 1 13 Thurs 11212
1994 1 20 Thurs 11682
-----------------------------------------------------
**Option 2:**
----------------------------------------------------
year month day_of_week 6 13 20
------ ------- ------------- ------- ------- -------
1994 1 Thurs 11406 11212 11682
----------------------------------------------------
(**Hint:** think about our next step *"Find the percent difference between the 13th and the average of the 6th and 12th"*. In which layout will this be easier using our tidy tools?)
## Your Turn 4
Tidy the filtered data to have the days in columns.
```{r}
US_births_1994_2003 %>%
select(-date) %>%
filter(date_of_month %in% c(6, 13, 20))
```
## Your Turn 5
Now use `mutate()` to add columns for:
* The average of the births on the 6th and 20th
* The percentage difference between the number of births on the 13th and the average of the 6th and 20th
```{r}
US_births_1994_2003 %>%
select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births)
```
## A little additional exploring
Now we have a percent difference between the 13th and the 6th and 20th of each month, it's probably worth exploring a little (at the very least to check our calculations seem reasonable).
To make it a little easier let's assign our current data to a variable
```{r}
births_diff_13 <- US_births_1994_2003 %>%
select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births) %>%
mutate(
avg_6_20 = (`6` + `20`)/2,
diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
)
```
Then take a look
```{r}
births_diff_13 %>%
ggplot(mapping = aes(day_of_week, diff_13)) +
geom_point()
```
Looks like we are on the right path. There's a big outlier one Monday
```{r}
births_diff_13 %>%
filter(day_of_week == "Mon", diff_13 > 10)
```
Seem's to be driven but a particularly low number of births on the 6th of Sep 1999. Maybe a holiday effect? Labour Day was of the 6th of Sep that year.
## Your Turn 6
Summarize each day of the week to have mean of diff_13.
Then, recreate the fivethirtyeight plot.
```{r}
US_births_1994_2003 %>%
select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births) %>%
mutate(
avg_6_20 = (`6` + `20`)/2,
diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
)
```
## Extra Challenges
* If you wanted to use the `US_births_2000_2014` data instead, what would you need to change in the pipeline? How about using both `US_births_1994_2003` and `US_births_2000_2014`?
* Try not removing the `date` column. At what point in the pipeline does it cause problems? Why?
* Can you come up with an alternative way to investigate the Friday the 13th effect? Try it out!
## Takeaways
The power of the tidyverse comes from being able to easily combine functions that do simple things well.