-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path03a-tidying-data.Rmd
422 lines (295 loc) · 14.7 KB
/
03a-tidying-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
---
knit: bookdown::preview_chapter
---
# Tidying Common Data Formats
## Terminology
- A __variable__ is a quantity, quality, or property that you can measure. For the grad programs, these would be all the column headers.
- An __observation__ is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point. For the grad programs, this is institution, and program, uniquley define the observation.
- A value is the state of a variable when you measure it. The value of a variable typically changes from observation to observation.
## Tidy tabular form
__Tabular data__ is a set of values, each associated with a variable and an observation. Tabular data is __tidy__ if each value is placed in its own `cell`, each variable in its own column, and each observation in its own row.
### Wide format
The grad programs data in tabular form. It also is in **wide** tidy form, because there are multiple columns containing different variables.
```{r readgrad}
library(emo)
library(tidyverse)
grad <- read_csv("data/graduate-programs.csv")
grad %>% top_n(10)
```
What's good about the format?
- Rows contain information about each department of an institution
- Columns contain types of information, like average number of publications, average number of citations, % completion,
It makes it easy to make summaries:
```{r gradsummary}
grad %>% count(subject)
grad %>% filter(subject == "economics") %>%
summarise(m=mean(NumStud), s=sd(NumStud))
grad %>% filter(subject == "economics") %>%
ggplot(aes(x=NumStud, y=MedianTimetoDegree)) +
geom_point() + theme(aspect.ratio=1)
```
### Long format
It can also be useful - for the process of data analysis - to arrange data into **long** tidy form, where each value is uniquely identified.
```{r longformat}
grad %>% gather(variable, value, -subject, -Inst)
```
## Raw data examples
For each of these data examples, **let's try together to identify the variables and the observations** - some are HARD!
### Gene expression experiment
`r ji("thinking")`
```{r readgenes}
genes <- read_csv("data/genes.csv")
genes
```
### Melbourne weather
What are the variables? Observations?
`r ji("scared")`
```{r readmelb}
melbtemp <- read_fwf("data/ASN00086282.dly",
col_positions=fwf_widths(c(11, 4, 2, 4,
rep(c(5, 1, 1, 1), 31))))
melbtemp %>% select(X1, X2, X3, X4, X5, X9,
X13, X17, X21, X25, X29, X33)
```
### TB incidence
This is current tuberculosis data taken from [WHO](http://www.who.int/tb/country/data/download/en/), the case notifications table.
`r set.seed(2018); ji("sick")`
```{r readTB}
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
select(country, year, starts_with("new_sp_")) %>%
filter(year>1996, year<2012)
tb %>% top_n(20)
```
### French fries
10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:
```{r loadff}
load("data/french_fries.rda")
french_fries %>% top_n(10)
```
![](images/french_fries.png)
What are the variables? Observations?
### Contingency table
This data is collated from this story: [41% Of Fliers Think You’re Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/)
```{r readfly}
fly <- read_csv("data/fly_tbl.csv")
fly
```
What are the variables?
## ABS Datapack
The Australian Bureau of Statistics (ABS) collects, maintains and delivers data and official statistics on a wide range of economic, social, population and environmental matters of importance to Australia. There are many different access points for data, but primarily aggregated data is the main type available. Examples at accessing the Census data from the ABS can be found in the `eechidna` package.
1. The individual `csv` files must be held locally. They come from a zip file and can be downloaded from: https://datapacks.censusdata.abs.gov.au/datapacks/
2. Select: 2016 Census Datapacks, General Community Profile, Commonwealth Electoral Divisons
3. Download for all of Australia
4. Unzip the package - its necessary, because the data is delivered in many small csv files. There is also the license information detailing appropriate usage, and detailed information about the formats.
```{r eval=FALSE}
G1_Main <- read_csv(here::here("data/2016 Census GCP Commonwealth Electoral Divisions for AUST/", "2016Census_G01_AUS_CED.csv"))
```
## Messy vs tidy
Messy data is messy in its own way. You can make unique solutions, but then another data set comes along, and you have to again make a unique solution.
Tidy data can be though of as legos. Once you have this form, you can put it together in so many different ways, to make different analyses and visualizations. <!--Pedantic comment: legos isn't a word. the plural of lego is lego. -->
![](images/lego.png)
## Tidy verbs
- `gather`: specify the **keys** (identifiers) and the **values** (measures) to make long form (used to be called melting)
- `spread`: move a variable from a row into columns (used to be called casting)
- `separate`: split a character vector into columns
## Tidying genes data
```{r wranglegenes}
genes
genes_long <- genes %>%
gather(variable, expr, -id)
genes_long
```
In this example, we identified the columns as containing expression levels of a gene during different experimental conditions (variables). We have _gathered_ these variables into a single column, with the expression levels in a corresponding column. The `-id` argument to `gather()` indicates that we do not want the `id` column to be gathered; this has the effect of repeating the 3 values of gene for each variable.
### Separate columns
The variables in the previous example are a cryptic combination of treatment, timepoint, and replicate. Using `separate()`, we can separate this information out into a tidy (and much more human friendly) representation of the experimental conditions:
```{r wranglegenes1}
genes_long %>%
separate(variable, c("trt", "leftover"), "-")
```
First, we separate the treatment from the rest of the information.
Next, we separate the remainder into timepoint and replicate variables.
```{r wranglegenes2}
genes_long %>%
separate(variable, c("trt", "leftover"), "-") %>%
separate(leftover, c("time", "rep"), "\\.")
```
This data is tidy, in that observations are in rows and variables are in columns, but it is not as easy to read. We can examine the different features of this data set using `spread()`.
### Now spread to examine different aspects
There are several different features of the data which we might want to explore. Each feature requires a slight reconfiguration of the dataset, but these are easily accomplished using `spread()`.
#### Examine treatments against each other
```{r treatments}
genes_long <- genes_long %>%
separate(variable, c("trt", "leftover"), "-") %>%
separate(leftover, c("time", "rep"), "\\.")
genes_long %>%
spread(trt, expr) %>%
ggplot(aes(x=WI, y=WM, colour=id, shape = id)) + geom_point()
```
Generally, some negative association within each gene, WM is low if WI is high. <!--Not really... at best, there's a negative association for Gene 2, but even that is pushing it. If you don't consider each gene separately, there's a positive association. -->
<!-- The color scheme is INCREDIBLY difficult to see... -->
#### Examine replicates against each other
```{r replicates}
genes_long %>%
spread(rep, expr) %>%
ggplot(aes(x=R1, y=R4, colour=id)) +
geom_point() +
coord_equal()
```
Very roughly, replicate 4 is like replicate 1, eg if one is low, the other is low. That's a good thing, that the replicates are fairly similar.
## Try this yourself
Here is a little data set to practice gather, spread and separate on.
```{r practicekoala}
kb <- read_csv("data/koala_bilby.csv")
kb
```
Try to do these by yourself, but the code is in the Rmd file if you need help.
### Exercise 1
Gather the data into long form, naming the two new variables, `label` and `count`
```{r practicekoala2, echo=FALSE}
kb_long <- kb %>% gather(label, count, -ID)
kb_long
```
### Exercise 2
Separate the labels into two new variables, `animal`, `state`
```{r practicekoala3, echo=FALSE}
kb_long <- kb_long %>%
separate(label, c("animal", "state"))
kb_long
```
### Exercise 3
Spread the long form data into wide form, where the columns are the states.
```{r practicekoala4, echo=FALSE}
kb_long %>% spread(state, count)
```
### Exercise 4
Spread the long form data into wide form, where the columns are the animals.
```{r practicekoala5, echo=FALSE}
kb_long %>% spread(animal, count)
```
## Tidying Melbourne weather
This is seriously messy!
Read the data first, just the columns that we need. The data is in fixed column width format, which means a certain number of columns are designated for the particular records, eg columns 1-11 contains the station id, columns 12-15 contain the year, ...
```{r practicemelb}
melbtemp <- read_fwf("data/ASN00086282.dly",
col_positions=fwf_widths(c(11, 4, 2, 4,
rep(c(5, 1, 1, 1), 31)),
col_names = c("station", "year", "month",
"variable", paste0("X", 5:128))))
melbtemp
```
X5, X9, ... contain the temperature and precipitation records. The remaining columns have data quality flags. We want to ignore the data quality flags for now, and select every fourth variable.
### Sometime easier to index select
- `[...]` allows indexing of elements of a vector, or collection of numbers
- `c(1:4, seq(5,128,4))` means collect items 1 through 4, and then every 4th item until the 128'th
```{r practicemelb2}
melbtemp <- melbtemp[,c(1:4, seq(5, 128, 4))]
melbtemp
```
### Make long form
```{r practicemelb3}
melbtemp_long <- melbtemp %>%
gather(day, value, X5:X125)
melbtemp_long
```
This is basically what we need. The day variable doesn't have the right values in it, but we can fix this later.
### Use spread to look at different things
But there are a couple of things that I need to fix to make it possible to do these plots. We will discuss these operations next week. <!-- Next week? -->
```{r practicemelb4}
melbtemp_long <- melbtemp_long %>%
filter(variable %in% c("TMIN", "TMAX", "PRCP")) %>%
mutate(value = ifelse(value == -9999, NA, value))
```
### What are the numbers?
- What is the range of temperature and precipitation?
- What units could this be in?
- Why are some values -9999?
Go to the data source: [https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt)
### Now use spread
#### Max vs Min
```{r practicemelb5}
melbtemp_long %>% spread(variable, value) %>%
ggplot(aes(x=TMIN, y=TMAX)) + geom_point() +
coord_equal() + geom_abline(slope=1, intercept=0)
```
Oh, TMAX is always higher than TMIN. This is a good thing.
#### Winter vs Summer
```{r practicemelb6}
melbtemp_long %>%
filter(variable == "TMAX") %>%
spread(month, value) %>%
ggplot(aes(x=`07`, y=`01`)) + geom_point() +
xlab("July") + ylab("Jan") +
coord_equal() + geom_abline(slope=1, intercept=0)
```
Same day of the month, same year, January has higher maximum temperatures than July. Except two days, the July max exceeded the January max. (This comparison is a bit whacky, though right?)
## Exercises
1. [41% Of Fliers Think You’re Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/). In the following table, V1 is a response to the question "Is it rude to recline your seat on a plane?", and V2 is the response to the question "Do you ever recline your seat when you fly?". The data is in the form of a contingency table.
```{r practicefly}
fly_tbl <- read_csv("data/fly_tbl.csv")
fly_tbl
```
a. What are the variables and observations in this data?
b. Put the data in tidy long form (using the names `V2` as the key variable, and `count` as the value).
You can get the data from [data/fly_tbl.csv](http://dmac.dicook.org/lectures/data/fly_tbl.csv)
```{r practicefly2, echo=FALSE, eval=FALSE}
library(tidyverse)
fly_tbl <- read_csv("data/fly_tbl.csv")
fly_tbl %>% gather(V2, count, -V1)
```
2. Your job is to tidy the TB incidence data.
- Write down the steps that will be needed go from raw data to tidy
- Write the code piece by piece
You can get the data from [TB_notifications_2018-03-18.csv](http://dmac.dicook.org/lectures/data/TB_notifications_2018-03-18.csv)
```{r practiceTB, echo=FALSE, eval=FALSE}
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
select(country, year, starts_with("new_sp_")) %>%
filter(year>1996, year<2012)
tb %>% top_n(20)
tb_tidy <- tb %>%
gather(stuff, count, starts_with("new_sp_")) %>%
separate(stuff, c("stuff1", "stuff2", "genderage")) %>%
separate(genderage, c("gender", "age"), sep=1) %>%
select(-stuff1, -stuff2)
tb_tidy
```
3. For the data set, `rates.csv`,
```{r practicerates}
rates <- read_csv("data/rates.csv")
head(rates)
```
a. What are the variables and observations?
b. Gather the five currencies, AUD, GBP, JPY, CNY, CAD, make it into tidy long form.
c. Make line plots of the currencies, like the plot below, and describe the similarities and differences between the currencies.
You can get the data from [rates.csv](http://dmac.netlify.com/lectures/data/rates.csv)
```{r echo=FALSE, results='hide', fig.show='hide'}
library(tidyverse)
rates <- read_csv("data/rates.csv")
ggplot(rates, aes(x=date, y=AUD)) + geom_line()
```
```{r practicerates2, echo=FALSE, results='hide', fig.width=6, fig.height=10}
rates_long <- rates %>% select(date, AUD, GBP, JPY, CNY, CAD) %>%
gather(currency, rate, -date)
ggplot(rates_long, aes(x=date, y=rate)) +
geom_line() +
facet_wrap(~currency, ncol=1, scales="free_y")
```
4. This is tough! From the web site [Department of Infrastructure, Regional Development and Cities](https://bitre.gov.au/publications/ongoing/airport_traffic_data.aspx) download the latest "Airport Traffic Data 1985–86 to ...". It is reasonably small.
Extract the "Airport Passengers" table into R with this code
```{r practiceair}
library(readxl)
passengers <- read_xlsx(here::here("data", "WebAirport_FY_1986-2019.xlsx"), sheet=3, skip=6)
passengers
```
Tidy the data, to produce a data set with these columns
- airport: all of the airports.
- year
- type_of_flight: DOMESTIC, INTERNATIONAL
- bound: IN or OUT
```{r practiceair2, eval=FALSE, echo=FALSE}
passengers <- passengers %>%
select(-X__1, -X__2, -Rank, -starts_with("TOTAL")) %>%
rename(INBOUND_DOM=INBOUND, INBOUND_INTL=INBOUND__1,
OUTBOUND_DOM=OUTBOUND, OUTBOUND_INTL=OUTBOUND__1,
INBOUND_TOTL=INBOUND__2, OUTBOUND_TOTL=OUTBOUND__2) %>%
gather(where, amount, starts_with("IN"), starts_with("OUT"))
```