-
Notifications
You must be signed in to change notification settings - Fork 16
/
tidyverse_next_steps.qmd
393 lines (286 loc) · 12.1 KB
/
tidyverse_next_steps.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
---
title: "R tidyverse: loops and data tidying"
format:
gfm:
toc: true
toc-depth: 3
editor: source
date: today
author: UQ Library
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Setting up
> If needed, review the [installation instructions](/R/Installation.md#r--rstudio-installation-instructions).
* If you are using your own laptop please open RStudio
+ Make sure you have a working Internet connection
* On the Library's training computers:
+ Log in with your UQ username and password
+ Make sure you have a working Internet connection
+ Open the ZENworks application
+ Look for "RStudio"
+ Double click on RStudio, which will install both R and RStudio
With RStudio open, let's make sure we have the necessary packages installed by running this command (this might take a few minutes):
```{r eval=FALSE}
install.packages("tidyverse")
```
This will install all the Tidyverse packages (and their dependencies).
## What are we going to learn?
tidyr and purrr, just like dplyr and ggplot2, are core to the Tidyverse.
* tidyr can be used to tidy your data
* purrr is useful to apply functions iteratively on lists or vectors
## Create a project and a script
Use the project menu (top right) to create a "New project...". Let's name this one "tidyverse".
We also want to work more comfortably by typing our code in a script. You can use the new file dropdown menu, or <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>N</kbd>, and save your script as "process.R" in the current working directory.
## Load the necessary packages
We can use one single command to load the 8 core Tidyverse packages:
```{r}
library(tidyverse)
```
## Tidy data
Tidy data makes it easy to transform and analyse data in R (and many other tools). Tidy data has observations in rows, and variables in columns. The whole Tidyverse is designed to work with tidy data.
Often, a dataset is organised in a way that makes it easy for humans to read and populate. This is usually called "wide format". Tidy data is _usually_ in "long" format.
The ultimate rules of tidy data are:
* Each row is an observation
* Each column is a variable
* Each cell contains one single value
> To learn more about Tidy Data, you can read [Hadley Wickham's 2014 article on the topic](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf).
### Import data
We are using a [dataset from the World Bank](https://datacatalog.worldbank.org/dataset/climate-change-data), which contains data about energy consumption and greenhouse gas emissions.
Let's download the file:
```{r eval=FALSE}
# download data, save locally
download.file(url = "https://raw.githubusercontent.com/uqlibrary/technology-training/master/R/tidyverse_next_steps/data_wb_climate.csv",
destfile = "data_wb_climate.csv")
```
... and read the data into an object:
```{r}
# read CSV into an object
climate_raw <- read_csv("data_wb_climate.csv",
na = "..")
```
We defined with the `na` argument that, in this dataset, missing data is recorded as "..".
You can use `View()` to explore your dataset. We can see that it doesn't respect the tidy data principles in a couple of ways, the most obvious one being that different years are spread out between different columns.
### Reshaping data
#### Lengthening
To go from wide format to long format, we can use the tidyr function `pivot_longer()`. Here, we want to gather all the columns titled with a year: we store the data in a "value" variable, and the years in a "year" variable.
```{r}
climate_long <- pivot_longer(climate_raw,
`1990`:`2011`,
names_to = "year",
values_to = "value") %>%
mutate(year = as.integer(year))
```
We add a `mutate()` step to convert the type fo the year column from character to integer.
This is better, but there is still an issue: our `value` variable contains many different indicators (i.e. entirely different units).
#### Widening
To do the opposite, going from long to wide format, we can use the `pivot_wider()` function.
We have single observations spread across several rows, so we should spread the "value" column.
First, let's keep a record of the correspondence between long descriptive variable names and their "code", for later reference:
```{r}
codes <- climate_long %>%
select(`Series code`, `Series name`) %>%
unique()
codes
```
This will be our key to variable details, or "code book", for future reference.
Now, let's widen the data (and remove some useless columns with `dplyr::select()`):
```{r}
climate_tidy <- climate_long %>%
select(-`Series name`, -SCALE, -Decimals) %>%
pivot_wider(names_from = `Series code`,
values_from = value)
```
### Challenge 1: Code comprehension
There's one more cleaning step we need to apply.
Have a look at this block of code. What do you think it does?
```{r}
groups <- c("Europe & Central Asia",
"East Asia & Pacific",
"Euro area",
"High income",
"Lower middle income",
"Low income",
"Low & middle income",
"Middle income",
"Middle East & North Africa",
"Latin America & Caribbean",
"South Asia",
"Small island developing states",
"Sub-Saharan Africa",
"Upper middle income",
"World")
climate_tidy <- climate_tidy %>%
filter(!`Country name` %in% groups)
```
Turns out this dataset contains grouped data as well as unique countries. Here, we created a vector of group names, and removed them from the data by using dplyr's `filter()` function (inverting the filter with `!`).
We can now check that we've only got single countries left:
```{r eval=FALSE}
unique(climate_tidy$`Country name`)
```
### Visualising
Now that we have clean, tidy data, we can process and visualise it more comfortably! For example, to visualise the increase in KT of CO<sup>2</sup>-equivalent for each country:
```{r}
climate_tidy %>%
ggplot(aes(x = year,
y = EN.ATM.CO2E.KT,
group = `Country name`)) +
geom_line()
```
Let's have a look at the increase in _global_ greenhouse gas emissions in KT:
```{r}
climate_tidy %>%
group_by(year) %>%
summarise(CO2e = sum(EN.ATM.CO2E.KT, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = CO2e)) +
geom_point()
```
#### Challenge 2
Looks like our data is missing after 2008, so how can we remove that?
We can add this extra step:
```{r}
climate_tidy %>%
group_by(year) %>%
summarise(CO2e = sum(EN.ATM.CO2E.KT, na.rm = TRUE)) %>%
filter(year < 2009) %>%
ggplot(aes(x = year, y = CO2e)) +
geom_point()
```
## Functional programming
Functional programming (as opposed to "imperative programming") makes use of functions rather than loops to iterate over objects.
The functions will allow to simplify our code, by abstracting common building blocks used in different cases of iteration. However, it means that there will usually be a different function for each different pattern.
You can iterate over elements by using:
1. the basic building blocks in R (for loops, while loops...), or
2. the `apply` function family from base R, or
3. the purrr functions.
Imagine we want to find out the median value for each variable in the `mtcars` dataset. Here is an example of a for loop:
```{r}
output <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
output[[i]] <- median(mtcars[[i]])
}
output
```
Better than having the same code repeated 11 times!
We allocate space in the expected **output** first (more efficient). We then specify the **sequence** for the loop, and put what we want to iterate in the loop **body**.
The apply family in base R is useful to replace for loops, but the purrr functions are easier to learn because they are more consistent. This package offers several tools to iterate functions over elements in a vector or a list (e.g. a dataframe).
### The map family
At purrr's core, there is the map family:
* `map()` outputs a list.
* `map_lgl()` outputs a logical vector.
* `map_int()` outputs an integer vector.
* `map_dbl()` outputs a double vector.
* `map_chr()` outputs a character vector.
For example, to do a similar operation to our previous for loop:
```{r}
map_dbl(mtcars, median)
```
A lot leaner, right?
The map functions automatically name the values in the resulting vector, which makes the result easier to read.
Lets try a different type of output. Here, we want to find out which columns in the starwars dataset are of type "character":
```{r}
map_lgl(starwars, is_character)
```
If we don't want to use the default behaviour of the mapped function, we can use extra arguments to pass to it. For example, for a trimmed mean:
```{r}
map_dbl(mtcars, mean, trim = 0.2)
```
Just like most functions in the Tidyverse, the first argument is the data that we want to process (which means we can use the pipe). The second argument is the name of the function we want to apply, but it can also be a custom formula. For example:
```{r}
map_dbl(mtcars, ~ round(mean(.x)))
map_lgl(mtcars, ~ max(.x) > 3 * min(.x))
```
We have to use the tilde `~` to introduce a custom formula, and `.x` to place the element being processed.
#### Challenge 3: custom formula
How can we find out the number of unique values in each variable of the `starwars` data.frame?
```{r}
map_int(starwars, ~ length(unique(.x)))
```
### Splitting
To split a dataset and apply an operation to separate parts, we can use the `group_split()` function:
```{r}
unique(mtcars$cyl)
mtcars %>%
group_split(cyl) %>% # split into three dataframes
map(summary) # applied to each dataframe
```
Using purrr functions with ggplot2 functions allows us to generate several plots in one command:
```{r}
mtcars %>%
group_split(cyl) %>%
map(~ ggplot(.x, aes(wt, mpg)) +
geom_point() +
geom_smooth() +
labs(title = paste(.x$cyl, "cylinders"))) # give a title
```
### Predicate functions
Purrr also contains functions that check for a condition, so we can set up conditions before iterating.
```{r}
str(iris)
iris %>%
map_dbl(mean) # warning, NA for Species
iris %>%
discard(is.factor) %>%
map_dbl(mean) # clean!
starwars %>%
keep(is.character) %>%
map_int(~length(unique(.x)))
```
`is.factor()` and `is.character()` are examples of "predicate functions".
To return everything, but apply a function only if a condition is met, we can use `map_if()`:
```{r}
str(iris)
iris %>%
map_if(is.numeric, round) %>%
str()
```
This results in a list in which the elements are rounded only if they store numeric data.
Now, let's see a more involved example with our climate dataset. In this one, we use functions from dplyr, purrr, stringr, tibble and ggplot2.
```{r}
# cumulative and yearly change in CO2 emissions dataset
climate_cumul <- climate_tidy %>%
arrange(`Country name`, year) %>%
group_by(`Country name`) %>%
mutate(cumul.CO2.KT = cumsum(EN.ATM.CO2E.KT),
dif.CO2.KT = EN.ATM.CO2E.KT - lag(EN.ATM.CO2E.KT)) %>%
map_at(vars(ends_with("KT")), ~ .x / 10^6) %>%
as_tibble() %>% # from list to tibble
rename_with(~ str_replace(.x, "KT", "PG"))
# visualise cumulative change
p <- climate_cumul %>%
ggplot() +
aes(x = year,
y = cumul.CO2.PG,
colour = `Country name`) +
geom_line() +
theme(legend.position = "none")
p
```
If you want to create an interactive visualisation, you can use plotly:
```{r eval=F}
library(plotly)
ggplotly(p)
```
Plot the annual change in PG CO2 by country:
```{r}
pdif <- climate_cumul %>%
ggplot() +
aes(x = year,
y = dif.CO2.PG,
colour = `Country name`) +
geom_line() +
theme(legend.position = "none")
pdif
```
```{r eval=F}
# interactive plot
ggplotly(pdif)
```
## What next
* [Chapter on iteration](https://r4ds.had.co.nz/iteration.html) in the book _R for Data Science_
* Cheatsheets:
* [tidyr](https://raw.githubusercontent.com/rstudio/cheatsheets/master/tidyr.pdf)
* [purrr](https://raw.githubusercontent.com/rstudio/cheatsheets/master/purrr.pdf)
* Explore our [recommended resources, online and around UQ](/R/usefullinks.md#what-next)
* [Tidy Data paper](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf)