-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path0402_tidy-data.Rmd
111 lines (79 loc) · 3.41 KB
/
0402_tidy-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Tidy data
A data set is _tidy_ if:
- Each row is an __observation__ appropriate for the analysis;
- Each column is a __variable__;
- Each cell is a __value__.
This means that each value belongs to exactly one variable and one observation.
Why bother? Because doing computations with untidy data can be a nightmare.
Computations become simple with tidy data.
Whether or not a data set is "tidy" depends on the type of analysis you are doing or plot you are making.
It depends on how you define your "observation" and "variables" for the current analysis.
```{r}
haireye <- as_tibble(HairEyeColor) |>
count(Hair, Eye, wt = n) |>
rename(hair = Hair, eye = Eye)
```
As an example, consider this example derived from the `datasets::HairEyeColor` dataset, containing the number of people having a certain hair and eye color.
If one observation is identified by a _hair-eye color combination_, then the tidy dataset is:
```{r}
haireye |>
print()
```
If one observation is identified by a _single person_, then the tidy dataset has one pair of values per person, and one row for each person.
We can use the handy `tidyr::uncount()` function, the opposite of `dplyr::count()`:
```{r}
haireye |>
tidyr::uncount(n) |>
print()
```
## Untidy Examples
The following are examples of untidy data.
They're untidy for either of the cases considered above, but for discussion, let's take a _hair-eye color combination_ to be one observational unit.
Note that untidy does not always mean "bad", just inconvenient for the analysis you want to do.
__Untidy Example 1__:
The following table is untidy because there are multiple observations per row.
It's _too wide_.
Imagine calculating the total number of people with each hair color.
You can't just `group_by()` and `summarize()`, here!
```{r, echo = FALSE}
haireye_untidy <- haireye |>
mutate(eye = str_c(eye, "_eyed")) |>
pivot_wider(id_cols = hair, names_from = eye, values_from = n)
print(haireye_untidy)
```
This sort of table is common when presenting results.
It's easy for humans to read, but hard for computers to work with.
Untidy data is usually that way because it was structured for human, not machine, reading.
__Untidy Example 2__:
The following table is untidy for the same reason as Example 1—multiple observations are contained per row.
It's _too wide_.
```{r, echo = FALSE}
haireye |>
mutate(hair = str_c(hair, "_haired")) |>
pivot_wider(id_cols = eye, names_from = hair, values_from = n) |>
print()
```
__Untidy Example 3__:
This is untidy because each observational unit is spread across multiple columns.
It's _too long_.
In fact, we needed to add an identifier for each observation, otherwise we would have lost which row belongs to which observation!
Does red hair ever occur with blue eyes?
Can't just `filter(hair == "red", eye == "blue")`!
```{r, echo = FALSE}
haireye |>
mutate(.id = 1:n()) |>
pivot_longer(cols = hair:eye, names_to = "body_part", values_to = "color") |>
select(-n, n) |>
print()
```
__Untidy Example 4__:
Just when you thought a data set couldn't get any longer!
Now, each variable has its own row: hair color, eye color, and `n`.
```{r, echo = FALSE}
haireye |>
mutate(obs = 1:n(),
n = as.character(n)) |>
pivot_longer(cols = c(hair, eye, n), names_to = "variable", values_to = "value") |>
print()
```
_This is the sort of format that is common pulling data from the web or other "Big Data" sources._