-
Notifications
You must be signed in to change notification settings - Fork 6
/
distributions.Rmd
556 lines (408 loc) · 21.8 KB
/
distributions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
# Distributions
```{r, message = FALSE}
library(tidyverse)
library(dcldata)
```
## Introduction
Take a look at the following [cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf) sections before reading this chapter.
* Geoms:
* `geom_histogram()`
* `geom_freqpoly()`
* `geom_density()`
* `geom_boxplot()`
* `geom_violin()`
* `geom_vline()`
* `geom_hline()`
A common first step when carrying out exploratory data analysis is examining distributions of continuous variables. For example, in this chapter, we'll explore the ages of Congress members. To get a sense of what distributions can show, here are some possible hypotheses about the ages of Congress members (some more likely than others).
* There is no age bracket more populous than any another.
* The average age is around 60.
* There are a lot of Congress members who are around 30, and a lot who around 70.
Before we dive into the Congress data, however, we'll introduce some mechanics. In the following sections, you'll learn about layering geoms and the basics of `theme()`.
## Mechanics
### Layering
The following plot shows the relationship of `class` to `hwy` using boxplots.
```{r}
mpg %>%
mutate(class = fct_reorder(class, hwy)) %>%
ggplot(aes(class, hwy)) +
geom_boxplot()
```
We'll explain the specifics of boxplots later on in this chapter.
The black line in the middle of each box represents the median value for that class. It would also be helpful to see the median `hwy` value for all car classes. To display this on our plot, we'll add a _reference line_ using `geom_hline()`. `geom_hline()` creates horizontal lines. `geom_vline()`, which creates vertical lines, and `geom_abline()`, which creates lines with a given slope and intercept, are also useful for creating reference lines.
In `geom_hline()`, the `yintercept` argument controls the position of the horizontal line. In this case, we want a horizontal line indicating the median `hwy` value, so we'll set `yintercept` equal to `median(hwy, na.rm = TRUE)`.
```{r}
mpg %>%
mutate(class = fct_reorder(class, hwy)) %>%
ggplot(aes(class, hwy)) +
geom_boxplot() +
geom_hline(aes(yintercept = median(hwy, na.rm = TRUE)), color = "red")
```
Notice that the horizontal line is on top of the boxplots. We added `geom_hline()` our plot after `geom_boxplot()`, so the boxplots were drawn first, and then the line drawn on top. If we switch the order, the boxplots will lie on top of the line.
```{r}
mpg %>%
mutate(class = fct_reorder(class, hwy)) %>%
ggplot(aes(class, hwy)) +
geom_hline(aes(yintercept = median(hwy, na.rm = TRUE)), color = "red") +
geom_boxplot()
```
Placing the line on top of the boxplots can obscure information, so it's generally better to place the line underneath.
You can add as many layers to a gpplot2 plot as you'd like. The different layers also don't have to use the same data.
```{r}
# Source: https://en.wikipedia.org/wiki/Corporate_average_fuel_economy
cafe_2008 <-
tribble(
~class, ~mpg,
"Passenger", 27.5,
"Truck", 22.4
)
mpg %>%
mutate(class = fct_reorder(class, hwy)) %>%
ggplot() +
geom_hline(aes(yintercept = mpg, color = class), data = cafe_2008) +
geom_boxplot(aes(class, hwy)) +
labs(color = "2008 CAFE\nstandard")
```
### `theme()`
Just like every ggplot2 plot has a coordinate system, every ggplot2 plot also has a _theme_. The theme controls the parts of the plot unrelated to the data, like the plot background color, tick mark length, and the color, size, and angle of the labeling text.
You can edit your plot's theme with the `theme()` function. Themes have many components, and so `theme()` takes many arguments. Take a look at [tidyverse page on themes](https://ggplot2.tidyverse.org/reference/theme.html) to get a sense of what's possible to change.
Two `theme()` arguments you'll commonly use are `axis.text.x` and `axis.text.y`. These arguments control the appearance of the axis tick mark labels. Next, we'll show you how to angle the x-axis tick mark labels to reduce overlap.
The x-axis labels overlap on the following visualization.
```{r}
mpg %>%
mutate(manufacturer = fct_reorder(manufacturer, hwy)) %>%
ggplot(aes(manufacturer, hwy)) +
geom_boxplot()
```
In the _Discrete-continuous relationships_ chapter, you saw how we can flip a plot and put the labels on the y-axis. However, for reasons you'll learn later on in this chapter, in this case it's better to keep the boxplots oriented vertically.
Instead of putting the labels on the y-axis, we'll use `theme()`'s `axis.text.x` argument to angle the labels.
```{r}
mpg %>%
mutate(manufacturer = fct_reorder(manufacturer, hwy)) %>%
ggplot(aes(manufacturer, hwy)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -45))
```
Take a look at the code. Notice that we had to use the function `element_text()`. `element_text()` is an _element function_. You'll need element functions to change most of `theme()`'s arguments. If you look at [`themes()`'s documentation](https://ggplot2.tidyverse.org/reference/theme.html), you'll see that the arguments are arranged into groups. At the beginning of each group, there's a reference to the appropriate element function. `axis.text.x` controls text, and uses `element_text()`, but `axis.line` controls the appearance of a line, and so uses `element_line()`. Like `theme()`, `element_text()` has many arguments. To change the angle of the text, we only needed one: `angle`.
Angling makes the axis text easier to read, but the text is now far away from the x-axis. The plot would be better if the labels were closer to the boxplots they represent. We can adjust the text justification with `element_text()`'s `hjust` argument.
```{r}
mpg %>%
mutate(manufacturer = fct_reorder(manufacturer, hwy)) %>%
ggplot(aes(manufacturer, hwy)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
```
Setting `hjust` to 0 aligns text to the left, which causes our labels to move closer to x-axis. `hjust = 1` aligns text right.
You'll learn about the other element functions, and other useful `theme()` changes, later on.
## One distribution
In this section, we'll use data on Congress members serving in 2019.
```{r}
congress
```
One of our hypotheses from the introduction was the average Congress members is around 60. "Average" is, however, ambiguous. Do we mean the median or the mean?
For some data, the mean and median are close together, but that won't always be the case. Means are sensitive to outliers, while medians are more robust. As an example, here's a vector `x`.
```{r}
x <- c(1, 2, 10, 15)
```
The mean is the result of summing the elements of `x` and dividing by the length of `x`.
```{r}
mean(x)
```
The median represents the midpoint of `x`. Half of the data is greater than 6, and half is less than 6.
```{r}
median(x)
```
Right now, the mean and median of `x` are close together, but if we add an outlier the mean explodes, but the median only increases by a small amount.
```{r}
x <- c(1, 2, 10, 15, 1e6)
mean(x)
median(x)
```
We'll typically use medians instead of means for averages and references lines because of this behavior.
`summary()` will tell us the mean and median age in Congress.
```{r}
summary(congress$age)
```
The mean and median are close together, both between 58 and 60. Our hypothesis that the average age is 60 was pretty close.
Another interesting finding is that the House is slightly younger than the Senate.
```{r}
congress %>%
group_by(chamber) %>%
summarize(median_age = median(age))
```
There are also some very young Congress members,
```{r}
congress %>%
filter(age < 32) %>%
arrange(age)
```
and the oldest member is 85.
```{r}
congress %>%
slice_max(age)
```
These summary calculations can only tell us so much about the data. We can get a richer understanding of the data by visualizing the age distribution, both for Congress as a whole and for the different chambers, states, parties, etc.
In this section, we'll focus just on visualizing the distribution of ages for all of Congress. First, let's try a histogram.
```{r}
congress %>%
ggplot(aes(age)) +
geom_histogram()
```
There's a message above our plot telling us that `geom_histogram()` is using `bins = 30`, but we should choose a better value of `binwidth`. Both the `bins` and `binwidth` arguments of `geom_histogram()` control the width of the bars you see on the plot. `bins` controls the number of bars. By default, `geom_histogram()` creates 30 bins/bars, and then divides up the data so that each bar spans the same number of whatever units are plotted along the x-axis. Here, those units are years, and each bar represents around 2 years. The `binwidth` argument of `geom_histogram()` allows you to set the width of the bars directly.
If you don't specify either `bins` or `binwidth` when you use `geom_histogram()`, you will always get the message you see above. This message reminds you that it is _always_ a good idea to think about your binwidth, and you will rarely want the default.
Let's try some different binwidths.
```{r}
congress %>%
ggplot(aes(age)) +
geom_histogram(binwidth = .1) +
stamp("Bad")
```
A binwidth of 0.1 years is too small. Notice how it's difficult to see the overall shape of the distribution and the plot appears very "peaky." Also, many of the bars only represent a single person.
If we swing too far in the opposite direction, however, we'll lose information.
```{r}
congress %>%
ggplot(aes(x = age)) +
geom_histogram(binwidth = 20) +
stamp("Bad")
```
20 years is too large of a chunk to be meaningful in this context.
In this case, the original 30 bins wasn't too far off from ideal. We'll make a minor adjustment so that each bin represents 5 years.
```{r}
congress %>%
ggplot(aes(x = age)) +
geom_histogram(binwidth = 5, boundary = 0)
```
The argument `boundary = 0` directs the bin boundaries to be at 0 plus multiples of the `binwidth`; in this case, the bin boundaries are at 25, 30, 35, ..., 90.
We can improve this histogram by adding a reference line at the median.
```{r}
congress %>%
ggplot(aes(age)) +
geom_histogram(binwidth = 5, boundary = 0) +
geom_vline(aes(xintercept = median(age)), color = "red")
```
For simple, one-dimensional distributions, histograms are generally a good bet. They are simple to interpret and can quickly give you a sense of the data.
However, there are other ways to visualize a single distribution. `geom_freqpoly()`, for example, is similar to `geom_histogram()`, but instead of encoding the number of people in each age bin with a bar, it uses points along a line.
```{r}
congress %>%
ggplot(aes(age)) +
geom_freqpoly(binwidth = 5) +
geom_vline(aes(xintercept = median(age)), color = "red")
```
You can think of frequency polygons as "histograms with lines." Just as you have to pick a binwidth for histograms, you also have to pick one for frequency polygons.
If you're just visualizing a single distribution, a histogram is probably better. The bars make it easier to connect counts with ages, as well as see the overall shape of the distribution. However, as you'll see later, frequency polygons can be useful for visualizing multiple distributions at once.
`geom_density()` creates a density plot and is another option for visualizing a single distribution. Density plots are essentially smoothed histograms.
```{r}
congress %>%
ggplot(aes(age)) +
geom_density()
```
Density plots lack the fine-grained detail of frequency polygons. Sometimes, you'll want to lose detail and focus on the overall trend, but, in other situations, density plots will hide important information. You should also avoid using density plots if you don't have very much data. We therefore recommend using density plots only in specific scenarios.
One such scenario is visualizing simulated data.
```{r}
set.seed(916)
sim_1 <- tibble(x = rnorm(n = 1e4))
sim_1 %>%
ggplot(aes(x)) +
geom_density()
```
We'll introduce a second scenario in the next section.
## Multiple distributions
You'll often want to visualize more than one distribution. For example, we might want to compare the age distributions for the different political parties in `congress`.
A first pass at a plot might involve using a histogram with `fill` to represent the different parties.
```{r}
congress %>%
ggplot(aes(age, fill = party)) +
geom_histogram(binwidth = 5) +
scale_fill_discrete(direction = -1) +
stamp("Bad")
```
Like stacked bar plots, stacked histograms aren't a very good idea. Stacking makes it difficult to figure out how many Democrats are in each bin. The line
`scale_fill_discrete(direction = -1)`
changes the direction of the fill colors, making the Democrats blue and the Republicans red.
One option is to facet.
```{r}
congress %>%
ggplot(aes(age)) +
geom_histogram(binwidth = 5) +
facet_grid(cols = vars(party)) +
stamp("Bad")
```
The side-by-side histograms are difficult to compare. Vertical facets are much better.
```{r}
congress %>%
ggplot(aes(age)) +
geom_histogram(binwidth = 5) +
facet_grid(rows = vars(party))
```
Another option is overlaying frequency polygons.
```{r}
congress %>%
ggplot(aes(age, color = party)) +
geom_freqpoly(binwidth = 5) +
scale_color_discrete(direction = -1)
```
The frequency polygons make it even easier to compare the three distributions. Democrats and Republicans are similar, but Democrats are more spread out and are slightly older. The few members in the "Other" category are older than the average Democrat or Republican.
The same pattern works for visualizing the age distributions of the two chambers of Congress.
```{r}
congress %>%
ggplot(aes(age, color = chamber)) +
geom_freqpoly(binwidth = 5)
```
Interestingly, they both have a very similar shape. However, because the House has hundreds more members than the Senate, the distributions are far away from each other and it is difficult to directly compare them.
We can solve this issue by mapping the _density_ of each age group to the y-axis. Like `geom_histogram()` and `geom_bar()`, `geom_freqpoly()` computes y-values from the data, and so you don't need to supply a `y` aesthetic. By default, the `y` aesthetic is set to `stat(count)`. `stat(count)` computes the number of items in each bin. In order to get `geom_freqpoly()` to plot densities instead of counts, we'll set `y = stat(density)`, which computes the density of items in each bin.
```{r}
congress %>%
ggplot(aes(age, y = stat(density), color = chamber)) +
geom_freqpoly(binwidth = 5)
```
Now, instead of mapping the number of people in each age bin to the y-axis, `geom_freqpoly()` uses the density of the people in each bin. It's now clear that, although their shapes are similar, the House is a bit younger and more evenly distributed.
In the previous section, we mentioned a use case for `geom_density()`. One advantage of `geom_density()` is that you can fill in the curves, which makes them easier to see.
```{r}
congress %>%
ggplot(aes(age, y = stat(density), color = chamber, fill = chamber )) +
geom_density() +
stamp("Bad")
```
By default, the fill regions are opaque, so to be useful we need to change the transparency with `alpha`.
```{r}
congress %>%
ggplot(aes(age, y = stat(density), color = chamber, fill = chamber )) +
geom_density(alpha = 0.5)
```
This approach works best when you have sufficient data and the distributions can be distinguished from each other.
```{r}
set.seed(916)
sim_2 <-
bind_rows(
tibble(mean = "0", x = rnorm(n = 1e4, mean = 0, sd = 1)),
tibble(mean = "1", x = rnorm(n = 1e5, mean = 5, sd = 2))
)
sim_2 %>%
ggplot(aes(x, color = mean, fill = mean)) +
geom_density(alpha = 0.5)
```
`geom_density()`, like `geom_freqpoly()`, doesn't require that you specify a `y` aesthetic. By default, the `y` aesthetic for `geom_density()` is `stat(density)`.
In some situations, you'll want to use `stat(count)` instead of `stat(density)`. For example, notice that in `sim_2`, there are more observations in the "5" group than in the "0" group. `stat(density)` doesn't show the number of observations in each group, so you can't see this difference in the above plot. If we use `stat(count)`, however, this difference is obvious.
```{r}
sim_2 %>%
ggplot(aes(x, y = stat(count), color = mean, fill = mean)) +
geom_density(alpha = 0.5)
```
Use `stat(count)` instead of `stat(density)` if you want to highlight the number of observations in each group. Otherwise, you can use the default `stat(density)`.
If we want to visualize more than 2 or 3 distributions at once, we'll need a new strategy. The variable `division` contains the U.S. Census division for each Congress member's state. There are more values in `division` than in `chamber` or `party`, so the `geom_freqpoly()` plus color combination doesn't work well.
```{r}
congress %>%
ggplot(aes(age, stat(density), color = division)) +
geom_freqpoly(binwidth = 8) +
stamp("Bad")
```
There are too many colors and the lines overlap too much for this visualization to be very useful.
Instead of encoding `division` with color, we can encode with position along the x-axis by using boxplots.
Boxplots summarize data.
```{r, echo=FALSE}
set.seed(200)
sim_3 <- tibble(x = rnorm(100, 0, 10))
text_color <- "#B33A3A"
sim_3 %>%
ggplot(aes(x = 1, x)) +
geom_boxplot() +
geom_segment(
aes(y = quantile(x, .25), yend = quantile(x, .75)),
x = 1.42,
xend = 1.42,
color = text_color
) +
geom_text(
aes(y = quantile(x, .75)),
x = 1.45,
color = text_color,
label = "50% of the data",
hjust = 0
) +
geom_text(
aes(y = quantile(x, .75) + IQR(x) * 1.5),
x = 1.05,
label = "Max value, excluding outliers",
hjust = 0,
vjust = 1.5,
color = text_color
) +
geom_text(
aes(y = min(x)),
x = 1.05,
label = "Min value, excluding outliers",
hjust = 0,
vjust = 0,
color = text_color
) +
geom_text(
aes(y = median(x)),
x = 0.5,
label = "Median",
color = text_color
) +
geom_text(
aes(y = max(x)),
x = 0.85,
label = "Outliers",
color = text_color
) +
coord_cartesian(xlim = c(0, 2)) +
theme_void()
```
Visual summaries like boxplots are helpful because they highlight important features of the data. Like any summary, however, they don't provide a full picture of the data. Boxplots can't fully convey the shape of the distribution in the same way that a histogram or density plot does. Sometimes this trade-off will be worth it, and sometimes it won't.
Because there are so many different values in `division`, boxplots are a good strategy. Summarizing the data will allow us to see the differences between the divisions that were hidden in the `geom_freqpoly()` plot.
```{r}
congress %>%
mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_boxplot() +
stamp("Bad")
```
By now, you should be used to the overlapping labels problem. We could try putting the labels on the y-axis.
```{r}
congress %>%
mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(age, division)) +
geom_boxplot() +
stamp("Bad")
```
However, flipping makes the median lines and box boundaries harder to compare, because it is more difficult to compare vertical line segments than horizontal ones.
Instead putting the labels on the y-axis, we can use `theme()` to angle the labels like you learned earlier.
```{r}
congress %>%
mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
```
The final step is to add a reference line indicating the overall median age.
```{r}
congress %>%
mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_hline(aes(yintercept = median(age)), color = "red") +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
```
From this plot, you can see that the median ages don't vary much across division. The size of the boxes are also similar. If you look at where in each box the median lines fall, however, you'll notice that some are in the middle, some towards the top, and some towards the bottom. This suggests that the individual distributions are different shapes.
If you care about distribution shape, `geom_violin()` is an alternative to `geom_boxplot()`.
```{r}
congress %>%
mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_hline(aes(yintercept = median(age)), color = "red") +
geom_violin() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
```
Violin plots are mirrored and flipped density plots. The thickness of the "violin" indicates how many values are in that area. With the violin plots, you can now tell that the distribution of ages look slightly different for different divisions. Note that, because violin plots are a form of density plot, they are only a good idea if you have sufficient data.
By default, `geom_violin()` doesn't display the median of each group. You can add quantile lines using the `draw_quantiles` argument. To just draw median lines, set `draw_quantiles` to 0.5.
```{r}
congress %>%
mutate(division = fct_reorder(division, age)) %>%
ggplot(aes(division, age)) +
geom_hline(aes(yintercept = median(age)), color = "red") +
geom_violin(draw_quantiles = 0.5) +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
```
(`geom_violin()` calculates the quantiles of the density estimate, which is not the same as the quantiles of the data used by `quantile()`, `median()`, and `geom_boxplot()`. This is why the 0.5 quantile lines aren't perfectly ordered.)
Neither boxplots nor violin plots display the amount of data behind each plot, and so you can't tell from the plots that New England has fewer representatives than the Pacific division. There are positives and negatives to every visualization decision you'll make, and so the best geom will depend on what you want your plot to show.