forked from geanders/RProgrammingForResearch
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path08-reportingresults2.Rmd
818 lines (623 loc) · 32.7 KB
/
08-reportingresults2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
# Reporting data results #2
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week8.pdf) a pdf of the lecture slides covering this topic.
```{r caa, echo = FALSE, message = FALSE, warning = FALSE}
knitr::opts_knit$set(error = TRUE)
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
library(knitr)
library(faraway)
data(worldcup)
library(ggthemes)
library(lubridate)
library(stringr)
library(gridExtra)
library(choroplethr)
library(choroplethrMaps)
library(titanic)
data(titanic_train)
```
## Matrices and lists
In this section, we'll talk about the `apply` family of functions, which allow you to apply a function to all values in a vector, matrix, or list. \bigskip
First, you need to know about two more object types in R:
- `matrix`
- `list`
### Matrices
A matrix is like a data frame, but all the values in all columns must be of the same class (e.g., numeric, character).
R uses matrices a lot for its underlying math (e.g., for the linear algebra operations required for fitting regression models). R can do matrix operations quite quickly.
You can create a matrix with the `matrix` function. Input a vector with the values to fill the matrix and `ncol` to set the number of columns:
```{r cab}
foo <- matrix(1:10, ncol = 5)
foo
```
By default, the matrix will fill up by column. You can fill it by row with the `byrow` function:
```{r cac}
foo <- matrix(1:10, ncol = 5, byrow = TRUE)
foo
```
In certain situations, you might want to work with a matrix instead of a data frame (for example, in cases where you were concerned about speed -- a matrix is more memory efficient than the corresponding data frame). If you want to convert a data frame to a matrix, you can use the `as.matrix` function:
```{r cad}
foo <- data.frame(col_1 = 1:2, col_2 = 3:4,
col_3 = 5:6, col_4 = 7:8,
col_5 = 9:10)
(foo <- as.matrix(foo))
```
You can index matrices with square brackets, just like data frames:
```{r cae}
foo[1, 1:2]
```
You cannot, however, use `dplyr` functions with matrices:
```{r caf, eval = FALSE}
foo %>% filter(col_1 == 1)
```
```
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to
an object of class "c('matrix', 'integer',
'numeric')"
```
All elements in a matrix must have the same class. \bigskip
The matrix will default to make all values the most general class of any of the values, in any column. For example, if we replaced one numeric value with the character "a", everything would turn into a character:
```{r cag}
foo[1, 1] <- "a"
foo
```
### Lists
A list has different elements, just like a data frame has different columns. However, the different elements of a list can have different lengths (unlike the columns of a data frame). The different elements can also have different classes.
```{r cah}
bar <- list(some_letters = letters[1:3],
some_numbers = 1:5,
some_logical_values = c(TRUE, FALSE))
bar
```
To index an element from a list, use double square brackets. You can use bracket indexing either with numbers (which element in the list?) or with names. You can also index lists with the `$` operator.
```{r cai}
bar[[1]]
bar[["some_numbers"]]
bar$some_logical_values
```
Lists can be used to contain data with an unusual structure and / or lots of different components. For example, the information from fitting a regression is often stored as a list:
```{r caj}
my_mod <- glm(rnorm(10) ~ c(1:10))
is.list(my_mod)
```
```{r cak}
head(names(my_mod), 3)
my_mod[["coefficients"]]
```
## `apply` functions
There is a whole family of `apply` functions, as part of base R. These include:
- `apply`: Apply a function over all the rows (`MARGIN = 1`) or columns (`MARGIN = 2`) of a matrix
- `lapply`: Apply a function over elements of a list.
- `sapply`: Like `lapply`, but returns a vector instead of a list.
Here is the syntax for `apply`:
```{r cal, eval = FALSE}
## Generic code
apply([matrix], MARGIN = [margin (1: rows, 2: columns)],
FUN = [function])
```
I'll use the `worldcup` data as an example:
```{r cam}
ex <- worldcup[ , c("Shots", "Passes", "Tackles", "Saves")]
head(ex)
```
Take the mean of all columns:
```{r can}
apply(ex, MARGIN = 2, mean)
```
Take the sum of all rows:
```{r cao}
head(apply(ex, MARGIN = 1, sum), 4)
```
You can use your own function with any of the `apply` functions. For example, if you wanted to calculate a value for each player that is a weighted mean of some of their statistics, you could run:
```{r cap}
weighted_mean <- function(soccer_stats,
weights = c(0.40, 0.01,
0.25, 1.5)){
out <- sum(weights * soccer_stats)
return(out)
}
head(apply(ex, MARGIN = 1, weighted_mean), 4)
```
The `lapply()` function will apply a function across a list. The different elements of the list do not have to be the same length (unlike a data frame, where the columns all have to have the same length).
```{r caq}
(ex <- list(a = c(1:5), b = rnorm(3), c = letters[1:4]))
```
This call will calculate the mean of each function:
```{r car, warning = FALSE}
lapply(ex, FUN = mean)
```
You can include arguments for the function that you specify with `FUN`, and they'll be passed to that function. For example, to get the first value of each element, you can run:
```{r cas, warning = FALSE}
lapply(ex, FUN = head, n = 1)
```
The `sapply()` function also applies a function over a list, but it returns a vector rather than a list:
```{r cat}
sapply(ex, FUN = head, n = 1)
```
In practice, I do use `apply()` some, but I can often find a way to do similar things to other `apply` family functions using the tools in `dplyr`. \bigskip
You should know that `apply` family functions take advantage of the matrix structure in R. This can be one of the fastest ways to run code in R. It is usually a lot faster than doing the same things with loops. However, unless you are working with large data sets, you may not notice a difference, and "tidyverse" functions are usually comparable in speed. \bigskip
I would recommend using whichever method makes the most sense to you until you run into an analysis that takes a noticeable amount of time to run, and then you might want to work a bit more to optimize your code. \bigskip
## Point maps
It is very easy now to create point maps in R based on longitude and latitude values of specific locations. You can use the `map_data` function from the `ggplot2` package to pull data for maps at different levels ("usa", "state", "world", "county").
The maps you pull using `map_data` are just data to use to plot polygon shapes for areas like states and counties.
```{r cau, warning = FALSE, message = FALSE}
library(ggplot2)
us_map <- map_data("state")
head(us_map, 3)
```
You can add points to these based on latitude and longitude.
Mapping uses the `long` and `lat` columns from this data for location:
```{r cav, fig.width = 6, fig.height = 2.5, fig.align = "center"}
north_carolina <- us_map %>%
filter(region == "north carolina")
ggplot(north_carolina, aes(x = long, y = lat)) +
geom_point()
```
If you try to plot lines, however, you'll have a problem:
```{r caw, fig.width = 3.5, fig.height = 2.25, fig.align = "center"}
carolinas <- us_map %>%
filter(str_detect(region, "carolina"))
ggplot(carolinas, aes(x = long, y = lat)) +
geom_path()
```
The `group` column fixes this problem. It will plot a separate path or polygon for each separate group. For mapping, this gives separate groupings for mainland versus islands and for different states:
```{r cax}
carolinas %>%
group_by(group) %>%
slice(1)
```
Using `group = group` avoids the extra lines from the earlier map:
```{r cay, fig.width = 3.5, fig.height = 2.25, fig.align = "center"}
ggplot(carolinas, aes(x = long, y = lat,
group = group)) +
geom_path()
```
To plot filled regions, use `geom_polygon` with `fill = region`. Also, the "void" theme is often useful when mapping:
```{r caz, fig.width = 4.5, fig.height = 2, fig.align = "center"}
ggplot(carolinas, aes(x = long, y = lat,
group = group,
fill = region)) +
geom_polygon(color = "black") +
theme_void()
```
Here is an example of plotting all of the US by state:
```{r cba}
map_1 <- ggplot(us_map, aes(x = long, y = lat,
group = group)) +
geom_polygon(fill = "dodgerblue",
color = "white") +
theme_void()
```
```{r cbb, fig.width = 6.5, fig.height = 4, fig.align = "center"}
map_1
```
To add points to these maps, you can use `geom_point`, again using longitude and latitude to define position. \bigskip
Here I'll use an example of data points related to the story told in last year's ["Serial" podcast](http://serialpodcast.org).
```{r cbc}
serial <- read.csv("data/serial_map_data.csv")
head(serial, 3)
```
[David Robinson](https://github.com/dgrtwo/serial-ggvis/blob/master/serial-preprocessing.Rmd) figured out a way to convert the x and y coordinates in this data to latitude and longitude coordinates. I'm also adding a column for whether of not the point is a cell tower.
```{r cbd, message = FALSE, warning = FALSE}
serial <- serial %>%
mutate(long = -76.8854 + 0.00017022 * x,
lat = 39.23822 + 1.371014e-04 * y,
tower = Type == "cell-site")
```
```{r cbe, message = FALSE, warning = FALSE}
serial[c(1:2, (nrow(serial) - 1):nrow(serial)),
c("Type", "Name", "long", "lat", "tower")]
```
Now I can map just Baltimore City and Baltimore County in Maryland and add these points. I used `map_data` to pull the "county" map and specified "region" as "maryland", to limit the map just to Maryland counties.
```{r cbf}
baltimore <- map_data('county', region = 'maryland')
head(baltimore, 3)
```
From that, I subset out rows where the `subregion` column was "baltimore city" or "baltimore".
```{r cbg}
baltimore <- subset(baltimore,
subregion %in% c("baltimore city",
"baltimore"))
head(baltimore, 3)
```
I used `geom_point` to plot the points. `ggplot` uses the `group` column to group together counties, but we don't need that in the points, so I needed to set `group = NA` in the `geom_point` statement. I put `color = tower` inside the `aes` statement so that the points would be one color for cell towers and another color for everything else.
```{r cbh}
balt_plot <- ggplot(baltimore,
aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "lightblue", color = "black") +
geom_point(data = serial, aes(x = long, y = lat,
group = NA,
color = tower)) +
theme_bw()
```
```{r cbi, fig.width = 7, fig.height = 4}
balt_plot
```
## Choropleth maps
There's a fantastic new(-ish) package in R to plot choropleth maps. You could also plot choropleths using `ggplot` and other mapping functions, but I would strongly recommend this new package if you're mapping the US.
You will need to install and load the `choroplethr` package in R to use the functions below.
```{r cbj, message = FALSE, warning = FALSE}
# install.packages("choroplethr")
library(choroplethr)
# install.packages("choroplethrMaps")
library(choroplethrMaps)
```
At the most basic level, you can use this package to plot some data that comes automatically with the package (you'll just need to load the data using the `data` function). For example, if you wanted to plot state-by-state populations as of 2012, you could use:
```{r cbk, warning = FALSE, message=FALSE}
data(df_pop_state)
map_3 <- state_choropleth(df_pop_state)
```
```{r cbl, fig.width=8, fig.height=3.5}
map_3
```
You can find out more about the `df_pop_state` data if you type `?df_pop_state`. Notice that, for the data frame, the location is given in a column called `region` and the population size to plot is in a column called `value`.
```{r cbm}
head(df_pop_state, 3)
```
You could use this function to create any state-level choropleth you wanted, as long as you could create a data frame with a column for states called `region` and a column with the value you want to show called `value`.
You can run similar functions at different spatial resolutions (for example, county or zip code):
```{r cbn}
data(df_pop_county)
head(df_pop_county, 3)
```
You can plot choropleths at this level, as well:
```{r cbo, warning = FALSE, message = FALSE}
map_4 <- county_choropleth(df_pop_county)
```
```{r cbp, fig.width=8, fig.height=4}
map_4
```
You can even do this for countries of the world:
```{r cbq, fig.width = 8, fig.height = 3.5, warning=FALSE, message=FALSE}
data(df_pop_country)
country_choropleth(df_pop_country)
```
You can zoom into states or counties. For example, to plot population by county in Colorado, you could run:
```{r cbr, fig.width = 6, fig.height = 3, message = FALSE, warning = FALSE}
county_choropleth(df_pop_county, state_zoom = "colorado")
```
You can also use this package to map different tables from the US Census' American Community Survey.
The package includes the `choroplethr_acs()` function to do this, with an option for which level of map you want (`map = `, choices are "state", "country", and "zip"). If you want to map at the state level, for example, use `state_choroplethr_acs()` (other options are county level and zip code level).
These functions pull recent Census data directly from the US Census using its API, so they require you to get an API key, which you can get [here](http://api.census.gov/data/key_signup.html). \bigskip
Once you put in your request, they'll email you your key. Once they give you your API key, you'll need to install it on R:
```{r cbs, echo = FALSE, message = FALSE, warning = FALSE, eval = FALSE}
library(acs)
api.key.install(Sys.getenv("acskey"))
```
```{r cbt, eval = FALSE}
library(acs)
api.key.install('[your census api key]');
```
You can pick from a large number of American Community Survey tables-- [see here](http://factfinder.census.gov/faces/affhelp/jsf/pages/metadata.xhtml?lang=en&type=dataset&id=dataset.en.ACS_12_5YR) for the list plus ID numbers. If the table has multiple columns, you will be prompted to select which one you want to plot.
For example, table B19301 gives per-capita income, so if you wanted to plot that, you could run:
```{r cbu, fig.width = 3, fig.height = 2.75, message = FALSE, eval = FALSE}
county_choropleth_acs(tableId = "B19301",
state_zoom = c("wyoming",
"colorado"))
```
```{r cbv, fig.width = 10, fig.height = 6, message = FALSE, echo = FALSE, fig.align = "center", eval = FALSE}
ex_choro <- county_choropleth_acs(tableId = "B19301",
state_zoom = c("wyoming",
"colorado"))
save(ex_choro, file = "data/example_choropleth.Rdata")
```
```{r cbw, echo = FALSE, fig.width = 10, fig.height = 6, message = FALSE, fig.align = "center", eval = FALSE}
load("data/example_choropleth.Rdata")
ex_choro
```
## Google Maps API
The `ggmap` package allows you to use tools from Google Maps directly from R.
```{r cbx, message=FALSE, warning = FALSE}
## install.packages("ggmap")
library(ggmap)
```
This package uses the Google Maps API, so you should read their [terms of service](http://developers.google.com/maps/terms) and make sure you follow them. In particular, you are limited to just a certain number of queries per time.
You can use the `get_map` function to get maps for different locations. You can either use the longitude and latitude of the center point of the map, along with the `zoom` option to say how much to zoom in (3: continent to 20: building) or you can use a character string to specify a location.
If you do the second, `get_map` will actually use the Google Maps API to geocode the string to a latitude and longitude and then get the map (you can imagine that this is like searching in Google Maps in the search box for a location).
```{r cby, message = FALSE, warning = FALSE, fig.width = 3.5, fig.height = 3.5, fig.align = "center"}
beijing <- get_map("Beijing", zoom = 12)
ggmap(beijing)
```
With this package, you can get maps from the following different sources:
- Google Maps
- OpenStreetMap
- Stamen Maps
- CloudMade Maps (You may need a separate API key for this)
Here are different examples of Beijing using different map sources. (Also, note that I'm using the option `extent = "device"` to fill up the whole plot are with the map, instead of including axis labels and titles.)
```{r cca, message = FALSE, warning = FALSE}
beijing_a <- get_map("Beijing", zoom = 12,
source = "stamen", maptype = "toner")
a <- ggmap(beijing_a, extent = "device")
beijing_b <- get_map("Beijing", zoom = 12,
source = "stamen", maptype = "watercolor")
b <- ggmap(beijing_b, extent = "device")
beijing_c <- get_map("Beijing", zoom = 12,
source = "google", maptype = "hybrid")
c <- ggmap(beijing_c, extent = "device")
```
```{r ccb, fig.width = 7}
grid.arrange(a, b, c, nrow = 1)
```
As with the maps from `ggplot2`, you can add points to these maps:
```{r cce, warning = FALSE, message = FALSE}
serial_phone <- read.csv("data/serial_phone_data.csv") %>%
mutate(Cell_Site = substring(Cell_Site, 1, 4),
Call_Time = as.POSIXct(Call_Time, format = "%d/%m/%y %H:%M",
tz = "EST")) %>%
left_join(serial, by = c("Cell_Site" = "Name")) %>%
select(Person_Called, Call_Time, Duration, long, lat) %>%
filter(!(Person_Called %in% c("incoming", "# + Adnan cell"))) %>%
arrange(Call_Time)
```
```{r ccf, message = FALSE, warning = FALSE, fig.width = 5}
serial_map <- get_map(c(-76.7, 39.3), zoom = 12,
source = "stamen",
maptype = "toner")
serial_map <- ggmap(serial_map, extent = "device") +
geom_point(data = serial_phone,
aes(x = long, y = lat),
color = "red", size = 3,
alpha = 0.4) +
geom_point(data = subset(serial,
Type != "cell-site"),
aes(x = long, y = lat),
color = "darkgoldenrod1",
size = 2)
```
```{r ccg, warning = FALSE, message = FALSE, fig.width = 4, fig.height = 4, fig.align = "center", echo = FALSE}
serial_map
```
You can also use the Google Maps API, through the `geocode` function, to get the latitude and longitude of specific locations. Basically, if the string would give you the right location if you typed it in Google Maps, `geocode` should be able to geocode it.
For example, you can get the location of CSU:
```{r cch, message = FALSE, warning = FALSE}
geocode("Colorado State University")
```
You can also get a location by address:
```{r cci, message = FALSE, warning = FALSE}
geocode("1 First St NE, Washington, DC")
```
You can get distances, too, using the `mapdist` function with two locations. This will give you distance and also time.
```{r ccj, message = FALSE, warning = FALSE}
mapdist("Fort Collins CO",
"1 First St NE, Washington, DC") %>%
select(from, miles, hours)
```
## String operations
The `str_trim` function from the `stringr` package allows you to trim leading and trailing white space:
```{r cck}
with_spaces <- c(" a ", " bob", " gamma")
with_spaces
str_trim(with_spaces)
```
This is rarer, but if you ever want to, you can add leading and/or trailing white space to elements of a character vector with `str_pad` from the `stringr` package.
There are also functions to change a full character string to uppercase, lowercase, or title case:
```{r ccl}
titanic_train$Name[1]
str_to_upper(titanic_train$Name[1])
str_to_lower(titanic_train$Name[1])
str_to_title(str_to_lower(titanic_train$Name[1]))
```
## In-course exercise
This exercise will continue using the Fatality Analysis Reporting System (FARS) data we started using last week.
```{r ccm, echo = FALSE, message = FALSE, warning = FALSE}
state_fips <- read_delim("http://www2.census.gov/geo/docs/reference/state.txt",
delim = "|") %>%
dplyr::rename(state = STATE,
state_name = STATE_NAME) %>%
dplyr::select(state, state_name) %>%
dplyr::mutate(state = as.integer(state))
accident <- read_csv("data/accident.csv") %>%
dplyr::select(STATE, DAY:MINUTE, DRUNK_DR) %>%
dplyr::rename(state = STATE,
drunk_dr = DRUNK_DR) %>%
dplyr::select(-DAY_WEEK) %>%
tidyr::unite(date, DAY:MINUTE, sep = "-") %>%
dplyr::mutate(date = dmy_hm(date),
drunk_dr = drunk_dr >= 1,
daytime = hour(date) %in% c(7:19),
month = month(date)) %>%
dplyr::filter(!is.na(date)) %>%
dplyr::left_join(state_fips, by = "state")
```
### Writing a function to create state-specific plots
- Adapt the code that you wrote for to create a loop in last week's in-course exercise and write a function that will create a state-specific plot when you call it. This function should have inputs of `datafr` and `which_state`, where `datafr` is the data frame with the full data set (`accident` in this case) and `which_state` is the name of the state for which you'd like to create the plot. Name your functions `make_state_plot`. Here are a few examples of running this function:
```{r ccn, echo = FALSE}
make_state_plot <- function(datafr, which_state){
a <- datafr %>%
dplyr::filter_(~ state_name == which_state) %>%
dplyr::mutate_(drunk_dr = ~ factor(drunk_dr,
labels = c("Unrelated to\ndrunk driving",
"Related to\ndrunk driving")),
daytime = ~ factor(daytime, labels = c("Nighttime", "Daytime"))) %>%
dplyr::group_by_(~ daytime, ~ month, ~ drunk_dr) %>%
dplyr::summarize(accidents = n())
a <- ggplot2::ggplot(a, ggplot2::aes(x = daytime, y = accidents,
group = daytime)) +
ggplot2::geom_boxplot() +
ggplot2::facet_wrap(~ drunk_dr, ncol = 2) +
ggplot2::xlab("Time of day") + ggplot2::ylab("# of monthly accidents") +
ggplot2::ggtitle(paste("Fatal accidents in", which_state, "in 2015"))
return(a)
}
```
```{r cco, fig.width = 6, fig.height = 3, fig.align = "center"}
make_state_plot(accident, which_state = "California")
make_state_plot(accident, which_state = "North Carolina")
make_state_plot(accident, which_state = "Illinois")
```
- In general, there is a pattern of more fatal accidents during the daytime being unrelated to drunk driving versus related. The pattern is reversed for nighttime accidents. Use your function to see if you can find any states that don't follow this pattern.
- So far, we have not covered error checking in functions. In the function that you just wrote, we are assuming that the user will never give inputs that would cause errors. Look through the code for your function and see what assumptions we're making about what values a user would input for `datafr` and `which_state`. What could go wrong if a user does not comply with these assumptions?
- The best practice when writing functions, especially for others to use, is to use the package::function syntax to identify which package each function comes from. For example, if you're using `mutate` in a function, you should write the function code as `dplyr::mutate`. Go back through the code for the function you wrote. Every time it uses a function that is not part of base R, change the code to use this package::function syntax.
- If you have time: Can you combine your function, a loop (or `lapply`), and regular expressions to create plots for all states that start with "C"? That have "North" in their names? (*Hint*: One way to get a list of states is from the `accident` data frame. Once you've joined in state names as a column called `state_name`, you can use: `unique(accident$state_name)`.)
#### Example R code
Here is some example R code for writing the function to create state-specific plots:
```{r ccp, eval = FALSE}
make_state_plot <- function(datafr, which_state){
a <- datafr %>%
dplyr::filter(state_name == which_state) %>%
dplyr::mutate(drunk_dr = factor(drunk_dr,
labels = c("Unrelated to\ndrunk driving",
"Related to\ndrunk driving")),
daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>%
dplyr::group_by(daytime, month, drunk_dr) %>%
dplyr::summarize_(accidents = ~ n()) %>%
ggplot2::ggplot(ggplot2::aes(x = daytime, y = accidents,
group = daytime)) +
ggplot2::geom_boxplot() +
ggplot2::facet_wrap(~ drunk_dr, ncol = 2) +
ggplot2::xlab("Time of day") + ggplot2::ylab("# of monthly accidents") +
ggplot2::ggtitle(paste("Fatal accidents in", which_state, "in 2015"))
return(a)
}
# With package::function notation
make_state_plot <- function(datafr, which_state){
a <- datafr %>%
dplyr::filter(state_name == which_state) %>%
dplyr::mutate(drunk_dr = factor(drunk_dr,
labels = c("Unrelated to\ndrunk driving",
"Related to\ndrunk driving")),
daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>%
dplyr::group_by(daytime, month, drunk_dr) %>%
dplyr::summarize_(accidents = ~ n()) %>%
ggplot2::ggplot(ggplot2::aes(x = daytime, y = accidents,
group = daytime)) +
ggplot2::geom_boxplot() +
ggplot2::facet_wrap(~ drunk_dr, ncol = 2) +
ggplot2::xlab("Time of day") + ggplot2::ylab("# of monthly accidents") +
ggplot2::ggtitle(paste("Fatal accidents in", which_state, "in 2015"))
return(a)
}
```
Example of using this function to create plots for all states that start with "C":
```{r ccq, fig.width = 6, fig.height = 3, fig.align = "center", results='hide', fig.show='asis'}
states <- unique(accident$state_name)
c_states <- states[str_detect(states, "^C")]
for(state in c_states){
state_plot <- make_state_plot(datafr = accident,
which_state = state)
print(state_plot)
}
```
To use `lapply`, the state name needs to be the first argument in the function. We could re-write the function to comply and then use `lapply`:
```{r ccr, fig.width = 6, fig.height = 3, fig.align = "center", results='hide', fig.show='asis'}
make_state_plot <- function(which_state, datafr){
a <- datafr %>%
dplyr::filter(state_name == which_state) %>%
dplyr::mutate(drunk_dr = factor(drunk_dr,
labels = c("Unrelated to\ndrunk driving",
"Related to\ndrunk driving")),
daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>%
dplyr::group_by(daytime, month, drunk_dr) %>%
dplyr::summarize_(accidents = ~ n()) %>%
ggplot2::ggplot(ggplot2::aes(x = daytime, y = accidents,
group = daytime)) +
ggplot2::geom_boxplot() +
ggplot2::facet_wrap(~ drunk_dr, ncol = 2) +
ggplot2::xlab("Time of day") + ggplot2::ylab("# of monthly accidents") +
ggplot2::ggtitle(paste("Fatal accidents in", which_state, "in 2015"))
return(a)
}
north_states <- states[str_detect(states, "North")]
lapply(north_states, make_state_plot, datafr = accident)
```
### First steps in mapping
If you have time, you can try some exercises with mapping.
- Read in and clean up the FARS data and save it to an R object called `accident`. The `accident` data frame should include location (longitude and latitude), state, and whether the accident involved drunk driving. Once you're done cleaning, the dataset should look like this:
```{r ccs, message = FALSE, warning = FALSE}
state_fips <- read_delim("http://www2.census.gov/geo/docs/reference/state.txt",
delim = "|") %>%
dplyr::rename(state = STATE,
state_name = STATE_NAME) %>%
dplyr::select(state, state_name) %>%
dplyr::mutate(state = as.integer(state))
accident <- read_csv("data/accident.csv") %>%
dplyr::select(STATE, LATITUDE, LONGITUD, DRUNK_DR) %>%
dplyr::rename(state = STATE,
lat = LATITUDE,
long = LONGITUD,
drunk_dr = DRUNK_DR) %>%
dplyr::mutate(drunk_dr = drunk_dr > 0) %>%
dplyr::left_join(state_fips, by = "state") %>%
dplyr::select(-state)
accident %>% slice(1:5)
```
- Next, create another R object with state-specific summaries. This object should be a data frame called `state_summaries` that gives the total number of fatal accidents in each state and the percent of all fatal accidents related to drunk driving in each state. The `state_summaries` data frame should look like this when you are done cleaning:
```{r cct}
state_summaries <- accident %>%
dplyr::group_by(state_name) %>%
dplyr::summarize_(n = ~ n(),
perc_drunk_dr = ~ 100 * mean(drunk_dr)) # Because `drunk_dr` is
# 0 / 1, the mean is the percent of
# values where `drunk_dr` == 1.
state_summaries %>% slice(1:5)
```
- Create the following state-level choropleths of number of fatal accidents in 2015 in each state and of the percent of fatal accidents linked to drunk driving in each state. The quickest way to do this is with the `choroplethr` package.
```{r ccu}
# With `choroplethr` package
library(choroplethr)
state_summaries %>%
dplyr::mutate(state_name = str_to_lower(state_name)) %>%
dplyr::rename(region = state_name,
value = n) %>%
state_choropleth(title = "Fatal accidents by state, 2015",
legend = "# of fatal accidents")
state_summaries %>%
dplyr::mutate(state_name = str_to_lower(state_name)) %>%
dplyr::rename(region = state_name,
value = perc_drunk_dr) %>%
state_choropleth(title = "Percent of fatal accidents linked\nto drunk driving, 2015",
legend = "% of fatal accidents\nlinked to drunk driving")
```
- Now, you will be using the latitude and longitude of each accident to plot accident locations within a state. Check if the columns for latitude and longitude need more cleaning (spoiler alert--- they do). Filter out any values that could not reasonably be within the 50 U.S. states and D.C.
```{r ccv}
# It turns out that values like 77.7777, 88.8888, and 99.9999 are used as
# missing values of latitude (similar for 777.777, etc., for longitude)
accident <- accident %>%
dplyr::filter(lat < 75 & long < 777)
```
- Map the locations of fatal accidents in Texas in 2015 on top of a Google Map base map that shows major roads. An example map is shown below. (*Hint:* Use the `get_map` and `ggmap` functions for this part.)
```{r ccw, message = FALSE, warning = FALSE}
texas <- get_map(location = "texas", zoom = 6, maptype = "roadmap")
ggmap(texas, extent = "device") +
geom_point(data = dplyr::filter(accident, state_name == "Texas"),
aes(x = long, y = lat), alpha = 0.2, size = 0.5,
color = "red")
```
- Create a function that will take three inputs: `datafr`, a data frame like `accident` with fatal accident data; `which_state`, a state name; and `drunk_dr`, a logical value (TRUE / FALSE) specifying whether to create separate maps for accidents that were and were not linked to drunk driving. This function should output a map (or two maps if `drunk_dr` is set to TRUE) with points for all fatal accidents in the state. Use `map_data` and create the map yourself, rather than using `choroplethr` for this question. Examples of calling this function are shown below the function definition.
```{r ccx}
map_fatal_accidents <- function(datafr = accident, which_state,
drunk_dr = FALSE){
state_accidents <- datafr %>%
dplyr::filter(state_name == which_state)
state_map_data <- ggplot2::map_data("county", region = which_state)
accident_map <- ggplot2::ggplot(state_map_data,
ggplot2::aes(x = long, y = lat,
group = subregion)) +
ggplot2::geom_polygon(color = "gray", fill = NA) +
ggplot2::theme_void() +
ggplot2::geom_point(data = state_accidents,
ggplot2::aes(x = long, y = lat, group = NULL),
alpha = 0.5, size = 0.7 )
if(drunk_dr){
accident_map <- accident_map +
ggplot2::facet_wrap(~ drunk_dr, ncol = 2)
}
return(accident_map)
}
```
```{r ccy, message = FALSE, warning = FALSE, fig.width = 4, fig.height = 2.75, fig.align = "center"}
map_fatal_accidents(accident, which_state = "Colorado")
```
```{r ccz, message = FALSE, warning = FALSE, fig.width = 4.3, fig.height = 4.5, fig.align = "center"}
map_fatal_accidents(accident, which_state = "Texas")
```
```{r cda, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 2.75, fig.align = "center"}
map_fatal_accidents(accident, which_state = "Colorado",
drunk_dr = TRUE)
```
```{r cdb, message = FALSE, warning = FALSE, fig.width = 8.6, fig.height = 4.5, fig.align = "center"}
map_fatal_accidents(accident, which_state = "Texas",
drunk_dr = TRUE)
```