08-reportingresults2.Rmd

# Reporting data results #2

[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week8.pdf) a pdf of the lecture slides covering this topic.

```{r caa, echo = FALSE, message = FALSE, warning = FALSE}
knitr::opts_knit$set(error = TRUE)
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
library(knitr)
library(faraway)
data(worldcup)
library(ggthemes)
library(lubridate)
library(stringr)
library(gridExtra)
library(choroplethr)
library(choroplethrMaps)
library(titanic)
data(titanic_train)
```

## Matrices and lists

In this section, we'll talk about the `apply` family of functions, which allow you to apply a function to all values in a vector, matrix, or list. \bigskip

First, you need to know about two more object types in R:

- `matrix`
- `list`

### Matrices

A matrix is like a data frame, but all the values in all columns must be of the same class (e.g., numeric, character). 

R uses matrices a lot for its underlying math (e.g., for the linear algebra operations required for fitting regression models). R can do matrix operations quite quickly.

You can create a matrix with the `matrix` function. Input a vector with the values to fill the matrix and `ncol` to set the number of columns:

```{r cab}
foo <- matrix(1:10, ncol = 5)
foo
```

By default, the matrix will fill up by column. You can fill it by row with the `byrow` function: 

```{r cac}
foo <- matrix(1:10, ncol = 5, byrow = TRUE)
foo
```

In certain situations, you might want to work with a matrix instead of a data frame (for example, in cases where you were concerned about speed -- a matrix is more memory efficient than the corresponding data frame). If you want to convert a data frame to a matrix, you can use the `as.matrix` function: 

```{r cad}
foo <- data.frame(col_1 = 1:2, col_2 = 3:4,
                  col_3 = 5:6, col_4 = 7:8,
                  col_5 = 9:10)
(foo <- as.matrix(foo))
```

You can index matrices with square brackets, just like data frames: 

```{r cae}
foo[1, 1:2]
```

You cannot, however, use `dplyr` functions with matrices: 

```{r caf, eval = FALSE}
foo %>% filter(col_1 == 1)
```
```
Error in UseMethod("filter_") : 
  no applicable method for 'filter_' applied to
  an object of class "c('matrix', 'integer',
  'numeric')"
```

All elements in a matrix must have the same class. \bigskip

The matrix will default to make all values the most general class of any of the values, in any column. For example, if we replaced one numeric value with the character "a", everything would turn into a character:

```{r cag}
foo[1, 1] <- "a"
foo
```

### Lists

A list has different elements, just like a data frame has different columns. However, the different elements of a list can have different lengths (unlike the columns of a data frame). The different elements can also have different classes.

```{r cah}
bar <- list(some_letters = letters[1:3],
            some_numbers = 1:5, 
            some_logical_values = c(TRUE, FALSE))
bar
```

To index an element from a list, use double square brackets. You can use bracket indexing either with numbers (which element in the list?) or with names. You can also index lists with the `$` operator.

```{r cai}
bar[[1]]
bar[["some_numbers"]]
bar$some_logical_values
```

Lists can be used to contain data with an unusual structure and / or lots of different components. For example, the information from fitting a regression is often stored as a list:

```{r caj}
my_mod <- glm(rnorm(10) ~ c(1:10))
is.list(my_mod)
```

```{r cak}
head(names(my_mod), 3)
my_mod[["coefficients"]]
```

## `apply` functions

There is a whole family of `apply` functions, as part of base R. These include:

- `apply`: Apply a function over all the rows (`MARGIN = 1`) or columns (`MARGIN = 2`) of a matrix
- `lapply`: Apply a function over elements of a list.
- `sapply`: Like `lapply`, but returns a vector instead of a list.

Here is the syntax for `apply`: 

```{r cal, eval = FALSE}
## Generic code
apply([matrix], MARGIN = [margin (1: rows, 2: columns)],
      FUN =  [function])
```

I'll use the `worldcup` data as an example:

```{r cam}
ex <- worldcup[ , c("Shots", "Passes", "Tackles", "Saves")]
head(ex)
```

Take the mean of all columns:

```{r can}
apply(ex, MARGIN = 2, mean)
```

Take the sum of all rows:

```{r cao}
head(apply(ex, MARGIN = 1, sum), 4)
```

You can use your own function with any of the `apply` functions. For example, if you wanted to calculate a value for each player that is a weighted mean of some of their statistics, you could run:

```{r cap}
weighted_mean <- function(soccer_stats,
                          weights = c(0.40, 0.01,
                                      0.25, 1.5)){
  out <- sum(weights * soccer_stats)
  return(out)
}

head(apply(ex, MARGIN = 1, weighted_mean), 4)
```

The `lapply()` function will apply a function across a list. The different elements of the list do not have to be the same length (unlike a data frame, where the columns all have to have the same length).

```{r caq}
(ex <- list(a = c(1:5), b = rnorm(3), c = letters[1:4]))
```

This call will calculate the mean of each function:

```{r car, warning = FALSE}
lapply(ex, FUN = mean)
```

You can include arguments for the function that you specify with `FUN`, and they'll be passed to that function. For example, to get the first value of each element, you can run:

```{r cas, warning = FALSE}
lapply(ex, FUN = head, n = 1)
```

The `sapply()` function also applies a function over a list, but it returns a vector rather than a list:

```{r cat}
sapply(ex, FUN = head, n = 1)
```

In practice, I do use `apply()` some, but I can often find a way to do similar things to other `apply` family functions using the tools in `dplyr`. \bigskip

You should know that `apply` family functions take advantage of the matrix structure in R. This can be one of the fastest ways to run code in R. It is usually a lot faster than doing the same things with loops. However, unless you are working with large data sets, you may not notice a difference, and "tidyverse" functions are usually comparable in speed. \bigskip

I would recommend using whichever method makes the most sense to you until you run into an analysis that takes a noticeable amount of time to run, and then you might want to work a bit more to optimize your code. \bigskip

## Point maps

It is very easy now to create point maps in R based on longitude and latitude values of specific locations. You can use the `map_data` function from the `ggplot2` package to pull data for maps at different levels ("usa", "state", "world", "county").

The maps you pull using `map_data` are just data to use to plot polygon shapes for areas like states and counties. 

```{r cau, warning = FALSE, message = FALSE}
library(ggplot2)
us_map <- map_data("state")
head(us_map, 3)
```

You can add points to these based on latitude and longitude.

Mapping uses the `long` and `lat` columns from this data for location:

```{r cav, fig.width = 6, fig.height = 2.5, fig.align = "center"}
north_carolina <- us_map %>% 
  filter(region == "north carolina")
ggplot(north_carolina, aes(x = long, y = lat)) +
  geom_point()
```

If you try to plot lines, however, you'll have a problem:

```{r caw, fig.width = 3.5, fig.height = 2.25, fig.align = "center"}
carolinas <- us_map %>% 
  filter(str_detect(region, "carolina"))
ggplot(carolinas, aes(x = long, y = lat)) + 
  geom_path()
```

The `group` column fixes this problem. It will plot a separate path or polygon for each separate group. For mapping, this gives separate groupings for mainland versus islands and for different states:

```{r cax}
carolinas %>%
  group_by(group) %>%
  slice(1)
```

Using `group = group` avoids the extra lines from the earlier map:

```{r cay, fig.width = 3.5, fig.height = 2.25, fig.align = "center"}
ggplot(carolinas, aes(x = long, y = lat,
                      group = group)) + 
  geom_path()
```

To plot filled regions, use `geom_polygon` with `fill = region`. Also, the "void" theme is often useful when mapping:

```{r caz, fig.width = 4.5, fig.height = 2, fig.align = "center"}
ggplot(carolinas, aes(x = long, y = lat,
                      group = group,
                      fill = region)) + 
  geom_polygon(color = "black") + 
  theme_void()
```

Here is an example of plotting all of the US by state:

```{r cba}
map_1 <- ggplot(us_map, aes(x = long, y = lat,
                            group = group)) + 
        geom_polygon(fill = "dodgerblue",
                     color = "white") +
        theme_void()
```

```{r cbb, fig.width = 6.5, fig.height = 4, fig.align = "center"}
map_1
```

To add points to these maps, you can use `geom_point`, again using longitude and latitude to define position. \bigskip

Here I'll use an example of data points related to the story told in last year's ["Serial" podcast](http://serialpodcast.org). 

```{r cbc}
serial <- read.csv("data/serial_map_data.csv")
head(serial, 3)
```

[David Robinson](https://github.com/dgrtwo/serial-ggvis/blob/master/serial-preprocessing.Rmd) figured out a way to convert the x and y coordinates in this data to latitude and longitude coordinates. I'm also adding a column for whether of not the point is a cell tower.

```{r cbd, message = FALSE, warning = FALSE}
serial <- serial %>%
    mutate(long = -76.8854 + 0.00017022 * x,
           lat  = 39.23822 + 1.371014e-04 * y,
           tower = Type == "cell-site")
```

```{r cbe, message = FALSE, warning = FALSE}
serial[c(1:2, (nrow(serial) - 1):nrow(serial)), 
       c("Type", "Name", "long", "lat", "tower")]
```

Now I can map just Baltimore City and Baltimore County in Maryland and add these points. I used `map_data` to pull the "county" map and specified "region" as "maryland", to limit the map just to Maryland counties. 

```{r cbf}
baltimore <- map_data('county', region = 'maryland')
head(baltimore, 3)
```

From that, I subset out rows where the `subregion` column was "baltimore city" or "baltimore". 

```{r cbg}
baltimore <- subset(baltimore, 
                    subregion %in% c("baltimore city",
                                     "baltimore"))
head(baltimore, 3)
```

I used `geom_point` to plot the points. `ggplot` uses the `group` column to group together counties, but we don't need that in the points, so I needed to set `group = NA` in the `geom_point` statement. I put `color = tower` inside the `aes` statement so that the points would be one color for cell towers and another color for everything else. 

```{r cbh}
balt_plot <- ggplot(baltimore, 
                    aes(x = long, y = lat, group = group)) + 
        geom_polygon(fill = "lightblue", color = "black") +
        geom_point(data = serial, aes(x = long, y = lat,
                                      group = NA,
                                      color = tower)) + 
        theme_bw()
```

```{r cbi, fig.width = 7, fig.height = 4}
balt_plot
```

## Choropleth maps

There's a fantastic new(-ish) package in R to plot choropleth maps. You could also plot choropleths using `ggplot` and other mapping functions, but I would strongly recommend this new package if you're mapping the US. 

You will need to install and load the `choroplethr` package in R to use the functions below.

```{r cbj, message = FALSE, warning = FALSE}
# install.packages("choroplethr")
library(choroplethr)

# install.packages("choroplethrMaps")
library(choroplethrMaps)
```

At the most basic level, you can use this package to plot some data that comes automatically with the package (you'll just need to load the data using the `data` function). For example, if you wanted to plot state-by-state populations as of 2012, you could use:

```{r cbk, warning = FALSE, message=FALSE}
data(df_pop_state)
map_3 <- state_choropleth(df_pop_state)
```

```{r cbl, fig.width=8, fig.height=3.5}
map_3
```

You can find out more about the `df_pop_state` data if you type `?df_pop_state`. Notice that, for the data frame, the location is given in a column called `region` and the population size to plot is in a column called `value`. 

```{r cbm}
head(df_pop_state, 3)
```

You could use this function to create any state-level choropleth you wanted, as long as you could create a data frame with a column for states called `region` and a column with the value you want to show called `value`.

You can run similar functions at different spatial resolutions (for example, county or zip code):

```{r cbn}
data(df_pop_county)
head(df_pop_county, 3)
```

You can plot choropleths at this level, as well:

```{r cbo, warning = FALSE, message = FALSE}
map_4 <- county_choropleth(df_pop_county)
```

```{r cbp, fig.width=8, fig.height=4}
map_4
```

You can even do this for countries of the world:

```{r cbq, fig.width = 8, fig.height = 3.5, warning=FALSE, message=FALSE}
data(df_pop_country)
country_choropleth(df_pop_country)
```

You can zoom into states or counties. For example, to plot population by county in Colorado, you could run:

```{r cbr, fig.width = 6, fig.height = 3, message = FALSE, warning = FALSE}
county_choropleth(df_pop_county, state_zoom = "colorado")
```

You can also use this package to map different tables from the US Census' American Community Survey. 

The package includes the `choroplethr_acs()` function to do this, with an option for which level of map you want (`map = `, choices are "state", "country", and "zip"). If you want to map at the state level, for example, use `state_choroplethr_acs()` (other options are county level and zip code level). 

These functions pull recent Census data directly from the US Census using its API, so they require you to get an API key, which you can get [here](http://api.census.gov/data/key_signup.html). \bigskip

Once you put in your request, they'll email you your key. Once they give you your API key, you'll need to install it on R:

```{r cbs, echo = FALSE, message = FALSE, warning = FALSE, eval = FALSE}
library(acs)
api.key.install(Sys.getenv("acskey"))
```

```{r cbt, eval = FALSE}
library(acs)
api.key.install('[your census api key]');
```

You can pick from a large number of American Community Survey tables-- [see here](http://factfinder.census.gov/faces/affhelp/jsf/pages/metadata.xhtml?lang=en&type=dataset&id=dataset.en.ACS_12_5YR) for the list plus ID numbers. If the table has multiple columns, you will be prompted to select which one you want to plot. 

For example, table B19301 gives per-capita income, so if you wanted to plot that, you could run:

```{r cbu, fig.width = 3, fig.height = 2.75, message = FALSE, eval = FALSE}
county_choropleth_acs(tableId = "B19301",
                       state_zoom = c("wyoming",
                                      "colorado"))

```

```{r cbv, fig.width = 10, fig.height = 6, message = FALSE, echo = FALSE, fig.align = "center", eval = FALSE}
ex_choro <- county_choropleth_acs(tableId = "B19301",
                       state_zoom = c("wyoming",
                                      "colorado"))
save(ex_choro, file = "data/example_choropleth.Rdata")

```

```{r cbw, echo = FALSE, fig.width = 10, fig.height = 6, message = FALSE, fig.align = "center", eval = FALSE}
load("data/example_choropleth.Rdata")
ex_choro
```


## Google Maps API

The `ggmap` package allows you to use tools from Google Maps directly from R.

```{r cbx, message=FALSE, warning = FALSE}
## install.packages("ggmap")
library(ggmap)
```

This package uses the Google Maps API, so you should read their [terms of service](http://developers.google.com/maps/terms) and make sure you follow them. In particular, you are limited to just a certain number of queries per time. 

You can use the `get_map` function to get maps for different locations. You can either use the longitude and latitude of the center point of the map, along with the `zoom` option to say how much to zoom in (3: continent to 20: building) or you can use a character string to specify a location. 

If you do the second, `get_map` will actually use the Google Maps API to geocode the string to a latitude and longitude and then get the map (you can imagine that this is like searching in Google Maps in the search box for a location).

```{r cby, message = FALSE, warning = FALSE, fig.width = 3.5, fig.height = 3.5, fig.align = "center"}
beijing <- get_map("Beijing", zoom = 12)
ggmap(beijing)
```

With this package, you can get maps from the following different sources:

- Google Maps
- OpenStreetMap
- Stamen Maps
- CloudMade Maps (You may need a separate API key for this)

Here are different examples of Beijing using different map sources. (Also, note that I'm using the option `extent = "device"` to fill up the whole plot are with the map, instead of including axis labels and titles.)

```{r cca, message = FALSE, warning = FALSE}
beijing_a <- get_map("Beijing", zoom = 12,
                     source = "stamen", maptype = "toner")
a <- ggmap(beijing_a, extent = "device")

beijing_b <- get_map("Beijing", zoom = 12,
                     source = "stamen", maptype = "watercolor")
b <- ggmap(beijing_b, extent = "device")

beijing_c <- get_map("Beijing", zoom = 12,
                     source = "google", maptype = "hybrid")
c <- ggmap(beijing_c, extent = "device")
```

```{r ccb, fig.width = 7}
grid.arrange(a, b, c, nrow = 1) 
```

As with the maps from `ggplot2`, you can add points to these maps:

```{r cce, warning = FALSE, message = FALSE}
serial_phone <- read.csv("data/serial_phone_data.csv") %>% 
        mutate(Cell_Site = substring(Cell_Site, 1, 4),
               Call_Time = as.POSIXct(Call_Time, format = "%d/%m/%y %H:%M",
                                      tz = "EST")) %>%
        left_join(serial, by = c("Cell_Site" = "Name")) %>%
        select(Person_Called, Call_Time, Duration, long, lat) %>%
        filter(!(Person_Called %in% c("incoming", "# + Adnan cell"))) %>%
        arrange(Call_Time)
```

```{r ccf, message = FALSE, warning = FALSE, fig.width = 5}
serial_map <- get_map(c(-76.7, 39.3), zoom = 12,
                      source = "stamen",
                      maptype = "toner")
serial_map <- ggmap(serial_map, extent = "device") + 
        geom_point(data = serial_phone,
                   aes(x = long, y = lat),
                   color = "red", size = 3,
                   alpha = 0.4) +
        geom_point(data = subset(serial,
                                 Type != "cell-site"),
                   aes(x = long, y = lat),
                   color = "darkgoldenrod1",
                   size = 2)
```

```{r ccg, warning = FALSE, message = FALSE, fig.width = 4, fig.height = 4, fig.align = "center", echo = FALSE}
serial_map
```

You can also use the Google Maps API, through the `geocode` function, to get the latitude and longitude of specific locations. Basically, if the string would give you the right location if you typed it in Google Maps, `geocode` should be able to geocode it.

For example, you can get the location of CSU:

```{r cch, message = FALSE, warning = FALSE}
geocode("Colorado State University")
```

You can also get a location by address:

```{r cci, message = FALSE, warning = FALSE}
geocode("1 First St NE, Washington, DC")
```

You can get distances, too, using the `mapdist` function with two locations. This will give you distance and also time. 

```{r ccj, message = FALSE, warning = FALSE}
mapdist("Fort Collins CO",
        "1 First St NE, Washington, DC") %>%
  select(from, miles, hours)
```

## String operations

The `str_trim` function from the `stringr` package allows you to trim leading and trailing white space: 

```{r cck}
with_spaces <- c("    a ", "  bob", " gamma")
with_spaces
str_trim(with_spaces)
```

This is rarer, but if you ever want to, you can add leading and/or trailing white space to elements of a character vector with `str_pad` from the `stringr` package.

There are also functions to change a full character string to uppercase, lowercase, or title case: 

```{r ccl}
titanic_train$Name[1]
str_to_upper(titanic_train$Name[1])
str_to_lower(titanic_train$Name[1])
str_to_title(str_to_lower(titanic_train$Name[1]))
```

## In-course exercise

This exercise will continue using the Fatality Analysis Reporting System (FARS) data we started using last week. 

```{r ccm, echo = FALSE, message = FALSE, warning = FALSE}
state_fips <- read_delim("http://www2.census.gov/geo/docs/reference/state.txt",
                         delim = "|") %>%
  dplyr::rename(state = STATE,
         state_name = STATE_NAME) %>%
  dplyr::select(state, state_name) %>%
  dplyr::mutate(state = as.integer(state))

accident <- read_csv("data/accident.csv") %>%
  dplyr::select(STATE, DAY:MINUTE, DRUNK_DR) %>%
  dplyr::rename(state = STATE, 
                drunk_dr = DRUNK_DR) %>%
  dplyr::select(-DAY_WEEK) %>%
  tidyr::unite(date, DAY:MINUTE, sep = "-") %>%
  dplyr::mutate(date = dmy_hm(date), 
                drunk_dr = drunk_dr >= 1, 
                daytime = hour(date) %in% c(7:19), 
                month = month(date)) %>%
  dplyr::filter(!is.na(date)) %>%
  dplyr::left_join(state_fips, by = "state")
```

### Writing a function to create state-specific plots

- Adapt the code that you wrote for to create a loop in last week's in-course exercise and write a function that will create a state-specific plot when you call it. This function should have inputs of `datafr` and `which_state`, where `datafr` is the data frame with the full data set (`accident` in this case) and `which_state` is the name of the state for which you'd like to create the plot. Name your functions `make_state_plot`. Here are a few examples of running this function:

```{r ccn, echo = FALSE}
make_state_plot <- function(datafr, which_state){
  a <- datafr %>%
    dplyr::filter_(~ state_name == which_state) %>%
    dplyr::mutate_(drunk_dr = ~ factor(drunk_dr,
                             labels = c("Unrelated to\ndrunk driving",
                                        "Related to\ndrunk driving")),
           daytime = ~ factor(daytime, labels = c("Nighttime", "Daytime"))) %>%
    dplyr::group_by_(~ daytime, ~ month, ~ drunk_dr) %>%
    dplyr::summarize(accidents = n()) 
  a <- ggplot2::ggplot(a, ggplot2::aes(x = daytime, y = accidents,
                                 group = daytime)) + 
    ggplot2::geom_boxplot() + 
    ggplot2::facet_wrap(~ drunk_dr, ncol = 2) +
    ggplot2::xlab("Time of day") + ggplot2::ylab("# of monthly accidents") + 
    ggplot2::ggtitle(paste("Fatal accidents in", which_state, "in 2015"))
  return(a)
}
```

```{r cco, fig.width = 6, fig.height = 3, fig.align = "center"}
make_state_plot(accident, which_state = "California")
make_state_plot(accident, which_state = "North Carolina")
make_state_plot(accident, which_state = "Illinois")
```

- In general, there is a pattern of more fatal accidents during the daytime being unrelated to drunk driving versus related. The pattern is reversed for nighttime accidents. Use your function to see if you can find any states that don't follow this pattern.
- So far, we have not covered error checking in functions. In the function that you just wrote, we are assuming that the user will never give inputs that would cause errors. Look through the code for your function and see what assumptions we're making about what values a user would input for `datafr` and `which_state`. What could go wrong if a user does not comply with these assumptions?
- The best practice when writing functions, especially for others to use, is to use the package::function syntax to identify which package each function comes from. For example, if you're using `mutate` in a function, you should write the function code as `dplyr::mutate`. Go back through the code for the function you wrote. Every time it uses a function that is not part of base R, change the code to use this package::function syntax.
- If you have time: Can you combine your function, a loop (or `lapply`), and regular expressions to create plots for all states that start with "C"? That have "North" in their names? (*Hint*: One way to get a list of states is from the `accident` data frame. Once you've joined in state names as a column called `state_name`, you can use: `unique(accident$state_name)`.)

#### Example R code

Here is some example R code for writing the function to create state-specific plots: 

```{r ccp, eval = FALSE}
make_state_plot <- function(datafr, which_state){
  a <- datafr %>%
    dplyr::filter(state_name == which_state) %>%
    dplyr::mutate(drunk_dr = factor(drunk_dr,
                             labels = c("Unrelated to\ndrunk driving",
                                        "Related to\ndrunk driving")),
           daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>%
    dplyr::group_by(daytime, month, drunk_dr) %>%
    dplyr::summarize_(accidents = ~ n()) %>%
    ggplot2::ggplot(ggplot2::aes(x = daytime, y = accidents,
                                 group = daytime)) + 
    ggplot2::geom_boxplot() + 
    ggplot2::facet_wrap(~ drunk_dr, ncol = 2) +
    ggplot2::xlab("Time of day") + ggplot2::ylab("# of monthly accidents") + 
    ggplot2::ggtitle(paste("Fatal accidents in", which_state, "in 2015"))
  return(a)
}

# With package::function notation
make_state_plot <- function(datafr, which_state){
  a <- datafr %>%
    dplyr::filter(state_name == which_state) %>%
    dplyr::mutate(drunk_dr = factor(drunk_dr,
                             labels = c("Unrelated to\ndrunk driving",
                                        "Related to\ndrunk driving")),
           daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>%
    dplyr::group_by(daytime, month, drunk_dr) %>%
    dplyr::summarize_(accidents = ~ n()) %>%
    ggplot2::ggplot(ggplot2::aes(x = daytime, y = accidents,
                                 group = daytime)) + 
    ggplot2::geom_boxplot() + 
    ggplot2::facet_wrap(~ drunk_dr, ncol = 2) +
    ggplot2::xlab("Time of day") + ggplot2::ylab("# of monthly accidents") + 
    ggplot2::ggtitle(paste("Fatal accidents in", which_state, "in 2015"))
  return(a)
}
```

Example of using this function to create plots for all states that start with "C":

```{r ccq, fig.width = 6, fig.height = 3, fig.align = "center", results='hide', fig.show='asis'}
states <- unique(accident$state_name)
c_states <- states[str_detect(states, "^C")]

for(state in c_states){
  state_plot <- make_state_plot(datafr = accident,
                                which_state = state)
  print(state_plot)
}
```

To use `lapply`, the state name needs to be the first argument in the function. We could re-write the function to comply and then use `lapply`:

```{r ccr, fig.width = 6, fig.height = 3, fig.align = "center", results='hide', fig.show='asis'}
make_state_plot <- function(which_state, datafr){
  a <- datafr %>%
    dplyr::filter(state_name == which_state) %>%
    dplyr::mutate(drunk_dr = factor(drunk_dr,
                             labels = c("Unrelated to\ndrunk driving",
                                        "Related to\ndrunk driving")),
           daytime = factor(daytime, labels = c("Nighttime", "Daytime"))) %>%
    dplyr::group_by(daytime, month, drunk_dr) %>%
    dplyr::summarize_(accidents = ~ n()) %>%
    ggplot2::ggplot(ggplot2::aes(x = daytime, y = accidents,
                                 group = daytime)) + 
    ggplot2::geom_boxplot() + 
    ggplot2::facet_wrap(~ drunk_dr, ncol = 2) +
    ggplot2::xlab("Time of day") + ggplot2::ylab("# of monthly accidents") + 
    ggplot2::ggtitle(paste("Fatal accidents in", which_state, "in 2015"))
  return(a)
}

north_states <- states[str_detect(states, "North")]
lapply(north_states, make_state_plot, datafr = accident)
```


### First steps in mapping 

If you have time, you can try some exercises with mapping. 

- Read in and clean up the FARS data and save it to an R object called `accident`. The `accident` data frame should include location (longitude and latitude), state, and whether the accident involved drunk driving. Once you're done cleaning, the dataset should look like this: 

```{r ccs, message = FALSE, warning = FALSE}
state_fips <- read_delim("http://www2.census.gov/geo/docs/reference/state.txt",
                         delim = "|") %>%
  dplyr::rename(state = STATE,
                state_name = STATE_NAME) %>%
  dplyr::select(state, state_name) %>%
  dplyr::mutate(state = as.integer(state))

accident <- read_csv("data/accident.csv") %>%
  dplyr::select(STATE, LATITUDE, LONGITUD, DRUNK_DR) %>%
  dplyr::rename(state = STATE,
                lat = LATITUDE,
                long = LONGITUD,
                drunk_dr = DRUNK_DR) %>%
  dplyr::mutate(drunk_dr = drunk_dr > 0) %>%
  dplyr::left_join(state_fips, by = "state") %>%
  dplyr::select(-state)
accident %>% slice(1:5)
```

- Next, create another R object with state-specific summaries. This object should be a data frame called `state_summaries` that gives the total number of fatal accidents in each state and the percent of all fatal accidents related to drunk driving in each state. The `state_summaries` data frame should look like this when you are done cleaning: 

```{r cct}
state_summaries <- accident %>%
  dplyr::group_by(state_name) %>%
  dplyr::summarize_(n = ~ n(),
            perc_drunk_dr = ~ 100 * mean(drunk_dr)) # Because `drunk_dr` is 
                                            # 0 / 1, the mean is the percent of 
                                            # values where `drunk_dr` == 1.
state_summaries %>% slice(1:5)
```

- Create the following state-level choropleths of number of fatal accidents in 2015 in each state and of the percent of fatal accidents linked to drunk driving in each state. The quickest way to do this is with the `choroplethr` package. 

```{r ccu}
# With `choroplethr` package
library(choroplethr)
state_summaries %>%
  dplyr::mutate(state_name = str_to_lower(state_name)) %>%
  dplyr::rename(region = state_name,
                value = n) %>%
  state_choropleth(title = "Fatal accidents by state, 2015",
                   legend = "# of fatal accidents")

state_summaries %>%
  dplyr::mutate(state_name = str_to_lower(state_name)) %>%
  dplyr::rename(region = state_name,
                value = perc_drunk_dr) %>%
  state_choropleth(title = "Percent of fatal accidents linked\nto drunk driving, 2015",
                   legend = "% of fatal accidents\nlinked to drunk driving")
```

- Now, you will be using the latitude and longitude of each accident to plot accident locations within a state. Check if the columns for latitude and longitude need more cleaning (spoiler alert--- they do). Filter out any values that could not reasonably be within the 50 U.S. states and D.C. 

```{r ccv}
# It turns out that values like 77.7777, 88.8888, and 99.9999 are used as 
# missing values of latitude (similar for 777.777, etc., for longitude)
accident <- accident %>%
  dplyr::filter(lat < 75 & long < 777)
```

- Map the locations of fatal accidents in Texas in 2015 on top of a Google Map base map that shows major roads. An example map is shown below. (*Hint:* Use the `get_map` and `ggmap` functions for this part.)

```{r ccw, message = FALSE, warning = FALSE}
texas <- get_map(location = "texas", zoom = 6, maptype = "roadmap")
ggmap(texas, extent = "device") + 
  geom_point(data = dplyr::filter(accident, state_name == "Texas"), 
             aes(x = long, y = lat), alpha = 0.2, size = 0.5,
             color = "red")
```


- Create a function that will take three inputs: `datafr`, a data frame like `accident` with fatal accident data; `which_state`, a state name; and `drunk_dr`, a logical value (TRUE / FALSE) specifying whether to create separate maps for accidents that were and were not linked to drunk driving. This function should output a map (or two maps if `drunk_dr` is set to TRUE) with points for all fatal accidents in the state. Use `map_data` and create the map yourself, rather than using `choroplethr` for this question. Examples of calling this function are shown below the function definition.

```{r ccx}
map_fatal_accidents <- function(datafr = accident, which_state,
                                drunk_dr = FALSE){
  state_accidents <- datafr %>%
    dplyr::filter(state_name == which_state)
  state_map_data <- ggplot2::map_data("county", region = which_state)
  
  accident_map <- ggplot2::ggplot(state_map_data,
                                  ggplot2::aes(x = long, y = lat, 
                                               group = subregion)) + 
    ggplot2::geom_polygon(color = "gray", fill = NA) + 
    ggplot2::theme_void() + 
    ggplot2::geom_point(data = state_accidents, 
                        ggplot2::aes(x = long, y = lat, group = NULL),
                        alpha = 0.5, size = 0.7 )
  
  if(drunk_dr){
    accident_map <- accident_map + 
      ggplot2::facet_wrap(~ drunk_dr, ncol = 2)
  }
  
  return(accident_map)
}
```

```{r ccy, message = FALSE, warning = FALSE, fig.width = 4, fig.height = 2.75, fig.align = "center"}
map_fatal_accidents(accident, which_state = "Colorado")
```

```{r ccz, message = FALSE, warning = FALSE, fig.width = 4.3, fig.height = 4.5, fig.align = "center"}
map_fatal_accidents(accident, which_state = "Texas")
```

```{r cda, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 2.75, fig.align = "center"}
map_fatal_accidents(accident, which_state = "Colorado", 
                    drunk_dr = TRUE)
```

```{r cdb, message = FALSE, warning = FALSE, fig.width = 8.6, fig.height = 4.5, fig.align = "center"}
map_fatal_accidents(accident, which_state = "Texas",
                    drunk_dr = TRUE)
```