diff --git a/R/learnr-things.R b/R/learnr-things.R new file mode 100644 index 0000000..fee6118 --- /dev/null +++ b/R/learnr-things.R @@ -0,0 +1,16 @@ +embedded_learnr <- function(url, id) { + glue::glue( + '', + .open = "[", .close = "]" + ) +} + +include_iframe_resizer <- function() { + glue::glue( + '' + ) +} diff --git a/_quarto.yml b/_quarto.yml index f40320e..9e8f75c 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -36,8 +36,12 @@ website: right: - syllabus.qmd - schedule.qmd - - text: "Classes and readings" + - text: "Content" file: content/index.qmd + - text: "Lessons" + file: lesson/index.qmd + - text: "Examples" + file: example/index.qmd - text: "Assignments" file: assignment/index.qmd - text: "Resources" @@ -177,6 +181,48 @@ website: - text: "15: Truth, beauty, and data revisited" file: assignment/15-exercise.qmd + - title: "Classes and readings" + contents: + - section: "Overview" + contents: + - lesson/index.qmd + - section: "Foundations" + contents: + - text: "1: Truth, beauty, and data + the tidyverse" + file: lesson/01-lesson.qmd + - text: "2: Graphic design" + file: lesson/02-lesson.qmd + - text: "3: Mapping data to graphics" + file: lesson/03-lesson.qmd + - section: "Core types of graphics" + contents: + - text: "4: Amounts and proportions" + file: lesson/04-lesson.qmd + - text: "5: Themes" + file: lesson/05-lesson.qmd + - text: "6: Uncertainty" + file: lesson/06-lesson.qmd + - text: "7: Relationships" + file: lesson/07-lesson.qmd + - text: "8: Comparisons" + file: lesson/08-lesson.qmd + - text: "9: Annotations" + file: lesson/09-lesson.qmd + - section: "Special applications" + contents: + - text: "10: Interactivity" + file: lesson/10-lesson.qmd + - text: "11: Time" + file: lesson/11-lesson.qmd + - text: "12: Space" + file: lesson/12-lesson.qmd + - text: "13: Text" + file: lesson/13-lesson.qmd + - text: "14: Enhancing graphics" + file: lesson/14-lesson.qmd + - text: "15: Truth, beauty, and data revisited" + file: lesson/15-lesson.qmd + - title: "Resources" contents: - section: "Resources" diff --git a/files/img/lesson/file-types/atlanta-night.jpg b/files/img/lesson/file-types/atlanta-night.jpg new file mode 100644 index 0000000..5242f92 Binary files /dev/null and b/files/img/lesson/file-types/atlanta-night.jpg differ diff --git a/files/img/lesson/file-types/atlanta-sign.jpg b/files/img/lesson/file-types/atlanta-sign.jpg new file mode 100644 index 0000000..182b9f5 Binary files /dev/null and b/files/img/lesson/file-types/atlanta-sign.jpg differ diff --git a/files/img/lesson/file-types/butterflies.png b/files/img/lesson/file-types/butterflies.png new file mode 100644 index 0000000..48ae04d Binary files /dev/null and b/files/img/lesson/file-types/butterflies.png differ diff --git a/files/img/lesson/file-types/gsu-logo.png b/files/img/lesson/file-types/gsu-logo.png new file mode 100644 index 0000000..a846f45 Binary files /dev/null and b/files/img/lesson/file-types/gsu-logo.png differ diff --git a/files/img/lesson/file-types/pie_chart.png b/files/img/lesson/file-types/pie_chart.png new file mode 100644 index 0000000..83d3a83 Binary files /dev/null and b/files/img/lesson/file-types/pie_chart.png differ diff --git a/files/img/lesson/file-types/solo.jpg b/files/img/lesson/file-types/solo.jpg new file mode 100644 index 0000000..755dcdb Binary files /dev/null and b/files/img/lesson/file-types/solo.jpg differ diff --git a/files/img/lesson/working-directory.png b/files/img/lesson/working-directory.png new file mode 100644 index 0000000..5a19900 Binary files /dev/null and b/files/img/lesson/working-directory.png differ diff --git a/lesson/01-lesson.qmd b/lesson/01-lesson.qmd new file mode 100644 index 0000000..c339534 --- /dev/null +++ b/lesson/01-lesson.qmd @@ -0,0 +1,89 @@ +--- +title: "Introduction to R and the tidyverse" +date: "2020-05-11" +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(fig.align = "center") +``` + +## Part 1: The basics of R and dplyr + +For the first part of today's lesson, you need to work through a few of RStudio's introductory primers. You'll do these in your browser and type code and see results there. + +You'll learn some of the basics of R, as well as some powerful methods for manipulating data with the **dplyr** package. + +Complete these: + +- **The Basics** + - [Visualization Basics](https://rstudio.cloud/learn/primers/1.1) + - [Programming Basics](https://rstudio.cloud/learn/primers/1.2) +- **Work with Data** + - [Working with Tibbles](https://rstudio.cloud/learn/primers/2.1) + - [Isolating Data with dplyr](https://rstudio.cloud/learn/primers/2.2) + - [Deriving Information with dplyr](https://rstudio.cloud/learn/primers/2.3) + +The content from these primers comes from the (free and online!) book [*R for Data Science* by Garrett Grolemund and Hadley Wickham](https://r4ds.had.co.nz/). I highly recommend the book as a reference and for continuing to learn and use R in the future (like running regression models and other types of statistical analysis) + + +## Part 2: Getting familiar with RStudio + +The RStudio primers you just worked through are a great introduction to writing and running R code, but you typically won't type code in a browser when you work with R. Instead, you'll use a nicer programming environment like RStudio, which lets you type and save code in scripts, run code from those scripts, and see the output of that code, all in the same program. + +To get familiar with RStudio, watch this video: + +
+ +
+ + +## Part 3: RStudio Projects + +One of the most powerful and useful aspects of RStudio is its ability to manage projects. + +When you first open R, it is "pointed" at some folder on your computer, and anything you do will be relative to that folder. The technical term for this is a "working directory." + +When you first open RStudio, look in the area right at the top of the Console pane to see your current working directory. Most likely you'll see something cryptic: `~/` + +```{r working-directory, echo=FALSE, out.width="50%"} +knitr::include_graphics("/files/img/lesson/working-directory.png", error = FALSE) +``` + +That tilde sign (`~`) is a shortcut that stands for your user directory. On Windows this is `C:\Users\your_user_name\`; on macOS this is `/Users/your_user_name/`. With the working directory set to `~/`, R is "pointed" at that folder, and anything you save will end up in that folder, and R will expect any data that you load to be there too. + +It's always best to point R at some other directory. If you don't use RStudio, you need to manually set the working directory to where you want it with `setwd()`, and many R scripts in the wild include something like `setwd("C:\\Users\\bill\\Desktop\\Important research project")` at the beginning to change the directory. **THIS IS BAD THOUGH** ([see here for an explanation](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/)). If you ever move that directory somewhere else, or run the script on a different computer, or share the project with someone, the path will be wrong and nothing will run and you will be sad. + +The best way to deal with working directories with RStudio is to use RStudio Projects. These are special files that RStudio creates for you that end in a `.Rproj` extension. When you open one of these special files, a new RStudio instance will open up and be pointed at the correct directory automatically. If you move the folder later or open it on a different computer, it will work just fine and you will not be sad. + +[Read this super short chapter on RStudio projects.](https://r4ds.had.co.nz/workflow-projects.html) + + +## Part 4: Getting familiar with R Markdown + +To ensure that the analysis and graphics you make are reproducible, you'll do the majority of your work in this class using **R Markdown** files. + +Do the following things: + +1. Watch this video: + +
+ +
+ +  + +2. Skim through the content at these pages: + + - [Using Markdown](/resource/markdown/) + - [Using R Markdown](/resource/rmarkdown/) + - [How it Works](http://rmarkdown.rstudio.com/lesson-2.html) + - [Code Chunks](http://rmarkdown.rstudio.com/lesson-3.html) + - [Inline Code](http://rmarkdown.rstudio.com/lesson-4.html) + - [Markdown Basics](http://rmarkdown.rstudio.com/lesson-8.html) (The [R Markdown Reference Guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) is super useful here.) + - [Output Formats](http://rmarkdown.rstudio.com/lesson-9.html) + +3. Watch this video: + +
+ +
diff --git a/lesson/02-lesson.qmd b/lesson/02-lesson.qmd new file mode 100644 index 0000000..a04a2e3 --- /dev/null +++ b/lesson/02-lesson.qmd @@ -0,0 +1,143 @@ +--- +title: "Graphic design" +date: "2020-05-12" +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(fig.align = "center") +``` + +## File types + +Recall from the [last section of the lecture](/slides/02-slides.html#image-types) that you'll typically work with one of two image file types: bitmap images and vector images. + +Bitmaps store image information as tiny squares, or pixels. Specific files types compress these images in different ways: JPEG files smudge together groups of similarly colored pixels to save repetition, while PNG and GIF files look for fields of the exact same color. + +```{r bitmap-example, echo=FALSE, out.width="30%"} +knitr::include_graphics("/slides/img/02/example-bitmap.png", error = FALSE) +``` + +You use bitmap images for things that go on the internet and when you place images in Word (technically modern versions of Word can handle some types of vector images, but that support isn't universal yet). + +Vector images, on the other hand, do not store image information as pixels. Instead, these use mathematical formulas to draw lines and curves and fill areas with specific colors. This makes them a little more complicated to draw and create, but it also means that you can scale them up or down infinitely—a vector image will look just as crisp on a postage stamp as it would on a billboard. + +```{r vector-example, echo=FALSE, out.width="30%"} +knitr::include_graphics("/slides/img/02/example-vector.png", error = FALSE) +``` + +Here are some general guidelines: + +- If an image has lots of colors (like a photograph), you should use a bitmap file type designed for lots of colors, like JPEG. This is the case regardless of where the image will ultimately end up. If you're putting it on the internet, it needs to be a JPEG. If you're blowing it up to fit on a billboard, it will still need to be a JPEG (and you have to use a fancy super high quality camera to get a high enough resolution for that kind of expansion) + +- If an image has a few colors and some text and is not a photograph *and* you're using the image in Word or on the internet, you should use a bitmap file type designed for carefully compressing a few colors, like PNG. + +- If an image has a few colors and some text and is not a photograph *and* you're planning on using it in multiple sizes (like a logo), or using it in fancier production software like Adobe InDesign (for print) or Adobe After Effects (for video), you should use a vector file type like PDF or SVG. + + +## Select the best file type + +Practice deciding what kind of file type you should use by looking at these images and choosing what you think works the best. + +```{r include=FALSE} +library(checkdown) +``` + +
+ +```{r atlanta-sign, echo=FALSE, out.width="60%"} +knitr::include_graphics("/files/img/lesson/file-types/atlanta-sign.jpg", error = FALSE) +``` + +```{r atlanta-sign-question, echo=FALSE, results="asis"} +check_question("JPG", options = c("PNG", "JPG", "PDF"), type = "radio", + button_label = "Check answer", question_id = 1, + right = "Correct! This is a photograph, so it should be a JPG. It might seem a little tricky since there are so few colors, but it still needs to be a JPG because the black paint on the brick is actually a range of thousands of different shades of black pixels.", + wrong = "Not quite—this image has a lot of colors in it…") +``` + +
+ +--- + +
+ +```{r gsu-logo, echo=FALSE, out.width="60%"} +knitr::include_graphics("/files/img/lesson/file-types/gsu-logo.png", error = FALSE) +``` + +```{r gsu-logo-question, echo=FALSE, results="asis"} +check_question(c("PNG", "PDF"), options = c("PNG", "JPG", "PDF"), type = "radio", + button_label = "Check answer", question_id = 2, + right = "Correct! This is a logo with a few colors in it, so it’s vector-based. If you use a PDF of the logo, you can rescale it infinitely big or small. If you use a PNG, it will work nicely online.", + wrong = "Not quite—this image doesn’t have a lot of colors in it…") +``` + +
+ +--- + +
+ +```{r pie-chart, echo=FALSE, out.width="60%"} +knitr::include_graphics("/files/img/lesson/file-types/pie_chart.png", error = FALSE) +``` + +```{r pie-chart-question, echo=FALSE, results="asis"} +check_question(c("PNG", "PDF"), options = c("PNG", "JPG", "PDF"), type = "radio", + button_label = "Check answer", question_id = 3, + right = "Correct! This is a grpah with a few colors in it, so should be vector-based. If you’re using this in a fancy publication or report, use a PDF. If you’e using Word or HTML, use a PNG.", + wrong = "Not quite—this image doesn’t have a lot of colors in it…") +``` + +
+ +--- + +
+ +```{r solo, echo=FALSE, out.width="60%"} +knitr::include_graphics("/files/img/lesson/file-types/solo.jpg", error = FALSE) +``` + +```{r solo-question, echo=FALSE, results="asis"} +check_question("JPG", options = c("PNG", "JPG", "PDF"), type = "radio", + button_label = "Check answer", question_id = 4, + right = "Correct! This has a ton of colors in it and is mostly a photograph. You may have been thrown off by the text in the bottom section, or the stylized shapes of the Millennium Falcon’s windows at the top. Those shapes and the text are both vector-based, but because the majority of the image is a photogrpah, it still needs to be saved as a JPG. To keep the text nice and crisp, it needs to be exported at a high resolution.", + wrong = "Not quite—this image has a lot of colors in it…") +``` + +
+ +--- + +
+ +```{r butterflies, echo=FALSE, out.width="60%"} +knitr::include_graphics("/files/img/lesson/file-types/butterflies.png", error = FALSE) +``` + +```{r butterflies-question, echo=FALSE, results="asis"} +check_question(c("PNG", "PDF"), options = c("PNG", "JPG", "PDF"), type = "radio", + button_label = "Check answer", question_id = 5, + right = "Correct! Even though this is very colorful, it should be a PNG or PDF, since it’s vector-based and not a photograph. ", + wrong = "Not quite—this image doesn’t have a lot of colors in it…") +``` + +
+ +--- + +
+ +```{r atlanta-night, echo=FALSE, out.width="60%"} +knitr::include_graphics("/files/img/lesson/file-types/atlanta-night.jpg", error = FALSE) +``` + +```{r atlanta-night-question, echo=FALSE, results="asis"} +check_question("JPG", options = c("PNG", "JPG", "PDF"), type = "radio", + button_label = "Check answer", question_id = 6, + right = "Correct! This is a photograph and should be a JPG.", + wrong = "Not quite—this image has a lot of colors in it…") +``` + +
diff --git a/lesson/03-lesson.qmd b/lesson/03-lesson.qmd new file mode 100644 index 0000000..5ad0b60 --- /dev/null +++ b/lesson/03-lesson.qmd @@ -0,0 +1,40 @@ +--- +title: "Mapping data to graphics" +date: "2020-05-13" +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(fig.align = "center") +``` + +## Part 1: Data visualization with {ggplot2} + +For the first part of today's lesson, you need to work through RStudio's introductory primers for {ggplot2}. You'll do these in your browser and type code and see results there. + +It seems like there are a lot, but they're short and go fairly quickly (especially as you get the hang of the `ggplot()` syntax). Complete these: + +- **Visualize Data** + - [Exploratory Data Analysis](https://rstudio.cloud/learn/primers/3.1) + - [Bar Charts](https://rstudio.cloud/learn/primers/3.2) + - [Histograms](https://rstudio.cloud/learn/primers/3.3) + - [Boxplots and Counts](https://rstudio.cloud/learn/primers/3.4) + - [Scatterplots](https://rstudio.cloud/learn/primers/3.5) + - [Line plots](https://rstudio.cloud/learn/primers/3.6) + - [Overplotting and Big Data](https://rstudio.cloud/learn/primers/3.7) + - [Customize Your Plots](https://rstudio.cloud/learn/primers/3.8) + + +## Part 2: Reshaping data with {tidyr} + +For the last part of today's lesson, you'll work through just one RStudio primer to learn how to use the {tidyr} package to reshape data from wide to long and back to wide. + +Complete this: + +- **Tidy Your Data** + - [Reshape Data](https://rstudio.cloud/learn/primers/4.1) + +::: {.callout-note} +### Pivoting + +Newer versions of **tidyr** have renamed these core functions: `gather()` is now `pivot_longer()` and `spread()` is now `pivot_wider()`. The syntax for these `pivot_*()` functions is *slightly* different from what it was in `gather()` and `spread()`, so you can't just replace the names. Fortunately, both `gather()` and `spread()` still work and won't go away for a while, so you can still use them as you learn about reshaping and tidying data. It would be worth learning how the newer `pivot_*()` functions work, eventually, though ([see here for examples](https://tidyr.tidyverse.org/articles/pivot.html)). +::: diff --git a/lesson/04-lesson.qmd b/lesson/04-lesson.qmd new file mode 100644 index 0000000..e857f6f --- /dev/null +++ b/lesson/04-lesson.qmd @@ -0,0 +1,565 @@ +--- +title: "Amounts and proportions" +date: "2022-06-13" +--- + +```{r setup, include=FALSE} +library(tidyverse) +library(gapminder) + +knitr::opts_chunk$set(fig.width = 6, fig.height = 4.5, fig.align = "center", collapse = TRUE) +set.seed(1234) +options(dplyr.summarise.inform = FALSE) +``` + +```{r learnr-setup, echo=FALSE, results="asis"} +source(here::here("R", "learnr-things.R")) +include_iframe_resizer() +``` + +When you visualize proportions with ggplot, you'll typically go through a two-step process: + +1. Summarize the data with {dplyr} (typically with a combination of `group_by()` and `summarize()`) +2. Plot the summarized data + + +## Manipulating data with {dplyr} + +You had some experience with {dplyr} functions in the RStudio primers, but we'll briefly review them here. + +There are 6 important verbs that you'll typically use when working with data: + +- Extract rows/cases with `filter()` +- Extract columns/variables with `select()` +- Arrange/sort rows with `arrange()` +- Make new columns/variables with `mutate()` +- Make group summaries with `group_by %>% summarize()` + +Every {dplyr} verb follows the same pattern. The first argument is always a data frame, and the function always returns a data frame: + +```{r dplyr-template, eval=FALSE} +VERB(DATA_TO_TRANSFORM, STUFF_IT_DOES) +``` + +### Filtering with `filter()` + +The `filter()` function takes two arguments: a data frame to transform, and a set of tests. It will return each row for which the test is TRUE. + +This code, for instance, will look at the `gapminder` dataset and return all rows where `country` is equal to "Denmark": + +```{r filter-denmark} +filter(gapminder, country == "Denmark") +``` + +Notice that there are two equal signs (`==`). This is because it's a logical test, similar to greater than (`>`) or less than (`<`). When you use a single equal sign, you set an argument (like `data = gapminder`); when you use two, you are doing a test. There are lots of different ways to do logical tests: + +| Test | Meaning | +| ----------- | ------------------------ | +| `x < y` | Less than | +| `x > y` | Greater than | +| `x == y` | Equal to | +| `x <= y` | Less than or equal to | +| `x >= y` | Greater than or equal to | +| `x != y` | Not equal to | +| `x %in% y` | In (group membership) | +| `is.na(x)` | Is missing | +| `!is.na(x)` | Is not missing | + +::: {.callout-important} +### Your turn + +Use `filter()` and logical tests to show: + +1. The data for Canada +2. All data for countries in Oceania +3. Rows where life expectancy is greater than 82 + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_04-dplyr-1/", + id = "learnr-04-lesson-dplyr1" +) +``` + +You can also use multiple conditions, and these will extract rows that meet every test. By default, if you separate the tests with a comma, R will consider this an "and" test and find rows that are *both* Denmark and greater than 2000. + +```{r filter-denmark-multiple} +filter(gapminder, country == "Denmark", year > 2000) +``` + +You can also use "or" with "`|`" and "not" with "`!`": + +| Operator | Meaning | +| -------- | ------- | +| `a & b` | and | +| `a | b` | or | +| `!a` | not | + + +::: {.callout-important} +### Your turn + +Use `filter()` and logical tests to show: + +1. Canada before 1970 +2. Countries where life expectancy in 2007 is below 50 +3. Countries where life expectancy in 2007 is below 50 and are not in Africa + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_04-dplyr-2/", + id = "learnr-04-lesson-dplyr2" +) +``` + +Beware of some common mistakes! You can't collapse multiple tests into one. Instead, use two separate tests: + +```{r filter-multiple, eval=FALSE} +# This won't work! +filter(gapminder, 1960 < year < 1980) + +# This will work +filter(gapminder, 1960 < year, year < 1980) +``` + +Also, you can avoid stringing together lots of tests by using the `%in%` operator, which checks to see if a value is in a list of values. + +```{r filter-in, eval=FALSE} +# This works, but is tedious +filter(gapminder, + country == "Mexico" | country == "Canada" | country == "United States") + +# This is more concise and easier to add other countries later +filter(gapminder, + country %in% c("Mexico", "Canada", "United States")) +``` + +### Adding new columns with `mutate()` + +You create new columns with the `mutate()` function. You can create a single column like this: + +```{r mutate-single} +mutate(gapminder, gdp = gdpPercap * pop) +``` + +And you can create multiple columns by including a comma-separated list of new columns to create: + +```{r mutate-multiple} +mutate(gapminder, gdp = gdpPercap * pop, + pop_mill = round(pop / 1000000)) +``` + +You can also do conditional tests within `mutate()` using the `ifelse()` function. This works like the `=IFELSE` function in Excel. Feed the function three arguments: (1) a test, (2) the value if the test is true, and (3) the value if the test is false: + +```{r show-ifelse, eval=FALSE} +ifelse(TEST, VALUE_IF_TRUE, VALUE_IF_FALSE) +``` + +We can create a new column that is a binary indicator for whether the country's row is after 1960: + +```{r mutate-after-1960} +mutate(gapminder, after_1960 = ifelse(year > 1960, TRUE, FALSE)) +``` + +We can also use text labels instead of `TRUE` and `FALSE`: + +```{r mutate-after-1960-text} +mutate(gapminder, + after_1960 = ifelse(year > 1960, "After 1960", "Before 1960")) +``` + +::: {.callout-important} +### Your turn + +Use `mutate()` to: + +1. Add an `africa` column that is TRUE if the country is on the African continent +2. Add a column for logged GDP per capita +3. Add an `africa_asia` column that says “Africa or Asia” if the country is in Africa or Asia, and “Not Africa or Asia” if it’s not + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_04-dplyr-3/", + id = "learnr-04-lesson-dplyr3" +) +``` + +### Combining multiple verbs with pipes (`%>%`) + +What if you want to filter to include only rows from 2002 *and* make a new column with the logged GDP per capita? Doing this requires both `filter()` and `mutate()`, so we need to find a way to use both at once. + +One solution is to use intermediate variables for each step: + +```{r pipes-intermediate, eval=FALSE} +gapminder_2002_filtered <- filter(gapminder, year == 2002) + +gapminder_2002_logged <- mutate(gapminder_2002_filtered, log_gdpPercap = log(gdpPercap)) +``` + +That works fine, but your environment panel will start getting full of lots of intermediate data frames. + +Another solution is to nest the functions inside each other. Remember that all {dplyr} functions return data frames, so you can feed the results of one into another: + +```{r pipes-nested, eval=FALSE} +filter(mutate(gapminder, log_gdpPercap = log(gdpPercap)), + year == 2002) +``` + +That works too, but it gets *really* complicated once you have even more functions, and it's hard to keep track of which function's arguments go where. I'd avoid doing this entirely. + +One really nice solution is to use a pipe, or `%>%`. **The pipe takes an object on the left and passes it as the first argument of the function on the right**. + +```{r pipe-example, eval=FALSE} +# gapminder will automatically get placed in the _____ spot +gapminder %>% filter(_____, country == "Canada") +``` + +These two lines of code do the same thing: + +```{r pipe-equivalent, eval=FALSE} +filter(gapminder, country == "Canada") + +gapminder %>% filter(country == "Canada") +``` + +Using pipes, you can start with a data frame, pass it to one verb, then pass the output of that verb to the next verb, and so on. **When reading any code with a `%>%`, it's easiest to read the `%>%` as "and then".** This would read: + +> Take the `gapminder` dataset *and then* filter it so that it only has rows from 2002 *and then* add a new column with the logged GDP per capita + +```{r pipes-full-example, eval=FALSE} +gapminder %>% + filter(year == 2002) %>% + mutate(log_gdpPercap = log(gdpPercap)) +``` + +Here's another way to think about pipes more conceptually. This isn't valid R code, obviously, but imagine you're going to take yourself, and then wake up, get out of bed, get dressed, and leave the house. Writing that whole process as nested functions would look like this: + +```{r wake-up-nested, eval=FALSE} +leave_house(get_dressed(get_out_of_bed(wake_up(me, time = "8:00"), side = "correct"), pants = TRUE, shirt = TRUE), car = TRUE, bike = FALSE) +``` + +Instead of nesting everything, we can use pipes to chain these together. This would read + +> Take myself, *and then* wake up at 8:00, *and then* get out of bed on the correct side, *and then* get dressed with pants and a shirt, *and then* leave the house in a car + +```{r wake-up-pipes, eval=FALSE} +me %>% + wake_up(time = "8:00") %>% + get_out_of_bed(side = "correct") %>% + get_dressed(pants = TRUE, shirt = TRUE) %>% + leave_house(car = TRUE, bike = FALSE) +``` + +### Summarizing data by groups with `group_by() %>% summarize()` + +The `summarize()` verb takes an entire frame and calculates summary information about it. For instance, this will find the average life expectancy for the whole `gapminder` data: + +```{r summarize-full-single} +gapminder %>% summarize(mean_life = mean(lifeExp)) +``` + +You can also make multiple summary variables, just like `mutate()`, and it will return a column for each: + +```{r summarize-full-multiple} +gapminder %>% summarize(mean_life = mean(lifeExp), + min_life = min(lifeExp)) +``` + +::: {.callout-important} +### Your turn + +Use `summarize()` to calculate: + +1. The first (minimum) year in the `gapminder` dataset +2. The last (maximum) year in the dataset +3. The number of rows in the dataset (use the [{dplyr} cheatsheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)) +4. The number of distinct countries in the dataset (use the [{dplyr} cheatsheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)) + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_04-dplyr-4/", + id = "learnr-04-lesson-dplyr4" +) +``` + +::: {.callout-important} +### Your turn + +Use `filter()` and `summarize()` to calculate the median life expectancy on the African continent in 2007: +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_04-dplyr-5/", + id = "learnr-04-lesson-dplyr5" +) +``` + +Notice that `summarize()` on its own summarizes the whole dataset, so you only get a single row back. These values are the averages and minimums for the entire data frame. If you group your data into separate subgroups, you can use `summarize()` to calculate summary statistics for each group. Do this with `group_by()`. + +The `group_by()` function puts rows into groups based on values in a column. If you run this: + +```{r summarize-groupby} +gapminder %>% group_by(continent) +``` + +…you won't see anything different! R has put the dataset into separate invisible groups behind the scenes, but you haven't done anything with those groups, so nothing has really happened. If you do things with those groups with `summarize()`, though, `group_by()` becomes much more useful. + +For instance, this will take the `gapminder` data frame, group it by continent, and then summarize it by calculating the number of distinct countries in each group. It will return *one row for each group*, so there should be a row for each continent: + +```{r summarize-group-distinct} +gapminder %>% + group_by(continent) %>% + summarize(n_countries = n_distinct(country)) +``` + +You can calculate multiple summary statistics, as before: + +```{r summarize-group-distinct-multiple} +gapminder %>% + group_by(continent) %>% + summarize(n_countries = n_distinct(country), + avg_life_exp = mean(lifeExp)) +``` + +::: {.callout-important} +### Your turn + +Find the minimum, maximum, and median life expectancy for each continent: + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_04-dplyr-6/", + id = "learnr-04-lesson-dplyr6" +) +``` + +::: {.callout-important} +### Your turn + +Find the minimum, maximum, and median life expectancy for each continent in 2007 only: + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_04-dplyr-7/", + id = "learnr-04-lesson-dplyr7" +) +``` + +Finally, you can group by multiple columns and R will create subgroups for every combination of the groups and return the number of rows of combinations. For instance, we can calculate the average life expectancy by both year and continent and we'll get 60 rows, since there are 5 continents and 12 years (5 × 12 = 60): + +```{r groupby-year-continent} +gapminder %>% + group_by(continent, year) %>% + summarize(avg_life_exp = mean(lifeExp)) +``` + + +### Selecting with `select()` + +The last two verbs are far simpler than `filter()`, `mutate()`, and `group_by() %>% summarize()`. + +You can choose specific columns with the `select()` verb. This will only keep two columns: `lifeExp` and `year`: + +```{r select-single} +gapminder %>% select(lifeExp, year) +``` + +You can remove specific columns by prefacing the column names with `-`, like `-lifeExp`: + +```{r omit-single} +gapminder %>% select(-lifeExp) +``` + +You can also rename columns using `select()`. Follow this pattern: `select(old_name = new_name)`. + +```{r rename-select} +gapminder %>% select(year, country, life_expectancy = lifeExp) +``` + +Alternatively, there's a special `rename()` verb that will, um, rename, while keeping all the other columns: + +```{r rename-rename} +gapminder %>% rename(life_expectancy = lifeExp) +``` + +### Arranging data with `arrange()` + +The `arrange()` verb sorts data. By default it sorts ascendingly, putting the lowest values first: + +```{r arrange-single} +gapminder %>% arrange(lifeExp) +``` + +You can reverse that by wrapping the column name with `desc()`: + +```{r arrange-single-desc} +gapminder %>% arrange(desc(lifeExp)) +``` + +You can sort by multiple columns by specifying them in a comma separated list. For example, we can sort by continent and then sort by life expectancy within the continents: + +```{r arrange-multiple} +gapminder %>% + arrange(continent, desc(lifeExp)) +``` + +### That's it! + +Those are the main verbs you'll deal with in this class. There are dozens of other really useful ones—check out the [{dplyr} and {tidyr} cheat sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for examples. + + +## Changing colors, shapes, and sizes, with `scale_*()` + +Recall from session 3 that the grammar of graphics uses a set of layers to define elements of plots: + +```{r gg-layers, echo=FALSE, out.width="50%", fig.align="center"} +knitr::include_graphics("/slides/img/03/ggplot-layers@4x.png", error = FALSE) +``` + +In tomorrow's session, you'll learn all about the Theme layer. Here we'll briefly cover the Scales layer, which we use for changing aspects of the different aesthetics, like using logged axes or changing colors or shapes. + +All the functions that deal with scales conveniently follow the same naming pattern: + +```{r scale-template, eval=FALSE} +scale_AESTHETIC_DETAILS() +``` + +Here are some common scale functions: + +```{r scale-examples, eval=FALSE} +scale_x_continuous() +scale_y_reverse() +scale_color_viridis_c() +scale_shape_manual(values = c(19, 13, 15)) +scale_fill_manual(values = c("red", "orange", "blue")) +``` + +You can see a [list of all of the possible scale functions here](https://ggplot2.tidyverse.org/reference/index.html#section-scales), and you should reference that documentation (and the excellent examples) often when working with these functions. + +As long as you have mapped a variable to an aesthetic with `aes()`, you can use the `scale_*()` functions to deal with it. For instance, in this ggplot, we have mapped variables to `x`, `y`, and `fill`, which means we can use those corresponding scale functions to manipulate how those aesthetics are shown. Here we reverse the y-axis (ew, don't really do this), and we use a discrete viridis color palette: + +```{r plot-continent-counts} +continent_counts <- gapminder %>% + group_by(continent) %>% + summarize(countries = n_distinct(country)) + +ggplot(continent_counts, aes(x = continent, y = countries, fill = continent)) + + geom_col() + + scale_y_reverse() + # lol this is bad; don't do it in real life + scale_fill_viridis_d() +``` + +You can also use different arguments in the scale functions—again, check the documentation for examples. For instance, if we want to use the [plasma palette from the viridis package](https://ggplot2.tidyverse.org/reference/scale_viridis.html), we can set that as an option: + +```{r plot-continent-plasma} +ggplot(continent_counts, aes(x = continent, y = countries, fill = continent)) + + geom_col() + + scale_fill_viridis_d(option = "plasma") +``` + +That yellow might be too bright and hard to see, so we can tell ggplot to not use the full range of the palette, ending at 90% of the range instead: + +```{r plot-continent-plasma-9} +ggplot(continent_counts, aes(x = continent, y = countries, fill = continent)) + + geom_col() + + scale_fill_viridis_d(option = "plasma", end = 0.9) +``` + +Instead of letting R calculate the colors from a general palette, you can also specify your own colors with `scale_fill_manual()` and feeding it a list of values—generally as [hex codes](https://www.google.com/search?q=color+picker) or a name from a [list of built-in R colors](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf): + +```{r plot-continent-manual} +ggplot(continent_counts, aes(x = continent, y = countries, fill = continent)) + + geom_col() + + scale_fill_manual(values = c("chartreuse4", "cornsilk4", "black", "#fc03b6", "#5c47d6")) +``` + +Scale functions also work for other aesthetics like `shape` or `color` or `size`. For instance, consider this plot, which has all three: + +```{r plot-health-wealth-basic} +gapminder_2007 <- gapminder %>% + filter(year == 2007) + +ggplot(gapminder_2007, + aes(x = gdpPercap, y = lifeExp, + color = continent, shape = continent, size = pop)) + + geom_point() + + scale_x_log10() +``` + +We can change the colors of the points with `scale_color_*()`: + +```{r plot-health-wealth-colors} +ggplot(gapminder_2007, + aes(x = gdpPercap, y = lifeExp, + color = continent, shape = continent, size = pop)) + + geom_point() + + scale_x_log10() + + scale_color_manual(values = c("chartreuse4", "cornsilk4", "black", "#fc03b6", "#5c47d6")) +``` + +We can change the shapes with `scale_shape_*()`. If you run `?pch` in your console or search for `pch` in the help, you can see all the possible shapes. + +```{r plot-health-wealth-shapes} +ggplot(gapminder_2007, + aes(x = gdpPercap, y = lifeExp, + color = continent, shape = continent, size = pop)) + + geom_point() + + scale_x_log10() + + scale_shape_manual(values = c(12, 9, 17, 19, 15)) +``` + +You can change the size with `scale_size_*()`. Here we make it so the smallest possible size is 1 and the largest is 15: + +```{r plot-health-wealth-size} +ggplot(gapminder_2007, + aes(x = gdpPercap, y = lifeExp, + color = continent, shape = continent, size = pop)) + + geom_point() + + scale_x_log10() + + scale_size_continuous(range = c(1, 15)) +``` + +We can even do all three at once: + +```{r plot-health-wealth-everything} +ggplot(gapminder_2007, + aes(x = gdpPercap, y = lifeExp, + color = continent, shape = continent, size = pop)) + + geom_point() + + scale_x_log10() + + scale_color_manual(values = c("chartreuse4", "cornsilk4", "black", "#fc03b6", "#5c47d6")) + + scale_shape_manual(values = c(12, 9, 17, 19, 15)) + + scale_size_continuous(range = c(1, 15)) +``` + +Phew. That's ugly. + +One last thing we can do with scales is format how they show up on the plot. Notice how the population legend uses scientific notation like `2.50e+08`. This means you need to move the decimal point 8 places to the right, making it `250000000`. Leaving it in scientific notation isn't great because it makes it really hard to read and interpret. + +If you load the {scales} library (which is installed as part of {tidyverse} but isn't automatically loaded), you can use some neat helper functions to reformat the text that shows up in plots. For instance, we can make it so population is formatted as a number with commas every 3 numbers, and the x-axis is formatted as dollars: + +```{r plot-health-wealth-scale-labels, warning=FALSE, message=FALSE} +library(scales) + +ggplot(gapminder_2007, + aes(x = gdpPercap, y = lifeExp, + color = continent, shape = continent, size = pop)) + + geom_point() + + scale_x_log10(labels = dollar) + + scale_size_continuous(labels = comma) +``` + +[Check the documentation for {scales}](https://scales.r-lib.org/reference/index.html) for details about all the labelling functions it has, including dates, percentages, p-values, LaTeX math, etc. diff --git a/lesson/05-lesson.qmd b/lesson/05-lesson.qmd new file mode 100644 index 0000000..48a8577 --- /dev/null +++ b/lesson/05-lesson.qmd @@ -0,0 +1,312 @@ +--- +title: "Themes" +date: "2022-05-14" +--- + +```{r setup, include=FALSE} +library(ggplot2) +knitr::opts_chunk$set(fig.width = 6, fig.height = 4.5, fig.align = "center", collapse = TRUE) +set.seed(1234) +``` + +```{r learnr-setup, echo=FALSE, results="asis"} +source(here::here("R", "learnr-things.R")) +include_iframe_resizer() +``` + +## Complete ggplot themes + +There are many built-in complete themes that have a good combination of all the different `theme()` options already set for you. By default, ggplot uses `theme_gray()` (also spelled `theme_grey()` for UK English; because the first developer of ggplot (Hadley Wickham) is from New Zealand, British spelling works throughout (e.g. you can use `colour` instead of `color`)) + +::: {.callout-important} +### Your turn + +Add `theme_minimal()` to this plot: + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_05-themes-1/", + id = "learnr-05-lesson-theme1" +) +``` + +Hopefully that was easy! + +If you look at [the documentation for the different theme functions](https://ggplot2.tidyverse.org/reference/ggtheme.html), you'll notice that there are a few optional arguments, like `base_size` and `base_family`. The `base_size` argument changes the base font size for the text in the plot, and it is 11 by default. Changing it to something like 20 will not make all the text in the plot be sized at 20—functions like `theme_minimal()` set the size of plot elements based on the `base_size`. For instance, in `theme_minimal()`, the plot title is set to be 120% of `base_size`, while the caption is 80%. Changing `base_size` will resize all the different elements accordingly. + +::: {.callout-important} +### Your turn + +Modify this plot to use `theme_minimal()` with a base size of 16: + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_05-themes-2/", + id = "learnr-05-lesson-theme2" +) +``` + +Hopefully that was also fairly straightforward! + + +## Modifying plot elements with `theme()` + +Using a complete theme like `theme_minimal()` or `theme_bw()` is a great starting point for getting a nice, clean, well designed plot. You'll often need to make adjustments to smaller, more specific parts of the plot though. To do this, you can use the `theme()` function. + +`theme()` is a massive function and has perhaps the most possible arguments of any function in R. It is impossible to remember everything it can possibly do. Fortunately its documentation is incredible. Run `?theme` in your R console to see the help page, or [go to this page online](https://ggplot2.tidyverse.org/reference/theme.html). + +### Deal with general plot elements + +A few arguments to `theme()` don't use any special function—you can just specify settings with text like `"bottom"` or `"right"` + +::: {.callout-important} +### Your turn + +Look at the [documentation for `theme()` online](https://ggplot2.tidyverse.org/reference/theme.html). Make this plot's legend appear on the bottom instead of the left. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_05-themes-3/", + id = "learnr-05-lesson-theme3" +) +``` + +### Disable elements completely with `element_blank()` + +Any plot element can be disabled by using `element_blank()`. For instance, if you want to remove the axis ticks, you can use `theme(axis.ticks = element_blank())`. + +::: {.callout-important} +### Your turn + +Look at the [documentation for `theme()` online](https://ggplot2.tidyverse.org/reference/theme.html). Disable the panel grid in this plot. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_05-themes-4/", + id = "learnr-05-lesson-theme4" +) +``` + +You can also target more specific plot elements. You can specify something like `axis.text`, which applies to all axis text, or you can use `axis.text.y` to only target the text on the y-axis. + +::: {.callout-important} +### Your turn + +Look at the [documentation for `theme()` online](https://ggplot2.tidyverse.org/reference/theme.html). Make the following changes to this plot: + +- Disable the major panel grid for the x-axis +- Disable the minor panel grid for the x-axis +- Disable the minor panel grid for the y-axis. + +You should only have three horizontal lines for the grid. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_05-themes-5/", + id = "learnr-05-lesson-theme5" +) +``` + +--- + +Almost every other plot element fits into one of three categories: a rectangle, a line, or text. Changing the settings on these elements requires specific functions that correspond to these categories. + +### Deal with borders and backgrounds with `element_rect()` + +Things like the plot background or the panel background or the legend background are rectangles and can be manipulated with `element_rect()`. If you want the legend box to be yellow with a thin black border, you would use `theme(legend.box.background = element_rect(fill = "yellow", color = "black", size = 1)`. + +::: {.callout-important} +### Your turn + +Look at the [documentation for `theme()`](https://ggplot2.tidyverse.org/reference/theme.html) and the [documentation for `element()`](https://ggplot2.tidyverse.org/reference/element.html) online. Make the following changes to this plot: + +- Fill the plot background with #F2D8CE +- Fill the panel background with #608BA6, and make the border #184759 with size = 5 + +This will be a fairly ugly plot. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_05-themes-6/", + id = "learnr-05-lesson-theme6" +) +``` + +### Deal with lines with `element_line()` + +Things like the panel grid, tick marks, and axis lines are all lines and can be manipulated with `element_line()`. If you want the x-axis line to be a dotted orange like, you would use `theme(axis.line.x = element_line(color = "orange", linetype = "dotted")`. + +::: {.callout-important} +### Your turn + +Look at the [documentation for `theme()`](https://ggplot2.tidyverse.org/reference/theme.html) and the [documentation for `element()`](https://ggplot2.tidyverse.org/reference/element.html) online. Make the following changes to this plot: + +- Make the major panel gridlines blue and dashed with size = 1 + +This will also be a fairly ugly plot. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_05-themes-7/", + id = "learnr-05-lesson-theme7" +) +``` + +### Deal with text with `element_text()` + +Finally, anything with text can be manipulated with `element_text()`, and you can specify all sorts of things, including font family (`family`), font weight (`face`), color (`color`), horizontal justification (`hjust`), angle (`angle`), and a bunch of other options. If you want the x-axis text to be italicized and rotated at a 45º angle, you would use `theme(axis.text.x = element_text(face = "italic", angle = 45))`. + +::: {.callout-important} +### Your turn + +Look at the [documentation for `theme()`](https://ggplot2.tidyverse.org/reference/theme.html) and the [documentation for `element()`](https://ggplot2.tidyverse.org/reference/element.html) online. Make the following changes to this plot: + +- Make the y-axis text italic +- Make the plot title right aligned, bold, and colored with #8C7811 +- Make the plot subtitle right aligned + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_05-themes-8/", + id = "learnr-05-lesson-theme8" +) +``` + +## Important note about ordering + +Things like `theme_grey()` or `theme_minimal()` are really just collections of changes to `theme()`, so the order is important when using a complete theme. If you do something like this to turn off the gridlines in the plot panel: + +```{r theme-example1, eval=FALSE} +ggplot(...) + + geom_point(...) + + theme(panel.grid = element_blank()) + + theme_bw() +``` + +…you'll still have panel gridlines! That's because `theme_bw()` turns them on, and you typed it after you turned it off. If you want to use both `theme_bw()` and remove the gridlines, you need to make sure any theme adjustments come after `theme_bw()`: + +```{r theme-example2, eval=FALSE} +ggplot(...) + + geom_point(...) + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +## Fonts + +You can use `theme()` to change the fonts as well, though sometimes it's a little tricky to get R to see the fonts on your computer—especially if you use Windows. [This detailed blog post](https://www.andrewheiss.com/blog/2017/09/27/working-with-r-cairo-graphics-custom-fonts-and-ggplot/) explains how to work with custom fonts in ggplot and shows how to get it set up on Windows. It should Just Work™ on macOS. + +In short, as long as you load the fonts correctly, you can specify different fonts either in a complete theme like `theme_minimal(base_family = "Comic Sans MS")` or in `theme()` like `theme(plot.title = element_text(family = "Papyrus"))`. + + +## Reusing themes + +If you want to repeat specific theme settings throughout a document, you can save yourself a ton of typing by storing the results of `theme()` to an object and reusing it. For instance, suppose you want your plots to be based on theme_minimal, have right aligned title and subtitle text, have the legend at the bottom, and have no minor gridlines. You can save all of that into an object named `my_neato_theme` or something, and then reuse it: + +```{r saved-theme, fig.width=6, fig.height=3.6, fig.align="center"} +my_neato_theme <- theme_minimal() + + theme(plot.title = element_text(hjust = 1), + plot.subtitle = element_text(hjust = 1), + legend.position = "bottom", + panel.grid.minor = element_blank()) + +# Make one plot +ggplot(data = mpg, + mapping = aes(x = displ, y = hwy, color = drv)) + + geom_point(size = 3) + + labs(title = "Engine displacement and highway MPG", + subtitle = "Heavier cars get worse mileage") + + my_neato_theme + +# Make another plot +ggplot(data = mpg, + mapping = aes(x = displ, y = hwy, color = cty)) + + geom_point(size = 3) + + labs(title = "Engine displacement and highway MPG", + subtitle = "Points colored by city MPG") + + my_neato_theme +``` + + +## Saving plots + +So far, all your plots have ended up either in RStudio or in a knitted HTML, Word, or PDF document. But what if you want to save just the plot to your computer so you can send it out to the world?! You could take a screenshot, but that won't provide the highest resolution, and that will only save the plot as a bitmap-based PNG, not an infinitely resizable vector-based PDF! + +Fortunately it's pretty easy to save a plot using the special `ggsave()` function. You can specify whatever dimensions you want and whatever file type you want and save the standalone plot to your computer. You should look at the [documentation for `ggsave()`](https://ggplot2.tidyverse.org/reference/ggsave.html) for complete details of all the different options and arguments it can take. Typically, you do something like this. + +First create a plot and store it as an object. We haven't done that yet in this lesson—so far we've just run `ggplot()` and seen the output immediately. If you save the output of `ggplot()` to an object, you actually won't see anything until you run the name of the object. + +```{r ggsave-example} +a_cool_plot <- ggplot(data = mpg, + mapping = aes(x = displ, y = hwy, color = drv)) + + geom_point(size = 3) + + labs(title = "Engine displacement and highway MPG", + subtitle = "Heavier cars get worse mileage") + +# Make sure you run this so you can see the plot +a_cool_plot +``` + +Next you can feed your saved plot to `ggsave()` to save it. It will automatically determine how to save it based on the filename you provide. If you tell it to be `something.png`, R will make a PNG; if you tell it to be `something.pdf`, R will make a PDF, and so on. Common types are PDF, PNG, JPEG (ew though), SVG, TIFF, and others. + +You can also save the plot as multiple files. I typically make PNG and PDF versions of any plots I export like so: + +```{r ggsave-example-1, eval=FALSE} +ggsave(filename = "a_cool_plot.pdf", plot = a_cool_plot, + width = 6, height = 4.5, units = "in") + +ggsave(filename = "a_cool_plot.png", plot = a_cool_plot, + width = 6, height = 4.5, units = "in") +``` + +From a file management perspective, it often makes sense to store all your output in a separate folder in your project, like `output` or `figures` or something. If you want to put saved images in a subfolder, include the name in the file name: + +```{r ggsave-example-2, eval=FALSE} +ggsave(filename = "figures/a_cool_plot.png", plot = a_cool_plot, + width = 6, height = 4.5, units = "in") +``` + +And finally, if you're using custom fonts, you need to add one bit of wizardry to get the fonts to embed correctly in PDFs. This is something you just have to memorize or copy and paste a lot—if you want to know the full details, [see this blog post](https://www.andrewheiss.com/blog/2017/09/27/working-with-r-cairo-graphics-custom-fonts-and-ggplot/). In short, R's default PDF writer doesn't know how to embed fonts and will panic if you make it try. R can use a different PDF-writing engine named Cairo that embeds fonts just fine, though, so you need to tell `ggsave()` to use it: + +```{r ggsave-example-3, eval=FALSE} +ggsave(filename = "figures/a_cool_plot.pdf", plot = a_cool_plot, + width = 6, height = 4.5, units = "in", device = cairo_pdf) +``` + + +```{r eval=FALSE, include=FALSE} +ggplot(data = mpg, + mapping = aes(x = displ, y = hwy, color = drv)) + + geom_point(size = 3) + + labs(title = "Engine displacement and highway MPG", + subtitle = "Heavier cars get worse mileage") + + theme_bw(base_size = 16) + + theme( + legend.position = "bottom", + # panel.grid = element_blank(), + # panel.grid.minor = element_blank(), + panel.grid.minor.x = element_blank(), + plot.background = element_rect(fill = "blue"), + axis.text.y = element_text(face = "bold"), + plot.title = element_text(hjust = 1, face = "bold"), + plot.subtitle = element_text(hjust = 1), + axis.ticks = element_blank(), + panel.border = element_rect(color = "pink", size = 5) + ) +``` diff --git a/lesson/06-lesson.qmd b/lesson/06-lesson.qmd new file mode 100644 index 0000000..cdb9f11 --- /dev/null +++ b/lesson/06-lesson.qmd @@ -0,0 +1,207 @@ +--- +title: "Uncertainty" +date: "2022-05-17" +--- + +```{r setup, include=FALSE} +library(tidyverse) +knitr::opts_chunk$set(fig.width = 6, fig.height = 4.5, fig.align = "center", collapse = TRUE) +set.seed(1234) +``` + +```{r learnr-setup, echo=FALSE, results="asis"} +source(here::here("R", "learnr-things.R")) +include_iframe_resizer() +``` + +Throughout this lesson, you'll use the built-in `mpg` dataset to make histograms, density plots, box plots, violin plots, and other graphics that show uncertainty. + +Sorry if `mpg` is getting repetitive! For short interactive things like this, it's easier to use built-in and easy-to-load datasets like `mpg` and `gapminder` instead of loading CSV files, hence our constant reuse of the dataset. This is fairly normal too—the majority of examples in R help pages (and in peoples' blog posts) use things like `mpg` or `gapminder`, or even `iris`, which measures the lengths and widths of a bunch of iris flowers in the 1930s (fun fact! I don't like using `iris` because the data was originally used in an article in the *Annals of Eugenics* (`r emoji::emoji("grimacing")`) in 1936, and the data was collected to advance eugenics, and [there's no good reason to use data like that in 2023](https://armchairecology.blog/iris-dataset/).) + +So we work with cars instead of racist flower data. + +The `mpg` dataset is available in R as soon as you load ggplot2 (or tidyverse). Yu don't have to run `read_csv()` or anything—it's just there in the background already. + +As a reminder, here are the first few rows of the `mpg` dataset: + +```{r head-mpg} +head(mpg) +``` + + +## Histograms + +When working with histograms, you *always* need to think about the bin width. Histograms calculate the counts of rows within specific ranges of data, and the shape of the histogram will change depending on how wide or narrow these ranges (or bins, or buckets) are. + +::: {.callout-important} +### Your turn + +Change this code to add a specific bin width for city miles per gallon `cty` (hint: `binwidth`). Play around with different widths until you find one that represents the data well. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-1/", + id = "learnr-06-lesson-uncertainty1" +) +``` + +By default, histograms are filled with a dark grey color and the bars have no borders. Additionally, R places the center of the bars at specific numbers: if you have a bin width of 5, for instance, a bar will show the range from 7.5 to 12.5 instead of 5-10 or 10-15. + +::: {.callout-important} +### Your turn + +Do the following: + +1. Add a specific bin width +2. Add a white border (hint: `color`) +3. Fill with #E16462 +4. Make it so the bars start at whole numbers like 10 or 20 (hint: `boundary`) + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-2/", + id = "learnr-06-lesson-uncertainty2" +) +``` + +You can add extra aesthetics to encode additional information about the distribution of variables across categories. + +::: {.callout-important} +### Your turn + +Make a histogram of `cty` and fill by `drv` (drive: front, rear, and 4-wheel). Make sure you specify a good bin width. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-3/", + id = "learnr-06-lesson-uncertainty3" +) +``` + +That's too much information! Instead of only filling, you can separate the data into multiple plots. + +::: {.callout-important} +### Your turn + +Make a histogram of `cty` fill *and* facet by `drv`. Make sure you specify a good bin width. Make sure you specify a good bin width. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-4/", + id = "learnr-06-lesson-uncertainty4" +) +``` + +## Density plots + +When working with density plots *in this class* you don't need to worry too much about the calculus behind the scenes that creates the curves. But you can change those settings if you really want. + +::: {.callout-important} +### Your turn + +Do the following: + +1. Fill this density plot with #E16462 +2. Add a border (hint: `color`) using #9C3836, with size = 1 +3. Change the bandwidth (hint: `bw`) to 0.5, then 1, then 10 + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-5/", + id = "learnr-06-lesson-uncertainty5" +) +``` + +Like histograms, you can map other variables onto the plot. It's often a good idea to make the curves semi-transparent so you can see the different distributions. + +::: {.callout-important} +### Your turn + +Do the following: + +1. Fill this plot using the `drv` variable +2. Make the density plots 50% transparent + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-6/", + id = "learnr-06-lesson-uncertainty6" +) +``` + +Even with transparency, it's often difficult to interpret density plots like this. As an alternative, you can use the [{ggridges} package](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html) to make ridge plots. Look at the [documentation and examples for {ggridges}](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html) for lots of details about different plots you can make. + +::: {.callout-important} +### Your turn + +Convert this plot into a ridge plot. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-7/", + id = "learnr-06-lesson-uncertainty7" +) +``` + +## Boxes, violins, and dots + +Finally, you can use things like boxplots and violin plots to show the distribution of variables, either by themselves or across categories. + +Box plots show the distribution of a variable by highlighting specific details, like the 25th, 50th (median) and 75th percentile, as well as the assumed minimum, assumed maximum, and outliers: + +![Anatomy of a boxplot](/slides/06-slides_files/figure-html/boxplot-explanation-1.png) + +When making boxplots with ggplot, you need to map the variable of interest to the `x` aesthetic (or `y` if you want a vertical boxplot), and you can optionally map a second categorical variable to the `y` aesthetic (or `x` if you want a vertical boxplot). + +You can adjust the fill and color of the plot, and you can change what counts as outliers with the `coef` argument. By default outliers are any point that is beyond the 75th percentile + 1.5 × the interquartile range (or below the 25th percentile + 1.5 × IQR), but that's adjustable. + +::: {.callout-important} +### Your turn + +Do the following: + +1. Fill the boxplot with #E6AD3C +2. Color the boxplot with #5ABD51 +3. Change the definition of outliers to be 5 times the IQR + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-8/", + id = "learnr-06-lesson-uncertainty8" +) +``` + +You can also use violin plots instead of boxplot, which show the mirrored density distribution. When doing this, it's often helpful to add other geoms like jittered points to show more of the data + +::: {.callout-important} +### Your turn + +Do the following + +1. Change this boxplot to use violins instead +2. Add jittered points with a jittering width of 0.1 and sized at 0.5 + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_06-uncertainty-9/", + id = "learnr-06-lesson-uncertainty9" +) +``` diff --git a/lesson/07-lesson.qmd b/lesson/07-lesson.qmd new file mode 100644 index 0000000..ef6a592 --- /dev/null +++ b/lesson/07-lesson.qmd @@ -0,0 +1,8 @@ +--- +title: "Relationships" +date: "2020-05-19" +--- + +There isn't really a lesson for today, and as we get further into the semester, the need for lessons will continue to decrease. Now that each section is focused on a few specific geoms and how to apply them, you don't need to go through interactive tutorials so much, since you should (hopefully!) be getting the hang of how ggplot works. (IF NOT, please reach out for help on Slack or via e-mail! I'm more than happy and ready to help!) + +For the lesson, [read through the code examples in the example](/example/07-example/) to see how to make dual y-axes, scatterplot matrices, coefficient plots, and marginal effects plots. diff --git a/lesson/08-lesson.qmd b/lesson/08-lesson.qmd new file mode 100644 index 0000000..9f2a970 --- /dev/null +++ b/lesson/08-lesson.qmd @@ -0,0 +1,8 @@ +--- +title: "Comparisons" +date: "2020-05-20" +--- + +Like the previous session, there isn't really a lesson today. You're not learning how to use any new functions—you're learning how to apply the geoms you already know in cool and exciting ways. But don't worry! You'll have a lesson for session 9! + +For the lesson, [read through the code examples in the example](/example/08-example/) to see how to make small multiples, sparklines, geofacets, and slopegraphs. diff --git a/lesson/09-lesson.qmd b/lesson/09-lesson.qmd new file mode 100644 index 0000000..e76b159 --- /dev/null +++ b/lesson/09-lesson.qmd @@ -0,0 +1,11 @@ +--- +title: "Annotations" +date: "2020-05-21" +--- + +Ha, so in the video I said there would be interactive lessons, but *I changed my mind!* You're only working with a few new functions this session (`annotate()`, `geom_text()`, `geom_label()`, `geom_text_repel()`, and `geom_label_repel()`), and the best way to figure out how to use them is to use them! + +There are some helpful blog posts and other resources online with examples and explanations. Read through these in addition to the documentation for [`annotate()`](https://ggplot2.tidyverse.org/reference/annotate.html), [`geom_text()/geom_label()`](https://ggplot2.tidyverse.org/reference/geom_text.html) and [{ggrepel}](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html): + +- ["Add shapes with `annotate()`"](https://www.r-graph-gallery.com/233-add-annotations-on-ggplot2-chart.html) +- ["Annotations"](https://ggplot2-book.org/annotations.html) diff --git a/lesson/10-lesson.qmd b/lesson/10-lesson.qmd new file mode 100644 index 0000000..3a4e145 --- /dev/null +++ b/lesson/10-lesson.qmd @@ -0,0 +1,8 @@ +--- +title: "Interactivity" +date: "2020-05-22" +--- + +Again, there's no lesson for this. The only way to learn how to use `ggplotly()` and create dashboards with {flexdashboard} is to try them out in RStudio, not in a mini browser-based R session here. + +So [head over to the exercise](/assignment/10-exercise/) to get started! diff --git a/lesson/11-lesson.qmd b/lesson/11-lesson.qmd new file mode 100644 index 0000000..734b779 --- /dev/null +++ b/lesson/11-lesson.qmd @@ -0,0 +1,10 @@ +--- +title: "Time" +date: "2020-05-26" +--- + +Once again, there's no lesson this time. You're all understanding the basics of R and {ggplot2} and {dplyr} *really well* (I'm seriously so impressed and proud of you all!). + +In your exercise today you'll visualize trends in time using one of three different real-world datasets. [In the example](/example/11-example/) I demonstrate how to remove seasonality from time series data, which is a useful skill, but *not always applicable* to every time series dataset. If there's no seasonality in your data, you don't need to remove it. + +So head over to [the example](/example/11-example/) or [the exercise](/assignment/11-exercise/) to get started! diff --git a/lesson/12-lesson.qmd b/lesson/12-lesson.qmd new file mode 100644 index 0000000..2fef47b --- /dev/null +++ b/lesson/12-lesson.qmd @@ -0,0 +1,363 @@ +--- +title: "Space" +date: "2020-05-27" +--- + +```{r setup, include=FALSE} +library(tidyverse) +knitr::opts_chunk$set(fig.align = "center", collapse = TRUE) +``` + +```{r learnr-setup, echo=FALSE, results="asis"} +source(here::here("R", "learnr-things.R")) +include_iframe_resizer() +``` + +There *is* a short lesson today! You'll learn the basics of joining two different datasets together, both vertically and horizontally. + +There are a few imaginary datasets I've created for you to play with: + +```{r create-fake-data, echo=FALSE} +national_data <- tribble( + ~state, ~year, ~unemployment, ~inflation, ~population, + "GA", 2018, 5, 2, 100, + "GA", 2019, 5.3, 1.8, 200, + "GA", 2020, 5.2, 2.5, 300, + "NC", 2018, 6.1, 1.8, 350, + "NC", 2019, 5.9, 1.6, 375, + "NC", 2020, 5.3, 1.8, 400, + "CO", 2018, 4.7, 2.7, 200, + "CO", 2019, 4.4, 2.6, 300, + "CO", 2020, 5.1, 2.5, 400 +) + +puerto_rico_data <- tribble( + ~state, ~unemployment, ~population, ~year, + "PR", 3.1, 150, 2018, + "PR", 3.2, 250, 2019, + "PR", 3.3, 350, 2020 +) + +national_libraries <- tribble( + ~state, ~year, ~libraries, ~schools, + "CO", 2018, 230, 470, + "CO", 2019, 240, 440, + "CO", 2020, 270, 510, + "NC", 2018, 200, 610, + "NC", 2019, 210, 590, + "NC", 2020, 220, 530, +) + +national_data_2019 <- national_data %>% + filter(year == 2019) %>% select(-year) + +national_libraries_2019 <- national_libraries %>% + filter(year == 2019) %>% select(-year) + +state_regions <- tribble( + ~region, ~state, + "Northeast", c("CT", "ME", "MA", "NH", "RI", "VT", "NJ", "NY", "PA"), + "Midwest", c("IL", "IN", "MI", "OH", "WI", "IA", "KS", "MN", "MO", "NE", "ND", "SD"), + "South", c("DE", "FL", "GA", "MD", "NC", "SC", "VA", "DC", "WV", "AL", "KY", "MS", "TN", "AR", "LA", "OK", "TX"), + "West", c("AZ", "CO", "ID", "MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR", "WA") +) %>% unnest(state) %>% + arrange(state) + +x <- tibble(id = c(1, 2, 3), + some_variable = c("x1", "x2", "x3")) + +y <- tibble(id = c(1, 2, 4), + some_other_variable = c("y1", "y2", "y4")) +``` + +```{r} +x +``` + +```{r} +y +``` + +```{r} +national_data +``` + +```{r} +national_data_2019 +``` + +```{r} +national_libraries +``` + +```{r} +national_libraries_2019 +``` + +```{r} +puerto_rico_data +``` + +```{r} +state_regions +``` + + +## Combining datasets vertically + +Recall from the [Lord of the Rings data in exercise 3](/assignment/03-exercise/) that you had to combine three different CSV files into dataset. You used `bind_rows()` to stack each of these on top of each other. + +```{r eval=FALSE} +lotr <- bind_rows(fellowship, tt, rotk) +``` + +That worked well because each of the individual data frames had the same columns in them, and R was able to line up the matching columns. If columns were missing, R would have placed `NA` in the appropriate locations. + +::: {.callout-important} +### Your turn + +Combine `national_data` and `puerto_rico_data` into a single dataset named `us_data` using `bind_rows`. Pay attention to what happens with the inflation column. Also notice that the columns in the Puerto Rico data are in a different order. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_12-joining-1/", + id = "learnr-12-lesson-joining1" +) +``` + +## Combining datasets horizontally + +Binding rows vertically is the easiest way to combine two datasets, but most often you won't be doing that. You'll only do this if you're combining datasets that come from the same source, like if a state offers separate CSV files of the same data for each county. + +In most cases, though, you'll need to combine completely different datasets, bringing one or more columns from one into another. With vertical combining, R needs column names with the same names in order to figure out where the data lines up. With horizontal combining, R needs values inside one or more columns to be the same in order to figure out where the data lines up. + +There is technically a function named `bind_cols()`, but you'll rarely want to use it. It doesn't attempt to match any rows—it just glues two datasets together: + +```{r show-bind-cols} +bind_cols(national_data, + # Repeat PR 3 times so that it has the same number of rows as national_data + bind_rows(puerto_rico_data, puerto_rico_data, puerto_rico_data)) +``` + +That's… not great. + +Instead, we need to use a function that is more careful about bringing in data. Fortunately there are a few good options: + +- `inner_join()` +- `left_join()` +- `right_join()` + +The **most** helpful way of understanding these different functions [is to go here and stare at the animations for a little while](https://github.com/gadenbuie/tidyexplain#mutating-joins) to see which pieces of which dataset go where. (There are lots of others, like `full_join()`, `semi_join()`, and `anti_join()`, and they have helpful animations, but I rarely use those.) + +For each of these functions, **you need at least one common ID column in both datasets** in order for R to know where things line up. + +Let's practice how these all work and see what the differences between them are. + +## `inner_join()` + +First, go to this page in a new tab and stare at the mesmerizing animation. + +Let's look at two datasets, `x` and `y`: + +```{r} +x +``` + +```{r} +y +``` + +Both datasets have an `id` column that is the same across each (though the values aren't necessarily the same). Because there's a shared column, we can join these two based on that column. + +If we use `inner_join()`, the resulting dataset will only keep the rows from the first where there are matching values from the second: + +```{r} +inner_join(x, y, by = "id") +``` + +Notice how it got rid of the row with `id = 3` from the first and the row with `id = 4` from the second. + +You can also write this with pipes, which is really common when working with {dplyr}: + +```{r} +x %>% + inner_join(y, by = "id") +``` + +Let's say we have two datasets: `national_data_2019` and `national_libraries_2019`: + +```{r} +national_data_2019 +``` + +```{r} +national_libraries_2019 +``` + +We want to bring the libraries and schools columns into the general national data. Notice how both datasets have a state column. + +::: {.callout-important} +### Your turn + +Create a new dataset named `combined_data` that uses `inner_join()` to merge `national_data_2019` and `national_libraries_2019`. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_12-joining-2/", + id = "learnr-12-lesson-joining2" +) +``` + +## `left_join()` + +Again, go to this page in a new tab and stare at the animation. + +Left joining is less destructive than inner joining. With left joining, any rows in the first dataset that don't have matches in the second *don't* get thrown away and instead are filled with `NA`: + +```{r} +left_join(x, y, by = "id") +``` + +Notice how the row with `id = 4` from the second dataset is gone, but the row with `id = 3` from the first is still there, with `NA` for `some_other_variable`. + +I find this much more useful when combining data. I often have a larger dataset with all the main variables I care about, perhaps with every combination of country and year over 20 years and 180 countries. If I find another dataset I want to join, and it has missing data for some of the years or countries, I don't want the combined data to throw away all the rows from the main big dataset that don't match! I still want those! + +*([Look at this for a real life example](https://stats.andrewheiss.com/canary-ngos/01_get-merge-data.html#final_clean_combined_data): I create a dataset I name `panel_skeleton` that is just all the combinations of countries and years (Afghanistan 1990, Afghanistan 1991, etc.), and then I bring in all sorts of other datasets that match the same countries and years. When there aren't matches, nothing in the skeleton gets thrown away—R just adds missing values instead.)* + +::: {.callout-important} +### Your turn + +Create a new dataset named `combined_data` that uses `left_join()` to merge `national_data_2019` and `national_libraries_2019`. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_12-joining-5/", + id = "learnr-12-lesson-joining5" +) +``` + +Left joining is also often surprisingly helpful for recoding lots of variables. Right now in our fake national data, we have a column for state, but it would be nice if we could have a column for region so we could facet or fill or color by region in a plot. Hunting around on the internet, you find this dataset that has a column for state and a column for abbreviations: + +```{r} +state_regions +``` + +::: {.callout-important} +### Your turn + +Create a new dataset named `national_data_with_region` that uses `left_join()` to combine `national_data_2019` with `state_regions`. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_12-joining-3/", + id = "learnr-12-lesson-joining3" +) +``` + +Because `left_join()` only keeps rows from the second dataset that match the first, we don't actually bring in all 50 rows from the `state_regions` data—only the rows that match the first dataset (`national_data_2019`) come over. We could have done with if some massive recoding (`mutate(region = ifelse(state == "GA" | state == "NC", "South", ifelse(state == "CO"), "West", NA))`), but that's awful. Left joining is far easier here. + +You can also join by multiple columns. So far we've been working with just `national_data_2019`, but if you look at `national_data`, you'll see there are rows for different years across these states: + +```{r} +national_data +``` + +Previously, we've been specifying the ID column with `by = "state"`, but now we have two ID columns: `state` and `year`. We can specify both with `by = c("state", "year")`. + +::: {.callout-important} +### Your turn + +Create a new dataset named `national_data_combined` that uses `left_join()` to combine `national_data` with `national_libraries` by state and year. + +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_12-joining-4/", + id = "learnr-12-lesson-joining4" +) +``` + +If one dataset has things like state and year, but another only has state, `left_join()` will still work, but it will only join where the state is the same. For instance, here's what happens when we join the region data to the yearly national data: + +```{r} +national_data_with_region <- national_data %>% + left_join(state_regions, by = "state") +national_data_with_region +``` + +The "South" region gets added to every row where the state is "GA" and "NC", even though those rows only appear once in `state_regions`. `left_join()` will still match all the values even if states are repeated. Magic! + +## Common column names + +So far, the column names in both datasets have been the same, which has greatly simplified life. In fact, if the columns have the same name, we can technically leave out the `by` argument and R will guess: + +```{r} +national_data %>% + left_join(national_libraries) +``` + +It's good practice to be specific about the columns you want and actually use `by`, but I will often run `left_join()` without it and then copy the message that it generates ("`by = c("state", "year")`") and paste it into my code. + +But what if the column names don't match? Let's rename the state column in our state/region table for fun: + +```{r} +state_regions_different <- state_regions %>% + rename(ST = state) +state_regions_different +``` + +Now watch what happens when we try to join the datasets: + +```{r error=TRUE} +national_data %>% + left_join(state_regions_different) +``` + +There are no common variables, so we get an error. The `state` and `ST` columns really are common variables, but R doesn't know that. + +We have two options: + +1. Rename one of the columns so it matches (either change `state` to `ST` or change `ST` to `state`) +2. Tell `left_join()` which columns are the same + +We can do option two by modifying the `by` argument like so: + +```{r} +national_data %>% + left_join(state_regions_different, by = c("state" = "ST")) +``` + + +## `right_join()` + +Once again, go to this page in a new tab and watch the animation. + +`right_join()` works exactly like `left_join()`, but in reverse. The *second* dataset is the base data. Any rows in the second dataset that don't match in the first will be kept, and any rows from the first that don't match will get thrown away. + +Watch what happens if we right join `national_data` and `state_regions`: + +```{r} +national_data %>% + right_join(state_regions, by = "state") +``` + +Yikes. R kept all the rows in `state_regions`, brought in the columns from `national_data` and filled most of the new columns with `NA`, and then repeated Colorado (and NC and GA) three times for each of the years from `national_data`. That's a mess. + +If we reverse the order, we'll get the correct merged data: + +```{r} +state_regions %>% + right_join(national_data, by = "state") +``` + +I rarely use `right_join()` because I find it more intuitive to just use `left_join()` since in my head, I'm taking a dataset and stacking columns onto the end of it. If you want to right join instead, neat—just remember to order things correctly. + diff --git a/lesson/13-lesson.qmd b/lesson/13-lesson.qmd new file mode 100644 index 0000000..7ceef61 --- /dev/null +++ b/lesson/13-lesson.qmd @@ -0,0 +1,8 @@ +--- +title: "Text" +date: "2020-05-28" +--- + +There's no lesson for this session. In your exercise today you'll visualize text data using [{tidytext}](https://www.tidytextmining.com/), and the best way to figure that out is to just play with data. + +So head over to [the example](/example/13-example/) to see how it's done, or [the exercise](/assignment/13-exercise/) to get started! diff --git a/lesson/14-lesson.qmd b/lesson/14-lesson.qmd new file mode 100644 index 0000000..bf9ade2 --- /dev/null +++ b/lesson/14-lesson.qmd @@ -0,0 +1,8 @@ +--- +title: "Enhancing graphics" +date: "2020-05-29" +--- + +There's no lesson for this session. In your exercise today you'll export a plot from ggplot, open it in a vector editor like [Illustrator](https://www.adobe.com/products/illustrator.html), [Inkscape](https://inkscape.org/), or [Gravit Designer](https://www.designer.io/en/), and make it extra pretty and well-designed. The best way to learn this is by actually doing it. + +So head over to [the example](/example/14-example/) to see how it's done, or [the exercise](/assignment/14-exercise/) to get started! diff --git a/lesson/15-lesson.qmd b/lesson/15-lesson.qmd new file mode 100644 index 0000000..83e596b --- /dev/null +++ b/lesson/15-lesson.qmd @@ -0,0 +1,6 @@ +--- +title: "Truth, beauty, and data revisited" +date: "2020-06-01" +--- + +There's no lesson for this session. You made it to the end of the course! Congratulations! diff --git a/lesson/index.qmd b/lesson/index.qmd new file mode 100644 index 0000000..f5ddc29 --- /dev/null +++ b/lesson/index.qmd @@ -0,0 +1,25 @@ +--- +title: Interactive lessons +--- + +```{r learnr-setup, echo=FALSE, results="asis"} +source(here::here("R", "learnr-things.R")) +include_iframe_resizer() +``` + +Each class session has an interactive lesson that you will work through ***after*** doing the readings and watching the lecture. These lessons are a central part of the class—they will teach you how to use {ggplot2} and other packages in the tidyverse to create beautiful and truthful visualizations with R. + +Interactive code sections look like this. Make changes in the text box and click on the green "Run Code" button to see the results. Sometimes there will be a button with a hint or solution. + +:::puzzle +**Your turn**: Modify the code here to show the relationship between health and wealth for 2002 instead of 2007. +::: + +```{r echo=FALSE, results="asis"} +embedded_learnr( + url = "https://andrewheiss.shinyapps.io/datavizm20_00-lesson-example/", + id = "learnr-00-lesson-example1" +) +``` + +If you're curious how this works, each interactive code section is a miniature [Shiny](https://shiny.rstudio.com/) app hosted at [shinyapps.io](https://www.shinyapps.io/). Each app uses [{learnr}](https://rstudio.github.io/learnr/) to provide interactivity, and these {learnr} apps are embedded in this website with some [HTML and Javascript wizardry](https://desiree.rbind.io/blog/learnr-iframes/).