Skip to content

Commit

Permalink
Move all example headings back a level
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewheiss committed May 30, 2023
1 parent fbf752d commit 69e00fa
Show file tree
Hide file tree
Showing 10 changed files with 68 additions and 88 deletions.
12 changes: 5 additions & 7 deletions example/03-example.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@ And I *promise* future examples will not be this long!
</div>


## Complete code

::: {.callout-important}
### Slight differences from the video

Expand All @@ -44,7 +42,7 @@ set.seed(1234)
options(dplyr.summarise.inform = FALSE)
```

### Load and clean data
## Load and clean data

First, we need to load a few libraries: {tidyverse} (as always) and {readxl} for reading Excel files:

Expand Down Expand Up @@ -95,7 +93,7 @@ bbc <- bbc_raw %>%
mutate(grant_year_category = factor(grant_year))
```

### Histograms
## Histograms

First let's look at the distribution of grant amounts with a histogram. Map `grant_amount` to the x-axis and don't map anything to the y-axis, since `geom_histogram()` will calculate the y-axis values for us:

Expand Down Expand Up @@ -142,7 +140,7 @@ ggplot(bbc, aes(x = grant_amount, fill = grant_year_category)) +

Neat!

### Points
## Points

Next let's look at the data using points, mapping year to the x-axis and grant amount to the y-axis:

Expand Down Expand Up @@ -181,7 +179,7 @@ ggplot(bbc, aes(x = grant_year_category, y = grant_amount, color = grant_program

It does! We appear to have two different distributions of grants: small grants have a limit of £30,000, while regular grants have a much higher average amount.

### Boxplots
## Boxplots

We can add summary information to the plot by only changing the `geom` we're using. Switch from `geom_point()` to `geom_boxplot()`:

Expand All @@ -190,7 +188,7 @@ ggplot(bbc, aes(x = grant_year_category, y = grant_amount, color = grant_program
geom_boxplot()
```

### Summaries
## Summaries

We can also make smaller summarized datasets with {dplyr} functions like `group_by()` and `summarize()` and plot those. First let's look at grant totals, averages, and counts over time:

Expand Down
16 changes: 7 additions & 9 deletions example/04-example.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,6 @@ If you want to follow along with this example, you can download the data directl
</div>


## Complete code

::: {.callout-important}
### Slight differences from the video

Expand All @@ -33,7 +31,7 @@ set.seed(1234)
options(dplyr.summarise.inform = FALSE)
```

### Load data
## Load data

There are two CSV files:

Expand Down Expand Up @@ -76,7 +74,7 @@ births_2000_2014 <- read_csv(here::here(
births_combined <- bind_rows(births_1994_1999, births_2000_2014)
```

### Wrangle data
## Wrangle data

Let's look at the first few rows of the data to see what we're working with:

Expand Down Expand Up @@ -111,7 +109,7 @@ If you look at the data now, you can see the columns are changed and have differ

Our `births` data is now clean and ready to go!

### Bar plot
## Bar plot

First we can look at a bar chart showing the total number of births each day. We need to make a smaller summarized dataset and then we'll plot it:

Expand Down Expand Up @@ -156,7 +154,7 @@ ggplot(data = total_births_weekday,
x = NULL, y = "Total births")
```

### Lollipop chart
## Lollipop chart

Since the ends of the bars are often the most important part of the graph, we can use a lollipop chart to emphasize them. We'll keep all the same code from our bar chart and make a few changes:

Expand All @@ -181,7 +179,7 @@ ggplot(data = total_births_weekday,
```


### Strip plot
## Strip plot

However, we want to \#barbarplots! (Though they're arguably okay here, since they show totals and not averages). Let's show all the data with points. We'll use the full dataset now, map x to weekday, y to births, and change `geom_col()` to `geom_point()`. We'll tell `geom_point()` to jitter the points randomly.

Expand All @@ -195,7 +193,7 @@ ggplot(data = births,

There are some interesting points in the low ends, likely because of holidays like Labor Day and Memorial Day (for the Mondays) and Thanksgiving (for the Thursday). If we had a column that indicated whether a day was a holiday, we could color by that and it would probably explain most of those low numbers. Unfortunately we don't have that column, and it'd be hard to make. Some holidays are constant (Halloween is always October 31), but some aren't (Thanksgiving is the fourth Thursday in November, so we'd need to find out which November 20-somethingth each year is the fourth Thursday, and good luck doing that at scale).

### Beeswarm plot
## Beeswarm plot

We can add some structure to these points if we use the [{ggbeeswarm} package](https://github.com/eclarke/ggbeeswarm), with either `geom_beeswarm()` or `geom_quasirandom()`. `geom_quasirandom()` actually works better here since there are so many points—`geom_beeswarm()` makes the clusters of points way too wide.

Expand All @@ -210,7 +208,7 @@ ggplot(data = births,
guides(color = "none")
```

### Heatmap
## Heatmap

Finally, let's use something non-traditional to show the average births by day in a somewhat proportional way. We can calculate the average number of births every day and then make a heatmap that fills each square by that average, thus showing the relative differences in births per day.

Expand Down
10 changes: 4 additions & 6 deletions example/06-example.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@ If you want to follow along with this example, you can download the data below (
</div>


## Complete code

::: {.callout-important}
### Slight differences from the video

Expand All @@ -31,7 +29,7 @@ knitr::opts_chunk$set(fig.width = 6, fig.height = 3.6, fig.align = "center", col
set.seed(1234)
```

### Load and clean data
## Load and clean data

First, we load the libraries we'll be using:

Expand Down Expand Up @@ -61,7 +59,7 @@ weather_atl <- weather_atl_raw %>%

Now we're ready to go!

### Histograms
## Histograms

We can first make a histogram of wind speed. We'll use a bin width of 1 and color the edges of the bars white:

Expand Down Expand Up @@ -98,7 +96,7 @@ ggplot(weather_atl, aes(x = windSpeed, fill = Month)) +

Neat! January, March, and April appear to have the most variation in windy days, with a few wind-less days and a few very-windy days, while August was very wind-less.

### Density plots
## Density plots

The code to create a density plot is nearly identical to what we used for the histogram—the only thing we change is the `geom` layer:

Expand Down Expand Up @@ -195,7 +193,7 @@ ggplot(weather_atl_long, aes(x = temp, y = fct_rev(Month),
Super neat! We can see much wider temperature disparities during the summer, with large gaps between high and low, and relatively equal high/low temperatures during the winter.


### Box, violin, and rain cloud plots
## Box, violin, and rain cloud plots

Finally, we can look at the distribution of variables with box plots, violin plots, and other similar graphs. First, we'll make a box plot of windspeed, filled by the `Day` variable we made indicating weekday:

Expand Down
20 changes: 9 additions & 11 deletions example/07-example.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@ If you want to follow along with this example, you can download the data below (
</div>


## Complete code

::: {.callout-important}
### Slight differences from the video

Expand All @@ -31,7 +29,7 @@ set.seed(1234)
options("digits" = 2, "width" = 150)
```

### Load and clean data
## Load and clean data

First, we load the libraries we'll be using:

Expand All @@ -52,7 +50,7 @@ weather_atl <- read_csv("data/atl-weather-2019.csv")
weather_atl <- read_csv(here::here("files", "data", "external_data", "atl-weather-2019.csv"))
```

### Legal dual y-axes
## Legal dual y-axes

It is fine (and often helpful!) to use two y-axes if the two different scales measure the same thing, like counts and percentages, Fahrenheit and Celsius, pounds and kilograms, inches and centimeters, etc.

Expand Down Expand Up @@ -96,7 +94,7 @@ ggplot(weather_atl, aes(x = time, y = temperatureHigh)) +
theme_minimal()
```

### Combining plots
## Combining plots

A good alternative to using two y-axes is to use two plots instead. The [{patchwork} package](https://github.com/thomasp85/patchwork) makes this *really* easy to do with R. There are other similar packages that do this, like {cowplot} and {gridExtra}, but I've found that {patchwork} is the easiest to use *and* it actually aligns the different plot elements like axis lines and legends (yay alignment in CRAP!). The [documentation for {patchwork}](https://patchwork.data-imaginist.com/articles/guides/assembly.html) is really great and full of examples—you should check it out to see all the things you can do with it!

Expand Down Expand Up @@ -146,7 +144,7 @@ temp_plot + humidity_plot +
plot_layout(ncol = 1, heights = c(0.7, 0.3))
```

### Scatterplot matrices
## Scatterplot matrices

We can visualize the correlations between pairs of variables with the `ggpairs()` function in the {GGally} package. For instance, how correlated are high and low temperatures, humidity, wind speed, and the chance of precipitation? We first make a smaller dataset with just those columns, and then we feed that dataset into `ggpairs()` to see all the correlation information:

Expand All @@ -170,7 +168,7 @@ ggpairs(weather_correlations) +
```


### Correlograms
## Correlograms

Scatterplot matrices typically include way too much information to be used in actual publications. I use them when doing my own analysis just to see how different variables are related, but I rarely polish them up for public consumption. In the readings for today, Claus Wilke showed a type of plot called a [*correlogram*](https://clauswilke.com/dataviz/visualizing-associations.html#associations-correlograms) which *is* more appropriate for publication.

Expand Down Expand Up @@ -255,7 +253,7 @@ ggplot(things_to_correlate_long,
```


### Simple regression
## Simple regression

We can also visualize the relationships between variables using regression. Simple regression is easy to visualize, since you're only working with an X and a Y. For instance, what's the relationship between humidity and high temperatures during the summer?

Expand Down Expand Up @@ -295,7 +293,7 @@ ggplot(weather_atl_summer,

And indeed, as humidity increases, temperatures decrease.

### Coefficient plots
## Coefficient plots

But if we use multiple variables in the model, it gets really hard to visualize the results since we're working with multiple dimensions. Instead, we can use coefficient plots to see the individual coefficients in the model.

Expand Down Expand Up @@ -334,7 +332,7 @@ ggplot(model_tidied,

Neat! Now we can see how big these different coefficients are and how close they are to zero. Wind speed has a big significant effect on temperature. The others are all very close to zero.

### Marginal effects plots
## Marginal effects plots

::: {.callout-tip}
### 2023 update!
Expand Down Expand Up @@ -455,7 +453,7 @@ ggplot(predicted_values_fancy, aes(x = windSpeed, y = .fitted)) +
That's so neat! Temperatures go down slightly as cloud cover increases. If we wanted to improve the model, we'd add an interaction term between cloud cover and windspeed so that each line would have a different slope in addition to a different intercept, but that's beyond the scope of this class.


### Predicted values and marginal effects in 2023
## Predicted values and marginal effects in 2023

Instead of using `expand_grid()` and `augment()` to create and plug in a mini dataset of variables to move up and down, we can use [the {marginaleffects} package](https://vincentarelbundock.github.io/marginaleffects/) to simplify life!

Expand Down
12 changes: 5 additions & 7 deletions example/08-example.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@ If you want to skip the data downloading, you can download the data below (you'l
</div>


## Complete code

::: {.callout-important}
### Slight differences from the video

Expand All @@ -38,7 +36,7 @@ set.seed(1234)
options("digits" = 2, "width" = 150)
```

### Load and clean data
## Load and clean data

First, we load the libraries we'll be using:

Expand Down Expand Up @@ -106,7 +104,7 @@ wdi_clean <- wdi_raw %>%
head(wdi_clean)
```

### Small multiples
## Small multiples

First we can make some small multiples plots and show life expectancy over time for a handful of countries. We'll make a list of some countries chosen at random while I scrolled through the data, and then filter our data to include only those rows. We then plot life expectancy, faceting by country.

Expand Down Expand Up @@ -167,7 +165,7 @@ ggplot(life_expectancy_eu, aes(x = year, y = life_expectancy)) +

Neat!

### Sparklines
## Sparklines

Sparklines are just line charts (or bar charts) that are really really small.

Expand Down Expand Up @@ -206,7 +204,7 @@ You can then use those saved tiny plots in your text.
> Both India <img class="img-inline" src="/example/08-example_files/figure-html/india-spark-1.png" width = "100"/> and China <img class="img-inline" src="/example/08-example_files/figure-html/china-spark-1.png" width = "100"/> have seen increased CO<sub>2</sub> emissions over the past 20 years.

### Slopegraphs
## Slopegraphs

We can make a slopegraph to show changes in GDP per capita between two time periods. We need to first filter our WDI to include only the start and end years (here 1995 and 2015). Then, to make sure that we're using complete data, we'll get rid of any country that has missing data for either 1995 or 2015. The `group_by(...) %>% filter(...) %>% ungroup()` pipeline does this, with the `!any(is.na(gdp_per_cap))` test keeping any rows where any of the `gdp_per_cap` values are not missing for the whole country.

Expand Down Expand Up @@ -289,7 +287,7 @@ ggplot(gdp_south_asia, aes(x = year, y = gdp_per_cap, group = country, color = c
```


### Bump charts
## Bump charts

Finally, we can make a bump chart that shows changes in rankings over time. We'll look at CO<sub>2</sub> emissions in South Asia. First we need to calculate a new variable that shows the rank of each country within each year. We can do this if we group by year and then use the `rank()` function to rank countries by the `co2_emissions` column.

Expand Down
8 changes: 3 additions & 5 deletions example/09-example.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@ If you want to skip the data downloading, you can download the data below (you'l
</div>


## Complete code

::: {.callout-important}
### Slight differences from the video

Expand All @@ -39,7 +37,7 @@ options("digits" = 2, "width" = 150)
```


### Load data
## Load data

First, we load the libraries we'll be using:

Expand Down Expand Up @@ -70,7 +68,7 @@ wdi_clean <- wdi_co2_raw %>%
filter(region != "Aggregates")
```

### Clean and reshape data
## Clean and reshape data

Next we'll do some substantial filtering and reshaping so that we can end up with the rankings of CO~2~ emissions in 1995 and 2014. I annotate as much as possible below so you can see what's happening in each step.

Expand Down Expand Up @@ -133,7 +131,7 @@ And here's what it looks like now:
head(co2_rankings)
```

### Plot the data and annotate
## Plot the data and annotate

I use IBM Plex Sans in this plot. You can [download it from Google Fonts](https://fonts.google.com/specimen/IBM+Plex+Sans).

Expand Down
Loading

0 comments on commit 69e00fa

Please sign in to comment.