From 69e00fa0d7ea3386bb4eeabbb141aef592350eba Mon Sep 17 00:00:00 2001 From: Andrew Heiss Date: Mon, 29 May 2023 22:15:05 -0400 Subject: [PATCH] Move all example headings back a level --- example/03-example.qmd | 12 +++++------- example/04-example.qmd | 16 +++++++--------- example/06-example.qmd | 10 ++++------ example/07-example.qmd | 20 +++++++++----------- example/08-example.qmd | 12 +++++------- example/09-example.qmd | 8 +++----- example/10-example.qmd | 12 +++++------- example/11-example.qmd | 12 +++++------- example/12-example.qmd | 28 +++++++++++++--------------- example/13-example.qmd | 26 ++++++++++++-------------- 10 files changed, 68 insertions(+), 88 deletions(-) diff --git a/example/03-example.qmd b/example/03-example.qmd index 77e87ed..716222b 100644 --- a/example/03-example.qmd +++ b/example/03-example.qmd @@ -28,8 +28,6 @@ And I *promise* future examples will not be this long! -## Complete code - ::: {.callout-important} ### Slight differences from the video @@ -44,7 +42,7 @@ set.seed(1234) options(dplyr.summarise.inform = FALSE) ``` -### Load and clean data +## Load and clean data First, we need to load a few libraries: {tidyverse} (as always) and {readxl} for reading Excel files: @@ -95,7 +93,7 @@ bbc <- bbc_raw %>% mutate(grant_year_category = factor(grant_year)) ``` -### Histograms +## Histograms First let's look at the distribution of grant amounts with a histogram. Map `grant_amount` to the x-axis and don't map anything to the y-axis, since `geom_histogram()` will calculate the y-axis values for us: @@ -142,7 +140,7 @@ ggplot(bbc, aes(x = grant_amount, fill = grant_year_category)) + Neat! -### Points +## Points Next let's look at the data using points, mapping year to the x-axis and grant amount to the y-axis: @@ -181,7 +179,7 @@ ggplot(bbc, aes(x = grant_year_category, y = grant_amount, color = grant_program It does! We appear to have two different distributions of grants: small grants have a limit of £30,000, while regular grants have a much higher average amount. -### Boxplots +## Boxplots We can add summary information to the plot by only changing the `geom` we're using. Switch from `geom_point()` to `geom_boxplot()`: @@ -190,7 +188,7 @@ ggplot(bbc, aes(x = grant_year_category, y = grant_amount, color = grant_program geom_boxplot() ``` -### Summaries +## Summaries We can also make smaller summarized datasets with {dplyr} functions like `group_by()` and `summarize()` and plot those. First let's look at grant totals, averages, and counts over time: diff --git a/example/04-example.qmd b/example/04-example.qmd index cffdc08..6913182 100644 --- a/example/04-example.qmd +++ b/example/04-example.qmd @@ -19,8 +19,6 @@ If you want to follow along with this example, you can download the data directl -## Complete code - ::: {.callout-important} ### Slight differences from the video @@ -33,7 +31,7 @@ set.seed(1234) options(dplyr.summarise.inform = FALSE) ``` -### Load data +## Load data There are two CSV files: @@ -76,7 +74,7 @@ births_2000_2014 <- read_csv(here::here( births_combined <- bind_rows(births_1994_1999, births_2000_2014) ``` -### Wrangle data +## Wrangle data Let's look at the first few rows of the data to see what we're working with: @@ -111,7 +109,7 @@ If you look at the data now, you can see the columns are changed and have differ Our `births` data is now clean and ready to go! -### Bar plot +## Bar plot First we can look at a bar chart showing the total number of births each day. We need to make a smaller summarized dataset and then we'll plot it: @@ -156,7 +154,7 @@ ggplot(data = total_births_weekday, x = NULL, y = "Total births") ``` -### Lollipop chart +## Lollipop chart Since the ends of the bars are often the most important part of the graph, we can use a lollipop chart to emphasize them. We'll keep all the same code from our bar chart and make a few changes: @@ -181,7 +179,7 @@ ggplot(data = total_births_weekday, ``` -### Strip plot +## Strip plot However, we want to \#barbarplots! (Though they're arguably okay here, since they show totals and not averages). Let's show all the data with points. We'll use the full dataset now, map x to weekday, y to births, and change `geom_col()` to `geom_point()`. We'll tell `geom_point()` to jitter the points randomly. @@ -195,7 +193,7 @@ ggplot(data = births, There are some interesting points in the low ends, likely because of holidays like Labor Day and Memorial Day (for the Mondays) and Thanksgiving (for the Thursday). If we had a column that indicated whether a day was a holiday, we could color by that and it would probably explain most of those low numbers. Unfortunately we don't have that column, and it'd be hard to make. Some holidays are constant (Halloween is always October 31), but some aren't (Thanksgiving is the fourth Thursday in November, so we'd need to find out which November 20-somethingth each year is the fourth Thursday, and good luck doing that at scale). -### Beeswarm plot +## Beeswarm plot We can add some structure to these points if we use the [{ggbeeswarm} package](https://github.com/eclarke/ggbeeswarm), with either `geom_beeswarm()` or `geom_quasirandom()`. `geom_quasirandom()` actually works better here since there are so many points—`geom_beeswarm()` makes the clusters of points way too wide. @@ -210,7 +208,7 @@ ggplot(data = births, guides(color = "none") ``` -### Heatmap +## Heatmap Finally, let's use something non-traditional to show the average births by day in a somewhat proportional way. We can calculate the average number of births every day and then make a heatmap that fills each square by that average, thus showing the relative differences in births per day. diff --git a/example/06-example.qmd b/example/06-example.qmd index 34ab0d3..aa7a9ab 100644 --- a/example/06-example.qmd +++ b/example/06-example.qmd @@ -18,8 +18,6 @@ If you want to follow along with this example, you can download the data below ( -## Complete code - ::: {.callout-important} ### Slight differences from the video @@ -31,7 +29,7 @@ knitr::opts_chunk$set(fig.width = 6, fig.height = 3.6, fig.align = "center", col set.seed(1234) ``` -### Load and clean data +## Load and clean data First, we load the libraries we'll be using: @@ -61,7 +59,7 @@ weather_atl <- weather_atl_raw %>% Now we're ready to go! -### Histograms +## Histograms We can first make a histogram of wind speed. We'll use a bin width of 1 and color the edges of the bars white: @@ -98,7 +96,7 @@ ggplot(weather_atl, aes(x = windSpeed, fill = Month)) + Neat! January, March, and April appear to have the most variation in windy days, with a few wind-less days and a few very-windy days, while August was very wind-less. -### Density plots +## Density plots The code to create a density plot is nearly identical to what we used for the histogram—the only thing we change is the `geom` layer: @@ -195,7 +193,7 @@ ggplot(weather_atl_long, aes(x = temp, y = fct_rev(Month), Super neat! We can see much wider temperature disparities during the summer, with large gaps between high and low, and relatively equal high/low temperatures during the winter. -### Box, violin, and rain cloud plots +## Box, violin, and rain cloud plots Finally, we can look at the distribution of variables with box plots, violin plots, and other similar graphs. First, we'll make a box plot of windspeed, filled by the `Day` variable we made indicating weekday: diff --git a/example/07-example.qmd b/example/07-example.qmd index 38cc93a..dd13ff4 100644 --- a/example/07-example.qmd +++ b/example/07-example.qmd @@ -17,8 +17,6 @@ If you want to follow along with this example, you can download the data below ( -## Complete code - ::: {.callout-important} ### Slight differences from the video @@ -31,7 +29,7 @@ set.seed(1234) options("digits" = 2, "width" = 150) ``` -### Load and clean data +## Load and clean data First, we load the libraries we'll be using: @@ -52,7 +50,7 @@ weather_atl <- read_csv("data/atl-weather-2019.csv") weather_atl <- read_csv(here::here("files", "data", "external_data", "atl-weather-2019.csv")) ``` -### Legal dual y-axes +## Legal dual y-axes It is fine (and often helpful!) to use two y-axes if the two different scales measure the same thing, like counts and percentages, Fahrenheit and Celsius, pounds and kilograms, inches and centimeters, etc. @@ -96,7 +94,7 @@ ggplot(weather_atl, aes(x = time, y = temperatureHigh)) + theme_minimal() ``` -### Combining plots +## Combining plots A good alternative to using two y-axes is to use two plots instead. The [{patchwork} package](https://github.com/thomasp85/patchwork) makes this *really* easy to do with R. There are other similar packages that do this, like {cowplot} and {gridExtra}, but I've found that {patchwork} is the easiest to use *and* it actually aligns the different plot elements like axis lines and legends (yay alignment in CRAP!). The [documentation for {patchwork}](https://patchwork.data-imaginist.com/articles/guides/assembly.html) is really great and full of examples—you should check it out to see all the things you can do with it! @@ -146,7 +144,7 @@ temp_plot + humidity_plot + plot_layout(ncol = 1, heights = c(0.7, 0.3)) ``` -### Scatterplot matrices +## Scatterplot matrices We can visualize the correlations between pairs of variables with the `ggpairs()` function in the {GGally} package. For instance, how correlated are high and low temperatures, humidity, wind speed, and the chance of precipitation? We first make a smaller dataset with just those columns, and then we feed that dataset into `ggpairs()` to see all the correlation information: @@ -170,7 +168,7 @@ ggpairs(weather_correlations) + ``` -### Correlograms +## Correlograms Scatterplot matrices typically include way too much information to be used in actual publications. I use them when doing my own analysis just to see how different variables are related, but I rarely polish them up for public consumption. In the readings for today, Claus Wilke showed a type of plot called a [*correlogram*](https://clauswilke.com/dataviz/visualizing-associations.html#associations-correlograms) which *is* more appropriate for publication. @@ -255,7 +253,7 @@ ggplot(things_to_correlate_long, ``` -### Simple regression +## Simple regression We can also visualize the relationships between variables using regression. Simple regression is easy to visualize, since you're only working with an X and a Y. For instance, what's the relationship between humidity and high temperatures during the summer? @@ -295,7 +293,7 @@ ggplot(weather_atl_summer, And indeed, as humidity increases, temperatures decrease. -### Coefficient plots +## Coefficient plots But if we use multiple variables in the model, it gets really hard to visualize the results since we're working with multiple dimensions. Instead, we can use coefficient plots to see the individual coefficients in the model. @@ -334,7 +332,7 @@ ggplot(model_tidied, Neat! Now we can see how big these different coefficients are and how close they are to zero. Wind speed has a big significant effect on temperature. The others are all very close to zero. -### Marginal effects plots +## Marginal effects plots ::: {.callout-tip} ### 2023 update! @@ -455,7 +453,7 @@ ggplot(predicted_values_fancy, aes(x = windSpeed, y = .fitted)) + That's so neat! Temperatures go down slightly as cloud cover increases. If we wanted to improve the model, we'd add an interaction term between cloud cover and windspeed so that each line would have a different slope in addition to a different intercept, but that's beyond the scope of this class. -### Predicted values and marginal effects in 2023 +## Predicted values and marginal effects in 2023 Instead of using `expand_grid()` and `augment()` to create and plug in a mini dataset of variables to move up and down, we can use [the {marginaleffects} package](https://vincentarelbundock.github.io/marginaleffects/) to simplify life! diff --git a/example/08-example.qmd b/example/08-example.qmd index 20c4b2f..29fdec8 100644 --- a/example/08-example.qmd +++ b/example/08-example.qmd @@ -24,8 +24,6 @@ If you want to skip the data downloading, you can download the data below (you'l -## Complete code - ::: {.callout-important} ### Slight differences from the video @@ -38,7 +36,7 @@ set.seed(1234) options("digits" = 2, "width" = 150) ``` -### Load and clean data +## Load and clean data First, we load the libraries we'll be using: @@ -106,7 +104,7 @@ wdi_clean <- wdi_raw %>% head(wdi_clean) ``` -### Small multiples +## Small multiples First we can make some small multiples plots and show life expectancy over time for a handful of countries. We'll make a list of some countries chosen at random while I scrolled through the data, and then filter our data to include only those rows. We then plot life expectancy, faceting by country. @@ -167,7 +165,7 @@ ggplot(life_expectancy_eu, aes(x = year, y = life_expectancy)) + Neat! -### Sparklines +## Sparklines Sparklines are just line charts (or bar charts) that are really really small. @@ -206,7 +204,7 @@ You can then use those saved tiny plots in your text. > Both India and China have seen increased CO2 emissions over the past 20 years. -### Slopegraphs +## Slopegraphs We can make a slopegraph to show changes in GDP per capita between two time periods. We need to first filter our WDI to include only the start and end years (here 1995 and 2015). Then, to make sure that we're using complete data, we'll get rid of any country that has missing data for either 1995 or 2015. The `group_by(...) %>% filter(...) %>% ungroup()` pipeline does this, with the `!any(is.na(gdp_per_cap))` test keeping any rows where any of the `gdp_per_cap` values are not missing for the whole country. @@ -289,7 +287,7 @@ ggplot(gdp_south_asia, aes(x = year, y = gdp_per_cap, group = country, color = c ``` -### Bump charts +## Bump charts Finally, we can make a bump chart that shows changes in rankings over time. We'll look at CO2 emissions in South Asia. First we need to calculate a new variable that shows the rank of each country within each year. We can do this if we group by year and then use the `rank()` function to rank countries by the `co2_emissions` column. diff --git a/example/09-example.qmd b/example/09-example.qmd index 33d5d39..187ab9d 100644 --- a/example/09-example.qmd +++ b/example/09-example.qmd @@ -24,8 +24,6 @@ If you want to skip the data downloading, you can download the data below (you'l -## Complete code - ::: {.callout-important} ### Slight differences from the video @@ -39,7 +37,7 @@ options("digits" = 2, "width" = 150) ``` -### Load data +## Load data First, we load the libraries we'll be using: @@ -70,7 +68,7 @@ wdi_clean <- wdi_co2_raw %>% filter(region != "Aggregates") ``` -### Clean and reshape data +## Clean and reshape data Next we'll do some substantial filtering and reshaping so that we can end up with the rankings of CO~2~ emissions in 1995 and 2014. I annotate as much as possible below so you can see what's happening in each step. @@ -133,7 +131,7 @@ And here's what it looks like now: head(co2_rankings) ``` -### Plot the data and annotate +## Plot the data and annotate I use IBM Plex Sans in this plot. You can [download it from Google Fonts](https://fonts.google.com/specimen/IBM+Plex+Sans). diff --git a/example/10-example.qmd b/example/10-example.qmd index e8eeb95..7e9a6bb 100644 --- a/example/10-example.qmd +++ b/example/10-example.qmd @@ -22,15 +22,13 @@ If you want to skip the data downloading, you can download the data below (you'l There is no video for this one, since it really only involves feeding a few ggplot plots fed into `ggplotly()`. -## Complete code - ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width = 6, fig.height = 3.6, fig.align = "center", collapse = TRUE) set.seed(1234) options("digits" = 2, "width" = 150) ``` -### Get and clean data +## Get and clean data First, we load the libraries we'll be using: @@ -62,7 +60,7 @@ wdi_clean <- wdi_parl_raw %>% ``` -### Creating a basic interactive chart +## Creating a basic interactive chart Let's make a chart that shows the distribution of the proportion of women in national parliaments in 2019, by continent. We'll use a strip plot with jittered points. @@ -100,7 +98,7 @@ ggplotly(static_plot) Not *everything* translates over to JavaScript—the caption is gone now, and the legend is back (which is fine, I guess, since the legend is interactive). But still, this is magic. -### Modifying the tooltip +## Modifying the tooltip Right now, the default tooltip you see when you hover over the points includes the actual proportion of women in parliament for each point, along with the continent, which is neat, but it'd be great if we could see the country name too. The tooltip picks up the information to include from the variables we use in `aes()`, and we never map the `country` column to any aesthetic, so it doesn't show up. @@ -129,7 +127,7 @@ ggplotly(static_plot_toolip, tooltip = "text") Now we should just see the country names in the tooltips! -### Including more information in the tooltip +## Including more information in the tooltip We have country names, but we lost the values in the x-axis. Rwanda has the highest proportion of women in parliament, but what's the exact number? It's somewhere above 60%, but that's all we can see now. @@ -198,7 +196,7 @@ htmlwidgets::saveWidget(interactive_plot, "fancy_plot.html") ``` -### Making a dashboard with {flexdashboard} +## Making a dashboard with {flexdashboard} The [documentation for {flexdashboard} is so great and complete](https://rmarkdown.rstudio.com/flexdashboard/) that I'm not going to include a full example here. There is also a brief overview in [chapter 5 of the official R Markdown book](https://bookdown.org/yihui/rmarkdown/dashboards.html). You can also watch [this really quick video here](https://www.youtube.com/watch?v=_oDfBVr9wmQ). She uses a package called {dimple} instead of {plotly}, which doesn't work with ggplot like `ggplotly()`, so *ignore her code* about `dimple()` and use your `ggplotly()` skills instead. You can search YouTube for a bunch of other short tutorial videos, too. diff --git a/example/11-example.qmd b/example/11-example.qmd index c0464f6..79be99b 100644 --- a/example/11-example.qmd +++ b/example/11-example.qmd @@ -24,8 +24,6 @@ If you want to skip the data downloading, you can download the data below (you'l -## Complete code - ::: {.callout-important} ### Slight differences from the video @@ -38,7 +36,7 @@ set.seed(1234) options("digits" = 2, "width" = 150) ``` -### Get data +## Get data First, we load the libraries we'll be using: @@ -98,7 +96,7 @@ fred_raw <- read_csv("data/fred_raw.csv") fred_raw <- read_csv(here::here(fred_path)) ``` -### Look at and clean data +## Look at and clean data The data we get from FRED is in a slightly different format than we're used to with `WDI()`, but with good reason. With World Bank data, you get data for every country and every year, so there are rows for Afghanistan 2000, Afghanistan 2001, etc. You then get a column for each of the variables you want (population, life expectancy, GDP/capita, etc.) @@ -158,7 +156,7 @@ All better. We can make as many subsets of the long, tidy, raw data as we want. -### Plotting time +## Plotting time Let's plot some of these and see what the trends look like. We'll just use `geom_line()`. @@ -199,7 +197,7 @@ Yikes COVID-19. There, we visualized time. `r emoji::emoji("check")` -### Improving graphics +## Improving graphics These were simple graphs and they're kind of helpful, but they're not incredibly informative. We can clean these up a little. First we can change the labels and themes and colors: @@ -323,7 +321,7 @@ ggplot(unemployment_claims_only, aes(x = date, y = price)) + theme(plot.title = element_text(face = "bold")) ``` -### Decomposition +## Decomposition The mechanics of decomposing and forecasting time series goes beyond the scope of this class, but there are lots of resources you can use to learn more, including [this phenomenal free textbook](https://otexts.com/fpp3/). diff --git a/example/12-example.qmd b/example/12-example.qmd index e0839a0..eae19ae 100644 --- a/example/12-example.qmd +++ b/example/12-example.qmd @@ -63,7 +63,7 @@ Alternatively, instead of using these index numbers, you can use any of the name I use a lot of different shapefiles in this example. To save you from having to go find and download each individual one, you can download this zip file: -- [{{< fa file-archive >}} `shapefiles.zip`](https://datavizm20.s3.amazonaws.com/shapefiles.zip) +- [{{< fa file-archive >}} `shapefiles.zip`](/files/data/external_data/shapefiles.zip) Unzip this and put all the contained folders in a folder named `data` if you want to follow along. **You don't need to follow along!** @@ -106,8 +106,6 @@ These shapefiles all came from these sources: -## Complete code - ::: {.callout-important} ### Slight differences from the video @@ -120,7 +118,7 @@ set.seed(1234) options("digits" = 2, "width" = 150) ``` -### Load and look at data +## Load and look at data First we'll load the libraries we're going to use: @@ -199,7 +197,7 @@ ga_schools <- read_sf(here::here("files", "data", "external_data", "maps", geocoded_addresses <- read_csv(here::here(geocoded_path)) ``` -### Basic plotting +## Basic plotting If you look at the `world_map` dataset in RStudio, you'll see it's just a standard data frame with `r nrow(world_map)` rows and `r ncol(world_map)` columns. The last column is the magical `geometry` column with the latitude/longitude details for the borders for every country. RStudio only shows you 50 columns at a time in the RStudio viewer, so you'll need to move to the next page of columns with the » button in the top left corner. @@ -240,7 +238,7 @@ ggplot() + theme_void() ``` -### World map with different projections +## World map with different projections Changing projections is trivial: add a `coord_sf()` layer where you specify the CRS you want to use. @@ -278,7 +276,7 @@ ggplot() + theme_void() ``` -### US map with different projections +## US map with different projections This same process works for any shapefile. The map of the US can also be projected differently—two common projections are NAD83 and Albers. We'll take the `us_states` dataset, remove Alaska, Hawaii, and Puerto Rico (they're so far from the rest of the lower 48 states that they make an unusable map—see the next section for a way to include them), and plot it. @@ -299,7 +297,7 @@ ggplot() + theme_void() ``` -### US map with non-continguous parts +## US map with non-continguous parts Plotting places like Alaska, Hawaii, and Puerto Rico gets a little tricky since they're far away from the contiguous 48 states. There's an easy way to handle it though! @@ -335,7 +333,7 @@ ggplot() + ``` -### Individual states +## Individual states Again, because these shapefiles are really just fancy data frames, we can filter them with normal dplyr functions. Let's plot just Georgia: @@ -372,7 +370,7 @@ ggplot() + Perfect. -### Plotting multiple shapefile layers +## Plotting multiple shapefile layers The state shapefiles from the Census Bureau only include state boundaries. If we want to see counties in Georgia, we need to download and load the Census's county shapefiles (which we did above). We can then add a second `geom_sf()` layer for the counties. @@ -412,7 +410,7 @@ ggplot() + theme_void() ``` -### Plotting multiple shapefile layers when some are bigger than the parent shape +## Plotting multiple shapefile layers when some are bigger than the parent shape So far we've been able to filter out states and counties that we don't want to plot using `filter()`, which works because the shapefiles have geometry data for each state or county. But what if you're plotting stuff that doesn't follow state or county boundaries, like freeways, roads, rivers, or lakes? @@ -494,7 +492,7 @@ ggplot() + Heck yeah. That's a great map. This is basically what [Kieran Healy did here](https://kieranhealy.org/prints/rivers/), but he used [even more detailed shapefiles from the US Geological Survey](https://www.usgs.gov/core-science-systems/ngp/national-hydrography). -### Plotting schools in Georgia +## Plotting schools in Georgia Shapefiles are not limited to just lines and areas—they can also contain points. I made a free account at the Georgia GIS Clearinghouse, searched for "schools" and found a shapefile of all the K–12 schools in 2009. [This is the direct link to the page](https://data.georgiaspatial.org/index.asp?body=preview&dataId=41516), but it only works if you're logged in to their system. [This is the official metadata for the shapefile](https://data.georgiaspatial.org/data/statewide/other/schools_2009.html), which you can see if you're not logged in, but you can't download anything. It's a dumb system and other states are a lot better at offering their GIS data (like, [here's a shapefile of all of Utah's schools and libraries](https://gis.utah.gov/data/society/schools-libraries/) as of 2017, publicly accessible without an account). @@ -561,7 +559,7 @@ ga_schools_fixed %>% ``` -### Making your own geoencoded data +## Making your own geoencoded data So, plotting shapefiles with `geom_sf()` is magical because {sf} deals with all of the projection issues for us automatically and it figures out how to plot all the latitude and longitude data for us automatically. But lots of data *doesn't* some as shapefiles. The [rats data from mini project 1](/assignment/01-mini-project.qmd), for instance, has two columns indicating the latitude and longitude of each rat sighting, but those are stored as just numbers. If we try to use `geom_sf()` with the rat data, it won't work. We need that magical `geometry` column. @@ -609,7 +607,7 @@ ggplot() + theme_void() ``` -### Automatic geoencoding by address +## Automatic geoencoding by address Using `st_as_sf()` is neat when you have latitude and longitude data already, but what if you have a list of addresses or cities instead, with no fancy geographic information? It's easy enough to right click on Google Maps, but you don't really want to do that hundreds of times for large-scale data. @@ -687,7 +685,7 @@ ggplot() + theme_void() ``` -### Plotting other data on maps +## Plotting other data on maps So far we've just plotted whatever data the shapefile creators decided to include and publish in their data. But what if you want to visualize some other variable on a map? We can do this by combining our shapefile data with any other kind of data, as long as the two have a shared column. For instance, we can make a choropleth map of life expectancy with data from the World Bank. diff --git a/example/13-example.qmd b/example/13-example.qmd index 8ba6a75..e314dac 100644 --- a/example/13-example.qmd +++ b/example/13-example.qmd @@ -34,8 +34,6 @@ If you want to see other examples of text visualizations with the {tidytext} pac -## Complete code - ::: {.callout-important} ### Big differences from the video @@ -49,7 +47,7 @@ options("digits" = 2, "width" = 150) options(dplyr.summarise.inform = FALSE) ``` -### Get data +## Get data First, as always, we'll load the libraries we'll be using: @@ -113,7 +111,7 @@ books_raw <- read_csv("data/books_raw.csv") ```` -### Clean data +## Clean data The data you get from Project Gutenberg comes in a tidy format, with a column for the book id, a column for the title, and a column for text. Sometimes this text column will be divided by lines in the book; sometimes it might be an entire page or paragraph or chapter. It all depends on how the book is formatted at Project Gutenberg. @@ -164,9 +162,9 @@ We could also figure out a systematic way to indicate acts and scenes, but that' Now that we have tidy text data, let's do stuff with it! -### Tokens and word counts +## Tokens and word counts -#### Single words +### Single words One way we can visualize text is to look at word frequencies and find the most common words. This is even more important when looking across documents. @@ -222,7 +220,7 @@ These results aren't terribly surprising. "lear" is the most common word in *Kin (Sharp-eyed readers will notice that the words aren't actually in perfect order! That's because some common words are repeated across the plays, like "lord" and "sir". However, each category in a factor can only have one possible position in the orer, so because "lord" is the second most common word in *Hamlet* it also appears as #2 in *Macbeth* and *King Lear*. You can fix this with the `reorder_within()` function in {tidytext}—see [Julia Silge's tutorial here](https://juliasilge.com/blog/reorder-within/) for how to use it.) -#### Bigrams +### Bigrams We can also look at pairs of words instead of single words. To do this, we need to change a couple arguments in `unnest_tokens()`, but otherwise everything else stays the same. In order to remove stopwords, we need to split the bigram column into two columns (`word1` and `word2`) with `separate()`, filter each of those columns, and then combine the word columns back together as `bigram` with `unite()` @@ -264,7 +262,7 @@ ggplot(top_bigrams, aes(y = fct_rev(bigram), x = n, fill = title)) + There are some neat trends here. "Lord Hamlet" is the most common pair of words in *Hamlet* (not surprisingly), but in Macbeth the repeated "knock knock" (the first non-name repeated pair) is a well-known plot point and reoccurring symbolic theme throughout the play. -### Bigrams and probability +## Bigrams and probability We can replicate the ["She Giggles, He Gallops"](https://pudding.cool/2017/08/screen-direction/) idea by counting the bigrams that match "he X" and "she X". @@ -339,7 +337,7 @@ ggplot(plot_word_ratios, aes(y = word, x = logratio, color = logratio < 0)) + Shakespeare doesn't use a lot of fancy verbs in his plays, so we're left with incredibly common verbs like "should" and "comes" and "was". Oh well. -### Term frequency-inverse document frequency (tf-idf) +## Term frequency-inverse document frequency (tf-idf) We can determine which words are the most unique for each book/document in our corpus using by calculating the tf-idf (term frequency-inverse document frequency) score for each term. The tf-idf is the product of the term frequency and the inverse document frequency: @@ -390,7 +388,7 @@ ggplot(tragedy_tf_idf_plot, Not surprisingly, the most unique words for each play happen to be the names of the characters in those plays. -### Sentiment analysis +## Sentiment analysis In the video, I plotted the sentiment of *Little Women* across the book, but it wasn't a very interesting plot. We'll try with Shakespeare here instead. @@ -459,11 +457,11 @@ ggplot(tragedies_split_into_lines, Neat. They're all really sad and negative, except for the beginning of Romeo and Juliet where the two lovers meet and fall in love. Then everyone dies later. -### Neat extra stuff +## Neat extra stuff None of this stuff was in the video, but it's useful to know and see how to do it. It all generally comes from the [*Tidy Text Mining* book](https://www.tidytextmining.com/) by Julia Silge and David Robinson -#### Part of speech tagging +### Part of speech tagging R has no way of knowing if words are nouns, verbs, or adjectives. You can algorithmically predict what part of speech each word is using a part-of-speech tagger, like [spaCy](https://spacy.io/) or [Stanford's Natural Langauge Processing (NLP) library](https://nlp.stanford.edu/). @@ -588,11 +586,11 @@ ggplot(main_characters_by_chapter, aes(x = prop, y = "1", fill = fct_rev(name))) ``` -#### Topic modeling and fingerprinting +### Topic modeling and fingerprinting If you want to see some examples of topic modeling with Latent Dirichlet Allocation (LDA) or text fingerprinting based on sentence length and counts of hapax legomena ([based on this article](https://kops.uni-konstanz.de/bitstream/handle/123456789/5492/Literature_Fingerprinting.pdf)), see these examples from a previous version of this class: [topic modeling](https://datavizf18.classes.andrewheiss.com/class/11-class/#topic-modeling) and [fingerprinting](https://datavizf18.classes.andrewheiss.com/class/11-class/#fingerprinting). -#### Text features +### Text features Finally, you can use [the {textfeatures} package](https://github.com/mkearney/textfeatures) to find all sorts of interesting numeric statistics about text, like the number of exclamation points, commas, digits, characters per word, uppercase letters, lowercase letters, and more!