From 9acb2f9eec71cc8f385e1f040e17a9e470372cdd Mon Sep 17 00:00:00 2001 From: lcolladotor Date: Thu, 5 Sep 2024 10:45:32 -0400 Subject: [PATCH] Add a link to https://blog.r-project.org/2020/02/16/stringsasfactors/ --- .../index/execute-results/html.json | 4 ++-- posts/08-managing-data-frames-with-tidyverse/index.qmd | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json b/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json index 8441317..561bbaa 100644 --- a/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json +++ b/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "bc0d7c2da9e8db6693ac978ee5d64a21", + "hash": "1313a0fdd88f8dbbae232b7a24d8518d", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"08 - Managing data frames with the Tidyverse\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \" An introduction to data frames in R and the managing them with the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tibble, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/08-managing-data-frames-with-tidyverse/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n3. [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Understand the advantages of a `tibble` and `data.frame` data objects in R\n- Learn about the dplyr R package to manage data frames\n- Recognize the key verbs to manage data frames in dplyr\n- Use the \"pipe\" operator to combine verbs together\n:::\n\n# Data Frames\n\nThe **data frame** (or `data.frame`) is a **key data structure** in statistics and in R.\n\nThe basic structure of a data frame is that there is **one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation**.\n\nR has an internal implementation of data frames that is likely the one you will use most often. However, there are packages on CRAN that implement data frames via things like relational databases that allow you to operate on very, very large data frames (but we will not discuss them here).\n\nGiven the importance of managing data frames, it is **important that we have good tools for dealing with them.**\n\nFor example, **operations** like filtering rows, re-ordering rows, and selecting columns, can often be tedious operations in R whose syntax is not very intuitive. The `dplyr` package is designed to mitigate a lot of these problems and to provide a highly optimized set of routines specifically for dealing with data frames.\n\n## Tibbles\n\nAnother type of data structure that we need to discuss is called the **tibble**! It's best to think of tibbles as an updated and stylish version of the `data.frame`.\n\nTibbles are what tidyverse packages work with most seamlessly. Now, that **does not mean tidyverse packages *require* tibbles**.\n\nIn fact, they still work with `data.frames`, but the more you work with tidyverse and tidyverse-adjacent packages, the more you will see the advantages of using tibbles.\n\nBefore we go any further, tibbles *are* data frames, but they have some new bells and whistles to make your life easier.\n\n### How tibbles differ from `data.frame`\n\nThere are a number of differences between tibbles and `data.frames`.\n\n::: callout-tip\n### Note\n\nTo see a full vignette about tibbles and how they differ from data.frame, you will want to execute `vignette(\"tibble\")` and read through that vignette.\n:::\n\nWe will summarize some of the most important points here:\n\n- **Input type remains unchanged** - `data.frame` is notorious for treating strings as factors; this will not happen with tibbles\n- **Variable names remain unchanged** - In base R, creating `data.frames` will remove spaces from names, converting them to periods or add \"x\" before numeric column names. Creating tibbles will not change variable (column) names.\n- **There are no `row.names()` for a tibble** - Tidy data requires that variables be stored in a consistent way, removing the need for row names.\n- **Tibbles print first ten rows and columns that fit on one screen** - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.\n\n## Creating a tibble\n\nThe tibble package is part of the `tidyverse` and can thus be loaded in (once installed) using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### `as_tibble()`\n\nSince many packages use the historical `data.frame` from base R, you will often find yourself in the situation that you have a `data.frame` and want to convert that `data.frame` to a `tibbl`e.\n\nTo do so, the `as_tibble()` function is exactly what you are looking for.\n\nFor the example, here we use a dataset (`chicago.rds`) containing air pollution and temperature data for the city of Chicago in the U.S.\n\nThe dataset is available in the `/data` repository. You can load the data into R using the `readRDS()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing\n```\n\n\n:::\n\n```{.r .cell-code}\nchicago <- readRDS(here(\"data\", \"chicago.rds\"))\n```\n:::\n\n\nYou can see some basic characteristics of the dataset with the `dim()` and `str()` functions.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 6940 8\n```\n\n\n:::\n\n```{.r .cell-code}\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t6940 obs. of 8 variables:\n $ city : chr \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date, format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num 34 NA 34.2 47 NA ...\n $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...\n```\n\n\n:::\n:::\n\n\nWe see this data structure is a `data.frame` with 6940 observations and 8 variables.\n\nTo convert this `data.frame` to a tibble you would use the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(as_tibble(chicago))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n\n\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTibbles, by default, **only print the first ten rows to screen**.\n\nIf you were to print the `data.frame` `chicago` to screen, all 6940 rows would be displayed. When working with large `data.frames`, this **default behavior can be incredibly frustrating**.\n\nUsing tibbles removes this frustration because of the default settings for tibble printing.\n:::\n\nAdditionally, you will note that the **type of the variable is printed for each variable in the tibble**. This helpful feature is another added bonus of tibbles relative to `data.frame`.\n\n#### Want to see more of the tibble?\n\nIf you *do* want to see more rows from the tibble, there are a few options!\n\n1. The `View()` function in RStudio is incredibly helpful. The input to this function is the `data.frame` or tibble you would like to see.\n\nSpecifically, `View(chicago)` would provide you, the viewer, with a scrollable view (in a new tab) of the complete dataset.\n\n2. Use the fact that `print()` enables you to specify how many rows and columns you would like to display.\n\nHere, we again display the `chicago` data.frame as a tibble but specify that we would only like to see 5 rows. The `width = Inf` argument specifies that we would like to see all the possible columns. Here, there are only 8, but for larger datasets, this can be helpful to specify.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas_tibble(chicago) %>%\n print(n = 5, width = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6,940 × 8\n city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0\n2 chic 33 29.9 1987-01-02 NA NA 3.30 23.2\n3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8\n4 chic 29 28.6 1987-01-04 NA 47 4.38 30.4\n5 chic 32 28.9 1987-01-05 NA NA 4.75 30.3\n# ℹ 6,935 more rows\n```\n\n\n:::\n:::\n\n\n### `tibble()`\n\nAlternatively, you can **create a tibble on the fly** by using `tibble()` and specifying the information you would like stored in each column.\n\n::: callout-tip\n### Note\n\nIf you provide a single value, this value will be repeated across all rows of the tibble. This is referred to as \"recycling inputs of length 1.\"\n\nIn the example here, we see that the column `c` will contain the value '1' across all rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n a = 1:5,\n b = 6:10,\n c = 1,\n z = (a + b)^2 + c\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 4\n a b c z\n \n1 1 6 1 50\n2 2 7 1 82\n3 3 8 1 122\n4 4 9 1 170\n5 5 10 1 226\n```\n\n\n:::\n:::\n\n:::\n\nThe `tibble()` function allows you to quickly generate tibbles and even allows you to **reference columns within the tibble you are creating**, as seen in column z of the example above.\n\n::: callout-tip\n### Note\n\n**Tibbles can have column names that are not allowed** in `data.frame`.\n\nIn the example below, we see that to utilize a nontraditional variable name, you surround the column name with backticks.\n\nNote that to refer to such columns in other tidyverse packages, you willl continue to use backticks surrounding the variable name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n `two words` = 1:5,\n `12` = \"numeric\",\n `:)` = \"smile\",\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n `two words` `12` `:)` \n \n1 1 numeric smile\n2 2 numeric smile\n3 3 numeric smile\n4 4 numeric smile\n5 5 numeric smile\n```\n\n\n:::\n:::\n\n:::\n\n## Subsetting tibbles\n\nSubsetting tibbles also differs slightly from how subsetting occurs with `data.frame`.\n\nWhen it comes to tibbles,\n\n- `[[` can subset by name or position\n- `$` only subsets by name\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n a = 1:5,\n b = 6:10,\n c = 1,\n z = (a + b)^2 + c\n)\n\n# Extract by name using $ or [[]]\ndf$z\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 50 82 122 170 226\n```\n\n\n:::\n\n```{.r .cell-code}\ndf[[\"z\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 50 82 122 170 226\n```\n\n\n:::\n\n```{.r .cell-code}\n# Extract by position requires [[]]\ndf[[4]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 50 82 122 170 226\n```\n\n\n:::\n:::\n\n\nHaving now discussed tibbles, which are the type of object most tidyverse and tidyverse-adjacent packages work best with, we now know the goal.\n\nIn many cases, **tibbles are ultimately what we want to work with in R**.\n\nHowever, **data are stored in many different formats outside of R**. We will spend the rest of this lesson discussing wrangling functions that work either a `data.frame` or `tibble`.\n\n# The `dplyr` Package\n\nThe `dplyr` package was developed by Posit (formely RStudio) and is **an optimized and distilled** version of the older `plyr` **package for data manipulation or wrangling**.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_wrangling.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe `dplyr` package does not provide any \"new\" functionality to R per se, in the sense that everything `dplyr` does could already be done with base R, but it **greatly** simplifies existing functionality in R.\n\nOne important contribution of the `dplyr` package is that it **provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames**.\n\nWith this grammar, you can sensibly communicate what it is that you are doing to a data frame that other people can understand (assuming they also know the grammar). This is useful because it **provides an abstraction for data manipulation that previously did not exist**.\n\nAnother useful contribution is that the `dplyr` functions are **very** fast, as many key operations are coded in C++.\n\n### `dplyr` grammar\n\nSome of the key \"verbs\" provided by the `dplyr` package are\n\n- `select()`: return a subset of the columns of a data frame, using a flexible notation\n\n- `filter()`: extract a subset of rows from a data frame based on logical conditions\n\n- `arrange()`: reorder rows of a data frame\n\n- `rename()`: rename variables in a data frame\n\n- `mutate()`: add new variables/columns or transform existing variables\n\n- `summarise()` / `summarize()`: generate summary statistics of different variables in the data frame, possibly within strata\n\n- `%>%`: the \"pipe\" operator is used to connect multiple verb actions together into a pipeline\n\n::: callout-tip\n### Note\n\nThe `dplyr` package as a number of its own data types that it takes advantage of.\n\nFor example, there is a handy `print()` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.\n:::\n\n### `dplyr` functions\n\nAll of the functions that we will discuss here will have a few common characteristics. In particular,\n\n1. The **first argument** is a data frame type object.\n\n2. The **subsequent arguments** describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly (without using the `$` operator, just use the column names).\n\n3. The **return result** of a function is a new data frame.\n\n4. Data frames must be **properly formatted** and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### `dplyr` installation\n\nThe `dplyr` package can be installed from CRAN or from GitHub using the `devtools` package and the `install_github()` function. The GitHub repository will usually contain the latest updates to the package and the development version.\n\nTo install from CRAN, just run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"dplyr\")\n```\n:::\n\n\nThe `dplyr` package is also installed when you install the `tidyverse` meta-package.\n\nAfter installing the package it is important that you load it into your R session with the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n```\n:::\n\n\nYou may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings.\n\n### `select()`\n\nWe will continue to use the `chicago` dataset containing air pollution and temperature data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- as_tibble(chicago)\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n\n\n:::\n:::\n\n\nThe `select()` function can be used to **select columns of a data frame** that you want to focus on.\n\n::: callout-tip\n### Example\n\nSuppose we wanted to take the first 3 columns only. There are a few ways to do this.\n\nWe could for example use numerical indices:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(chicago)[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"city\" \"tmpd\" \"dptp\"\n```\n\n\n:::\n:::\n\n\nBut we can also use the names directly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, city:dptp)\nhead(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n city tmpd dptp\n \n1 chic 31.5 31.5\n2 chic 33 29.9\n3 chic 33 27.4\n4 chic 29 28.6\n5 chic 32 28.9\n6 chic 40 35.1\n```\n\n\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nThe `:` normally cannot be used with names or strings, but inside the `select()` function you can use it to specify a range of variable names.\n:::\n\nYou can also **omit** variables using the `select()` function by using the negative sign. With `select()` you can do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect(chicago, -(city:dptp))\n```\n:::\n\n\nwhich indicates that we should include every variable *except* the variables `city` through `dptp`. The equivalent code in base R would be\n\n\n::: {.cell}\n\n```{.r .cell-code}\ni <- match(\"city\", names(chicago))\nj <- match(\"dptp\", names(chicago))\nhead(chicago[, -(i:j)])\n```\n:::\n\n\nNot super intuitive, right?\n\nThe `select()` function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a \"2\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, ends_with(\"2\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [6,940 × 4] (S3: tbl_df/tbl/data.frame)\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n\n\n:::\n:::\n\n\nOr if we wanted to keep every variable that starts with a \"d\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, starts_with(\"d\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [6,940 × 2] (S3: tbl_df/tbl/data.frame)\n $ dptp: num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date: Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n```\n\n\n:::\n:::\n\n\nYou can also use more general regular expressions if necessary. See the help page (`?select`) for more details.\n\n### `filter()`\n\nThe `filter()` function is used to **extract subsets of rows** from a data frame. This function is similar to the existing `subset()` function in R but is quite a bit faster in my experience.\n\n![Artwork by Allison Horst on filter() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_filter.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n::: callout-tip\n### Example\n\nSuppose we wanted to extract the rows of the `chicago` data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30)\nstr(chic.f)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [194 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:194] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:194] 23 28 55 59 57 57 75 61 73 78 ...\n $ dptp : num [1:194] 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...\n $ date : Date[1:194], format: \"1998-01-17\" \"1998-01-23\" ...\n $ pm25tmean2: num [1:194] 38.1 34 39.4 35.4 33.3 ...\n $ pm10tmean2: num [1:194] 32.5 38.7 34 28.5 35 ...\n $ o3tmean2 : num [1:194] 3.18 1.75 10.79 14.3 20.66 ...\n $ no2tmean2 : num [1:194] 25.3 29.4 25.3 31.4 26.8 ...\n```\n\n\n:::\n:::\n\n:::\n\nYou can see that there are now only 194 rows in the data frame and the distribution of the `pm25tmean2` values is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(chic.f$pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n 30.05 32.12 35.04 36.63 39.53 61.50 \n```\n\n\n:::\n:::\n\n\nWe can place an arbitrarily complex logical sequence inside of `filter()`, so we could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)\nselect(chic.f, date, tmpd, pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 17 × 3\n date tmpd pm25tmean2\n \n 1 1998-08-23 81 39.6\n 2 1998-09-06 81 31.5\n 3 2001-07-20 82 32.3\n 4 2001-08-01 84 43.7\n 5 2001-08-08 85 38.8\n 6 2001-08-09 84 38.2\n 7 2002-06-20 82 33 \n 8 2002-06-23 82 42.5\n 9 2002-07-08 81 33.1\n10 2002-07-18 82 38.8\n11 2003-06-25 82 33.9\n12 2003-07-04 84 32.9\n13 2005-06-24 86 31.9\n14 2005-06-27 82 51.5\n15 2005-06-28 85 31.2\n16 2005-07-17 84 32.7\n17 2005-08-03 84 37.9\n```\n\n\n:::\n:::\n\n\nNow there are only 17 observations where both of those conditions are met.\n\nOther logical operators you should be aware of include:\n\n| Operator | Meaning | Example |\n|----------:|-------------------------:|-------------------------------:|\n| `==` | Equals | `city == chic` |\n| `!=` | Does not equal | `city != chic` |\n| `>` | Greater than | `tmpd > 32.0` |\n| `>=` | Greater than or equal to | `tmpd >= 32.0` |\n| `<` | Less than | `tmpd < 32.0` |\n| `<=` | Less than or equal to | `tmpd <= 32.0` |\n| `%in%` | Included in | `city %in% c(\"chic\", \"bmore\")` |\n| `is.na()` | Is a missing value | `is.na(pm10tmean2)` |\n\n::: callout-tip\n### Note\n\nIf you are ever unsure of how to write a logical statement, but know how to write its opposite, you can use the `!` operator to negate the whole statement.\n\nA common use of this is to identify observations with non-missing data (e.g., `!(is.na(pm10tmean2))`).\n:::\n\n### `arrange()`\n\nThe `arrange()` function is used to **reorder rows** of a data frame according to one of the variables/columns. Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R. The `arrange()` function simplifies the process quite a bit.\n\nHere we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, date)\n```\n:::\n\n\nWe can now check the first few rows\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 1987-01-01 NA\n2 1987-01-02 NA\n3 1987-01-03 NA\n```\n\n\n:::\n:::\n\n\nand the last few rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 2005-12-29 7.45\n2 2005-12-30 15.1 \n3 2005-12-31 15 \n```\n\n\n:::\n:::\n\n\nColumns can be arranged in descending order too by useing the special `desc()` operator.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, desc(date))\n```\n:::\n\n\nLooking at the first three and last three rows shows the dates in descending order.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 2005-12-31 15 \n2 2005-12-30 15.1 \n3 2005-12-29 7.45\n```\n\n\n:::\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 1987-01-03 NA\n2 1987-01-02 NA\n3 1987-01-01 NA\n```\n\n\n:::\n:::\n\n\n### `rename()`\n\n**Renaming a variable** in a data frame in R is surprisingly hard to do! The `rename()` function is designed to make this process easier.\n\nHere you can see the names of the first five variables in the `chicago` data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n city tmpd dptp date pm25tmean2\n \n1 chic 35 30.1 2005-12-31 15 \n2 chic 36 31 2005-12-30 15.1 \n3 chic 35 29.4 2005-12-29 7.45\n```\n\n\n:::\n:::\n\n\nThe `dptp` column is supposed to represent the dew point temperature and the `pm25tmean2` column provides the PM2.5 data.\n\nHowever, these names are pretty obscure or awkward and probably be renamed to something more sensible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n city tmpd dewpoint date pm25\n \n1 chic 35 30.1 2005-12-31 15 \n2 chic 36 31 2005-12-30 15.1 \n3 chic 35 29.4 2005-12-29 7.45\n```\n\n\n:::\n:::\n\n\nThe syntax inside the `rename()` function is to have the new name on the left-hand side of the `=` sign and the old name on the right-hand side.\n\n::: callout-note\n### Question\n\nHow would you do the equivalent in base R without `dplyr`?\n:::\n\n### `mutate()`\n\nThe `mutate()` function exists to **compute transformations of variables** in a data frame. Often, you want to create new variables that are derived from existing variables and `mutate()` provides a clean interface for doing that.\n\n![Artwork by Allison Horst on mutate() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_mutate.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nFor example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data.\n\n- That way we can look at whether a given day's air pollution level is higher than or less than average (as opposed to looking at its absolute level).\n\nHere, we create a `pm25detrend` variable that subtracts the mean from the `pm25` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))\nhead(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 9\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2\n2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8\n3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0\n4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3\n5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5\n6 chic 35 29.6 2005-12-26 8.4 8.5 14.0 16.8\n# ℹ 1 more variable: pm25detrend \n```\n\n\n:::\n:::\n\n\nThere is also the related `transmute()` function, which does the same thing as `mutate()` but then *drops all non-transformed variables*.\n\nHere, we de-trend the PM10 and ozone (O3) variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(transmute(chicago,\n pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),\n o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)\n))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 2\n pm10detrend o3detrend\n \n1 -10.4 -16.9 \n2 -14.7 -16.4 \n3 -10.4 -12.6 \n4 -6.40 -16.2 \n5 -6.90 -15.0 \n6 -25.4 -5.39\n```\n\n\n:::\n:::\n\n\nNote that there are only two columns in the transmuted data frame.\n\n### `group_by()`\n\nThe `group_by()` function is used to **generate summary statistics** from the data frame within strata defined by a variable.\n\nFor example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is?\n\nSo the stratum is the year, and that is something we can derive from the `date` variable.\n\n**In conjunction** with the `group_by()` function, we often use the `summarize()` function (or `summarise()` for some parts of the world).\n\n::: callout-tip\n### Note\n\nThe **general operation** here is a combination of\n\n1. Splitting a data frame into separate pieces defined by a variable or group of variables (`group_by()`)\n2. Then, applying a summary function across those subsets (`summarize()`)\n:::\n\n::: callout-tip\n### Example\n\nFirst, we can create a `year` variable using `as.POSIXlt()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)\n```\n:::\n\n\nNow we can create a separate data frame that splits the original data frame by year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nyears <- group_by(chicago, year)\n```\n:::\n\n\nFinally, we compute summary statistics for each year in the data frame with the `summarize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(years,\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 19 × 4\n year pm25 o3 no2\n \n 1 1987 NaN 63.0 23.5\n 2 1988 NaN 61.7 24.5\n 3 1989 NaN 59.7 26.1\n 4 1990 NaN 52.2 22.6\n 5 1991 NaN 63.1 21.4\n 6 1992 NaN 50.8 24.8\n 7 1993 NaN 44.3 25.8\n 8 1994 NaN 52.2 28.5\n 9 1995 NaN 66.6 27.3\n10 1996 NaN 58.4 26.4\n11 1997 NaN 56.5 25.5\n12 1998 18.3 50.7 24.6\n13 1999 18.5 57.5 24.7\n14 2000 16.9 55.8 23.5\n15 2001 16.9 51.8 25.1\n16 2002 15.3 54.9 22.7\n17 2003 15.2 56.2 24.6\n18 2004 14.6 44.5 23.4\n19 2005 16.2 58.8 22.6\n```\n\n\n:::\n:::\n\n:::\n\n`summarize()` returns a data frame with `year` as the first column, and then the annual summary statistics of `pm25`, `o3`, and `no2`.\n\n::: callout-tip\n### More complicated example\n\nIn a slightly more complicated example, we might want to know what are the average levels of ozone (`o3`) and nitrogen dioxide (`no2`) within quintiles of `pm25`. A slicker way to do this would be through a regression model, but we can actually do this quickly with `group_by()` and `summarize()`.\n\nFirst, we can create a categorical variable of `pm25` divided into quantiles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)\nchicago <- mutate(chicago, pm25.quint = cut(pm25, qq))\n```\n:::\n\n\nNow we can group the data frame by the `pm25.quint` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquint <- group_by(chicago, pm25.quint)\n```\n:::\n\n\nFinally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(quint,\n o3 = mean(o3tmean2, na.rm = TRUE),\n no2 = mean(no2tmean2, na.rm = TRUE)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n pm25.quint o3 no2\n \n1 (1.7,8.7] 21.7 18.0\n2 (8.7,12.4] 20.4 22.1\n3 (12.4,16.7] 20.7 24.4\n4 (16.7,22.6] 19.9 27.3\n5 (22.6,61.5] 20.3 29.6\n6 18.8 25.8\n```\n\n\n:::\n:::\n\n:::\n\nFrom the table, it seems there is not a strong relationship between `pm25` and `o3`, but there appears to be a positive correlation between `pm25` and `no2`.\n\nMore sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of `dplyr` functions can often get you most of the way there.\n\n### `%>%`\n\nThe pipeline operator `%>%` is very handy for **stringing together multiple `dplyr` functions in a sequence of operations**.\n\nNotice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nthird(second(first(x)))\n```\n:::\n\n\nThis **nesting is not a natural way** to think about a sequence of operations.\n\nThe `%>%` operator allows you to string operations in a left-to-right fashion, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfirst(x) %>%\n second() %>%\n third()\n```\n:::\n\n\n::: callout-tip\n### Example\n\nTake the example that we just did in the last section.\n\nThat can be done with the following sequence in a single R expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago %>%\n mutate(year = as.POSIXlt(date)$year + 1900) %>%\n group_by(year) %>%\n summarize(\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 19 × 4\n year pm25 o3 no2\n \n 1 1987 NaN 63.0 23.5\n 2 1988 NaN 61.7 24.5\n 3 1989 NaN 59.7 26.1\n 4 1990 NaN 52.2 22.6\n 5 1991 NaN 63.1 21.4\n 6 1992 NaN 50.8 24.8\n 7 1993 NaN 44.3 25.8\n 8 1994 NaN 52.2 28.5\n 9 1995 NaN 66.6 27.3\n10 1996 NaN 58.4 26.4\n11 1997 NaN 56.5 25.5\n12 1998 18.3 50.7 24.6\n13 1999 18.5 57.5 24.7\n14 2000 16.9 55.8 23.5\n15 2001 16.9 51.8 25.1\n16 2002 15.3 54.9 22.7\n17 2003 15.2 56.2 24.6\n18 2004 14.6 44.5 23.4\n19 2005 16.2 58.8 22.6\n```\n\n\n:::\n:::\n\n:::\n\nThis way we do not have to create a set of temporary variables along the way or create a massive nested sequence of function calls.\n\n::: callout-tip\n### Note\n\nIn the above code, I pass the `chicago` data frame to the first call to `mutate()`, but then afterwards I do not have to pass the first argument to `group_by()` or `summarize()`.\n\nOnce you travel down the pipeline with `%>%`, the first argument is taken to be the output of the previous element in the pipeline.\n:::\n\nAnother example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutate(chicago, month = as.POSIXlt(date)$mon + 1) %>%\n group_by(month) %>%\n summarize(\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 12 × 4\n month pm25 o3 no2\n \n 1 1 17.8 28.2 25.4\n 2 2 20.4 37.4 26.8\n 3 3 17.4 39.0 26.8\n 4 4 13.9 47.9 25.0\n 5 5 14.1 52.8 24.2\n 6 6 15.9 66.6 25.0\n 7 7 16.6 59.5 22.4\n 8 8 16.9 54.0 23.0\n 9 9 15.9 57.5 24.5\n10 10 14.2 47.1 24.2\n11 11 15.2 29.5 23.6\n12 12 17.5 27.7 24.5\n```\n\n\n:::\n:::\n\n\nHere, we can see that `o3` tends to be low in the winter months and high in the summer while `no2` is higher in the winter and lower in the summer.\n\n### `slice_*()`\n\nThe `slice_sample()` function of the `dplyr` package will allow you to see a **sample of random rows** in random order.\n\nThe number of rows to show is specified by the `n` argument.\n\n- This can be useful if you **do not want to print the entire tibble**, but you want to get a greater sense of the values.\n- This is a **good option for data analysis reports**, where printing the entire tibble would not be appropriate if the tibble is quite large.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_sample(chicago, n = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n 1 chic 73 61.5 1989-06-25 NA 53 34.3 28.4\n 2 chic 33 30.4 1998-01-27 NA 34 1.47 36.4\n 3 chic 18 2.3 1998-03-11 NA 20 16.1 29.9\n 4 chic 21 -2.5 1990-02-24 NA 75 29.1 11.5\n 5 chic 58 51.1 2001-12-04 14.3 58 6.96 28.3\n 6 chic 62.5 46.8 1995-05-07 NA 44 37.3 27.1\n 7 chic 68.5 69.4 1992-08-07 NA 61.5 30.4 30.1\n 8 chic 46 27.4 2003-11-30 5.6 12 16.2 19.1\n 9 chic 38 22.6 2002-11-02 10.8 21 11.0 21.8\n10 chic 39.5 29.8 1990-12-10 NA 24.5 11.4 27.7\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n\n\n:::\n:::\n\n:::\n\nYou can also use `slice_head()` or `slice_tail()` to take a look at the top rows or bottom rows of your tibble. Again the number of rows can be specified with the `n` argument.\n\nThis will show the first 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_head(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2\n2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8\n3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0\n4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3\n5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n\n\n:::\n:::\n\n\nThis will show the last 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_tail(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 32 28.9 1987-01-05 NA NA 4.75 30.3\n2 chic 29 28.6 1987-01-04 NA 47 4.38 30.4\n3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8\n4 chic 33 29.9 1987-01-02 NA NA 3.30 23.2\n5 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n\n\n:::\n:::\n\n\n# Summary\n\nThe `dplyr` pacfkage provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.\n\nOnce you learn the `dplyr` grammar there are a few additional benefits\n\n- `dplyr` can work with other data frame \"back ends\" such as SQL databases. There is an SQL interface for relational databases via the DBI package\n\n- `dplyr` can be integrated with the `data.table` package for large fast tables\n\nThe `dplyr` package is handy way to both simplify and speed up your data frame management code. It is rare that you get such a combination at the same time!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. How can you tell if an object is a tibble?\n2. What option controls how many additional column names are printed at the footer of a tibble?\n3. Using the `trees` dataset in base R (this dataset stores the girth, height, and volume for Black Cherry Trees) and using the pipe operator: (i) convert the `data.frame` to a tibble, (ii) filter for rows with a tree height of greater than 70, and (iii) order rows by `Volume` (smallest to largest).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Girth Height Volume\n1 8.3 70 10.3\n2 8.6 65 10.3\n3 8.8 63 10.2\n4 10.5 72 16.4\n5 10.7 81 18.8\n6 10.8 83 19.7\n```\n\n\n:::\n:::\n\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-09-05\n pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.0)\n digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)\n dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)\n evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)\n fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)\n ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)\n glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)\n gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)\n lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)\n munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)\n readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)\n rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)\n stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)\n tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)\n tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)\n timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)\n utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)\n vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)\n withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)\n xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", + "markdown": "---\ntitle: \"08 - Managing data frames with the Tidyverse\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \" An introduction to data frames in R and the managing them with the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tibble, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/08-managing-data-frames-with-tidyverse/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n3. [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Understand the advantages of a `tibble` and `data.frame` data objects in R\n- Learn about the dplyr R package to manage data frames\n- Recognize the key verbs to manage data frames in dplyr\n- Use the \"pipe\" operator to combine verbs together\n:::\n\n# Data Frames\n\nThe **data frame** (or `data.frame`) is a **key data structure** in statistics and in R.\n\nThe basic structure of a data frame is that there is **one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation**.\n\nR has an internal implementation of data frames that is likely the one you will use most often. However, there are packages on CRAN that implement data frames via things like relational databases that allow you to operate on very, very large data frames (but we will not discuss them here).\n\nGiven the importance of managing data frames, it is **important that we have good tools for dealing with them.**\n\nFor example, **operations** like filtering rows, re-ordering rows, and selecting columns, can often be tedious operations in R whose syntax is not very intuitive. The `dplyr` package is designed to mitigate a lot of these problems and to provide a highly optimized set of routines specifically for dealing with data frames.\n\n## Tibbles\n\nAnother type of data structure that we need to discuss is called the **tibble**! It's best to think of tibbles as an updated and stylish version of the `data.frame`.\n\nTibbles are what tidyverse packages work with most seamlessly. Now, that **does not mean tidyverse packages *require* tibbles**.\n\nIn fact, they still work with `data.frames`, but the more you work with tidyverse and tidyverse-adjacent packages, the more you will see the advantages of using tibbles.\n\nBefore we go any further, tibbles *are* data frames, but they have some new bells and whistles to make your life easier.\n\n### How tibbles differ from `data.frame`\n\nThere are a number of differences between tibbles and `data.frames`.\n\n::: callout-tip\n### Note\n\nTo see a full vignette about tibbles and how they differ from data.frame, you will want to execute `vignette(\"tibble\")` and read through that vignette.\n:::\n\nWe will summarize some of the most important points here:\n\n- **Input type remains unchanged** - `data.frame` was notorious for treating strings as factors; this will not happen with tibbles. As of R version 4.0 this is no longer the case as noted by Kurt Hornik on the [R blog](https://blog.r-project.org/2020/02/16/stringsasfactors/).\n- **Variable names remain unchanged** - In base R, creating `data.frames` will remove spaces from names, converting them to periods or add \"x\" before numeric column names. Creating tibbles will not change variable (column) names.\n- **There are no `row.names()` for a tibble** - Tidy data requires that variables be stored in a consistent way, removing the need for row names.\n- **Tibbles print first ten rows and columns that fit on one screen** - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.\n\n## Creating a tibble\n\nThe tibble package is part of the `tidyverse` and can thus be loaded in (once installed) using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### `as_tibble()`\n\nSince many packages use the historical `data.frame` from base R, you will often find yourself in the situation that you have a `data.frame` and want to convert that `data.frame` to a `tibbl`e.\n\nTo do so, the `as_tibble()` function is exactly what you are looking for.\n\nFor the example, here we use a dataset (`chicago.rds`) containing air pollution and temperature data for the city of Chicago in the U.S.\n\nThe dataset is available in the `/data` repository. You can load the data into R using the `readRDS()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing\n```\n\n\n:::\n\n```{.r .cell-code}\nchicago <- readRDS(here(\"data\", \"chicago.rds\"))\n```\n:::\n\n\nYou can see some basic characteristics of the dataset with the `dim()` and `str()` functions.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 6940 8\n```\n\n\n:::\n\n```{.r .cell-code}\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t6940 obs. of 8 variables:\n $ city : chr \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date, format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num 34 NA 34.2 47 NA ...\n $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...\n```\n\n\n:::\n:::\n\n\nWe see this data structure is a `data.frame` with 6940 observations and 8 variables.\n\nTo convert this `data.frame` to a tibble you would use the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(as_tibble(chicago))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n\n\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTibbles, by default, **only print the first ten rows to screen**.\n\nIf you were to print the `data.frame` `chicago` to screen, all 6940 rows would be displayed. When working with large `data.frames`, this **default behavior can be incredibly frustrating**.\n\nUsing tibbles removes this frustration because of the default settings for tibble printing.\n:::\n\nAdditionally, you will note that the **type of the variable is printed for each variable in the tibble**. This helpful feature is another added bonus of tibbles relative to `data.frame`.\n\n#### Want to see more of the tibble?\n\nIf you *do* want to see more rows from the tibble, there are a few options!\n\n1. The `View()` function in RStudio is incredibly helpful. The input to this function is the `data.frame` or tibble you would like to see.\n\nSpecifically, `View(chicago)` would provide you, the viewer, with a scrollable view (in a new tab) of the complete dataset.\n\n2. Use the fact that `print()` enables you to specify how many rows and columns you would like to display.\n\nHere, we again display the `chicago` data.frame as a tibble but specify that we would only like to see 5 rows. The `width = Inf` argument specifies that we would like to see all the possible columns. Here, there are only 8, but for larger datasets, this can be helpful to specify.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas_tibble(chicago) %>%\n print(n = 5, width = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6,940 × 8\n city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0\n2 chic 33 29.9 1987-01-02 NA NA 3.30 23.2\n3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8\n4 chic 29 28.6 1987-01-04 NA 47 4.38 30.4\n5 chic 32 28.9 1987-01-05 NA NA 4.75 30.3\n# ℹ 6,935 more rows\n```\n\n\n:::\n:::\n\n\n### `tibble()`\n\nAlternatively, you can **create a tibble on the fly** by using `tibble()` and specifying the information you would like stored in each column.\n\n::: callout-tip\n### Note\n\nIf you provide a single value, this value will be repeated across all rows of the tibble. This is referred to as \"recycling inputs of length 1.\"\n\nIn the example here, we see that the column `c` will contain the value '1' across all rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n a = 1:5,\n b = 6:10,\n c = 1,\n z = (a + b)^2 + c\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 4\n a b c z\n \n1 1 6 1 50\n2 2 7 1 82\n3 3 8 1 122\n4 4 9 1 170\n5 5 10 1 226\n```\n\n\n:::\n:::\n\n:::\n\nThe `tibble()` function allows you to quickly generate tibbles and even allows you to **reference columns within the tibble you are creating**, as seen in column z of the example above.\n\n::: callout-tip\n### Note\n\n**Tibbles can have column names that are not allowed** in `data.frame`.\n\nIn the example below, we see that to utilize a nontraditional variable name, you surround the column name with backticks.\n\nNote that to refer to such columns in other tidyverse packages, you willl continue to use backticks surrounding the variable name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n `two words` = 1:5,\n `12` = \"numeric\",\n `:)` = \"smile\",\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 3\n `two words` `12` `:)` \n \n1 1 numeric smile\n2 2 numeric smile\n3 3 numeric smile\n4 4 numeric smile\n5 5 numeric smile\n```\n\n\n:::\n:::\n\n:::\n\n## Subsetting tibbles\n\nSubsetting tibbles also differs slightly from how subsetting occurs with `data.frame`.\n\nWhen it comes to tibbles,\n\n- `[[` can subset by name or position\n- `$` only subsets by name\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n a = 1:5,\n b = 6:10,\n c = 1,\n z = (a + b)^2 + c\n)\n\n# Extract by name using $ or [[]]\ndf$z\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 50 82 122 170 226\n```\n\n\n:::\n\n```{.r .cell-code}\ndf[[\"z\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 50 82 122 170 226\n```\n\n\n:::\n\n```{.r .cell-code}\n# Extract by position requires [[]]\ndf[[4]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 50 82 122 170 226\n```\n\n\n:::\n:::\n\n\nHaving now discussed tibbles, which are the type of object most tidyverse and tidyverse-adjacent packages work best with, we now know the goal.\n\nIn many cases, **tibbles are ultimately what we want to work with in R**.\n\nHowever, **data are stored in many different formats outside of R**. We will spend the rest of this lesson discussing wrangling functions that work either a `data.frame` or `tibble`.\n\n# The `dplyr` Package\n\nThe `dplyr` package was developed by Posit (formely RStudio) and is **an optimized and distilled** version of the older `plyr` **package for data manipulation or wrangling**.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_wrangling.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe `dplyr` package does not provide any \"new\" functionality to R per se, in the sense that everything `dplyr` does could already be done with base R, but it **greatly** simplifies existing functionality in R.\n\nOne important contribution of the `dplyr` package is that it **provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames**.\n\nWith this grammar, you can sensibly communicate what it is that you are doing to a data frame that other people can understand (assuming they also know the grammar). This is useful because it **provides an abstraction for data manipulation that previously did not exist**.\n\nAnother useful contribution is that the `dplyr` functions are **very** fast, as many key operations are coded in C++.\n\n### `dplyr` grammar\n\nSome of the key \"verbs\" provided by the `dplyr` package are\n\n- `select()`: return a subset of the columns of a data frame, using a flexible notation\n\n- `filter()`: extract a subset of rows from a data frame based on logical conditions\n\n- `arrange()`: reorder rows of a data frame\n\n- `rename()`: rename variables in a data frame\n\n- `mutate()`: add new variables/columns or transform existing variables\n\n- `summarise()` / `summarize()`: generate summary statistics of different variables in the data frame, possibly within strata\n\n- `%>%`: the \"pipe\" operator is used to connect multiple verb actions together into a pipeline\n\n::: callout-tip\n### Note\n\nThe `dplyr` package as a number of its own data types that it takes advantage of.\n\nFor example, there is a handy `print()` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.\n:::\n\n### `dplyr` functions\n\nAll of the functions that we will discuss here will have a few common characteristics. In particular,\n\n1. The **first argument** is a data frame type object.\n\n2. The **subsequent arguments** describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly (without using the `$` operator, just use the column names).\n\n3. The **return result** of a function is a new data frame.\n\n4. Data frames must be **properly formatted** and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### `dplyr` installation\n\nThe `dplyr` package can be installed from CRAN or from GitHub using the `devtools` package and the `install_github()` function. The GitHub repository will usually contain the latest updates to the package and the development version.\n\nTo install from CRAN, just run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"dplyr\")\n```\n:::\n\n\nThe `dplyr` package is also installed when you install the `tidyverse` meta-package.\n\nAfter installing the package it is important that you load it into your R session with the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n```\n:::\n\n\nYou may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings.\n\n### `select()`\n\nWe will continue to use the `chicago` dataset containing air pollution and temperature data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- as_tibble(chicago)\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n\n\n:::\n:::\n\n\nThe `select()` function can be used to **select columns of a data frame** that you want to focus on.\n\n::: callout-tip\n### Example\n\nSuppose we wanted to take the first 3 columns only. There are a few ways to do this.\n\nWe could for example use numerical indices:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(chicago)[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"city\" \"tmpd\" \"dptp\"\n```\n\n\n:::\n:::\n\n\nBut we can also use the names directly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, city:dptp)\nhead(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n city tmpd dptp\n \n1 chic 31.5 31.5\n2 chic 33 29.9\n3 chic 33 27.4\n4 chic 29 28.6\n5 chic 32 28.9\n6 chic 40 35.1\n```\n\n\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nThe `:` normally cannot be used with names or strings, but inside the `select()` function you can use it to specify a range of variable names.\n:::\n\nYou can also **omit** variables using the `select()` function by using the negative sign. With `select()` you can do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect(chicago, -(city:dptp))\n```\n:::\n\n\nwhich indicates that we should include every variable *except* the variables `city` through `dptp`. The equivalent code in base R would be\n\n\n::: {.cell}\n\n```{.r .cell-code}\ni <- match(\"city\", names(chicago))\nj <- match(\"dptp\", names(chicago))\nhead(chicago[, -(i:j)])\n```\n:::\n\n\nNot super intuitive, right?\n\nThe `select()` function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a \"2\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, ends_with(\"2\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [6,940 × 4] (S3: tbl_df/tbl/data.frame)\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n\n\n:::\n:::\n\n\nOr if we wanted to keep every variable that starts with a \"d\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, starts_with(\"d\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [6,940 × 2] (S3: tbl_df/tbl/data.frame)\n $ dptp: num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date: Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n```\n\n\n:::\n:::\n\n\nYou can also use more general regular expressions if necessary. See the help page (`?select`) for more details.\n\n### `filter()`\n\nThe `filter()` function is used to **extract subsets of rows** from a data frame. This function is similar to the existing `subset()` function in R but is quite a bit faster in my experience.\n\n![Artwork by Allison Horst on filter() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_filter.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n::: callout-tip\n### Example\n\nSuppose we wanted to extract the rows of the `chicago` data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30)\nstr(chic.f)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [194 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:194] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:194] 23 28 55 59 57 57 75 61 73 78 ...\n $ dptp : num [1:194] 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...\n $ date : Date[1:194], format: \"1998-01-17\" \"1998-01-23\" ...\n $ pm25tmean2: num [1:194] 38.1 34 39.4 35.4 33.3 ...\n $ pm10tmean2: num [1:194] 32.5 38.7 34 28.5 35 ...\n $ o3tmean2 : num [1:194] 3.18 1.75 10.79 14.3 20.66 ...\n $ no2tmean2 : num [1:194] 25.3 29.4 25.3 31.4 26.8 ...\n```\n\n\n:::\n:::\n\n:::\n\nYou can see that there are now only 194 rows in the data frame and the distribution of the `pm25tmean2` values is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(chic.f$pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n 30.05 32.12 35.04 36.63 39.53 61.50 \n```\n\n\n:::\n:::\n\n\nWe can place an arbitrarily complex logical sequence inside of `filter()`, so we could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)\nselect(chic.f, date, tmpd, pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 17 × 3\n date tmpd pm25tmean2\n \n 1 1998-08-23 81 39.6\n 2 1998-09-06 81 31.5\n 3 2001-07-20 82 32.3\n 4 2001-08-01 84 43.7\n 5 2001-08-08 85 38.8\n 6 2001-08-09 84 38.2\n 7 2002-06-20 82 33 \n 8 2002-06-23 82 42.5\n 9 2002-07-08 81 33.1\n10 2002-07-18 82 38.8\n11 2003-06-25 82 33.9\n12 2003-07-04 84 32.9\n13 2005-06-24 86 31.9\n14 2005-06-27 82 51.5\n15 2005-06-28 85 31.2\n16 2005-07-17 84 32.7\n17 2005-08-03 84 37.9\n```\n\n\n:::\n:::\n\n\nNow there are only 17 observations where both of those conditions are met.\n\nOther logical operators you should be aware of include:\n\n| Operator | Meaning | Example |\n|----------:|-------------------------:|-------------------------------:|\n| `==` | Equals | `city == chic` |\n| `!=` | Does not equal | `city != chic` |\n| `>` | Greater than | `tmpd > 32.0` |\n| `>=` | Greater than or equal to | `tmpd >= 32.0` |\n| `<` | Less than | `tmpd < 32.0` |\n| `<=` | Less than or equal to | `tmpd <= 32.0` |\n| `%in%` | Included in | `city %in% c(\"chic\", \"bmore\")` |\n| `is.na()` | Is a missing value | `is.na(pm10tmean2)` |\n\n::: callout-tip\n### Note\n\nIf you are ever unsure of how to write a logical statement, but know how to write its opposite, you can use the `!` operator to negate the whole statement.\n\nA common use of this is to identify observations with non-missing data (e.g., `!(is.na(pm10tmean2))`).\n:::\n\n### `arrange()`\n\nThe `arrange()` function is used to **reorder rows** of a data frame according to one of the variables/columns. Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R. The `arrange()` function simplifies the process quite a bit.\n\nHere we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, date)\n```\n:::\n\n\nWe can now check the first few rows\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 1987-01-01 NA\n2 1987-01-02 NA\n3 1987-01-03 NA\n```\n\n\n:::\n:::\n\n\nand the last few rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 2005-12-29 7.45\n2 2005-12-30 15.1 \n3 2005-12-31 15 \n```\n\n\n:::\n:::\n\n\nColumns can be arranged in descending order too by useing the special `desc()` operator.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, desc(date))\n```\n:::\n\n\nLooking at the first three and last three rows shows the dates in descending order.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 2005-12-31 15 \n2 2005-12-30 15.1 \n3 2005-12-29 7.45\n```\n\n\n:::\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 1987-01-03 NA\n2 1987-01-02 NA\n3 1987-01-01 NA\n```\n\n\n:::\n:::\n\n\n### `rename()`\n\n**Renaming a variable** in a data frame in R is surprisingly hard to do! The `rename()` function is designed to make this process easier.\n\nHere you can see the names of the first five variables in the `chicago` data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n city tmpd dptp date pm25tmean2\n \n1 chic 35 30.1 2005-12-31 15 \n2 chic 36 31 2005-12-30 15.1 \n3 chic 35 29.4 2005-12-29 7.45\n```\n\n\n:::\n:::\n\n\nThe `dptp` column is supposed to represent the dew point temperature and the `pm25tmean2` column provides the PM2.5 data.\n\nHowever, these names are pretty obscure or awkward and probably be renamed to something more sensible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n city tmpd dewpoint date pm25\n \n1 chic 35 30.1 2005-12-31 15 \n2 chic 36 31 2005-12-30 15.1 \n3 chic 35 29.4 2005-12-29 7.45\n```\n\n\n:::\n:::\n\n\nThe syntax inside the `rename()` function is to have the new name on the left-hand side of the `=` sign and the old name on the right-hand side.\n\n::: callout-note\n### Question\n\nHow would you do the equivalent in base R without `dplyr`?\n:::\n\n### `mutate()`\n\nThe `mutate()` function exists to **compute transformations of variables** in a data frame. Often, you want to create new variables that are derived from existing variables and `mutate()` provides a clean interface for doing that.\n\n![Artwork by Allison Horst on mutate() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_mutate.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nFor example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data.\n\n- That way we can look at whether a given day's air pollution level is higher than or less than average (as opposed to looking at its absolute level).\n\nHere, we create a `pm25detrend` variable that subtracts the mean from the `pm25` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))\nhead(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 9\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2\n2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8\n3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0\n4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3\n5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5\n6 chic 35 29.6 2005-12-26 8.4 8.5 14.0 16.8\n# ℹ 1 more variable: pm25detrend \n```\n\n\n:::\n:::\n\n\nThere is also the related `transmute()` function, which does the same thing as `mutate()` but then *drops all non-transformed variables*.\n\nHere, we de-trend the PM10 and ozone (O3) variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(transmute(chicago,\n pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),\n o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)\n))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 2\n pm10detrend o3detrend\n \n1 -10.4 -16.9 \n2 -14.7 -16.4 \n3 -10.4 -12.6 \n4 -6.40 -16.2 \n5 -6.90 -15.0 \n6 -25.4 -5.39\n```\n\n\n:::\n:::\n\n\nNote that there are only two columns in the transmuted data frame.\n\n### `group_by()`\n\nThe `group_by()` function is used to **generate summary statistics** from the data frame within strata defined by a variable.\n\nFor example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is?\n\nSo the stratum is the year, and that is something we can derive from the `date` variable.\n\n**In conjunction** with the `group_by()` function, we often use the `summarize()` function (or `summarise()` for some parts of the world).\n\n::: callout-tip\n### Note\n\nThe **general operation** here is a combination of\n\n1. Splitting a data frame into separate pieces defined by a variable or group of variables (`group_by()`)\n2. Then, applying a summary function across those subsets (`summarize()`)\n:::\n\n::: callout-tip\n### Example\n\nFirst, we can create a `year` variable using `as.POSIXlt()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)\n```\n:::\n\n\nNow we can create a separate data frame that splits the original data frame by year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nyears <- group_by(chicago, year)\n```\n:::\n\n\nFinally, we compute summary statistics for each year in the data frame with the `summarize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(years,\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 19 × 4\n year pm25 o3 no2\n \n 1 1987 NaN 63.0 23.5\n 2 1988 NaN 61.7 24.5\n 3 1989 NaN 59.7 26.1\n 4 1990 NaN 52.2 22.6\n 5 1991 NaN 63.1 21.4\n 6 1992 NaN 50.8 24.8\n 7 1993 NaN 44.3 25.8\n 8 1994 NaN 52.2 28.5\n 9 1995 NaN 66.6 27.3\n10 1996 NaN 58.4 26.4\n11 1997 NaN 56.5 25.5\n12 1998 18.3 50.7 24.6\n13 1999 18.5 57.5 24.7\n14 2000 16.9 55.8 23.5\n15 2001 16.9 51.8 25.1\n16 2002 15.3 54.9 22.7\n17 2003 15.2 56.2 24.6\n18 2004 14.6 44.5 23.4\n19 2005 16.2 58.8 22.6\n```\n\n\n:::\n:::\n\n:::\n\n`summarize()` returns a data frame with `year` as the first column, and then the annual summary statistics of `pm25`, `o3`, and `no2`.\n\n::: callout-tip\n### More complicated example\n\nIn a slightly more complicated example, we might want to know what are the average levels of ozone (`o3`) and nitrogen dioxide (`no2`) within quintiles of `pm25`. A slicker way to do this would be through a regression model, but we can actually do this quickly with `group_by()` and `summarize()`.\n\nFirst, we can create a categorical variable of `pm25` divided into quantiles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)\nchicago <- mutate(chicago, pm25.quint = cut(pm25, qq))\n```\n:::\n\n\nNow we can group the data frame by the `pm25.quint` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquint <- group_by(chicago, pm25.quint)\n```\n:::\n\n\nFinally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(quint,\n o3 = mean(o3tmean2, na.rm = TRUE),\n no2 = mean(no2tmean2, na.rm = TRUE)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n pm25.quint o3 no2\n \n1 (1.7,8.7] 21.7 18.0\n2 (8.7,12.4] 20.4 22.1\n3 (12.4,16.7] 20.7 24.4\n4 (16.7,22.6] 19.9 27.3\n5 (22.6,61.5] 20.3 29.6\n6 18.8 25.8\n```\n\n\n:::\n:::\n\n:::\n\nFrom the table, it seems there is not a strong relationship between `pm25` and `o3`, but there appears to be a positive correlation between `pm25` and `no2`.\n\nMore sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of `dplyr` functions can often get you most of the way there.\n\n### `%>%`\n\nThe pipeline operator `%>%` is very handy for **stringing together multiple `dplyr` functions in a sequence of operations**.\n\nNotice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nthird(second(first(x)))\n```\n:::\n\n\nThis **nesting is not a natural way** to think about a sequence of operations.\n\nThe `%>%` operator allows you to string operations in a left-to-right fashion, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfirst(x) %>%\n second() %>%\n third()\n```\n:::\n\n\n::: callout-tip\n### Example\n\nTake the example that we just did in the last section.\n\nThat can be done with the following sequence in a single R expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago %>%\n mutate(year = as.POSIXlt(date)$year + 1900) %>%\n group_by(year) %>%\n summarize(\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 19 × 4\n year pm25 o3 no2\n \n 1 1987 NaN 63.0 23.5\n 2 1988 NaN 61.7 24.5\n 3 1989 NaN 59.7 26.1\n 4 1990 NaN 52.2 22.6\n 5 1991 NaN 63.1 21.4\n 6 1992 NaN 50.8 24.8\n 7 1993 NaN 44.3 25.8\n 8 1994 NaN 52.2 28.5\n 9 1995 NaN 66.6 27.3\n10 1996 NaN 58.4 26.4\n11 1997 NaN 56.5 25.5\n12 1998 18.3 50.7 24.6\n13 1999 18.5 57.5 24.7\n14 2000 16.9 55.8 23.5\n15 2001 16.9 51.8 25.1\n16 2002 15.3 54.9 22.7\n17 2003 15.2 56.2 24.6\n18 2004 14.6 44.5 23.4\n19 2005 16.2 58.8 22.6\n```\n\n\n:::\n:::\n\n:::\n\nThis way we do not have to create a set of temporary variables along the way or create a massive nested sequence of function calls.\n\n::: callout-tip\n### Note\n\nIn the above code, I pass the `chicago` data frame to the first call to `mutate()`, but then afterwards I do not have to pass the first argument to `group_by()` or `summarize()`.\n\nOnce you travel down the pipeline with `%>%`, the first argument is taken to be the output of the previous element in the pipeline.\n:::\n\nAnother example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutate(chicago, month = as.POSIXlt(date)$mon + 1) %>%\n group_by(month) %>%\n summarize(\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 12 × 4\n month pm25 o3 no2\n \n 1 1 17.8 28.2 25.4\n 2 2 20.4 37.4 26.8\n 3 3 17.4 39.0 26.8\n 4 4 13.9 47.9 25.0\n 5 5 14.1 52.8 24.2\n 6 6 15.9 66.6 25.0\n 7 7 16.6 59.5 22.4\n 8 8 16.9 54.0 23.0\n 9 9 15.9 57.5 24.5\n10 10 14.2 47.1 24.2\n11 11 15.2 29.5 23.6\n12 12 17.5 27.7 24.5\n```\n\n\n:::\n:::\n\n\nHere, we can see that `o3` tends to be low in the winter months and high in the summer while `no2` is higher in the winter and lower in the summer.\n\n### `slice_*()`\n\nThe `slice_sample()` function of the `dplyr` package will allow you to see a **sample of random rows** in random order.\n\nThe number of rows to show is specified by the `n` argument.\n\n- This can be useful if you **do not want to print the entire tibble**, but you want to get a greater sense of the values.\n- This is a **good option for data analysis reports**, where printing the entire tibble would not be appropriate if the tibble is quite large.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_sample(chicago, n = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n 1 chic 73 49.9 1988-07-04 NA 51 47.1 38.1\n 2 chic 56 50.6 2000-06-04 12.8 20 28.0 14.4\n 3 chic 66 47.4 1988-07-01 NA 19 22.8 20.9\n 4 chic 74 61.8 2002-09-07 32.0 59.5 38.8 40.1\n 5 chic 51 46.5 2002-03-08 13.6 12 19.0 23.2\n 6 chic 66 51.5 2001-10-03 16.2 66.5 35.5 21.5\n 7 chic 34 26.1 1989-03-09 NA 77 8.54 50.3\n 8 chic 79 68.9 2005-08-12 NA 27 30.8 24.0\n 9 chic 45 35 2002-10-30 7.43 9 20.2 20.1\n10 chic 44 35 1998-02-22 27.8 39.1 11.8 23.3\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n\n\n:::\n:::\n\n:::\n\nYou can also use `slice_head()` or `slice_tail()` to take a look at the top rows or bottom rows of your tibble. Again the number of rows can be specified with the `n` argument.\n\nThis will show the first 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_head(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2\n2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8\n3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0\n4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3\n5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n\n\n:::\n:::\n\n\nThis will show the last 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_tail(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 32 28.9 1987-01-05 NA NA 4.75 30.3\n2 chic 29 28.6 1987-01-04 NA 47 4.38 30.4\n3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8\n4 chic 33 29.9 1987-01-02 NA NA 3.30 23.2\n5 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n\n\n:::\n:::\n\n\n# Summary\n\nThe `dplyr` pacfkage provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.\n\nOnce you learn the `dplyr` grammar there are a few additional benefits\n\n- `dplyr` can work with other data frame \"back ends\" such as SQL databases. There is an SQL interface for relational databases via the DBI package\n\n- `dplyr` can be integrated with the `data.table` package for large fast tables\n\nThe `dplyr` package is handy way to both simplify and speed up your data frame management code. It is rare that you get such a combination at the same time!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. How can you tell if an object is a tibble?\n2. What option controls how many additional column names are printed at the footer of a tibble?\n3. Using the `trees` dataset in base R (this dataset stores the girth, height, and volume for Black Cherry Trees) and using the pipe operator: (i) convert the `data.frame` to a tibble, (ii) filter for rows with a tree height of greater than 70, and (iii) order rows by `Volume` (smallest to largest).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Girth Height Volume\n1 8.3 70 10.3\n2 8.6 65 10.3\n3 8.8 63 10.2\n4 10.5 72 16.4\n5 10.7 81 18.8\n6 10.8 83 19.7\n```\n\n\n:::\n:::\n\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-09-05\n pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.0)\n digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)\n dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)\n evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)\n fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)\n ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)\n glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)\n gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)\n lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)\n munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)\n readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)\n rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)\n stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)\n tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)\n tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)\n timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)\n utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)\n vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)\n withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)\n xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/posts/08-managing-data-frames-with-tidyverse/index.qmd b/posts/08-managing-data-frames-with-tidyverse/index.qmd index ff8422b..e5cb066 100644 --- a/posts/08-managing-data-frames-with-tidyverse/index.qmd +++ b/posts/08-managing-data-frames-with-tidyverse/index.qmd @@ -86,7 +86,7 @@ To see a full vignette about tibbles and how they differ from data.frame, you wi We will summarize some of the most important points here: -- **Input type remains unchanged** - `data.frame` is notorious for treating strings as factors; this will not happen with tibbles +- **Input type remains unchanged** - `data.frame` was notorious for treating strings as factors; this will not happen with tibbles. As of R version 4.0 this is no longer the case as noted by Kurt Hornik on the [R blog](https://blog.r-project.org/2020/02/16/stringsasfactors/). - **Variable names remain unchanged** - In base R, creating `data.frames` will remove spaces from names, converting them to periods or add "x" before numeric column names. Creating tibbles will not change variable (column) names. - **There are no `row.names()` for a tibble** - Tidy data requires that variables be stored in a consistent way, removing the need for row names. - **Tibbles print first ten rows and columns that fit on one screen** - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.