diff --git a/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json b/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json
index 083d7be..ce18255 100644
--- a/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json
+++ b/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "f1d609aa26ae665593d37bd091989900",
+  "hash": "27583eecf8d325ba0fbc85dc5d42cee3",
   "result": {
-    "markdown": "---\ntitle: \"02 - Introduction to R and RStudio!\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Let's dig into the R programming language and the RStudio integrated developer environment\"\nimage: https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png\ncategories: [module 1, week 1, R, programming, RStudio]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/02-introduction-to-r-and-rstudio/index.qmd).*\n\n> There are only two kinds of languages: the ones people complain about and the ones nobody uses. ---*Bjarne Stroustrup*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  [An overview and history of R](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger Peng\n2.  [Installing R and RStudio](https://rafalab.github.io/dsbook/installing-r-rstudio.html) from Rafael Irizarry\n3.  [Getting Started in R and RStudio](https://rafalab.github.io/dsbook/getting-started.html) from Rafael Irizarry\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html>\n-   <https://rafalab.github.io/dsbook>\n-   <https://rmd4sci.njtierney.com>\n-   <https://andreashandel.github.io/MADAcourse>\n\n# Learning objectives\n\n::: callout-note\n## Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Learn about (some of) the history of R.\n-   Identify some of the strengths and weaknesses of R.\n-   Install R and Rstudio on your computer.\n-   Know how to install and load R packages.\n:::\n\n# Overview and history of R\n\nBelow is a very quick introduction to R, to get you set up and running. We'll go deeper into R and coding later.\n\n### tl;dr (R in a nutshell)\n\nLike every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:\n\n-   R is open-source, freely accessible, and cross-platform (multiple OS).\n-   R is a [\"high-level\" programming language](https://en.wikipedia.org/wiki/High-level_programming_language), relatively easy to learn.\n    -   While \"Low-level\" programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.\n    -   In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.\n-   R is great for statistics, data analysis, websites, web apps, data visualizations, and so much more!\n-   R integrates easily with document preparation systems like $\\LaTeX$, but R files can also be used to create `.docx`, `.pdf`, `.html`, `.ppt` files with integrated R code output and graphics.\n-   The R Community is very dynamic, helpful and welcoming.\n    -   Check out the [#rstats](https://twitter.com/search?q=%23rstats) or [#rtistry](https://twitter.com/search?q=%23rtistry) on Twitter, [TidyTuesday](https://www.tidytuesday.com) podcast and community activity in the [R4DS Online Learning Community](https://www.rfordatasci.com), and [r/rstats](https://www.reddit.com/r/rstats/) subreddit.\n    -   If you are looking for more local resources, check out [R-Ladies Baltimore](https://www.meetup.com/rladies-baltimore/).\n-   Through R packages, it is easy to get lots of state-of-the-art algorithms.\n-   Documentation and help files for R are generally good.\n\nWhile we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).\n\nDepending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.\n\nWith the skills gained in this course, hopefully you will find R a fun and useful programming language for your future projects.\n\n![Artwork by Allison Horst on learning R](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/r_first_then.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Basic Features of R\n\nToday R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.\n\nOne nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.\n\nAnother key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R's ability to create \"publication quality\" graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.\n\nR has maintained the original S philosophy (see box below), which is that **it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools**. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.\n\n::: callout-tip\nFor a great discussion on an overview and history of R and the S programming language, read through [this chapter](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger D. Peng.\n:::\n\nFinally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow, Twitter [#rstats](https://twitter.com/search?q=%23rstats), [#rtistry](https://twitter.com/search?q=%23rtistry), and [Reddit](https://www.reddit.com/r/rstats/).\n\n### Free Software\n\nA major advantage that R has over many other statistical packages and is that it's free in the sense of free software (it's also free in the sense of free beer). The copyright for the primary source code for R is held by the [R Foundation](http://www.r-project.org/foundation/) and is published under the [GNU General Public License version 2.0](http://www.gnu.org/licenses/gpl-2.0.html).\n\nAccording to the Free Software Foundation, with *free software*, you are granted the following [four freedoms](http://www.gnu.org/philosophy/free-sw.html)\n\n-   The freedom to run the program, for any purpose (freedom 0).\n\n-   The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.\n\n-   The freedom to redistribute copies so you can help your neighbor (freedom 2).\n\n-   The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.\n\n::: callout-tip\nYou can visit the [Free Software Foundation's web site](http://www.fsf.org) to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and [Stallman's personal web site](https://stallman.org) is an interesting read if you happen to have some spare time.\n:::\n\n### Design of the R System\n\nThe primary R system is available from the [Comprehensive R Archive Network](http://cran.r-project.org), also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.\n\nThe R system is divided into 2 conceptual parts:\n\n1.  The \"base\" R system that you download from CRAN:\n\n-   [Linux](http://cran.r-project.org/bin/linux/)\n-   [Windows](http://cran.r-project.org/bin/windows/)\n-   [Mac](http://cran.r-project.org/bin/macosx/)\n\n2.  Everything else.\n\nR functionality is divided into a number of *packages*.\n\n-   The \"base\" R system contains, among other things, the `base` package which is required to run R and contains the most fundamental functions.\n\n-   The other packages contained in the \"base\" system include `utils`, `stats`, `datasets`, `graphics`, `grDevices`, `grid`, `methods`, `tools`, `parallel`, `compiler`, `splines`, `tcltk`, `stats4`.\n\n-   There are also \"Recommended\" packages: `boot`, `class`, `cluster`, `codetools`, `foreign`, `KernSmooth`, `lattice`, `mgcv`, `nlme`, `rpart`, `survival`, `MASS`, `spatial`, `nnet`, `Matrix`.\n\nWhen you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:\n\n-   There are over 10,000 packages on CRAN that have been developed by users and programmers around the world.\n\n-   There are also many packages associated with the [Bioconductor project](http://bioconductor.org).\n\n-   People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.\n\n::: callout-note\n## Questions\n\n1.  How many R packages are on CRAN today?\n2.  How many R packages are on Bioconductor today?\n3.  How many R packages are on GitHub today?\n:::\n\n### Limitations of R\n\nNo programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on **almost 50 year old technology**, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the \"old days\").\n\nAnother commonly cited limitation of R is that **objects must generally be stored in physical memory** (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.\n\nAt a higher level one \"limitation\" of R is that **its functionality is based on consumer demand and (voluntary) user contributions**. If no one feels like implementing your favorite method, then it's *your* job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.\n\n# Using R and RStudio\n\n> If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. --- *Nicholas Tierney*\n\n\\[[Source](https://rmd4sci.njtierney.com)\\]\n\nThe RStudio layout has the following features:\n\n-   On the upper left, something called a Rmarkdown script\n-   On the lower left, the R console\n-   On the lower right, the view for files, plots, packages, help, and viewer.\n-   On the upper right, the environment / history pane\n\n![A screenshot of the RStudio integrated developer environment (IDE) -- aka the working environment](https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png)\n\nThe R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we'll learn about these files later).\n\nThe file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.\n\n### Installing R and RStudio\n\n-   If you have not already, [install R first](http://cran.r-project.org). If you already have R installed, make sure it is a fairly recent version, version 4.0 or newer. If yours is older, I suggest you update (install a new R version).\n-   Once you have R installed, install the free version of [RStudio Desktop](https://www.rstudio.com/products/rstudio/download/). Again, make sure it's a recent version.\n\n::: callout-tip\nInstalling R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry's `dsbook`\n\n-   <https://rafalab.github.io/dsbook/installing-r-rstudio.html>\n:::\n\nIf things don't work, ask for help in the Courseplus discussion board.\n\nI personally only have experience with Mac, but everything should work on all the standard operating systems (Windows, Mac, and even Linux).\n\n### RStudio default options\n\nTo first get set up, I highly recommend changing the following setting\n\nTools \\> Global Options (or `Cmd + ,` on macOS)\n\nUnder the **General** tab:\n\n-   For **workspace**\n    -   Uncheck restore .RData into workspace at startup\n    -   Save workspace to .RData on exit : \"Never\"\n-   For **History**\n    -   Uncheck \"Always save history (even when not saving .RData)\n    -   Uncheck \"Remove duplicate entries in history\"\n\nThis means that you won't save the objects and other things that you create in your R session and reload them. This is important for two reasons\n\n1.  **Reproducibility**: you don't want to have objects from last week cluttering your session\n2.  **Privacy**: you don't want to save private data or other things to your session. You only want to read these in.\n\nYour \"history\" is the commands that you have entered into R.\n\nAdditionally, not saving your history means that you won't be relying on things that you typed in the last session, which is a good habit to get into!\n\n### Installing and loading R packages\n\nAs we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.\n\nThe \"official\" place for packages is the [CRAN website](https://cran.r-project.org/web/packages/available_packages_by_name.html). If you are interested in packages on a specific topic, the [CRAN task views](http://cran.r-project.org/web/views/) provide curated descriptions of packages sorted by topic.\n\nTo install an R package from CRAN, one can simply call the `install.packages()` function and pass the name of the package as an argument. For example, to install the `ggplot2` package from CRAN: open RStudio,go to the R prompt (the `>` symbol) in the lower-left corner and type\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nand the appropriate version of the package will be installed.\n\nOften, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.\n\n::: callout-note\n## Questions\n\n1.  As you installed the `ggplot2` package, what other packages were installed?\n2.  What happens if you tried to install `GGplot2`?\n:::\n\nIt could be that you already have all packages required by `ggplot2` installed. In that case, you will not see any other packages installed. To see which of the packages above `ggplot2` needs (and thus installs if it is not present), type into the R console:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntools::package_dependencies(\"ggplot2\")\n```\n:::\n\n\nIn RStudio, you can also install (and update/remove) packages by clicking on the 'Packages' tab in the bottom right window.\n\nIt is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the `remotes` package and then use the following function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github()\n```\n:::\n\n\nWe will not do that now, but it is quite likely that at one point later in this course we will.\n\nYou only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the `library()` command. For instance to load the `ggplot2` package, type\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary('ggplot2')\n```\n:::\n\n\nYou may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.\n\nThis was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.\n\n### Getting started in RStudio\n\nWhile one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the [R Studio Essentials](https://resources.rstudio.com/) collection of videos.\n\n::: callout-tip\nFor more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the `dsbook`:\n\n-   <https://rafalab.github.io/dsbook/getting-started.html>\n\nThis chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n## Questions\n\n1.  If a software company asks you, as a requirement for using their software, to sign a license that restricts you from using their software to commit illegal activities, is this consistent with the \"Four Freedoms\" of Free Software?\n\n2.  What is an R package and what is it used for?\n\n3.  What function in R can be used to install packages from CRAN?\n\n4.  What is a limitation of the current R system?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   [R for Data Science](https://r4ds.had.co.nz) by Wickham & Grolemund (2017). Covers most of the basics of using R for data analysis.\n\n-   [Advanced R](https://adv-r.hadley.nz) by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.\n\n-   [RStudio IDE cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf)\n:::\n\n## rtistry\n\n\n::: {.cell .fig-cap-location-top}\n::: {.cell-output-display}\n![](https://github.com/djnavarro/art/raw/master/static/gallery/water-colours/watercolour_splash.jpg)\n:::\n:::\n\n\n\\['Water Colours' from Danielle Navarro <https://art.djnavarro.net>\\]\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"02 - Introduction to R and RStudio!\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Let's dig into the R programming language and the RStudio integrated developer environment\"\nimage: https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png\ncategories: [module 1, week 1, R, programming, RStudio]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/02-introduction-to-r-and-rstudio/index.qmd).*\n\n> There are only two kinds of languages: the ones people complain about and the ones nobody uses. ---*Bjarne Stroustrup*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  [An overview and history of R](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger Peng\n2.  [Installing R and RStudio](https://rafalab.github.io/dsbook/installing-r-rstudio.html) from Rafael Irizarry\n3.  [Getting Started in R and RStudio](https://rafalab.github.io/dsbook/getting-started.html) from Rafael Irizarry\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html>\n-   <https://rafalab.github.io/dsbook>\n-   <https://rmd4sci.njtierney.com>\n-   <https://andreashandel.github.io/MADAcourse>\n\n# Learning objectives\n\n::: callout-note\n## Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Learn about (some of) the history of R.\n-   Identify some of the strengths and weaknesses of R.\n-   Install R and Rstudio on your computer.\n-   Know how to install and load R packages.\n:::\n\n# Overview and history of R\n\nBelow is a very quick introduction to R, to get you set up and running. We'll go deeper into R and coding later.\n\n### tl;dr (R in a nutshell)\n\nLike every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:\n\n-   R is open-source, freely accessible, and cross-platform (multiple OS).\n-   R is a [\"high-level\" programming language](https://en.wikipedia.org/wiki/High-level_programming_language), relatively easy to learn.\n    -   While \"Low-level\" programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.\n    -   In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.\n-   R is great for statistics, data analysis, websites, web apps, data visualizations, and so much more!\n-   R integrates easily with document preparation systems like $\\LaTeX$, but R files can also be used to create `.docx`, `.pdf`, `.html`, `.ppt` files with integrated R code output and graphics.\n-   The R Community is very dynamic, helpful and welcoming.\n    -   Check out the [#rstats](https://twitter.com/search?q=%23rstats) or [#rtistry](https://twitter.com/search?q=%23rtistry) on Twitter, [TidyTuesday](https://www.tidytuesday.com) podcast and community activity in the [R4DS Online Learning Community](https://www.rfordatasci.com), and [r/rstats](https://www.reddit.com/r/rstats/) subreddit.\n    -   If you are looking for more local resources, check out [R-Ladies Baltimore](https://www.meetup.com/rladies-baltimore/).\n-   Through R packages, it is easy to get lots of state-of-the-art algorithms.\n-   Documentation and help files for R are generally good.\n\nWhile we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).\n\nDepending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.\n\nWith the skills gained in this course, hopefully you will find R a fun and useful programming language for your future projects.\n\n![Artwork by Allison Horst on learning R](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/r_first_then.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Basic Features of R\n\nToday R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.\n\nOne nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.\n\nAnother key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R's ability to create \"publication quality\" graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.\n\nR has maintained the original S philosophy (see box below), which is that **it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools**. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.\n\n::: callout-tip\nFor a great discussion on an overview and history of R and the S programming language, read through [this chapter](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger D. Peng.\n:::\n\nFinally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow, Twitter [#rstats](https://twitter.com/search?q=%23rstats), [#rtistry](https://twitter.com/search?q=%23rtistry), and [Reddit](https://www.reddit.com/r/rstats/).\n\n### Free Software\n\nA major advantage that R has over many other statistical packages and is that it's free in the sense of free software (it's also free in the sense of free beer). The copyright for the primary source code for R is held by the [R Foundation](http://www.r-project.org/foundation/) and is published under the [GNU General Public License version 2.0](http://www.gnu.org/licenses/gpl-2.0.html).\n\nAccording to the Free Software Foundation, with *free software*, you are granted the following [four freedoms](http://www.gnu.org/philosophy/free-sw.html)\n\n-   The freedom to run the program, for any purpose (freedom 0).\n\n-   The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.\n\n-   The freedom to redistribute copies so you can help your neighbor (freedom 2).\n\n-   The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.\n\n::: callout-tip\nYou can visit the [Free Software Foundation's web site](http://www.fsf.org) to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and [Stallman's personal web site](https://stallman.org) is an interesting read if you happen to have some spare time.\n:::\n\n### Design of the R System\n\nThe primary R system is available from the [Comprehensive R Archive Network](http://cran.r-project.org), also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.\n\nThe R system is divided into 2 conceptual parts:\n\n1.  The \"base\" R system that you download from CRAN:\n\n-   [Linux](http://cran.r-project.org/bin/linux/)\n-   [Windows](http://cran.r-project.org/bin/windows/)\n-   [Mac](http://cran.r-project.org/bin/macosx/)\n\n2.  Everything else.\n\nR functionality is divided into a number of *packages*.\n\n-   The \"base\" R system contains, among other things, the `base` package which is required to run R and contains the most fundamental functions.\n\n-   The other packages contained in the \"base\" system include `utils`, `stats`, `datasets`, `graphics`, `grDevices`, `grid`, `methods`, `tools`, `parallel`, `compiler`, `splines`, `tcltk`, `stats4`.\n\n-   There are also \"Recommended\" packages: `boot`, `class`, `cluster`, `codetools`, `foreign`, `KernSmooth`, `lattice`, `mgcv`, `nlme`, `rpart`, `survival`, `MASS`, `spatial`, `nnet`, `Matrix`.\n\nWhen you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:\n\n-   There are over 10,000 packages on CRAN that have been developed by users and programmers around the world.\n\n-   There are also many packages associated with the [Bioconductor project](http://bioconductor.org).\n\n-   People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.\n\n::: callout-note\n## Questions\n\n1.  How many R packages are on CRAN today?\n2.  How many R packages are on Bioconductor today?\n3.  How many R packages are on GitHub today?\n:::\n\n### Limitations of R\n\nNo programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on **almost 50 year old technology**, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the \"old days\").\n\nAnother commonly cited limitation of R is that **objects must generally be stored in physical memory** (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.\n\nAt a higher level one \"limitation\" of R is that **its functionality is based on consumer demand and (voluntary) user contributions**. If no one feels like implementing your favorite method, then it's *your* job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.\n\n# Using R and RStudio\n\n> If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. --- *Nicholas Tierney*\n\n\\[[Source](https://rmd4sci.njtierney.com)\\]\n\nThe RStudio layout has the following features:\n\n-   On the upper left, something called a Rmarkdown script\n-   On the lower left, the R console\n-   On the lower right, the view for files, plots, packages, help, and viewer.\n-   On the upper right, the environment / history pane\n\n![A screenshot of the RStudio integrated developer environment (IDE) -- aka the working environment](https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png)\n\nThe R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we'll learn about these files later).\n\nThe file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.\n\n### Installing R and RStudio\n\n-   If you have not already, [install R first](http://cran.r-project.org). If you already have R installed, make sure it is a fairly recent version, version 4.0 or newer. If yours is older, I suggest you update (install a new R version).\n-   Once you have R installed, install the free version of [RStudio Desktop](https://www.rstudio.com/products/rstudio/download/). Again, make sure it's a recent version.\n\n::: callout-tip\nInstalling R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry's `dsbook`\n\n-   <https://rafalab.github.io/dsbook/installing-r-rstudio.html>\n:::\n\nIf things don't work, ask for help in the Courseplus discussion board.\n\nI personally only have experience with Mac, but everything should work on all the standard operating systems (Windows, Mac, and even Linux).\n\n### RStudio default options\n\nTo first get set up, I highly recommend changing the following setting\n\nTools \\> Global Options (or `Cmd + ,` on macOS)\n\nUnder the **General** tab:\n\n-   For **workspace**\n    -   Uncheck restore .RData into workspace at startup\n    -   Save workspace to .RData on exit : \"Never\"\n-   For **History**\n    -   Uncheck \"Always save history (even when not saving .RData)\n    -   Uncheck \"Remove duplicate entries in history\"\n\nThis means that you won't save the objects and other things that you create in your R session and reload them. This is important for two reasons\n\n1.  **Reproducibility**: you don't want to have objects from last week cluttering your session\n2.  **Privacy**: you don't want to save private data or other things to your session. You only want to read these in.\n\nYour \"history\" is the commands that you have entered into R.\n\nAdditionally, not saving your history means that you won't be relying on things that you typed in the last session, which is a good habit to get into!\n\n### Installing and loading R packages\n\nAs we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.\n\nThe \"official\" place for packages is the [CRAN website](https://cran.r-project.org/web/packages/available_packages_by_name.html). If you are interested in packages on a specific topic, the [CRAN task views](http://cran.r-project.org/web/views/) provide curated descriptions of packages sorted by topic.\n\nTo install an R package from CRAN, one can simply call the `install.packages()` function and pass the name of the package as an argument. For example, to install the `ggplot2` package from CRAN: open RStudio,go to the R prompt (the `>` symbol) in the lower-left corner and type\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nand the appropriate version of the package will be installed.\n\nOften, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.\n\n::: callout-note\n## Questions\n\n1.  As you installed the `ggplot2` package, what other packages were installed?\n2.  What happens if you tried to install `GGplot2`?\n:::\n\nIt could be that you already have all packages required by `ggplot2` installed. In that case, you will not see any other packages installed. To see which of the packages above `ggplot2` needs (and thus installs if it is not present), type into the R console:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntools::package_dependencies(\"ggplot2\")\n```\n:::\n\n\nIn RStudio, you can also install (and update/remove) packages by clicking on the 'Packages' tab in the bottom right window.\n\nIt is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the `remotes` package and then use the following function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github()\n```\n:::\n\n\nWe will not do that now, but it is quite likely that at one point later in this course we will.\n\nYou only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the `library()` command. For instance to load the `ggplot2` package, type\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"ggplot2\")\n```\n:::\n\n\nYou may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.\n\nThis was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.\n\n### Getting started in RStudio\n\nWhile one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the [R Studio Essentials](https://resources.rstudio.com/) collection of videos.\n\n::: callout-tip\nFor more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the `dsbook`:\n\n-   <https://rafalab.github.io/dsbook/getting-started.html>\n\nThis chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n## Questions\n\n1.  If a software company asks you, as a requirement for using their software, to sign a license that restricts you from using their software to commit illegal activities, is this consistent with the \"Four Freedoms\" of Free Software?\n\n2.  What is an R package and what is it used for?\n\n3.  What function in R can be used to install packages from CRAN?\n\n4.  What is a limitation of the current R system?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   [R for Data Science](https://r4ds.had.co.nz) by Wickham & Grolemund (2017). Covers most of the basics of using R for data analysis.\n\n-   [Advanced R](https://adv-r.hadley.nz) by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.\n\n-   [RStudio IDE cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf)\n:::\n\n## rtistry\n\n\n::: {.cell .fig-cap-location-top}\n::: {.cell-output-display}\n![](https://github.com/djnavarro/art/raw/master/static/gallery/water-colours/watercolour_splash.jpg)\n:::\n:::\n\n\n\\['Water Colours' from Danielle Navarro <https://art.djnavarro.net>\\]\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/05-literate-programming/index/execute-results/html.json b/_freeze/posts/05-literate-programming/index/execute-results/html.json
index a5f3c3b..658affc 100644
--- a/_freeze/posts/05-literate-programming/index/execute-results/html.json
+++ b/_freeze/posts/05-literate-programming/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "4e89eeba2590cab5734e048c12e3697e",
+  "hash": "781f56af33f440fc1c8df4db96a05562",
   "result": {
-    "markdown": "---\ntitle: \"05 - Literate Statistical Programming\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to literate statistical programming tools including R Markdown\"\nimage: https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/rmarkdown_rockstar.png\ncategories: [module 1, week 1, R Markdown, programming]\nbibliography: my-refs.bib\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/05-literate-programming/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rafalab.github.io/dsbook/reproducible-projects-with-rstudio-and-r-markdown.html>\n2.  <https://statsandr.com/blog/tips-and-tricks-in-rstudio-and-r-markdown>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-literate-statistical-programming.html>\n-   <https://statsandr.com/blog/tips-and-tricks-in-rstudio-and-r-markdown>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to define literate programming\n-   Recognize differences between available tools to for literate programming\n-   Know how to efficiently work within RStudio for efficient literate programming\n-   Create a R Markdown document\n:::\n\n# Introduction\n\nOne basic idea to make writing reproducible reports easier is what's known as *literate statistical programming* (or sometimes called [literate statistical practice](http://www.r-project.org/conferences/DSC-2001/Proceedings/Rossini.pdf)). This comes from the idea of [literate programming](https://en.wikipedia.org/wiki/Literate_programming) in the area of writing computer programs.\n\nThe idea is to **think of a report or a publication as a stream of text and code**.\n\n-   The text is readable by people and the code is readable by computers.\n\n-   The analysis is described in a series of text and code chunks.\n\n-   Each kind of code chunk will do something like load some data or compute some results.\n\n-   Each text chunk will relay something in a human readable language.\n\nThere might also be **presentation code** that formats tables and figures and there's article text that explains what's going on around all this code. This stream of text and code is a literate statistical program or a literate statistical analysis.\n\n### Weaving and Tangling\n\nLiterate programs by themselves are a bit difficult to work with, but they can be processed in two important ways.\n\nLiterate programs can be **weaved** to produce human readable documents like PDFs or HTML web pages, and they can **tangled** to produce machine-readable \"documents\", or in other words, machine readable code.\n\nThe basic idea behind literate programming in order to generate the different kinds of output you might need, **you only need a single source document**---you can weave and tangle to get the rest.\n\nIn order to use a system like this you need a documentational language, that's human readable, and you need a programming language that's machine readable (or can be compiled/interpreted into something that's machine readable).\n\n### Sweave\n\nOne of the original literate programming systems in R that was designed to do this was called Sweave. Sweave enables users to combine R code with a documentation program called LaTeX.\n\n**Sweave files ends a `.Rnw`** and have R code weaved through the document:\n\n```         \n<<plot1, height=4, width=5, eval=FALSE>>=\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n@\n```\n\nOnce you have created your `.Rnw` file, Sweave will process the file, executing the R chunks and replacing them with output as appropriate before creating the PDF document.\n\nIt was originally developed by Fritz Leisch, who is a core member of R, and the code base is still maintained by R Core. The Sweave system comes with any installation of R.\n\nThere are many limitations to the original Sweave system.\n\n-   One of the limitations is that it is **focused primarily on LaTeX**, which is not a documentation language that many people are familiar with.\n-   Therefore, it **can be difficult to learn this type of markup language** if you're not already in a field that uses it regularly.\n-   Sweave also **lacks a lot of features that people find useful** like caching, and multiple plots per page and mixing programming languages.\n\nInstead, folks have **moved towards using something called knitr**, which offers everything Sweave does, plus it extends it further.\n\n-   With Sweave, additional tools are required for advanced operations, whereas knitr supports more internally. We'll discuss knitr below.\n\n### rmarkdown\n\nAnother choice for literate programming is to build documents based on [Markdown](https://en.wikipedia.org/wiki/Markdown) language. A markdown file is a plain text file that is typically given the extension `.md.`. The [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) R package takes a R Markdown file (`.Rmd`) and weaves together R code chunks like this:\n\n````         \n```{r plot1, height=4, width=5, eval=FALSE, echo=TRUE}\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n```\n````\n\n::: callout-tip\nThe best resource for learning about R Markdown this by Yihui Xie, J. J. Allaire, and Garrett Grolemund:\n\n-   <https://bookdown.org/yihui/rmarkdown>\n\nThe R Markdown Cookbook by Yihui Xie, Christophe Dervieux, and Emily Riederer is really good too:\n\n-   <https://bookdown.org/yihui/rmarkdown-cookbook>\n\nThe authors of the 2nd book describe the motivation for the 2nd book as:\n\n> \"However, we have received comments from our readers and publisher that it would be beneficial to provide more practical and relatively short examples to show the interesting and useful usage of R Markdown, because it can be daunting to find out how to achieve a certain task from the aforementioned reference book (put another way, that book is too dry to read). As a result, this cookbook was born.\"\n:::\n\nBecause this is lecture is built in a `.qmd` file (which is very similar to a `.Rmd` file), let's demonstrate how this work. I am going to change `eval=FALSE` to `eval=TRUE`.\n\n\n::: {.cell height='4' width='5'}\n\n```{.r .cell-code}\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/plot2-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Questions\n\n1.  Why do we not see the back ticks \\`\\`\\` anymore in the code chunk above that made the plot?\n2.  What do you think we should do if we want to have the code executed, but we want to hide the code that made it?\n:::\n\nBefore we leave this section, I find that there is quite a bit of terminology to understand the magic behind `rmarkdown` that can be confusing, so let's break it down:\n\n-   [Pandoc](https://pandoc.org). Pandoc is a command line tool with no GUI that converts documents (e.g. from number of different markup formats to many other formats, such as .doc, .pdf etc). It is completely independent from R (but does come bundled with RStudio).\n-   [Markdown](https://en.wikipedia.org/wiki/Markdown) (**markup language**). Markdown is a lightweight [markup language](https://en.wikipedia.org/wiki/Markup_language) with plain text formatting syntax designed so that it can be converted to HTML and many other formats. A markdown file is a plain text file that is typically given the extension `.md.` It is completely independent from R.\n-   [`markdown`](https://CRAN.R-project.org/package=markdown) (**R package**). `markdown` is an R package which converts `.md` files into HTML. It is no longer recommended for use has been surpassed by [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) (discussed below).\n-   R Markdown (**markup language**). R Markdown is an extension of the markdown syntax. R Markdown files are plain text files that typically have the file extension `.Rmd`.\n-   [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) (**R package**). The R package `rmarkdown` is a library that uses pandoc to process and convert `.Rmd` files into a number of different formats. This core function is `rmarkdown::render()`. **Note**: this package only deals with the markdown language. If the input file is e.g. `.Rhtml` or `.Rnw`, then you need to use `knitr` prior to calling pandoc (see below).\n\n::: callout-tip\nCheck out the R Markdown Quick Tour for more:\n\n-   <https://rmarkdown.rstudio.com/authoring_quick_tour.html>\n:::\n\n![Artwork by Allison Horst on RMarkdown](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/rmarkdown_rockstar.png){width=\"80%\"}\n\n### knitr\n\nOne of the alternative that has come up in recent times is something called `knitr`.\n\n-   The `knitr` package for R takes a lot of these ideas of literate programming and updates and improves upon them.\n-   `knitr` still uses R as its programming language, but it allows you to mix other programming languages in.\n-   You can also use a variety of documentation languages now, such as LaTeX, markdown and HTML.\n-   `knitr` was developed by Yihui Xie while he was a graduate student at Iowa State and it has become a very popular package for writing literate statistical programs.\n\nKnitr takes a plain text document with embedded code, executes the code and 'knits' the results back into the document.\n\nFor for example, it converts\n\n-   An R Markdown (`.Rmd)` file into a standard markdown file (`.md`)\n-   An `.Rnw` (Sweave) file into to `.tex` format.\n-   An `.Rhtml` file into to `.html`.\n\nThe core function is `knitr::knit()` and by default this will look at the input document and try and guess what type it is e.g. `Rnw`, `Rmd` etc.\n\nThis core function performs three roles:\n\n-   A **source parser**, which looks at the input document and detects which parts are code that the user wants to be evaluated.\n-   A **code evaluator**, which evaluates this code\n-   An **output renderer**, which writes the results of evaluation back to the document in a format which is interpretable by the raw output type. For instance, if the input file is an `.Rmd`, the output render marks up the output of code evaluation in `.md` format.\n\n\n::: {.cell layout-align=\"center\" preview='true'}\n::: {.cell-output-display}\n![Converting a Rmd file to many outputs using knitr and pandoc](https://d33wubrfki0l68.cloudfront.net/61d189fd9cdf955058415d3e1b28dd60e1bd7c9b/9791d/images/rmarkdownflow.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\\[[Source](https://rmarkdown.rstudio.com/authoring_quick_tour.html)\\]\n\nAs seen in the figure above, from there pandoc is used to convert e.g. a `.md` file into many other types of file formats into a `.html`, etc.\n\nSo in summary:\n\n> \"R Markdown stands on the shoulders of knitr and Pandoc. The former executes the computer code embedded in Markdown, and converts R Markdown to Markdown. The latter renders Markdown to the output format you want (such as PDF, HTML, Word, and so on).\"\n\n\\[[Source](https://bookdown.org/yihui/rmarkdown/)\\]\n\n# Create and Knit Your First R Markdown Document\n\n<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/dY9KNat_vYs\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen>\n\n</iframe>\n\nWhen creating your first R Markdown document, in RStudio you can\n\n1.  Go to File \\> New File \\> R Markdown...\n\n2.  Feel free to edit the Title\n\n3.  Make sure to select \"Default Output Format\" to be HTML\n\n4.  Click \"OK\". RStudio creates the R Markdown document and places some boilerplate text in there just so you can see how things are setup.\n\n5.  Click the \"Knit\" button (or go to File \\> Knit Document) to make sure you can create the HTML output\n\nIf you successfully knit your first R Markdown document, then congratulations!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Mission accomplished!](https://media.giphy.com/media/L4ZZNbDpOCfiX8uYSd/giphy.gif){width=60%}\n:::\n:::\n\n\n# Websites and Books in R Markdown\n\nNow that you are on the road to using R Markdown documents, it is important to know about other wonderful things you do with these documents. For example, let's say you have multiple `.Rmd` documents that you want to put together into a website, blog, book, etc.\n\nThere are primarily two ways to build multiple `.Rmd` documents together:\n\n1.  [**blogdown**](https://bookdown.org/yihui/blogdown/) for building websites\n2.  [**bookdown**](https://bookdown.org/yihui/bookdown/) for authoring books\n\nIn this section, we briefly introduce both packages, but it's worth mentioning that the [**rmarkdown** package also has a built-in site generator](https://bookdown.org/yihui/rmarkdown/rmarkdown-site.html) to build websites.\n\n### blogdown\n\n\n::: {.cell}\n::: {.cell-output-display}\n![blogdown logo](https://bookdown.org/yihui/blogdown/images/logo.png){width=30%}\n:::\n:::\n\n\n\\[[Source](https://bookdown.org/yihui/bookdown/images/logo.png)\\]\n\nThe `blogdown` R package is built on top of R Markdown, supports multi-page HTML output to write a blog post or a general page in an Rmd document, or a plain Markdown document.\n\n-   These source documents (e.g. `.Rmd` or `.md`) are built into a static website (i.e. a bunch of static HTML files, images and CSS files).\n-   Using this folder of files, it is very easy to publish it to any web server as a website.\n-   Also, it is easy to maintain because it is only a single folder.\n\n::: callout-tip\nFor example, my personal website was built in blogdown:\n\n-   <https://www.stephaniehicks.com>\n\nOther really great examples can be found here:\n\n-   <https://awesome-blogdown.com>\n:::\n\nOther advantages include the content likely being reproducible, easier to maintain, and easy to convert pages to e.g. PDF or other formats in the future if you do not want to convert to HTML files.\n\nBecause it is based on the Markdown syntax, it is easy to write technical documents, including math equations, insert figures or tables with captions, cross-reference with figure or table numbers, add citations, and present theorems or proofs.\n\nHere's a video you can watch of someone making a blogdown website.\n\n<p align=\"center\">\n\n<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/AADnslLpzJ4\" frameborder=\"0\" allowfullscreen>\n\n</iframe>\n\n</p>\n\n\\[[Source](https://www.youtube.com/watch?v=AADnslLpzJ4) on YouTube\\]\n\n### bookdown\n\n\n::: {.cell}\n::: {.cell-output-display}\n![book logo](https://bookdown.org/yihui/bookdown/images/logo.png){width=30%}\n:::\n:::\n\n\n\\[[Source](https://bookdown.org/yihui/bookdown/images/logo.png)\\]\n\nSimilar to `blogdown`, the `bookdown` R package is built on top of R Markdown, but also offers features like multi-page HTML output, numbering and cross-referencing figures/tables/sections/equations, inserting parts/appendices, and imported the GitBook style (<https://www.gitbook.com>) to create elegant and appealing HTML book pages. Share\n\n::: callout-tip\nFor example, the previous version of this course was built in bookdown:\n\n-   <https://rdpeng.github.io/Biostat776>\n\nAnother example is the [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) book that the JHU Data Science Lab wrote. The github repo that contains all the `.Rmd` files can be found [here](https://github.com/jhudsl/tidyversecourse).\n\n-   <https://jhudatascience.org/tidyversecourse>\n-   <https://github.com/jhudsl/tidyversecourse>\n:::\n\n**Note**: Even though the word \"book\" is in \"bookdown\", this package is not only for books. It really can be anything that consists of multiple `.Rmd` documents meant to be read in a linear sequence such as course dissertation/thesis, handouts, study notes, a software manual, a thesis, or even a diary.\n\n-   https://bookdown.org/yihui/rmarkdown/basics-examples.html#examples-books\n\n### distill\n\nThere is another great way to build blogs or websites using the [distill for R Markdown](https://rstudio.github.io/distill/).\n\n-   <https://rstudio.github.io/distill>\n\nDistill for R Markdown combines the technical authoring features of the [Distill web framework](https://github.com/distillpub/template) (optimized for scientific and technical communication) with [R Markdown](https://rmarkdown.rstudio.com), enabling a fully reproducible workflow based on literate programming [@knuth1984].\n\nDistill articles include:\n\n-   Reader-friendly typography that adapts well to mobile devices.\n-   Features essential to technical writing like LaTeX math, citations, and footnotes.\n-   Flexible figure layout options (e.g. displaying figures at a larger width than the article text).\n-   Attractively rendered tables with optional support for pagination.\n-   Support for a wide variety of diagramming tools for illustrating concepts. The ability to incorporate JavaScript and D3-based interactive visualizations.\n-   A variety of ways to publish articles, including support for publishing sets of articles as a Distill website or as a Distill blog.\n\nThe course website from last year was built in Distill for R Markdown:\n\n-   Website: <https://stephaniehicks.com/jhustatcomputing2021>\n-   Github: <https://github.com/stephaniehicks/jhustatcomputing2021>\n\nSome other cool things about distill is the use of footnotes and asides.\n\nFor example [^1]. The number of the footnote will be automatically generated.\n\n[^1]: This will become a hover-able footnote\n\nYou can also optionally include notes in the gutter of the article (immediately to the right of the article text). To do this use the aside tag.\n\n<aside>This content will appear in the gutter of the article.</aside>\n\nYou can also include figures in the gutter. Just enclose the code chunk which generates the figure in an aside tag\n\n# Tips and tricks in R Markdown in RStudio\n\nHere are shortcuts and tips on efficiently using RStudio to improve how you write code.\n\n### Run code\n\nIf you want to run a code chunk:\n\n```         \ncommand + Enter on Mac\nCtrl + Enter on Windows\n```\n\n### Insert a comment in R and R Markdown\n\nTo insert a comment:\n\n```         \ncommand + Shift + C on Mac\nCtrl + Shift + C on Windows\n```\n\nThis shortcut can be used both for:\n\n-   R code when you want to comment your code. It will add a `#` at the beginning of the line\n-   for text in R Markdown. It will add `<!--` and `-->` around the text\n\nNote that if you want to comment more than one line, select all the lines you want to comment then use the shortcut. If you want to uncomment a comment, apply the same shortcut.\n\n### Knit a R Markdown document\n\nYou can knit R Markdown documents by using this shortcut:\n\n```         \ncommand + Shift + K on Mac\nCtrl + Shift + K on Windows\n```\n\n### Code snippets\n\nCode snippets is usually a few characters long and is used as a shortcut to insert a common piece of code. You simply type a few characters then press `Tab` and it will complete your code with a larger code. `Tab` is then used again to navigate through the code where customization is required. For instance, if you type `fun` then press `Tab`, it will auto-complete the code with the required code to create a function:\n\n```         \nname <- function(variables) {\n  \n}\n```\n\nPressing `Tab` again will jump through the placeholders for you to edit it. So you can first edit the name of the function, then the variables and finally the code inside the function (try by yourself!).\n\nThere are many code snippets by default in RStudio. Here are the code snippets I use most often:\n\n-   `lib` to call `library()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(package)\n```\n:::\n\n\n-   `mat` to create a matrix\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data, nrow = rows, ncol = cols)\n```\n:::\n\n\n-   `if`, `el`, and `ei` to create conditional expressions such as `if() {}`, `else {}` and `else if () {}`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (condition) {\n  \n}\n\nelse {\n  \n}\n\nelse if (condition) {\n  \n}\n```\n:::\n\n\n-   `fun` to create a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nname <- function(variables) {\n  \n}\n```\n:::\n\n\n-   `for` to create for loops\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (variable in vector) {\n  \n}\n```\n:::\n\n\n-   `ts` to insert a comment with the current date and time (useful if you have very long code and share it with others so they see when it has been edited)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Tue Jan 21 20:20:14 2020 ------------------------------\n```\n:::\n\n\nYou can see all default code snippets and add yours by clicking on Tools \\> Global Options... \\> Code (left sidebar) \\> Edit Snippets...\n\n### Ordered list in R Markdown\n\nIn R Markdown, when creating an ordered list such as this one:\n\n1.  Item 1\n2.  Item 2\n3.  Item 3\n\nInstead of bothering with the numbers and typing\n\n```         \n1. Item 1\n2. Item 2\n3. Item 3\n```\n\nyou can simply type\n\n```         \n1. Item 1\n1. Item 2\n1. Item 3\n```\n\nfor the exact same result (try it yourself or check the code of this article!). This way you do not need to bother which number is next when creating a new item.\n\nTo go even further, any numeric will actually render the same result as long as the first item is the number you want to start from. For example, you could type:\n\n```         \n1. Item 1\n7. Item 2\n3. Item 3\n```\n\nwhich renders\n\n1.  Item 1\n2.  Item 2\n3.  Item 3\n\nHowever, I suggest always using the number you want to start from for all items because if you move one item at the top, the list will start with this new number. For instance, if we move `7. Item 2` from the previous list at the top, the list becomes:\n\n```         \n7. Item 2\n1. Item 1\n3. Item 3\n```\n\nwhich incorrectly renders\n\n7.  Item 2\n8.  Item 1\n9.  Item 3\n\n### New code chunk in R Markdown\n\nWhen editing R Markdown documents, you will need to insert a new R code chunk many times. The following shortcuts will make your life easier:\n\n```         \ncommand + option + I on Mac (or command + alt + I depending on your keyboard)\nCtrl + ALT + I on Windows\n```\n\n### Reformat code\n\nA clear and readable code is always easier and faster to read (and look more professional when sharing it to collaborators). To automatically apply the most common coding guidelines such as white spaces, indents, etc., use:\n\n```         \ncmd + Shift + A on Mac\nCtrl + Shift + A on Windows\n```\n\nSo for example the following code which does not respect the guidelines (and which is not easy to read):\n\n```         \n1+1\n  for(i in 1:10){if(!i%%2){next}\nprint(i)\n }\n```\n\nbecomes much more neat and readable:\n\n```         \n1 + 1\nfor (i in 1:10) {\n  if (!i %% 2) {\n    next\n  }\n  print(i)\n}\n```\n\n### RStudio addins\n\nRStudio addins are extensions which provide a simple mechanism for executing advanced R functions from within RStudio. In simpler words, when executing an addin (by clicking a button in the Addins menu), the corresponding code is executed without you having to write the code. RStudio addins have the advantage that they allow you to execute complex and advanced code much more easily than if you would have to write it yourself.\n\n::: callout-tip\n**For more information about RStudio addins, check out**:\n\n-   <https://rstudio.github.io/rstudioaddins>\n-   <https://statsandr.com/blog/tips-and-tricks-in-rstudio-and-r-markdown>\n:::\n\n### Others\n\nSimilar to many other programs, you can also use:\n\n-   `command + Shift + N` on Mac and `Ctrl + Shift + N` on Windows to open a new R Script\n-   `command + S` on Mac and `Ctrl + S` on Windows to save your current script or R Markdown document\n\nCheck out Tools --\\> Keyboard Shortcuts Help to see a long list of these shortcuts.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: questions\n### Questions\n\n1.  What is literate programming?\n\n2.  What was the first literate statistical programming tool to weave together a statistical language (R) with a markup language (LaTeX)?\n\n3.  What is `knitr` and how is different than other literate statistical programming tools?\n\n4.  Where can you find a list of other commands that help make your code writing more efficient in RStudio?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   [RMarkdown Tips and Tricks](https://indrajeetpatil.github.io/RmarkdownTips/) by Indrajeet Patil\n-   <https://bookdown.org/yihui/rmarkdown>\n-   <https://bookdown.org/yihui/rmarkdown-cookbook>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"05 - Literate Statistical Programming\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to literate statistical programming tools including R Markdown\"\nimage: https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/rmarkdown_rockstar.png\ncategories: [module 1, week 1, R Markdown, programming]\nbibliography: my-refs.bib\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/05-literate-programming/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rafalab.github.io/dsbook/reproducible-projects-with-rstudio-and-r-markdown.html>\n2.  <https://statsandr.com/blog/tips-and-tricks-in-rstudio-and-r-markdown>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-literate-statistical-programming.html>\n-   <https://statsandr.com/blog/tips-and-tricks-in-rstudio-and-r-markdown>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to define literate programming\n-   Recognize differences between available tools to for literate programming\n-   Know how to efficiently work within RStudio for efficient literate programming\n-   Create a R Markdown document\n:::\n\n# Introduction\n\nOne basic idea to make writing reproducible reports easier is what's known as *literate statistical programming* (or sometimes called [literate statistical practice](http://www.r-project.org/conferences/DSC-2001/Proceedings/Rossini.pdf)). This comes from the idea of [literate programming](https://en.wikipedia.org/wiki/Literate_programming) in the area of writing computer programs.\n\nThe idea is to **think of a report or a publication as a stream of text and code**.\n\n-   The text is readable by people and the code is readable by computers.\n\n-   The analysis is described in a series of text and code chunks.\n\n-   Each kind of code chunk will do something like load some data or compute some results.\n\n-   Each text chunk will relay something in a human readable language.\n\nThere might also be **presentation code** that formats tables and figures and there's article text that explains what's going on around all this code. This stream of text and code is a literate statistical program or a literate statistical analysis.\n\n### Weaving and Tangling\n\nLiterate programs by themselves are a bit difficult to work with, but they can be processed in two important ways.\n\nLiterate programs can be **weaved** to produce human readable documents like PDFs or HTML web pages, and they can **tangled** to produce machine-readable \"documents\", or in other words, machine readable code.\n\nThe basic idea behind literate programming in order to generate the different kinds of output you might need, **you only need a single source document**---you can weave and tangle to get the rest.\n\nIn order to use a system like this you need a documentational language, that's human readable, and you need a programming language that's machine readable (or can be compiled/interpreted into something that's machine readable).\n\n### Sweave\n\nOne of the original literate programming systems in R that was designed to do this was called Sweave. Sweave enables users to combine R code with a documentation program called LaTeX.\n\n**Sweave files ends a `.Rnw`** and have R code weaved through the document:\n\n```         \n<<plot1, height=4, width=5, eval=FALSE>>=\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n@\n```\n\nOnce you have created your `.Rnw` file, Sweave will process the file, executing the R chunks and replacing them with output as appropriate before creating the PDF document.\n\nIt was originally developed by Fritz Leisch, who is a core member of R, and the code base is still maintained by R Core. The Sweave system comes with any installation of R.\n\nThere are many limitations to the original Sweave system.\n\n-   One of the limitations is that it is **focused primarily on LaTeX**, which is not a documentation language that many people are familiar with.\n-   Therefore, it **can be difficult to learn this type of markup language** if you're not already in a field that uses it regularly.\n-   Sweave also **lacks a lot of features that people find useful** like caching, and multiple plots per page and mixing programming languages.\n\nInstead, folks have **moved towards using something called knitr**, which offers everything Sweave does, plus it extends it further.\n\n-   With Sweave, additional tools are required for advanced operations, whereas knitr supports more internally. We'll discuss knitr below.\n\n### rmarkdown\n\nAnother choice for literate programming is to build documents based on [Markdown](https://en.wikipedia.org/wiki/Markdown) language. A markdown file is a plain text file that is typically given the extension `.md.`. The [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) R package takes a R Markdown file (`.Rmd`) and weaves together R code chunks like this:\n\n````         \n```{r plot1, height=4, width=5, eval=FALSE, echo=TRUE}\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n```\n````\n\n::: callout-tip\nThe best resource for learning about R Markdown this by Yihui Xie, J. J. Allaire, and Garrett Grolemund:\n\n-   <https://bookdown.org/yihui/rmarkdown>\n\nThe R Markdown Cookbook by Yihui Xie, Christophe Dervieux, and Emily Riederer is really good too:\n\n-   <https://bookdown.org/yihui/rmarkdown-cookbook>\n\nThe authors of the 2nd book describe the motivation for the 2nd book as:\n\n> \"However, we have received comments from our readers and publisher that it would be beneficial to provide more practical and relatively short examples to show the interesting and useful usage of R Markdown, because it can be daunting to find out how to achieve a certain task from the aforementioned reference book (put another way, that book is too dry to read). As a result, this cookbook was born.\"\n:::\n\nBecause this is lecture is built in a `.qmd` file (which is very similar to a `.Rmd` file), let's demonstrate how this work. I am going to change `eval=FALSE` to `eval=TRUE`.\n\n\n::: {.cell height='4' width='5'}\n\n```{.r .cell-code}\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/plot2-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Questions\n\n1.  Why do we not see the back ticks \\`\\`\\` anymore in the code chunk above that made the plot?\n2.  What do you think we should do if we want to have the code executed, but we want to hide the code that made it?\n:::\n\nBefore we leave this section, I find that there is quite a bit of terminology to understand the magic behind `rmarkdown` that can be confusing, so let's break it down:\n\n-   [Pandoc](https://pandoc.org). Pandoc is a command line tool with no GUI that converts documents (e.g. from number of different markup formats to many other formats, such as .doc, .pdf etc). It is completely independent from R (but does come bundled with RStudio).\n-   [Markdown](https://en.wikipedia.org/wiki/Markdown) (**markup language**). Markdown is a lightweight [markup language](https://en.wikipedia.org/wiki/Markup_language) with plain text formatting syntax designed so that it can be converted to HTML and many other formats. A markdown file is a plain text file that is typically given the extension `.md.` It is completely independent from R.\n-   [`markdown`](https://CRAN.R-project.org/package=markdown) (**R package**). `markdown` is an R package which converts `.md` files into HTML. It is no longer recommended for use has been surpassed by [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) (discussed below).\n-   R Markdown (**markup language**). R Markdown is an extension of the markdown syntax. R Markdown files are plain text files that typically have the file extension `.Rmd`.\n-   [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) (**R package**). The R package `rmarkdown` is a library that uses pandoc to process and convert `.Rmd` files into a number of different formats. This core function is `rmarkdown::render()`. **Note**: this package only deals with the markdown language. If the input file is e.g. `.Rhtml` or `.Rnw`, then you need to use `knitr` prior to calling pandoc (see below).\n\n::: callout-tip\nCheck out the R Markdown Quick Tour for more:\n\n-   <https://rmarkdown.rstudio.com/authoring_quick_tour.html>\n:::\n\n![Artwork by Allison Horst on RMarkdown](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/rmarkdown_rockstar.png){width=\"80%\"}\n\n### knitr\n\nOne of the alternative that has come up in recent times is something called `knitr`.\n\n-   The `knitr` package for R takes a lot of these ideas of literate programming and updates and improves upon them.\n-   `knitr` still uses R as its programming language, but it allows you to mix other programming languages in.\n-   You can also use a variety of documentation languages now, such as LaTeX, markdown and HTML.\n-   `knitr` was developed by Yihui Xie while he was a graduate student at Iowa State and it has become a very popular package for writing literate statistical programs.\n\nKnitr takes a plain text document with embedded code, executes the code and 'knits' the results back into the document.\n\nFor for example, it converts\n\n-   An R Markdown (`.Rmd)` file into a standard markdown file (`.md`)\n-   An `.Rnw` (Sweave) file into to `.tex` format.\n-   An `.Rhtml` file into to `.html`.\n\nThe core function is `knitr::knit()` and by default this will look at the input document and try and guess what type it is e.g. `Rnw`, `Rmd` etc.\n\nThis core function performs three roles:\n\n-   A **source parser**, which looks at the input document and detects which parts are code that the user wants to be evaluated.\n-   A **code evaluator**, which evaluates this code\n-   An **output renderer**, which writes the results of evaluation back to the document in a format which is interpretable by the raw output type. For instance, if the input file is an `.Rmd`, the output render marks up the output of code evaluation in `.md` format.\n\n\n::: {.cell layout-align=\"center\" preview='true'}\n::: {.cell-output-display}\n![Converting a Rmd file to many outputs using knitr and pandoc](https://d33wubrfki0l68.cloudfront.net/61d189fd9cdf955058415d3e1b28dd60e1bd7c9b/9791d/images/rmarkdownflow.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\\[[Source](https://rmarkdown.rstudio.com/authoring_quick_tour.html)\\]\n\nAs seen in the figure above, from there pandoc is used to convert e.g. a `.md` file into many other types of file formats into a `.html`, etc.\n\nSo in summary:\n\n> \"R Markdown stands on the shoulders of knitr and Pandoc. The former executes the computer code embedded in Markdown, and converts R Markdown to Markdown. The latter renders Markdown to the output format you want (such as PDF, HTML, Word, and so on).\"\n\n\\[[Source](https://bookdown.org/yihui/rmarkdown/)\\]\n\n# Create and Knit Your First R Markdown Document\n\n<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/dY9KNat_vYs\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen>\n\n</iframe>\n\nWhen creating your first R Markdown document, in RStudio you can\n\n1.  Go to File \\> New File \\> R Markdown...\n\n2.  Feel free to edit the Title\n\n3.  Make sure to select \"Default Output Format\" to be HTML\n\n4.  Click \"OK\". RStudio creates the R Markdown document and places some boilerplate text in there just so you can see how things are setup.\n\n5.  Click the \"Knit\" button (or go to File \\> Knit Document) to make sure you can create the HTML output\n\nIf you successfully knit your first R Markdown document, then congratulations!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Mission accomplished!](https://media.giphy.com/media/L4ZZNbDpOCfiX8uYSd/giphy.gif){width=60%}\n:::\n:::\n\n\n# Websites and Books in R Markdown\n\nNow that you are on the road to using R Markdown documents, it is important to know about other wonderful things you do with these documents. For example, let's say you have multiple `.Rmd` documents that you want to put together into a website, blog, book, etc.\n\nThere are primarily two ways to build multiple `.Rmd` documents together:\n\n1.  [**blogdown**](https://bookdown.org/yihui/blogdown/) for building websites\n2.  [**bookdown**](https://bookdown.org/yihui/bookdown/) for authoring books\n\nIn this section, we briefly introduce both packages, but it's worth mentioning that the [**rmarkdown** package also has a built-in site generator](https://bookdown.org/yihui/rmarkdown/rmarkdown-site.html) to build websites.\n\n### blogdown\n\n\n::: {.cell}\n::: {.cell-output-display}\n![blogdown logo](https://bookdown.org/yihui/blogdown/images/logo.png){width=30%}\n:::\n:::\n\n\n\\[[Source](https://bookdown.org/yihui/bookdown/images/logo.png)\\]\n\nThe `blogdown` R package is built on top of R Markdown, supports multi-page HTML output to write a blog post or a general page in an Rmd document, or a plain Markdown document.\n\n-   These source documents (e.g. `.Rmd` or `.md`) are built into a static website (i.e. a bunch of static HTML files, images and CSS files).\n-   Using this folder of files, it is very easy to publish it to any web server as a website.\n-   Also, it is easy to maintain because it is only a single folder.\n\n::: callout-tip\nFor example, my personal website was built in blogdown:\n\n-   <https://www.stephaniehicks.com>\n\nOther really great examples can be found here:\n\n-   <https://awesome-blogdown.com>\n:::\n\nOther advantages include the content likely being reproducible, easier to maintain, and easy to convert pages to e.g. PDF or other formats in the future if you do not want to convert to HTML files.\n\nBecause it is based on the Markdown syntax, it is easy to write technical documents, including math equations, insert figures or tables with captions, cross-reference with figure or table numbers, add citations, and present theorems or proofs.\n\nHere's a video you can watch of someone making a blogdown website.\n\n<p align=\"center\">\n\n<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/AADnslLpzJ4\" frameborder=\"0\" allowfullscreen>\n\n</iframe>\n\n</p>\n\n\\[[Source](https://www.youtube.com/watch?v=AADnslLpzJ4) on YouTube\\]\n\n### bookdown\n\n\n::: {.cell}\n::: {.cell-output-display}\n![book logo](https://bookdown.org/yihui/bookdown/images/logo.png){width=30%}\n:::\n:::\n\n\n\\[[Source](https://bookdown.org/yihui/bookdown/images/logo.png)\\]\n\nSimilar to `blogdown`, the `bookdown` R package is built on top of R Markdown, but also offers features like multi-page HTML output, numbering and cross-referencing figures/tables/sections/equations, inserting parts/appendices, and imported the GitBook style (<https://www.gitbook.com>) to create elegant and appealing HTML book pages. Share\n\n::: callout-tip\nFor example, the previous version of this course was built in bookdown:\n\n-   <https://rdpeng.github.io/Biostat776>\n\nAnother example is the [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) book that the JHU Data Science Lab wrote. The github repo that contains all the `.Rmd` files can be found [here](https://github.com/jhudsl/tidyversecourse).\n\n-   <https://jhudatascience.org/tidyversecourse>\n-   <https://github.com/jhudsl/tidyversecourse>\n:::\n\n**Note**: Even though the word \"book\" is in \"bookdown\", this package is not only for books. It really can be anything that consists of multiple `.Rmd` documents meant to be read in a linear sequence such as course dissertation/thesis, handouts, study notes, a software manual, a thesis, or even a diary.\n\n-   https://bookdown.org/yihui/rmarkdown/basics-examples.html#examples-books\n\n### distill\n\nThere is another great way to build blogs or websites using the [distill for R Markdown](https://rstudio.github.io/distill/).\n\n-   <https://rstudio.github.io/distill>\n\nDistill for R Markdown combines the technical authoring features of the [Distill web framework](https://github.com/distillpub/template) (optimized for scientific and technical communication) with [R Markdown](https://rmarkdown.rstudio.com), enabling a fully reproducible workflow based on literate programming [@knuth1984].\n\nDistill articles include:\n\n-   Reader-friendly typography that adapts well to mobile devices.\n-   Features essential to technical writing like LaTeX math, citations, and footnotes.\n-   Flexible figure layout options (e.g. displaying figures at a larger width than the article text).\n-   Attractively rendered tables with optional support for pagination.\n-   Support for a wide variety of diagramming tools for illustrating concepts. The ability to incorporate JavaScript and D3-based interactive visualizations.\n-   A variety of ways to publish articles, including support for publishing sets of articles as a Distill website or as a Distill blog.\n\nThe course website from last year was built in Distill for R Markdown:\n\n-   Website: <https://stephaniehicks.com/jhustatcomputing2021>\n-   Github: <https://github.com/stephaniehicks/jhustatcomputing2021>\n\nSome other cool things about distill is the use of footnotes and asides.\n\nFor example [^1]. The number of the footnote will be automatically generated.\n\n[^1]: This will become a hover-able footnote\n\nYou can also optionally include notes in the gutter of the article (immediately to the right of the article text). To do this use the aside tag.\n\n<aside>This content will appear in the gutter of the article.</aside>\n\nYou can also include figures in the gutter. Just enclose the code chunk which generates the figure in an aside tag\n\n# Tips and tricks in R Markdown in RStudio\n\nHere are shortcuts and tips on efficiently using RStudio to improve how you write code.\n\n### Run code\n\nIf you want to run a code chunk:\n\n```         \ncommand + Enter on Mac\nCtrl + Enter on Windows\n```\n\n### Insert a comment in R and R Markdown\n\nTo insert a comment:\n\n```         \ncommand + Shift + C on Mac\nCtrl + Shift + C on Windows\n```\n\nThis shortcut can be used both for:\n\n-   R code when you want to comment your code. It will add a `#` at the beginning of the line\n-   for text in R Markdown. It will add `<!--` and `-->` around the text\n\nNote that if you want to comment more than one line, select all the lines you want to comment then use the shortcut. If you want to uncomment a comment, apply the same shortcut.\n\n### Knit a R Markdown document\n\nYou can knit R Markdown documents by using this shortcut:\n\n```         \ncommand + Shift + K on Mac\nCtrl + Shift + K on Windows\n```\n\n### Code snippets\n\nCode snippets is usually a few characters long and is used as a shortcut to insert a common piece of code. You simply type a few characters then press `Tab` and it will complete your code with a larger code. `Tab` is then used again to navigate through the code where customization is required. For instance, if you type `fun` then press `Tab`, it will auto-complete the code with the required code to create a function:\n\n```         \nname <- function(variables) {\n  \n}\n```\n\nPressing `Tab` again will jump through the placeholders for you to edit it. So you can first edit the name of the function, then the variables and finally the code inside the function (try by yourself!).\n\nThere are many code snippets by default in RStudio. Here are the code snippets I use most often:\n\n-   `lib` to call `library()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(package)\n```\n:::\n\n\n-   `mat` to create a matrix\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data, nrow = rows, ncol = cols)\n```\n:::\n\n\n-   `if`, `el`, and `ei` to create conditional expressions such as `if() {}`, `else {}` and `else if () {}`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (condition) {\n    ## Case 1\n} else if (condition) {\n    ## Case 2\n} else if (condition) {\n    ## Case 3\n}\n```\n:::\n\n\n-   `fun` to create a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nname <- function(variables) {\n\n}\n```\n:::\n\n\n-   `for` to create for loops\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (variable in vector) {\n\n}\n```\n:::\n\n\n-   `ts` to insert a comment with the current date and time (useful if you have very long code and share it with others so they see when it has been edited)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Tue Jan 21 20:20:14 2020 ------------------------------\n```\n:::\n\n\nYou can see all default code snippets and add yours by clicking on Tools \\> Global Options... \\> Code (left sidebar) \\> Edit Snippets...\n\n### Ordered list in R Markdown\n\nIn R Markdown, when creating an ordered list such as this one:\n\n1.  Item 1\n2.  Item 2\n3.  Item 3\n\nInstead of bothering with the numbers and typing\n\n```         \n1. Item 1\n2. Item 2\n3. Item 3\n```\n\nyou can simply type\n\n```         \n1. Item 1\n1. Item 2\n1. Item 3\n```\n\nfor the exact same result (try it yourself or check the code of this article!). This way you do not need to bother which number is next when creating a new item.\n\nTo go even further, any numeric will actually render the same result as long as the first item is the number you want to start from. For example, you could type:\n\n```         \n1. Item 1\n7. Item 2\n3. Item 3\n```\n\nwhich renders\n\n1.  Item 1\n2.  Item 2\n3.  Item 3\n\nHowever, I suggest always using the number you want to start from for all items because if you move one item at the top, the list will start with this new number. For instance, if we move `7. Item 2` from the previous list at the top, the list becomes:\n\n```         \n7. Item 2\n1. Item 1\n3. Item 3\n```\n\nwhich incorrectly renders\n\n7.  Item 2\n8.  Item 1\n9.  Item 3\n\n### New code chunk in R Markdown\n\nWhen editing R Markdown documents, you will need to insert a new R code chunk many times. The following shortcuts will make your life easier:\n\n```         \ncommand + option + I on Mac (or command + alt + I depending on your keyboard)\nCtrl + ALT + I on Windows\n```\n\n### Reformat code\n\nA clear and readable code is always easier and faster to read (and look more professional when sharing it to collaborators). To automatically apply the most common coding guidelines such as white spaces, indents, etc., use:\n\n```         \ncmd + Shift + A on Mac\nCtrl + Shift + A on Windows\n```\n\nSo for example the following code which does not respect the guidelines (and which is not easy to read):\n\n```         \n1+1\n  for(i in 1:10){if(!i%%2){next}\nprint(i)\n }\n```\n\nbecomes much more neat and readable:\n\n```         \n1 + 1\nfor (i in 1:10) {\n  if (!i %% 2) {\n    next\n  }\n  print(i)\n}\n```\n\n### RStudio addins\n\nRStudio addins are extensions which provide a simple mechanism for executing advanced R functions from within RStudio. In simpler words, when executing an addin (by clicking a button in the Addins menu), the corresponding code is executed without you having to write the code. RStudio addins have the advantage that they allow you to execute complex and advanced code much more easily than if you would have to write it yourself.\n\n::: callout-tip\n**For more information about RStudio addins, check out**:\n\n-   <https://rstudio.github.io/rstudioaddins>\n-   <https://statsandr.com/blog/tips-and-tricks-in-rstudio-and-r-markdown>\n:::\n\n### Others\n\nSimilar to many other programs, you can also use:\n\n-   `command + Shift + N` on Mac and `Ctrl + Shift + N` on Windows to open a new R Script\n-   `command + S` on Mac and `Ctrl + S` on Windows to save your current script or R Markdown document\n\nCheck out Tools --\\> Keyboard Shortcuts Help to see a long list of these shortcuts.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: questions\n### Questions\n\n1.  What is literate programming?\n\n2.  What was the first literate statistical programming tool to weave together a statistical language (R) with a markup language (LaTeX)?\n\n3.  What is `knitr` and how is different than other literate statistical programming tools?\n\n4.  Where can you find a list of other commands that help make your code writing more efficient in RStudio?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   [RMarkdown Tips and Tricks](https://indrajeetpatil.github.io/RmarkdownTips/) by Indrajeet Patil\n-   <https://bookdown.org/yihui/rmarkdown>\n-   <https://bookdown.org/yihui/rmarkdown-cookbook>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/06-reference-management/index/execute-results/html.json b/_freeze/posts/06-reference-management/index/execute-results/html.json
index 60fbf6e..bf3f70d 100644
--- a/_freeze/posts/06-reference-management/index/execute-results/html.json
+++ b/_freeze/posts/06-reference-management/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "16b0cad09428b521aecb026163f42472",
+  "hash": "7d57048944742a579c008b905e80a842",
   "result": {
-    "markdown": "---\ntitle: \"06 - Reference management\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"How to use citations and incorporate references from a bibliography in R Markdown.\"\nimage: https://www.bibtex.com/img/bibtex-format-700x402.png\ncategories: [module 1, week 1, R Markdown, programming]\nbibliography: my-refs.bib\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/06-reference-management/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  Authoring in [R Markdown from RStudio](https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html)\n2.  Citations from [Reproducible Research in R](https://monashdatafluency.github.io/r-rep-res/citations.html) from the [Monash Data Fluency](https://monashdatafluency.github.io) initiative\n3.  Bibliography from [R Markdown Cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/bibliography.html)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://andreashandel.github.io/MADAcourse>\n-   <https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html>\n-   <https://bookdown.org/yihui/rmarkdown-cookbook/bibliography.html>\n-   <https://monashdatafluency.github.io/r-rep-res/citations.html>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Know what types of bibliography file formats can be used in a R Markdown file\n-   Learn how to add citations to a R Markdown file\n-   Know how to change the citation style (e.g. APA, Chicago, etc)\n:::\n\n# Introduction\n\nFor almost any data analysis, especially if it is meant for publication in the academic literature, you will have to cite other people's work and include the references (bibliographies or citations) in your work. In this class, you are likely to need to include references and cite other people's work like in a regular research paper.\n\nR provides nice function `citation()` that helps us generating citation blob for R packages that we have used. Let's try generating citation text for rmarkdown package by using the following command\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncitation(\"rmarkdown\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTo cite package 'rmarkdown' in publications use:\n\n  Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K,\n  Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2023). _rmarkdown:\n  Dynamic Documents for R_. R package version 2.24,\n  <https://github.com/rstudio/rmarkdown>.\n\n  Xie Y, Allaire J, Grolemund G (2018). _R Markdown: The Definitive\n  Guide_. Chapman and Hall/CRC, Boca Raton, Florida. ISBN\n  9781138359338, <https://bookdown.org/yihui/rmarkdown>.\n\n  Xie Y, Dervieux C, Riederer E (2020). _R Markdown Cookbook_. Chapman\n  and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837,\n  <https://bookdown.org/yihui/rmarkdown-cookbook>.\n\nTo see these entries in BibTeX format, use 'print(<citation>,\nbibtex=TRUE)', 'toBibtex(.)', or set\n'options(citation.bibtex.max=999)'.\n```\n:::\n:::\n\n\nI assume you are familiar with how citing references works, and hopefully, you are already using a reference manager. If not, let me know in the discussion boards.\n\nTo have something that plays well with R Markdown, you need file format that stores all the references. Click here to learn more other possible file formats available to you to use within a R Markdown file:\n\n-   <https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html>\n\n### Citation management software\n\nAs you can see, there are ton of file formats including `.medline` (MEDLINE), `.bib` (BibTeX), `.ris` (RIS), `.enl` (EndNote).\n\nI will not discuss underlying citational management software itself, but I will talk briefly how you might create one of these file formats.\n\nIf you recall the output from `citation(\"rmarkdown\")` above, we might consider manually copying and pasting the output into a citation management software, but instead we can use `write_bib()` function from `knitr` package to create a bibliography file ending in `.bib`.\n\nLet's run the following code in order to generate a `my-refs.bib` file\n\n\n::: {.cell}\n\n```{.r .cell-code}\nknitr::write_bib(\"rmarkdown\", file = \"my-refs.bib\")\n```\n:::\n\n\nNow we can see we have the file saved locally.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.files()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\"       \"index.rmarkdown\" \"my-refs.bib\"    \n```\n:::\n:::\n\n\nIf you open up the `my-refs.bib` file, you will see\n\n```         \n@Manual{R-rmarkdown,\n  title = {rmarkdown: Dynamic Documents for R},\n  author = {JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},\n  year = {2021},\n  note = {R package version 2.8},\n  url = {https://CRAN.R-project.org/package=rmarkdown},\n}\n\n@Book{rmarkdown2018,\n  title = {R Markdown: The Definitive Guide},\n  author = {Yihui Xie and J.J. Allaire and Garrett Grolemund},\n  publisher = {Chapman and Hall/CRC},\n  address = {Boca Raton, Florida},\n  year = {2018},\n  note = {ISBN 9781138359338},\n  url = {https://bookdown.org/yihui/rmarkdown},\n}\n\n@Book{rmarkdown2020,\n  title = {R Markdown Cookbook},\n  author = {Yihui Xie and Christophe Dervieux and Emily Riederer},\n  publisher = {Chapman and Hall/CRC},\n  address = {Boca Raton, Florida},\n  year = {2020},\n  note = {ISBN 9780367563837},\n  url = {https://bookdown.org/yihui/rmarkdown-cookbook},\n}\n```\n\n::: resources\n**Note there are three keys that we will use later on**:\n\n-   `R-rmarkdown`\n-   `rmarkdown2018`\n-   `rmarkdown2020`\n:::\n\n### Linking `.bib` file with `.rmd` (and `.qmd`) files\n\nIn order to use references within a R Markdown file, you will need to specify the name and a location of a bibliography file using the bibliography metadata field in a YAML metadata section. For example:\n\n``` yaml\n---\ntitle: \"My top ten favorite R packages\"\noutput: html_document\nbibliography: my-refs.bib\n---\n```\n\nYou can include multiple reference files using the following syntax, alternatively you can concatenate two bib files into one.\n\n``` yaml\n---\nbibliography: [\"my-refs1.bib\", \"my-refs2.bib\"]\n---\n```\n\n### Inline citation\n\nNow we can start using those bib keys that we have learned just before, using the following syntax\n\n-   `[@key]` for single citation\n-   `[@key1; @key2]` multiple citation can be separated by semi-colon\n-   `[-@key]` in order to suppress author name, and just display the year\n-   `[see @key1 p 12; also this ref @key2]` is also a valid syntax\n\nLet's start by citing the `rmarkdown` package using the following code and press `Knit` button:\n\n------------------------------------------------------------------------\n\nI have been using the amazing Rmarkdown package [@R-rmarkdown]! I should also go and read [@rmarkdown2018; and @rmarkdown2020] books.\n\n------------------------------------------------------------------------\n\nPretty cool, eh??\n\n### Citation styles\n\nBy default, Pandoc will use a Chicago author-date format for citations and references.\n\nTo use another style, you will need to specify a CSL (Citation Style Language) file in the `csl` metadata field, e.g.,\n\n``` yaml\n---\ntitle: \"My top ten favorite R packages\"\noutput: html_document\nbibliography: my-refs.bib\ncsl: biomed-central.csl\n---\n```\n\n::: resources\nTo find your required formats, we recommend using the [Zotero Style Repository](https://www.zotero.org/styles), which makes it easy to search for and download your desired style.\n:::\n\nCSL files can be tweaked to meet custom formatting requirements. For example, we can change the number of authors required before \"et al.\" is used to abbreviate them. This can be simplified through the use of visual editors such as the one available at https://editor.citationstyles.org.\n\n### Other cool features\n\n#### Add an item to a bibliography without using it\n\nBy default, the bibliography will only display items that are directly referenced in the document. If you want to include items in the bibliography without actually citing them in the body text, you can define a dummy nocite metadata field and put the citations there.\n\n``` yaml\n---\nnocite: |\n  @item1, @item2\n---\n```\n\n#### Add all items to the bibliography\n\nIf we do not wish to explicitly state all of the items within the bibliography but would still like to show them in our references, we can use the following syntax:\n\n``` yaml\n---\nnocite: '@*'\n---\n```\n\nThis will force all items to be displayed in the bibliography.\n\n::: resources\nYou can also have an appendix appear after bibliography. For more on this, see:\n\n-   <https://bookdown.org/yihui/rmarkdown-cookbook/bibliography.html>\n:::\n\n# Other useful tips\n\nWe have learned that inside your file that contains all your references (e.g. `my-refs.bib`), typically each reference gets a key, which is a shorthand that is generated by the reference manager or you can create yourself.\n\nFor instance, I use a format of lower-case first author last name followed by 4 digit year for each reference followed by a keyword (e.g name of a software package). Alternatively, you can omit the keyword. But note that if I cite a paper by the same first author that was published in the same year, then a lower case letter is added to the end. For instance, for a paper that I wrote as 1st author in 2010, my bibtex key might be `hicks2022` or `hicks2022a`. You can decide what scheme to use, just pick one and use it *forever*.\n\nIn your R Markdown document, you can then cite the reference by adding the key, such as `...in the paper by Hicks et al. [@hicks2022]...`.\n\n# Post-lecture materials\n\n### Practice\n\nHere are some post-lecture tasks to practice some of the material discussed.\n\n::: callout-note\n### Questions\n\n**Try out the following:**\n\n1.  What do you notice that's different when you run `citation(\"tidyverse\")` (compared to `citation(\"rmarkdown\")`)?\n\n2.  Install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(c(\"bibtex\", \"RefManageR\")\n```\n:::\n\n\nWhat do they do? How might they be helpful to you in terms of reference management?\n\n3.  Instead of using a `.bib` file, try using a different bibliography file format in an R Markdown document.\n\n4.  Practice using a different CSL file to change the citation style.\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   Add here.\n:::\n\n## rtistry\n\n\n::: {.cell .fig-cap-location-top}\n\n:::\n\n\n\\[Add here.\\]\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"06 - Reference management\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"How to use citations and incorporate references from a bibliography in R Markdown.\"\nimage: https://www.bibtex.com/img/bibtex-format-700x402.png\ncategories: [module 1, week 1, R Markdown, programming]\nbibliography: my-refs.bib\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/06-reference-management/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  Authoring in [R Markdown from RStudio](https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html)\n2.  Citations from [Reproducible Research in R](https://monashdatafluency.github.io/r-rep-res/citations.html) from the [Monash Data Fluency](https://monashdatafluency.github.io) initiative\n3.  Bibliography from [R Markdown Cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/bibliography.html)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://andreashandel.github.io/MADAcourse>\n-   <https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html>\n-   <https://bookdown.org/yihui/rmarkdown-cookbook/bibliography.html>\n-   <https://monashdatafluency.github.io/r-rep-res/citations.html>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Know what types of bibliography file formats can be used in a R Markdown file\n-   Learn how to add citations to a R Markdown file\n-   Know how to change the citation style (e.g. APA, Chicago, etc)\n:::\n\n# Introduction\n\nFor almost any data analysis, especially if it is meant for publication in the academic literature, you will have to cite other people's work and include the references (bibliographies or citations) in your work. In this class, you are likely to need to include references and cite other people's work like in a regular research paper.\n\nR provides nice function `citation()` that helps us generating citation blob for R packages that we have used. Let's try generating citation text for rmarkdown package by using the following command\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncitation(\"rmarkdown\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTo cite package 'rmarkdown' in publications use:\n\n  Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K,\n  Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2023). _rmarkdown:\n  Dynamic Documents for R_. R package version 2.24,\n  <https://github.com/rstudio/rmarkdown>.\n\n  Xie Y, Allaire J, Grolemund G (2018). _R Markdown: The Definitive\n  Guide_. Chapman and Hall/CRC, Boca Raton, Florida. ISBN\n  9781138359338, <https://bookdown.org/yihui/rmarkdown>.\n\n  Xie Y, Dervieux C, Riederer E (2020). _R Markdown Cookbook_. Chapman\n  and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837,\n  <https://bookdown.org/yihui/rmarkdown-cookbook>.\n\nTo see these entries in BibTeX format, use 'print(<citation>,\nbibtex=TRUE)', 'toBibtex(.)', or set\n'options(citation.bibtex.max=999)'.\n```\n:::\n:::\n\n\nI assume you are familiar with how citing references works, and hopefully, you are already using a reference manager. If not, let me know in the discussion boards.\n\nTo have something that plays well with R Markdown, you need file format that stores all the references. Click here to learn more other possible file formats available to you to use within a R Markdown file:\n\n-   <https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html>\n\n### Citation management software\n\nAs you can see, there are ton of file formats including `.medline` (MEDLINE), `.bib` (BibTeX), `.ris` (RIS), `.enl` (EndNote).\n\nI will not discuss underlying citational management software itself, but I will talk briefly how you might create one of these file formats.\n\nIf you recall the output from `citation(\"rmarkdown\")` above, we might consider manually copying and pasting the output into a citation management software, but instead we can use `write_bib()` function from `knitr` package to create a bibliography file ending in `.bib`.\n\nLet's run the following code in order to generate a `my-refs.bib` file\n\n\n::: {.cell}\n\n```{.r .cell-code}\nknitr::write_bib(\"rmarkdown\", file = \"my-refs.bib\")\n```\n:::\n\n\nNow we can see we have the file saved locally.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.files()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\"       \"index.rmarkdown\" \"my-refs.bib\"    \n```\n:::\n:::\n\n\nIf you open up the `my-refs.bib` file, you will see\n\n```         \n@Manual{R-rmarkdown,\n  title = {rmarkdown: Dynamic Documents for R},\n  author = {JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},\n  year = {2021},\n  note = {R package version 2.8},\n  url = {https://CRAN.R-project.org/package=rmarkdown},\n}\n\n@Book{rmarkdown2018,\n  title = {R Markdown: The Definitive Guide},\n  author = {Yihui Xie and J.J. Allaire and Garrett Grolemund},\n  publisher = {Chapman and Hall/CRC},\n  address = {Boca Raton, Florida},\n  year = {2018},\n  note = {ISBN 9781138359338},\n  url = {https://bookdown.org/yihui/rmarkdown},\n}\n\n@Book{rmarkdown2020,\n  title = {R Markdown Cookbook},\n  author = {Yihui Xie and Christophe Dervieux and Emily Riederer},\n  publisher = {Chapman and Hall/CRC},\n  address = {Boca Raton, Florida},\n  year = {2020},\n  note = {ISBN 9780367563837},\n  url = {https://bookdown.org/yihui/rmarkdown-cookbook},\n}\n```\n\n::: resources\n**Note there are three keys that we will use later on**:\n\n-   `R-rmarkdown`\n-   `rmarkdown2018`\n-   `rmarkdown2020`\n:::\n\n### Linking `.bib` file with `.rmd` (and `.qmd`) files\n\nIn order to use references within a R Markdown file, you will need to specify the name and a location of a bibliography file using the bibliography metadata field in a YAML metadata section. For example:\n\n``` yaml\n---\ntitle: \"My top ten favorite R packages\"\noutput: html_document\nbibliography: my-refs.bib\n---\n```\n\nYou can include multiple reference files using the following syntax, alternatively you can concatenate two bib files into one.\n\n``` yaml\n---\nbibliography: [\"my-refs1.bib\", \"my-refs2.bib\"]\n---\n```\n\n### Inline citation\n\nNow we can start using those bib keys that we have learned just before, using the following syntax\n\n-   `[@key]` for single citation\n-   `[@key1; @key2]` multiple citation can be separated by semi-colon\n-   `[-@key]` in order to suppress author name, and just display the year\n-   `[see @key1 p 12; also this ref @key2]` is also a valid syntax\n\nLet's start by citing the `rmarkdown` package using the following code and press `Knit` button:\n\n------------------------------------------------------------------------\n\nI have been using the amazing Rmarkdown package [@R-rmarkdown]! I should also go and read [@rmarkdown2018; and @rmarkdown2020] books.\n\n------------------------------------------------------------------------\n\nPretty cool, eh??\n\n### Citation styles\n\nBy default, Pandoc will use a Chicago author-date format for citations and references.\n\nTo use another style, you will need to specify a CSL (Citation Style Language) file in the `csl` metadata field, e.g.,\n\n``` yaml\n---\ntitle: \"My top ten favorite R packages\"\noutput: html_document\nbibliography: my-refs.bib\ncsl: biomed-central.csl\n---\n```\n\n::: resources\nTo find your required formats, we recommend using the [Zotero Style Repository](https://www.zotero.org/styles), which makes it easy to search for and download your desired style.\n:::\n\nCSL files can be tweaked to meet custom formatting requirements. For example, we can change the number of authors required before \"et al.\" is used to abbreviate them. This can be simplified through the use of visual editors such as the one available at https://editor.citationstyles.org.\n\n### Other cool features\n\n#### Add an item to a bibliography without using it\n\nBy default, the bibliography will only display items that are directly referenced in the document. If you want to include items in the bibliography without actually citing them in the body text, you can define a dummy nocite metadata field and put the citations there.\n\n``` yaml\n---\nnocite: |\n  @item1, @item2\n---\n```\n\n#### Add all items to the bibliography\n\nIf we do not wish to explicitly state all of the items within the bibliography but would still like to show them in our references, we can use the following syntax:\n\n``` yaml\n---\nnocite: '@*'\n---\n```\n\nThis will force all items to be displayed in the bibliography.\n\n::: resources\nYou can also have an appendix appear after bibliography. For more on this, see:\n\n-   <https://bookdown.org/yihui/rmarkdown-cookbook/bibliography.html>\n:::\n\n# Other useful tips\n\nWe have learned that inside your file that contains all your references (e.g. `my-refs.bib`), typically each reference gets a key, which is a shorthand that is generated by the reference manager or you can create yourself.\n\nFor instance, I use a format of lower-case first author last name followed by 4 digit year for each reference followed by a keyword (e.g name of a software package). Alternatively, you can omit the keyword. But note that if I cite a paper by the same first author that was published in the same year, then a lower case letter is added to the end. For instance, for a paper that I wrote as 1st author in 2010, my bibtex key might be `hicks2022` or `hicks2022a`. You can decide what scheme to use, just pick one and use it *forever*.\n\nIn your R Markdown document, you can then cite the reference by adding the key, such as `...in the paper by Hicks et al. [@hicks2022]...`.\n\n# Post-lecture materials\n\n### Practice\n\nHere are some post-lecture tasks to practice some of the material discussed.\n\n::: callout-note\n### Questions\n\n**Try out the following:**\n\n1.  What do you notice that's different when you run `citation(\"tidyverse\")` (compared to `citation(\"rmarkdown\")`)?\n\n2.  Install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(c(\"bibtex\", \"RefManageR\"))\n```\n:::\n\n\nWhat do they do? How might they be helpful to you in terms of reference management?\n\n3.  Instead of using a `.bib` file, try using a different bibliography file format in an R Markdown document.\n\n4.  Practice using a different CSL file to change the citation style.\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   Add here.\n:::\n\n## rtistry\n\n\n::: {.cell .fig-cap-location-top}\n\n:::\n\n\n\\[Add here.\\]\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/07-reading-and-writing-data/index/execute-results/html.json b/_freeze/posts/07-reading-and-writing-data/index/execute-results/html.json
index 75acc3f..4d72a5a 100644
--- a/_freeze/posts/07-reading-and-writing-data/index/execute-results/html.json
+++ b/_freeze/posts/07-reading-and-writing-data/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "da2ed2dbc83c00a905134a7a4a7fecbb",
+  "hash": "53f6a7572e4a213383e32cd28c924565",
   "result": {
-    "markdown": "---\ntitle: \"07 - Reading and Writing data\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"How to get data in and out of R using relative paths\"\ncategories: [module 2, week 2, R, programming, readr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/07-reading-and-writing-data/index.qmd).*\n\n\n::: {.cell}\n\n:::\n\n\n<!-- Add interesting quote -->\n\n> \"When writing code, you're always collaborating with future-you; and past-you doesn't respond to emails\". ---*Hadley Wickham*\n\n\\[[Source](https://fivebooks.com/best-books/computer-science-data-science-hadley-wickham/)\\]\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rdpeng.github.io/Biostat776/lecture-getting-and-cleaning-data>\n2.  <https://jhudatascience.org/tidyversecourse/get-data>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-getting-and-cleaning-data>\n-   <https://r4ds.had.co.nz/data-import>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Know difference between relative vs absolute paths\n-   Be able to read and write text / csv files in R\n-   Be able to read and write R data objects in R\n-   Be able to calculate memory requirements for R objects\n-   Use modern R packages for reading and writing data\n:::\n\n# Introduction\n\nThis lesson introduces **ways to read and write data** (e.g. `.txt` and `.csv` files) using base R functions as well as more modern R packages, such as `readr`, which is typically [10x faster than base R](https://r4ds.had.co.nz/data-import.html#compared-to-base-r).\n\nWe will also briefly describe different ways for reading and writing other data types such as, Excel files, google spreadsheets, or SQL databases.\n\n# Relative versus absolute paths\n\nWhen you are starting a data analysis, you can create a new `.Rproj` file that asks RStudio to change the path (location on your computer) to the `.Rproj` location.\n\nLet's try this out. In RStudio, click `Project: (None)` in the top right corner and `New Project`.\n\nAfter opening up a `.Rproj` file, you can test this by\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngetwd()\n```\n:::\n\n\nWhen you open up someone else's R code or analysis, you might also see the\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd()\n```\n:::\n\n\nfunction being used which explicitly tells R to change the absolute path or absolute location of which directory to move into.\n\nFor example, say I want to clone a GitHub repo from my colleague Brian, which has 100 R script files, and in every one of those files at the top is:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"C:\\Users\\Brian\\path\\only\\that\\Brian\\has\")\n```\n:::\n\n\nThe problem is, if I want to use his code, I will need to go and hand-edit every single one of those paths (`C:\\Users\\Brian\\path\\only\\that\\Brian\\has`) to the path that I want to use on my computer or wherever I saved the folder on my computer (e.g. `/Users/Stephanie/Documents/path/only/I/have`).\n\n1.  This is an unsustainable practice.\n2.  I can go in and manually edit the path, but this assumes I know how to set a working directory. Not everyone does.\n\nSo instead of absolute paths:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"/Users/bcaffo/data\")\nsetwd(\"~/Desktop/files/data\")\nsetwd(\"C:\\\\Users\\\\Michelle\\\\Downloads\")\n```\n:::\n\n\nA better idea is to use relative paths:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"../data\")\nsetwd(\"../files\")\nsetwd(\"..\\tmp\")\n```\n:::\n\n\nWithin R, an even better idea is to use the [here](https://github.com/r-lib/here) R package will recognize the top-level directory of a Git repo and supports building all paths relative to that. For more on project-oriented workflow suggestions, read [this post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) from Jenny Bryan.\n\n![Artwork by Allison Horst on setwd() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/cracked_setwd.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### The `here` package\n\nIn her post, Jenny Bryan writes\n\n> \"I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work.\"\n\nInstead of using `setwd()` at the top your `.R` or `.Rmd` file, she suggests:\n\n-   Organize each logical project into a folder on your computer.\n-   Make sure the top-level folder advertises itself as such. This can be as simple as having an empty file named `.here`. Or, if you use RStudio and/or Git, those both leave characteristic files behind that will get the job done.\n-   Use the `here()` function from the `here` package to build the path when you read or write a file. Create paths relative to the top-level directory.\n-   Whenever you work on this project, launch the R process from the project's top-level directory. If you launch R from the shell, `cd` to the correct folder first.\n\nLet's test this out. We can use `getwd()` to see our current working directory path and the files available using `list.files()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngetwd()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/posts/07-reading-and-writing-data\"\n```\n:::\n\n```{.r .cell-code}\nlist.files()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\"       \"index.rmarkdown\"\n```\n:::\n:::\n\n\nOK so our current location is in the reading and writing lectures sub-folder of the `jhustatcomputing2022` course repository. Let's try using the `here` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n\nlist.files(here::here())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"_freeze\"                    \"_post_template.qmd\"        \n [3] \"_quarto.yml\"                \"_site\"                     \n [5] \"data\"                       \"gh-pages\"                  \n [7] \"icon_32.png\"                \"images\"                    \n [9] \"index.qmd\"                  \"jhustatcomputing2023.Rproj\"\n[11] \"lectures.qmd\"               \"posts\"                     \n[13] \"profile.jpg\"                \"projects\"                  \n[15] \"projects.qmd\"               \"README.md\"                 \n[17] \"resources.qmd\"              \"schedule.qmd\"              \n[19] \"scripts\"                    \"site_libs\"                 \n[21] \"styles.css\"                 \"syllabus.qmd\"              \n[23] \"videos\"                    \n```\n:::\n\n```{.r .cell-code}\nlist.files(here(\"data\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"2016-07-19.csv.bz2\"       \"b_lyrics.RDS\"            \n [3] \"bmi_pm25_no2_sim.csv\"     \"chicago.rds\"             \n [5] \"chocolate.RDS\"            \"flights.csv\"             \n [7] \"maacs_sim.csv\"            \"sales.RDS\"               \n [9] \"storms_2004.csv.gz\"       \"team_standings.csv\"      \n[11] \"ts_lyrics.RDS\"            \"tuesdata_rainfall.RDS\"   \n[13] \"tuesdata_temperature.RDS\"\n```\n:::\n:::\n\n\nNow we see that using the `here::here()` function is a *relative* path (relative to the `.Rproj` file in our `jhustatcomputing2022` repository. We also see there is are two `.csv` files in the `data` folder. We will learn how to read those files into R in the next section.\n\n![Artwork by Allison Horst on here package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/here.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Finding and creating files locally\n\nOne last thing. If you want to download a file, one way to use the `file.exists()`, `dir.create()` and `list.files()` functions.\n\n-   `file.exists(here(\"my\", \"relative\", \"path\"))`: logical test if the file exists\n-   `dir.create(here(\"my\", \"relative\", \"path\"))`: create a folder\n-   `list.files(here(\"my\", \"relative\", \"path\"))`: list contents of folder\n-   `file.create(here(\"my\", \"relative\", \"path\"))`: create a file\n-   `file.remove(here(\"my\", \"relative\", \"path\"))`: delete a file\n\nFor example, I can put all this together by\n\n1.  Checking to see if a file exists in my path. If not, then\n2.  Create a directory in that path.\n3.  List the files in the path.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif(!file.exists(here(\"my\", \"relative\", \"path\"))){\n  dir.create(here(\"my\", \"relative\", \"path\"))\n}\nlist.files(here(\"my\", \"relative\", \"path\"))\n```\n:::\n\n\nLet's put relative paths to use while reading and writing data.\n\n# Reading data in base R\n\nIn this section, we're going to demonstrate the essential functions you need to know to read and write (or save) data in R.\n\n## txt or csv\n\nThere are a few primary functions reading data from base R.\n\n-   `read.table()`, `read.csv()`: for reading tabular data\n-   `readLines()`: for reading lines of a text file\n\nThere are analogous functions for writing data to files\n\n-   `write.table()`: for writing tabular data to text files (i.e. CSV) or connections\n-   `writeLines()`: for writing character data line-by-line to a file or connection\n\nLet's try reading some data into R with the `read.csv()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(here(\"data\", \"team_standings.csv\"))\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   Standing         Team\n1         1        Spain\n2         2  Netherlands\n3         3      Germany\n4         4      Uruguay\n5         5    Argentina\n6         6       Brazil\n7         7        Ghana\n8         8     Paraguay\n9         9        Japan\n10       10        Chile\n11       11     Portugal\n12       12          USA\n13       13      England\n14       14       Mexico\n15       15  South Korea\n16       16     Slovakia\n17       17  Ivory Coast\n18       18     Slovenia\n19       19  Switzerland\n20       20 South Africa\n21       21    Australia\n22       22  New Zealand\n23       23       Serbia\n24       24      Denmark\n25       25       Greece\n26       26        Italy\n27       27      Nigeria\n28       28      Algeria\n29       29       France\n30       30     Honduras\n31       31     Cameroon\n32       32  North Korea\n```\n:::\n:::\n\n\nWe can use the `$` symbol to pick out a specific column:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$Team\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Spain\"        \"Netherlands\"  \"Germany\"      \"Uruguay\"      \"Argentina\"   \n [6] \"Brazil\"       \"Ghana\"        \"Paraguay\"     \"Japan\"        \"Chile\"       \n[11] \"Portugal\"     \"USA\"          \"England\"      \"Mexico\"       \"South Korea\" \n[16] \"Slovakia\"     \"Ivory Coast\"  \"Slovenia\"     \"Switzerland\"  \"South Africa\"\n[21] \"Australia\"    \"New Zealand\"  \"Serbia\"       \"Denmark\"      \"Greece\"      \n[26] \"Italy\"        \"Nigeria\"      \"Algeria\"      \"France\"       \"Honduras\"    \n[31] \"Cameroon\"     \"North Korea\" \n```\n:::\n:::\n\n\nWe can also ask for the full paths for specific files\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhere(\"data\", \"team_standings.csv\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/data/team_standings.csv\"\n```\n:::\n:::\n\n\n::: callout-note\n### Questions\n\n-   What happens when you use `readLines()` function with the `team_standings.csv` data?\n-   How would you only read in the first 5 lines?\n:::\n\n## R code\n\nSometimes, someone will give you a file that ends in a `.R`.\n\nThis is what's called an **R script file**. It may contain code someone has written (maybe even you!), for example, a function that you can use with your data. In this case, you want the function available for you to use.\n\nTo use the function, **you have to first, read in the function from R script file into R**.\n\nYou can check to see if the function already is loaded in R by looking at the Environment tab.\n\nThe function you want to use is\n\n-   `source()`: for reading in R code files\n\nFor example, it might be something like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource(here::here('functions.R'))\n```\n:::\n\n\n## R objects\n\nAlternatively, you might be interested in reading and writing R objects.\n\nWriting data in e.g. `.txt`, `.csv` or Excel file formats is good if you want to open these files with other analysis software, such as Excel. However, these formats do not preserve data structures, such as column data types (numeric, character or factor). In order to do that, the data should be written out in a R data format.\n\nThere are several types R data file formats to be aware of:\n\n-   `.RData`: Stores **multiple** R objects\n-   `.Rda`: This is short for `.RData` and is equivalent.\n-   `.Rds`: Stores a **single** R object\n\n::: callout-note\n### Question\n\n**Why is saving data in as a R object useful?**\n\nSaving data into R data formats can **typically** reduce considerably the size of large files by compression.\n:::\n\nNext, we will learn how to read and save\n\n1.  A single R object\n2.  Multiple R objects\n3.  Your entire work space in a specified file\n\n### Reading in data from files\n\n-   `load()`: for reading in single or multiple R objects (opposite of `save()`) with a `.Rda` or `.RData` file format (objects must be same name)\n-   `readRDS()`: for reading in a single object with a `.Rds` file format (can rename objects)\n-   `unserialize()`: for reading single R objects in binary form\n\n### Writing data to files\n\n-   `save()`: for saving an arbitrary number of R objects in binary format (possibly compressed) to a file.\n-   `saveRDS()`: for saving a single object\n-   `serialize()`: for converting an R object into a binary format for outputting to a connection (or file).\n-   `save.image()`: short for 'save my current workspace'; while this **sounds** nice, it's not terribly useful for reproducibility (hence not suggested); it's also what happens when you try to quit R and it asks if you want to save your work space.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Save data into R data file formats: RDS and RDATA](http://www.sthda.com/sthda/RDoc/images/save-data-into-r-data-formats.png)\n:::\n:::\n\n\n\\[[Source](http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata)\\]\n\n### Example\n\nLet's try an example. Let's save a vector of length 5 into the two file formats.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\nsave(x, file=here(\"data\", \"x.Rda\"))\nsaveRDS(x, file=here(\"data\", \"x.Rds\"))\nlist.files(path=here(\"data\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"2016-07-19.csv.bz2\"       \"b_lyrics.RDS\"            \n [3] \"bmi_pm25_no2_sim.csv\"     \"chicago.rds\"             \n [5] \"chocolate.RDS\"            \"flights.csv\"             \n [7] \"maacs_sim.csv\"            \"sales.RDS\"               \n [9] \"storms_2004.csv.gz\"       \"team_standings.csv\"      \n[11] \"ts_lyrics.RDS\"            \"tuesdata_rainfall.RDS\"   \n[13] \"tuesdata_temperature.RDS\" \"x.Rda\"                   \n[15] \"x.Rds\"                   \n```\n:::\n:::\n\n\nHere we assign the imported data to an object using `readRDS()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnew_x1 <- readRDS(here(\"data\", \"x.Rds\"))\nnew_x1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2 3 4 5\n```\n:::\n:::\n\n\nHere we assign the imported data to an object using `load()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnew_x2 <- load(here(\"data\", \"x.Rda\"))\nnew_x2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\n`load()` simply returns the name of the objects loaded. Not the values.\n:::\n\nLet's clean up our space.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rda\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rds\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nrm(x)\n```\n:::\n\n\n::: callout-note\n### Question\n\nWhat do you think this code will do?\n\n**Hint**: change `eval=TRUE` to see result\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\ny <- x^2\nsave(x,y, file=here(\"data\", \"x.Rda\"))\nnew_x2 <- load(here(\"data\", \"x.Rda\"))\n```\n:::\n\n\nWhen you are done:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rda\"))\n```\n:::\n\n:::\n\n## Other data types\n\nNow, there are of course, many R packages that have been developed to read in all kinds of other datasets, and you may need to resort to one of these packages if you are working in a specific area.\n\nFor example, check out\n\n-   [`DBI`](https://github.com/r-dbi/DBI) for relational databases\n-   [`haven`](https://haven.tidyverse.org) for SPSS, Stata, and SAS data\n-   [`httr`](https://github.com/r-lib/httr) for web APIs\n-   [`readxl`](https://readxl.tidyverse.org) for `.xls` and `.xlsx` sheets\n-   [`googlesheets4`](https://googlesheets4.tidyverse.org) for Google Sheets\n-   [`googledrive`](https://googledrive.tidyverse.org) for Google Drive files\n-   [`rvest`](https://github.com/tidyverse/rvest) for web scraping\n-   [`jsonlite`](https://github.com/jeroen/jsonlite#jsonlite) for JSON\n-   [`xml2`](https://github.com/r-lib/xml2) for XML.\n\n## Reading data files with `read.table()`\n\n<details>\n\n<summary>For details on reading data with `read.table()`, click here.</summary>\n\nThe `read.table()` function is one of the most commonly used functions for reading data. The help file for `read.table()` is worth reading in its entirety if only because the function gets used a lot (run `?read.table` in R).\n\n**I know, I know**, everyone always says to read the help file, but this one is actually worth reading.\n\nThe `read.table()` function has a few important arguments:\n\n-   `file`, the name of a file, or a connection\n-   `header`, logical indicating if the file has a header line\n-   `sep`, a string indicating how the columns are separated\n-   `colClasses`, a character vector indicating the class of each column in the dataset\n-   `nrows`, the number of rows in the dataset. By default `read.table()` reads an entire file.\n-   `comment.char`, a character string indicating the comment character. This defaults to `\"#\"`. If there are no commented lines in your file, it's worth setting this to be the empty string `\"\"`.\n-   `skip`, the number of lines to skip from the beginning\n-   `stringsAsFactors`, should character variables be coded as factors? This defaults to `FALSE`. However, back in the \"old days\", it defaulted to `TRUE`. The reason for this was because, if you had data that were stored as strings, it was because those strings represented levels of a categorical variable. Now, we have lots of data that is text data and they do not always represent categorical variables. So you may want to set this to be `FALSE` in those cases. If you *always* want this to be `FALSE`, you can set a global option via `options(stringsAsFactors = FALSE)`.\n\nI've never seen so much heat generated on discussion forums about an R function argument than the `stringsAsFactors` argument. **Seriously**.\n\nFor small to moderately sized datasets, you can usually call `read.table()` without specifying any other arguments\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- read.table(\"foo.txt\")\n```\n:::\n\n\n::: callout-tip\n### Note\n\n`foo.txt` is not a real dataset here. It is only used as an example for how to use `read.table()`\n:::\n\nIn this case, R will automatically:\n\n-   skip lines that begin with a \\#\n-   figure out how many rows there are (and how much memory needs to be allocated)\n-   figure what type of variable is in each column of the table.\n\nTelling R all these things directly makes R run faster and more efficiently.\n\n::: callout-tip\n### Note\n\nThe `read.csv()` function is identical to `read.table()` except that some of the defaults are set differently (like the `sep` argument).\n:::\n\n</details>\n\n## Reading in larger datasets with `read.table()`\n\n<details>\n\n<summary>For details on reading larger datasets with `read.table()`, click here.</summary>\n\nWith much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking.\n\n-   Read the help page for `read.table()`, which contains many hints\n-   Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.\n-   Set `comment.char = \"\"` if there are no commented lines in your file.\n-   Use the `colClasses` argument. Specifying this option instead of using the default can make `read.table()` run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are \"numeric\", for example, then you can just set `colClasses = \"numeric\"`. A quick an dirty way to figure out the classes of each column is the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninitial <- read.table(\"datatable.txt\", nrows = 100)\nclasses <- sapply(initial, class)\ntabAll <- read.table(\"datatable.txt\", colClasses = classes)\n```\n:::\n\n\n**Note**: `datatable.txt` is not a real dataset here. It is only used as an example for how to use `read.table()`.\n\n-   Set `nrows`. This does not make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool `wc` to calculate the number of lines in a file.\n\nIn general, when using R with larger datasets, it's also useful to know a few things about your system.\n\n-   How much memory is available on your system?\n-   What other applications are in use? Can you close any of them?\n-   Are there other users logged into the same system?\n-   What operating system ar you using? Some operating systems can limit the amount of memory a single process can access\n\n</details>\n\n# Calculating Memory Requirements for R Objects\n\nBecause **R stores all of its objects in physical memory**, it is important to be cognizant of how much memory is being used up by all of the data objects residing in your workspace.\n\nOne situation where it is particularly important to understand memory requirements is when you are reading in a new dataset into R. Fortunately, it is easy to make a back of the envelope calculation of how much memory will be required by a new dataset.\n\nFor example, suppose I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?\n\nWell, on most modern computers [double precision floating point numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) are stored using 64 bits of memory, or 8 bytes. Given that information, you can do the following calculation\n\n1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes\n\n= 1,440,000,000 / 2^20^ bytes/MB\n\n= 1,373.29 MB\n\n= 1.34 GB\n\nSo the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware of\n\n-   what other programs might be running on your computer, using up RAM\n-   what other R objects might already be taking up RAM in your workspace\n\nReading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later.\n\n# Using the `readr` package\n\nThe `readr` package was developed by Posit (formerly known as RStudio) to deal with reading in large flat files quickly.\n\nThe package provides replacements for functions like `read.table()` and `read.csv()`. The analogous functions in `readr` are `read_table()` and `read_csv()`. These **functions are often much faster than their base R analogues** and provide a few other nice features such as progress meters.\n\nFor example, the package includes a variety of functions in the `read_*()` family that allow you to read in data from different formats of flat files. The following table gives a guide to several functions in the `read_*()` family.\n\n\n::: {.cell}\n::: {.cell-output-display}\n|`readr` function |Use                                          |\n|:----------------|:--------------------------------------------|\n|`read_csv()`     |Reads comma-separated file                   |\n|`read_csv2()`    |Reads semicolon-separated file               |\n|`read_tsv()`     |Reads tab-separated file                     |\n|`read_delim()`   |General function for reading delimited files |\n|`read_fwf()`     |Reads fixed width files                      |\n|`read_log()`     |Reads log files                              |\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nIn this code, I used the `kable()` function from the `knitr` package to create the summary table in a table format, rather than as basic R output.\n\nThis function is very useful **for formatting basic tables in R markdown documents**. For more complex tables, check out the `pander` and `xtable` packages.\n:::\n\nFor the most part, you can read use `read_table()` and `read_csv()` pretty much anywhere you might use `read.table()` and `read.csv()`.\n\nIn addition, if there are non-fatal problems that occur while reading in the data, you will **get a warning and the returned data frame will have some information about which rows/observations triggered the warning**.\n\nThis can be very helpful for \"debugging\" problems with your data before you get neck deep in data analysis.\n\n## Advantages\n\nThe advantage of the `read_csv()` function is perhaps better understood from an historical perspective.\n\n-   R's built in `read.csv()` function similarly reads CSV files, but the `read_csv()` function in `readr` builds on that by **removing some of the quirks and \"gotchas\"** of `read.csv()` as well as **dramatically optimizing the speed** with which it can read data into R.\n-   The `read_csv()` function also adds some nice user-oriented features like a progress meter and a compact method for specifying column types.\n\n## Example\n\nA typical call to `read_csv()` will look as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readr)\nteams <- read_csv(here(\"data\", \"team_standings.csv\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 32 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (1): Team\ndbl (1): Standing\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n```{.r .cell-code}\nteams\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 32 × 2\n   Standing Team       \n      <dbl> <chr>      \n 1        1 Spain      \n 2        2 Netherlands\n 3        3 Germany    \n 4        4 Uruguay    \n 5        5 Argentina  \n 6        6 Brazil     \n 7        7 Ghana      \n 8        8 Paraguay   \n 9        9 Japan      \n10       10 Chile      \n# ℹ 22 more rows\n```\n:::\n:::\n\n\nBy default, `read_csv()` will open a CSV file and read it in line-by-line. Similar to `read.table()`, you can tell the function to `skip` lines or which lines are comments:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_csv(\"The first line of metadata\n  The second line of metadata\n  x,y,z\n  1,2,3\",\n  skip = 2)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 1 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): x, y, z\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 3\n      x     y     z\n  <dbl> <dbl> <dbl>\n1     1     2     3\n```\n:::\n:::\n\n\nAlternatively, you can use the `comment` argument:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_csv(\"# A comment I want to skip\n  x,y,z\n  1,2,3\",\n  comment = \"#\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 1 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): x, y, z\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 3\n      x     y     z\n  <dbl> <dbl> <dbl>\n1     1     2     3\n```\n:::\n:::\n\n\nIt will also (**by default**), **read in the first few rows of the table** in order to figure out the type of each column (i.e. integer, character, etc.). From the `read_csv()` help page:\n\n> If 'NULL', all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.\n\nYou can specify the type of each column with the `col_types` argument.\n\n::: callout-tip\n### Note\n\nIn general, it is a good idea to **specify the column types explicitly**.\n\nThis rules out any possible guessing errors on the part of `read_csv()`.\n\nAlso, specifying the column types explicitly provides a useful safety check in case anything about the dataset should change without you knowing about it.\n:::\n\nHere is an example of how to specify the column types explicitly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nteams <- read_csv(here(\"data\", \"team_standings.csv\"), \n                  col_types = \"cc\")\n```\n:::\n\n\nNote that the `col_types` argument accepts a compact representation. Here `\"cc\"` indicates that the first column is `character` and the second column is `character` (there are only two columns). Using the `col_types` argument is useful because often it is not easy to automatically figure out the type of a column by looking at a few rows (especially if a column has many missing values).\n\n::: callout-tip\n### Note\n\nThe `read_csv()` function **will also read compressed files** automatically.\n\nThere is no need to decompress the file first or use the `gzfile` connection function.\n:::\n\nThe following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogs <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"), \n                 n_max = 10)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 10 Columns: 10\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr  (6): r_version, r_arch, r_os, package, version, country\ndbl  (2): size, ip_id\ndate (1): date\ntime (1): time\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\nNote that the warnings indicate that `read_csv()` may have had some difficulty identifying the type of each column. This can be solved by using the `col_types` argument.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogs <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"), \n                 col_types = \"ccicccccci\", \n                 n_max = 10)\nlogs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 10\n   date       time     size r_version r_arch r_os  package version country ip_id\n   <chr>      <chr>   <int> <chr>     <chr>  <chr> <chr>   <chr>   <chr>   <int>\n 1 2016-07-19 22:00… 1.89e6 3.3.0     x86_64 ming… data.t… 1.9.6   US          1\n 2 2016-07-19 22:00… 4.54e4 3.3.1     x86_64 ming… assert… 0.1     US          2\n 3 2016-07-19 22:00… 1.43e7 3.3.1     x86_64 ming… stringi 1.1.1   DE          3\n 4 2016-07-19 22:00… 1.89e6 3.3.1     x86_64 ming… data.t… 1.9.6   US          4\n 5 2016-07-19 22:00… 3.90e5 3.3.1     x86_64 ming… foreach 1.4.3   US          4\n 6 2016-07-19 22:00… 4.88e4 3.3.1     x86_64 linu… tree    1.0-37  CO          5\n 7 2016-07-19 22:00… 5.25e2 3.3.1     x86_64 darw… surviv… 2.39-5  US          6\n 8 2016-07-19 22:00… 3.23e6 3.3.1     x86_64 ming… Rcpp    0.12.5  US          2\n 9 2016-07-19 22:00… 5.56e5 3.3.1     x86_64 ming… tibble  1.1     US          2\n10 2016-07-19 22:00… 1.52e5 3.3.1     x86_64 ming… magrit… 1.5     US          2\n```\n:::\n:::\n\n\nYou can **specify the column type in a more detailed fashion** by using the various `col_*()` functions.\n\nFor example, in the log data above, the first column is actually a date, so it might make more sense to read it in as a `Date` object.\n\nIf we wanted to just read in that first column, we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogdates <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"), \n                     col_types = cols_only(date = col_date()),\n                     n_max = 10)\nlogdates\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 1\n   date      \n   <date>    \n 1 2016-07-19\n 2 2016-07-19\n 3 2016-07-19\n 4 2016-07-19\n 5 2016-07-19\n 6 2016-07-19\n 7 2016-07-19\n 8 2016-07-19\n 9 2016-07-19\n10 2016-07-19\n```\n:::\n:::\n\n\nNow the `date` column is stored as a `Date` object which can be used for relevant date-related computations (for example, see the `lubridate` package).\n\n::: callout-tip\n### Note\n\nThe `read_csv()` function has a `progress` option that defaults to TRUE.\n\nThis options provides a nice progress meter while the CSV file is being read.\n\nHowever, if you are using `read_csv()` in a function, or perhaps embedding it in a loop, it is probably best to set `progress = FALSE`.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  What is the point of reference for using relative paths with the `here::here()` function?\n\n2.  Why was the argument `stringsAsFactors=TRUE` historically used?\n\n3.  What is the difference between `.Rds` and `.Rda` file formats?\n\n4.  What function in `readr` would you use to read a file where fields were separated with \"\\|\"?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv>\n-   <https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-the-RStudio-IDE>\n-   <https://jhudatascience.org/tidyversecourse/get-data.html>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"07 - Reading and Writing data\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"How to get data in and out of R using relative paths\"\ncategories: [module 2, week 2, R, programming, readr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/07-reading-and-writing-data/index.qmd).*\n\n\n::: {.cell}\n\n:::\n\n\n<!-- Add interesting quote -->\n\n> \"When writing code, you're always collaborating with future-you; and past-you doesn't respond to emails\". ---*Hadley Wickham*\n\n\\[[Source](https://fivebooks.com/best-books/computer-science-data-science-hadley-wickham/)\\]\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rdpeng.github.io/Biostat776/lecture-getting-and-cleaning-data>\n2.  <https://jhudatascience.org/tidyversecourse/get-data>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-getting-and-cleaning-data>\n-   <https://r4ds.had.co.nz/data-import>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Know difference between relative vs absolute paths\n-   Be able to read and write text / csv files in R\n-   Be able to read and write R data objects in R\n-   Be able to calculate memory requirements for R objects\n-   Use modern R packages for reading and writing data\n:::\n\n# Introduction\n\nThis lesson introduces **ways to read and write data** (e.g. `.txt` and `.csv` files) using base R functions as well as more modern R packages, such as `readr`, which is typically [10x faster than base R](https://r4ds.had.co.nz/data-import.html#compared-to-base-r).\n\nWe will also briefly describe different ways for reading and writing other data types such as, Excel files, google spreadsheets, or SQL databases.\n\n# Relative versus absolute paths\n\nWhen you are starting a data analysis, you can create a new `.Rproj` file that asks RStudio to change the path (location on your computer) to the `.Rproj` location.\n\nLet's try this out. In RStudio, click `Project: (None)` in the top right corner and `New Project`.\n\nAfter opening up a `.Rproj` file, you can test this by\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngetwd()\n```\n:::\n\n\nWhen you open up someone else's R code or analysis, you might also see the\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd()\n```\n:::\n\n\nfunction being used which explicitly tells R to change the absolute path or absolute location of which directory to move into.\n\nFor example, say I want to clone a GitHub repo from my colleague Brian, which has 100 R script files, and in every one of those files at the top is:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"C:\\\\Users\\\\Brian\\\\path\\\\only\\\\that\\\\Brian\\\\has\")\n```\n:::\n\n\nThe problem is, if I want to use his code, I will need to go and hand-edit every single one of those paths (`C:\\Users\\Brian\\path\\only\\that\\Brian\\has`) to the path that I want to use on my computer or wherever I saved the folder on my computer (e.g. `/Users/leocollado/Documents/path/only/I/have`).\n\n1.  This is an unsustainable practice.\n2.  I can go in and manually edit the path, but this assumes I know how to set a working directory. Not everyone does.\n\nSo instead of absolute paths:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"/Users/bcaffo/data\")\nsetwd(\"~/Desktop/files/data\")\nsetwd(\"C:\\\\Users\\\\Michelle\\\\Downloads\")\n```\n:::\n\n\nA better idea is to use relative paths:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"../data\")\nsetwd(\"../files\")\nsetwd(\"..\\tmp\")\n```\n:::\n\n\nWithin R, an even better idea is to use the [here](https://github.com/r-lib/here) R package will recognize the top-level directory of a Git repo and supports building all paths relative to that. For more on project-oriented workflow suggestions, read [this post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) from Jenny Bryan.\n\n![Artwork by Allison Horst on setwd() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/cracked_setwd.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### The `here` package\n\nIn her post, Jenny Bryan writes\n\n> \"I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work.\"\n\nInstead of using `setwd()` at the top your `.R` or `.Rmd` file, she suggests:\n\n-   Organize each logical project into a folder on your computer.\n-   Make sure the top-level folder advertises itself as such. This can be as simple as having an empty file named `.here`. Or, if you use RStudio and/or Git, those both leave characteristic files behind that will get the job done.\n-   Use the `here()` function from the `here` package to build the path when you read or write a file. Create paths relative to the top-level directory.\n-   Whenever you work on this project, launch the R process from the project's top-level directory. If you launch R from the shell, `cd` to the correct folder first.\n\nLet's test this out. We can use `getwd()` to see our current working directory path and the files available using `list.files()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngetwd()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/posts/07-reading-and-writing-data\"\n```\n:::\n\n```{.r .cell-code}\nlist.files()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\"       \"index.rmarkdown\"\n```\n:::\n:::\n\n\nOK so our current location is in the reading and writing lectures sub-folder of the `jhustatcomputing2022` course repository. Let's try using the `here` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n\nlist.files(here::here())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"_freeze\"                    \"_post_template.qmd\"        \n [3] \"_quarto.yml\"                \"_site\"                     \n [5] \"data\"                       \"gh-pages\"                  \n [7] \"icon_32.png\"                \"images\"                    \n [9] \"index.qmd\"                  \"jhustatcomputing2023.Rproj\"\n[11] \"lectures.qmd\"               \"posts\"                     \n[13] \"profile.jpg\"                \"projects\"                  \n[15] \"projects.qmd\"               \"README.md\"                 \n[17] \"resources.qmd\"              \"schedule.qmd\"              \n[19] \"scripts\"                    \"site_libs\"                 \n[21] \"styles.css\"                 \"syllabus.qmd\"              \n[23] \"videos\"                    \n```\n:::\n\n```{.r .cell-code}\nlist.files(here(\"data\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"2016-07-19.csv.bz2\"       \"b_lyrics.RDS\"            \n [3] \"bmi_pm25_no2_sim.csv\"     \"chicago.rds\"             \n [5] \"chocolate.RDS\"            \"flights.csv\"             \n [7] \"maacs_sim.csv\"            \"sales.RDS\"               \n [9] \"storms_2004.csv.gz\"       \"team_standings.csv\"      \n[11] \"ts_lyrics.RDS\"            \"tuesdata_rainfall.RDS\"   \n[13] \"tuesdata_temperature.RDS\"\n```\n:::\n:::\n\n\nNow we see that using the `here::here()` function is a *relative* path (relative to the `.Rproj` file in our `jhustatcomputing2022` repository. We also see there is are two `.csv` files in the `data` folder. We will learn how to read those files into R in the next section.\n\n![Artwork by Allison Horst on here package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/here.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Finding and creating files locally\n\nOne last thing. If you want to download a file, one way to use the `file.exists()`, `dir.create()` and `list.files()` functions.\n\n-   `file.exists(here(\"my\", \"relative\", \"path\"))`: logical test if the file exists\n-   `dir.create(here(\"my\", \"relative\", \"path\"))`: create a folder\n-   `list.files(here(\"my\", \"relative\", \"path\"))`: list contents of folder\n-   `file.create(here(\"my\", \"relative\", \"path\"))`: create a file\n-   `file.remove(here(\"my\", \"relative\", \"path\"))`: delete a file\n\nFor example, I can put all this together by\n\n1.  Checking to see if a file exists in my path. If not, then\n2.  Create a directory in that path.\n3.  List the files in the path.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (!file.exists(here(\"my\", \"relative\", \"path\"))) {\n    dir.create(here(\"my\", \"relative\", \"path\"))\n}\nlist.files(here(\"my\", \"relative\", \"path\"))\n```\n:::\n\n\nLet's put relative paths to use while reading and writing data.\n\n# Reading data in base R\n\nIn this section, we're going to demonstrate the essential functions you need to know to read and write (or save) data in R.\n\n## txt or csv\n\nThere are a few primary functions reading data from base R.\n\n-   `read.table()`, `read.csv()`: for reading tabular data\n-   `readLines()`: for reading lines of a text file\n\nThere are analogous functions for writing data to files\n\n-   `write.table()`: for writing tabular data to text files (i.e. CSV) or connections\n-   `writeLines()`: for writing character data line-by-line to a file or connection\n\nLet's try reading some data into R with the `read.csv()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(here(\"data\", \"team_standings.csv\"))\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   Standing         Team\n1         1        Spain\n2         2  Netherlands\n3         3      Germany\n4         4      Uruguay\n5         5    Argentina\n6         6       Brazil\n7         7        Ghana\n8         8     Paraguay\n9         9        Japan\n10       10        Chile\n11       11     Portugal\n12       12          USA\n13       13      England\n14       14       Mexico\n15       15  South Korea\n16       16     Slovakia\n17       17  Ivory Coast\n18       18     Slovenia\n19       19  Switzerland\n20       20 South Africa\n21       21    Australia\n22       22  New Zealand\n23       23       Serbia\n24       24      Denmark\n25       25       Greece\n26       26        Italy\n27       27      Nigeria\n28       28      Algeria\n29       29       France\n30       30     Honduras\n31       31     Cameroon\n32       32  North Korea\n```\n:::\n:::\n\n\nWe can use the `$` symbol to pick out a specific column:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$Team\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Spain\"        \"Netherlands\"  \"Germany\"      \"Uruguay\"      \"Argentina\"   \n [6] \"Brazil\"       \"Ghana\"        \"Paraguay\"     \"Japan\"        \"Chile\"       \n[11] \"Portugal\"     \"USA\"          \"England\"      \"Mexico\"       \"South Korea\" \n[16] \"Slovakia\"     \"Ivory Coast\"  \"Slovenia\"     \"Switzerland\"  \"South Africa\"\n[21] \"Australia\"    \"New Zealand\"  \"Serbia\"       \"Denmark\"      \"Greece\"      \n[26] \"Italy\"        \"Nigeria\"      \"Algeria\"      \"France\"       \"Honduras\"    \n[31] \"Cameroon\"     \"North Korea\" \n```\n:::\n:::\n\n\nWe can also ask for the full paths for specific files\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhere(\"data\", \"team_standings.csv\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/data/team_standings.csv\"\n```\n:::\n:::\n\n\n::: callout-note\n### Questions\n\n-   What happens when you use `readLines()` function with the `team_standings.csv` data?\n-   How would you only read in the first 5 lines?\n:::\n\n## R code\n\nSometimes, someone will give you a file that ends in a `.R`.\n\nThis is what's called an **R script file**. It may contain code someone has written (maybe even you!), for example, a function that you can use with your data. In this case, you want the function available for you to use.\n\nTo use the function, **you have to first, read in the function from R script file into R**.\n\nYou can check to see if the function already is loaded in R by looking at the Environment tab.\n\nThe function you want to use is\n\n-   `source()`: for reading in R code files\n\nFor example, it might be something like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource(here::here(\"functions.R\"))\n```\n:::\n\n\n## R objects\n\nAlternatively, you might be interested in reading and writing R objects.\n\nWriting data in e.g. `.txt`, `.csv` or Excel file formats is good if you want to open these files with other analysis software, such as Excel. However, these formats do not preserve data structures, such as column data types (numeric, character or factor). In order to do that, the data should be written out in a R data format.\n\nThere are several types R data file formats to be aware of:\n\n-   `.RData`: Stores **multiple** R objects\n-   `.Rda`: This is short for `.RData` and is equivalent.\n-   `.Rds`: Stores a **single** R object\n\n::: callout-note\n### Question\n\n**Why is saving data in as a R object useful?**\n\nSaving data into R data formats can **typically** reduce considerably the size of large files by compression.\n:::\n\nNext, we will learn how to read and save\n\n1.  A single R object\n2.  Multiple R objects\n3.  Your entire work space in a specified file\n\n### Reading in data from files\n\n-   `load()`: for reading in single or multiple R objects (opposite of `save()`) with a `.Rda` or `.RData` file format (objects must be same name)\n-   `readRDS()`: for reading in a single object with a `.Rds` file format (can rename objects)\n-   `unserialize()`: for reading single R objects in binary form\n\n### Writing data to files\n\n-   `save()`: for saving an arbitrary number of R objects in binary format (possibly compressed) to a file.\n-   `saveRDS()`: for saving a single object\n-   `serialize()`: for converting an R object into a binary format for outputting to a connection (or file).\n-   `save.image()`: short for 'save my current workspace'; while this **sounds** nice, it's not terribly useful for reproducibility (hence not suggested); it's also what happens when you try to quit R and it asks if you want to save your work space.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Save data into R data file formats: RDS and RDATA](http://www.sthda.com/sthda/RDoc/images/save-data-into-r-data-formats.png)\n:::\n:::\n\n\n\\[[Source](http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata)\\]\n\n### Example\n\nLet's try an example. Let's save a vector of length 5 into the two file formats.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\nsave(x, file = here(\"data\", \"x.Rda\"))\nsaveRDS(x, file = here(\"data\", \"x.Rds\"))\nlist.files(path = here(\"data\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"2016-07-19.csv.bz2\"       \"b_lyrics.RDS\"            \n [3] \"bmi_pm25_no2_sim.csv\"     \"chicago.rds\"             \n [5] \"chocolate.RDS\"            \"flights.csv\"             \n [7] \"maacs_sim.csv\"            \"sales.RDS\"               \n [9] \"storms_2004.csv.gz\"       \"team_standings.csv\"      \n[11] \"ts_lyrics.RDS\"            \"tuesdata_rainfall.RDS\"   \n[13] \"tuesdata_temperature.RDS\" \"x.Rda\"                   \n[15] \"x.Rds\"                   \n```\n:::\n:::\n\n\nHere we assign the imported data to an object using `readRDS()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnew_x1 <- readRDS(here(\"data\", \"x.Rds\"))\nnew_x1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2 3 4 5\n```\n:::\n:::\n\n\nHere we assign the imported data to an object using `load()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnew_x2 <- load(here(\"data\", \"x.Rda\"))\nnew_x2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\n`load()` simply returns the name of the objects loaded. Not the values.\n:::\n\nLet's clean up our space.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rda\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rds\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nrm(x)\n```\n:::\n\n\n::: callout-note\n### Question\n\nWhat do you think this code will do?\n\n**Hint**: change `eval=TRUE` to see result\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\ny <- x^2\nsave(x, y, file = here(\"data\", \"x.Rda\"))\nnew_x2 <- load(here(\"data\", \"x.Rda\"))\n```\n:::\n\n\nWhen you are done:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rda\"))\n```\n:::\n\n:::\n\n## Other data types\n\nNow, there are of course, many R packages that have been developed to read in all kinds of other datasets, and you may need to resort to one of these packages if you are working in a specific area.\n\nFor example, check out\n\n-   [`DBI`](https://github.com/r-dbi/DBI) for relational databases\n-   [`haven`](https://haven.tidyverse.org) for SPSS, Stata, and SAS data\n-   [`httr`](https://github.com/r-lib/httr) for web APIs\n-   [`readxl`](https://readxl.tidyverse.org) for `.xls` and `.xlsx` sheets\n-   [`googlesheets4`](https://googlesheets4.tidyverse.org) for Google Sheets\n-   [`googledrive`](https://googledrive.tidyverse.org) for Google Drive files\n-   [`rvest`](https://github.com/tidyverse/rvest) for web scraping\n-   [`jsonlite`](https://github.com/jeroen/jsonlite#jsonlite) for JSON\n-   [`xml2`](https://github.com/r-lib/xml2) for XML.\n\n## Reading data files with `read.table()`\n\n<details>\n\n<summary>For details on reading data with `read.table()`, click here.</summary>\n\nThe `read.table()` function is one of the most commonly used functions for reading data. The help file for `read.table()` is worth reading in its entirety if only because the function gets used a lot (run `?read.table` in R).\n\n**I know, I know**, everyone always says to read the help file, but this one is actually worth reading.\n\nThe `read.table()` function has a few important arguments:\n\n-   `file`, the name of a file, or a connection\n-   `header`, logical indicating if the file has a header line\n-   `sep`, a string indicating how the columns are separated\n-   `colClasses`, a character vector indicating the class of each column in the dataset\n-   `nrows`, the number of rows in the dataset. By default `read.table()` reads an entire file.\n-   `comment.char`, a character string indicating the comment character. This defaults to `\"#\"`. If there are no commented lines in your file, it's worth setting this to be the empty string `\"\"`.\n-   `skip`, the number of lines to skip from the beginning\n-   `stringsAsFactors`, should character variables be coded as factors? This defaults to `FALSE`. However, back in the \"old days\", it defaulted to `TRUE`. The reason for this was because, if you had data that were stored as strings, it was because those strings represented levels of a categorical variable. Now, we have lots of data that is text data and they do not always represent categorical variables. So you may want to set this to be `FALSE` in those cases. If you *always* want this to be `FALSE`, you can set a global option via `options(stringsAsFactors = FALSE)`.\n\nI've never seen so much heat generated on discussion forums about an R function argument than the `stringsAsFactors` argument. **Seriously**.\n\nFor small to moderately sized datasets, you can usually call `read.table()` without specifying any other arguments\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- read.table(\"foo.txt\")\n```\n:::\n\n\n::: callout-tip\n### Note\n\n`foo.txt` is not a real dataset here. It is only used as an example for how to use `read.table()`\n:::\n\nIn this case, R will automatically:\n\n-   skip lines that begin with a \\#\n-   figure out how many rows there are (and how much memory needs to be allocated)\n-   figure what type of variable is in each column of the table.\n\nTelling R all these things directly makes R run faster and more efficiently.\n\n::: callout-tip\n### Note\n\nThe `read.csv()` function is identical to `read.table()` except that some of the defaults are set differently (like the `sep` argument).\n:::\n\n</details>\n\n## Reading in larger datasets with `read.table()`\n\n<details>\n\n<summary>For details on reading larger datasets with `read.table()`, click here.</summary>\n\nWith much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking.\n\n-   Read the help page for `read.table()`, which contains many hints\n-   Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.\n-   Set `comment.char = \"\"` if there are no commented lines in your file.\n-   Use the `colClasses` argument. Specifying this option instead of using the default can make `read.table()` run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are \"numeric\", for example, then you can just set `colClasses = \"numeric\"`. A quick an dirty way to figure out the classes of each column is the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninitial <- read.table(\"datatable.txt\", nrows = 100)\nclasses <- sapply(initial, class)\ntabAll <- read.table(\"datatable.txt\", colClasses = classes)\n```\n:::\n\n\n**Note**: `datatable.txt` is not a real dataset here. It is only used as an example for how to use `read.table()`.\n\n-   Set `nrows`. This does not make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool `wc` to calculate the number of lines in a file.\n\nIn general, when using R with larger datasets, it's also useful to know a few things about your system.\n\n-   How much memory is available on your system?\n-   What other applications are in use? Can you close any of them?\n-   Are there other users logged into the same system?\n-   What operating system ar you using? Some operating systems can limit the amount of memory a single process can access\n\n</details>\n\n# Calculating Memory Requirements for R Objects\n\nBecause **R stores all of its objects in physical memory**, it is important to be cognizant of how much memory is being used up by all of the data objects residing in your workspace.\n\nOne situation where it is particularly important to understand memory requirements is when you are reading in a new dataset into R. Fortunately, it is easy to make a back of the envelope calculation of how much memory will be required by a new dataset.\n\nFor example, suppose I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?\n\nWell, on most modern computers [double precision floating point numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) are stored using 64 bits of memory, or 8 bytes. Given that information, you can do the following calculation\n\n1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes\n\n= 1,440,000,000 / 2^20^ bytes/MB\n\n= 1,373.29 MB\n\n= 1.34 GB\n\nSo the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware of\n\n-   what other programs might be running on your computer, using up RAM\n-   what other R objects might already be taking up RAM in your workspace\n\nReading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later.\n\n# Using the `readr` package\n\nThe `readr` package was developed by Posit (formerly known as RStudio) to deal with reading in large flat files quickly.\n\nThe package provides replacements for functions like `read.table()` and `read.csv()`. The analogous functions in `readr` are `read_table()` and `read_csv()`. These **functions are often much faster than their base R analogues** and provide a few other nice features such as progress meters.\n\nFor example, the package includes a variety of functions in the `read_*()` family that allow you to read in data from different formats of flat files. The following table gives a guide to several functions in the `read_*()` family.\n\n\n::: {.cell}\n::: {.cell-output-display}\n|`readr` function |Use                                          |\n|:----------------|:--------------------------------------------|\n|`read_csv()`     |Reads comma-separated file                   |\n|`read_csv2()`    |Reads semicolon-separated file               |\n|`read_tsv()`     |Reads tab-separated file                     |\n|`read_delim()`   |General function for reading delimited files |\n|`read_fwf()`     |Reads fixed width files                      |\n|`read_log()`     |Reads log files                              |\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nIn this code, I used the `kable()` function from the `knitr` package to create the summary table in a table format, rather than as basic R output.\n\nThis function is very useful **for formatting basic tables in R markdown documents**. For more complex tables, check out the `pander` and `xtable` packages.\n:::\n\nFor the most part, you can read use `read_table()` and `read_csv()` pretty much anywhere you might use `read.table()` and `read.csv()`.\n\nIn addition, if there are non-fatal problems that occur while reading in the data, you will **get a warning and the returned data frame will have some information about which rows/observations triggered the warning**.\n\nThis can be very helpful for \"debugging\" problems with your data before you get neck deep in data analysis.\n\n## Advantages\n\nThe advantage of the `read_csv()` function is perhaps better understood from an historical perspective.\n\n-   R's built in `read.csv()` function similarly reads CSV files, but the `read_csv()` function in `readr` builds on that by **removing some of the quirks and \"gotchas\"** of `read.csv()` as well as **dramatically optimizing the speed** with which it can read data into R.\n-   The `read_csv()` function also adds some nice user-oriented features like a progress meter and a compact method for specifying column types.\n\n## Example\n\nA typical call to `read_csv()` will look as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readr)\nteams <- read_csv(here(\"data\", \"team_standings.csv\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 32 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (1): Team\ndbl (1): Standing\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n```{.r .cell-code}\nteams\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 32 × 2\n   Standing Team       \n      <dbl> <chr>      \n 1        1 Spain      \n 2        2 Netherlands\n 3        3 Germany    \n 4        4 Uruguay    \n 5        5 Argentina  \n 6        6 Brazil     \n 7        7 Ghana      \n 8        8 Paraguay   \n 9        9 Japan      \n10       10 Chile      \n# ℹ 22 more rows\n```\n:::\n:::\n\n\nBy default, `read_csv()` will open a CSV file and read it in line-by-line. Similar to `read.table()`, you can tell the function to `skip` lines or which lines are comments:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_csv(\"The first line of metadata\n  The second line of metadata\n  x,y,z\n  1,2,3\",\n    skip = 2\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 1 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): x, y, z\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 3\n      x     y     z\n  <dbl> <dbl> <dbl>\n1     1     2     3\n```\n:::\n:::\n\n\nAlternatively, you can use the `comment` argument:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_csv(\"# A comment I want to skip\n  x,y,z\n  1,2,3\",\n    comment = \"#\"\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 1 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): x, y, z\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 3\n      x     y     z\n  <dbl> <dbl> <dbl>\n1     1     2     3\n```\n:::\n:::\n\n\nIt will also (**by default**), **read in the first few rows of the table** in order to figure out the type of each column (i.e. integer, character, etc.). From the `read_csv()` help page:\n\n> If 'NULL', all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.\n\nYou can specify the type of each column with the `col_types` argument.\n\n::: callout-tip\n### Note\n\nIn general, it is a good idea to **specify the column types explicitly**.\n\nThis rules out any possible guessing errors on the part of `read_csv()`.\n\nAlso, specifying the column types explicitly provides a useful safety check in case anything about the dataset should change without you knowing about it.\n:::\n\nHere is an example of how to specify the column types explicitly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nteams <- read_csv(here(\"data\", \"team_standings.csv\"),\n    col_types = \"cc\"\n)\n```\n:::\n\n\nNote that the `col_types` argument accepts a compact representation. Here `\"cc\"` indicates that the first column is `character` and the second column is `character` (there are only two columns). Using the `col_types` argument is useful because often it is not easy to automatically figure out the type of a column by looking at a few rows (especially if a column has many missing values).\n\n::: callout-tip\n### Note\n\nThe `read_csv()` function **will also read compressed files** automatically.\n\nThere is no need to decompress the file first or use the `gzfile` connection function.\n:::\n\nThe following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogs <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"),\n    n_max = 10\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 10 Columns: 10\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr  (6): r_version, r_arch, r_os, package, version, country\ndbl  (2): size, ip_id\ndate (1): date\ntime (1): time\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\nNote that the warnings indicate that `read_csv()` may have had some difficulty identifying the type of each column. This can be solved by using the `col_types` argument.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogs <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"),\n    col_types = \"ccicccccci\",\n    n_max = 10\n)\nlogs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 10\n   date       time     size r_version r_arch r_os  package version country ip_id\n   <chr>      <chr>   <int> <chr>     <chr>  <chr> <chr>   <chr>   <chr>   <int>\n 1 2016-07-19 22:00… 1.89e6 3.3.0     x86_64 ming… data.t… 1.9.6   US          1\n 2 2016-07-19 22:00… 4.54e4 3.3.1     x86_64 ming… assert… 0.1     US          2\n 3 2016-07-19 22:00… 1.43e7 3.3.1     x86_64 ming… stringi 1.1.1   DE          3\n 4 2016-07-19 22:00… 1.89e6 3.3.1     x86_64 ming… data.t… 1.9.6   US          4\n 5 2016-07-19 22:00… 3.90e5 3.3.1     x86_64 ming… foreach 1.4.3   US          4\n 6 2016-07-19 22:00… 4.88e4 3.3.1     x86_64 linu… tree    1.0-37  CO          5\n 7 2016-07-19 22:00… 5.25e2 3.3.1     x86_64 darw… surviv… 2.39-5  US          6\n 8 2016-07-19 22:00… 3.23e6 3.3.1     x86_64 ming… Rcpp    0.12.5  US          2\n 9 2016-07-19 22:00… 5.56e5 3.3.1     x86_64 ming… tibble  1.1     US          2\n10 2016-07-19 22:00… 1.52e5 3.3.1     x86_64 ming… magrit… 1.5     US          2\n```\n:::\n:::\n\n\nYou can **specify the column type in a more detailed fashion** by using the various `col_*()` functions.\n\nFor example, in the log data above, the first column is actually a date, so it might make more sense to read it in as a `Date` object.\n\nIf we wanted to just read in that first column, we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogdates <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"),\n    col_types = cols_only(date = col_date()),\n    n_max = 10\n)\nlogdates\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 1\n   date      \n   <date>    \n 1 2016-07-19\n 2 2016-07-19\n 3 2016-07-19\n 4 2016-07-19\n 5 2016-07-19\n 6 2016-07-19\n 7 2016-07-19\n 8 2016-07-19\n 9 2016-07-19\n10 2016-07-19\n```\n:::\n:::\n\n\nNow the `date` column is stored as a `Date` object which can be used for relevant date-related computations (for example, see the `lubridate` package).\n\n::: callout-tip\n### Note\n\nThe `read_csv()` function has a `progress` option that defaults to TRUE.\n\nThis options provides a nice progress meter while the CSV file is being read.\n\nHowever, if you are using `read_csv()` in a function, or perhaps embedding it in a loop, it is probably best to set `progress = FALSE`.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  What is the point of reference for using relative paths with the `here::here()` function?\n\n2.  Why was the argument `stringsAsFactors=TRUE` historically used?\n\n3.  What is the difference between `.Rds` and `.Rda` file formats?\n\n4.  What function in `readr` would you use to read a file where fields were separated with \"\\|\"?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv>\n-   <https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-the-RStudio-IDE>\n-   <https://jhudatascience.org/tidyversecourse/get-data.html>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json b/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json
index fda21c2..9d27133 100644
--- a/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json
+++ b/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "692ae9c13383582e0de2ed88c813e47b",
+  "hash": "14e8698f5056b6c674e42b8c203edc6c",
   "result": {
-    "markdown": "---\ntitle: \"08 - Managing data frames with the Tidyverse\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \" An introduction to data frames in R and the managing them with the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tibble, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/08-managing-data-frames-with-tidyverse/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/tibbles>\n2.  <https://jhudatascience.org/tidyversecourse/wrangle-data.html#data-wrangling>\n3.  [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-managing-data-frames-with-the-tidyverse>\n-   <https://jhudatascience.org/tidyversecourse/get-data.html#tibbles>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Understand the advantages of a `tibble` and `data.frame` data objects in R\n-   Learn about the dplyr R package to manage data frames\n-   Recognize the key verbs to manage data frames in dplyr\n-   Use the \"pipe\" operator to combine verbs together\n:::\n\n# Data Frames\n\nThe **data frame** (or `data.frame`) is a **key data structure** in statistics and in R.\n\nThe basic structure of a data frame is that there is **one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation**.\n\nR has an internal implementation of data frames that is likely the one you will use most often. However, there are packages on CRAN that implement data frames via things like relational databases that allow you to operate on very, very large data frames (but we will not discuss them here).\n\nGiven the importance of managing data frames, it is **important that we have good tools for dealing with them.**\n\nFor example, **operations** like filtering rows, re-ordering rows, and selecting columns, can often be tedious operations in R whose syntax is not very intuitive. The `dplyr` package is designed to mitigate a lot of these problems and to provide a highly optimized set of routines specifically for dealing with data frames.\n\n## Tibbles\n\nAnother type of data structure that we need to discuss is called the **tibble**! It's best to think of tibbles as an updated and stylish version of the `data.frame`.\n\nTibbles are what tidyverse packages work with most seamlessly. Now, that **does not mean tidyverse packages *require* tibbles**.\n\nIn fact, they still work with `data.frames`, but the more you work with tidyverse and tidyverse-adjacent packages, the more you will see the advantages of using tibbles.\n\nBefore we go any further, tibbles *are* data frames, but they have some new bells and whistles to make your life easier.\n\n### How tibbles differ from `data.frame`\n\nThere are a number of differences between tibbles and `data.frames`.\n\n::: callout-tip\n### Note\n\nTo see a full vignette about tibbles and how they differ from data.frame, you will want to execute `vignette(\"tibble\")` and read through that vignette.\n:::\n\nWe will summarize some of the most important points here:\n\n-   **Input type remains unchanged** - `data.frame` is notorious for treating strings as factors; this will not happen with tibbles\n-   **Variable names remain unchanged** - In base R, creating `data.frames` will remove spaces from names, converting them to periods or add \"x\" before numeric column names. Creating tibbles will not change variable (column) names.\n-   **There are no `row.names()` for a tibble** - Tidy data requires that variables be stored in a consistent way, removing the need for row names.\n-   **Tibbles print first ten rows and columns that fit on one screen** - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.\n\n## Creating a tibble\n\nThe tibble package is part of the `tidyverse` and can thus be loaded in (once installed) using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### `as_tibble()`\n\nSince many packages use the historical `data.frame` from base R, you will often find yourself in the situation that you have a `data.frame` and want to convert that `data.frame` to a `tibbl`e.\n\nTo do so, the `as_tibble()` function is exactly what you are looking for.\n\nFor the example, here we use a dataset (`chicago.rds`) containing air pollution and temperature data for the city of Chicago in the U.S.\n\nThe dataset is available in the `/data` repository. You can load the data into R using the `readRDS()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing2023\n```\n:::\n\n```{.r .cell-code}\nchicago <- readRDS(here(\"data\", \"chicago.rds\"))\n```\n:::\n\n\nYou can see some basic characteristics of the dataset with the `dim()` and `str()` functions.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6940    8\n```\n:::\n\n```{.r .cell-code}\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t6940 obs. of  8 variables:\n $ city      : chr  \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...\n $ date      : Date, format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num  34 NA 34.2 47 NA ...\n $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nWe see this data structure is a `data.frame` with 6940 observations and 8 variables.\n\nTo convert this `data.frame` to a tibble you would use the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(as_tibble(chicago))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city      : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd      : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp      : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date      : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2  : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTibbles, by default, **only print the first ten rows to screen**.\n\nIf you were to print the `data.frame` `chicago` to screen, all 6940 rows would be displayed. When working with large `data.frames`, this **default behavior can be incredibly frustrating**.\n\nUsing tibbles removes this frustration because of the default settings for tibble printing.\n:::\n\nAdditionally, you will note that the **type of the variable is printed for each variable in the tibble**. This helpful feature is another added bonus of tibbles relative to `data.frame`.\n\n#### Want to see more of the tibble?\n\nIf you *do* want to see more rows from the tibble, there are a few options!\n\n1.  The `View()` function in RStudio is incredibly helpful. The input to this function is the `data.frame` or tibble you would like to see.\n\nSpecifically, `View(chicago)` would provide you, the viewer, with a scrollable view (in a new tab) of the complete dataset.\n\n2.  Use the fact that `print()` enables you to specify how many rows and columns you would like to display.\n\nHere, we again display the `chicago` data.frame as a tibble but specify that we would only like to see 5 rows. The `width = Inf` argument specifies that we would like to see all the possible columns. Here, there are only 8, but for larger datasets, this can be helpful to specify.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas_tibble(chicago) %>% \n  print(n = 5, width = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6,940 × 8\n  city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2\n  <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>\n1 chic   31.5  31.5 1987-01-01         NA       34       4.25      20.0\n2 chic   33    29.9 1987-01-02         NA       NA       3.30      23.2\n3 chic   33    27.4 1987-01-03         NA       34.2     3.33      23.8\n4 chic   29    28.6 1987-01-04         NA       47       4.38      30.4\n5 chic   32    28.9 1987-01-05         NA       NA       4.75      30.3\n# ℹ 6,935 more rows\n```\n:::\n:::\n\n\n### `tibble()`\n\nAlternatively, you can **create a tibble on the fly** by using `tibble()` and specifying the information you would like stored in each column.\n\n::: callout-tip\n### Note\n\nIf you provide a single value, this value will be repeated across all rows of the tibble. This is referred to as \"recycling inputs of length 1.\"\n\nIn the example here, we see that the column `c` will contain the value '1' across all rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  a = 1:5,\n  b = 6:10,\n  c = 1,\n  z = (a + b)^2 + c\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 4\n      a     b     c     z\n  <int> <int> <dbl> <dbl>\n1     1     6     1    50\n2     2     7     1    82\n3     3     8     1   122\n4     4     9     1   170\n5     5    10     1   226\n```\n:::\n:::\n\n:::\n\nThe `tibble()` function allows you to quickly generate tibbles and even allows you to **reference columns within the tibble you are creating**, as seen in column z of the example above.\n\n::: callout-tip\n### Note\n\n**Tibbles can have column names that are not allowed** in `data.frame`.\n\nIn the example below, we see that to utilize a nontraditional variable name, you surround the column name with backticks.\n\nNote that to refer to such columns in other tidyverse packages, you willl continue to use backticks surrounding the variable name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  `two words` = 1:5,\n  `12` = \"numeric\",\n  `:)` = \"smile\",\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n  `two words` `12`    `:)` \n        <int> <chr>   <chr>\n1           1 numeric smile\n2           2 numeric smile\n3           3 numeric smile\n4           4 numeric smile\n5           5 numeric smile\n```\n:::\n:::\n\n:::\n\n## Subsetting tibbles\n\nSubsetting tibbles also differs slightly from how subsetting occurs with `data.frame`.\n\nWhen it comes to tibbles,\n\n-   `[[` can subset by name or position\n-   `$` only subsets by name\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n  a = 1:5,\n  b = 6:10,\n  c = 1,\n  z = (a + b)^2 + c\n)\n\n# Extract by name using $ or [[]]\ndf$z\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  50  82 122 170 226\n```\n:::\n\n```{.r .cell-code}\ndf[[\"z\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  50  82 122 170 226\n```\n:::\n\n```{.r .cell-code}\n# Extract by position requires [[]]\ndf[[4]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  50  82 122 170 226\n```\n:::\n:::\n\n\nHaving now discussed tibbles, which are the type of object most tidyverse and tidyverse-adjacent packages work best with, we now know the goal.\n\nIn many cases, **tibbles are ultimately what we want to work with in R**.\n\nHowever, **data are stored in many different formats outside of R**. We will spend the rest of this lesson discussing wrangling functions that work either a `data.frame` or `tibble`.\n\n# The `dplyr` Package\n\nThe `dplyr` package was developed by Posit (formely RStudio) and is **an optimized and distilled** version of the older `plyr` **package for data manipulation or wrangling**.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_wrangling.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe `dplyr` package does not provide any \"new\" functionality to R per se, in the sense that everything `dplyr` does could already be done with base R, but it **greatly** simplifies existing functionality in R.\n\nOne important contribution of the `dplyr` package is that it **provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames**.\n\nWith this grammar, you can sensibly communicate what it is that you are doing to a data frame that other people can understand (assuming they also know the grammar). This is useful because it **provides an abstraction for data manipulation that previously did not exist**.\n\nAnother useful contribution is that the `dplyr` functions are **very** fast, as many key operations are coded in C++.\n\n### `dplyr` grammar\n\nSome of the key \"verbs\" provided by the `dplyr` package are\n\n-   `select()`: return a subset of the columns of a data frame, using a flexible notation\n\n-   `filter()`: extract a subset of rows from a data frame based on logical conditions\n\n-   `arrange()`: reorder rows of a data frame\n\n-   `rename()`: rename variables in a data frame\n\n-   `mutate()`: add new variables/columns or transform existing variables\n\n-   `summarise()` / `summarize()`: generate summary statistics of different variables in the data frame, possibly within strata\n\n-   `%>%`: the \"pipe\" operator is used to connect multiple verb actions together into a pipeline\n\n::: callout-tip\n### Note\n\nThe `dplyr` package as a number of its own data types that it takes advantage of.\n\nFor example, there is a handy `print()` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.\n:::\n\n### `dplyr` functions\n\nAll of the functions that we will discuss here will have a few common characteristics. In particular,\n\n1.  The **first argument** is a data frame type object.\n\n2.  The **subsequent arguments** describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly (without using the `$` operator, just use the column names).\n\n3.  The **return result** of a function is a new data frame.\n\n4.  Data frames must be **properly formatted** and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### `dplyr` installation\n\nThe `dplyr` package can be installed from CRAN or from GitHub using the `devtools` package and the `install_github()` function. The GitHub repository will usually contain the latest updates to the package and the development version.\n\nTo install from CRAN, just run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"dplyr\")\n```\n:::\n\n\nThe `dplyr` package is also installed when you install the `tidyverse` meta-package.\n\nAfter installing the package it is important that you load it into your R session with the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n```\n:::\n\n\nYou may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings.\n\n### `select()`\n\nWe will continue to use the `chicago` dataset containing air pollution and temperature data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- as_tibble(chicago)\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city      : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd      : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp      : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date      : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2  : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nThe `select()` function can be used to **select columns of a data frame** that you want to focus on.\n\n::: callout-tip\n### Example\n\nSuppose we wanted to take the first 3 columns only. There are a few ways to do this.\n\nWe could for example use numerical indices:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(chicago)[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"city\" \"tmpd\" \"dptp\"\n```\n:::\n:::\n\n\nBut we can also use the names directly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, city:dptp)\nhead(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n  city   tmpd  dptp\n  <chr> <dbl> <dbl>\n1 chic   31.5  31.5\n2 chic   33    29.9\n3 chic   33    27.4\n4 chic   29    28.6\n5 chic   32    28.9\n6 chic   40    35.1\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nThe `:` normally cannot be used with names or strings, but inside the `select()` function you can use it to specify a range of variable names.\n:::\n\nYou can also **omit** variables using the `select()` function by using the negative sign. With `select()` you can do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect(chicago, -(city:dptp))\n```\n:::\n\n\nwhich indicates that we should include every variable *except* the variables `city` through `dptp`. The equivalent code in base R would be\n\n\n::: {.cell}\n\n```{.r .cell-code}\ni <- match(\"city\", names(chicago))\nj <- match(\"dptp\", names(chicago))\nhead(chicago[, -(i:j)])\n```\n:::\n\n\nNot super intuitive, right?\n\nThe `select()` function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a \"2\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, ends_with(\"2\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 4] (S3: tbl_df/tbl/data.frame)\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2  : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nOr if we wanted to keep every variable that starts with a \"d\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, starts_with(\"d\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 2] (S3: tbl_df/tbl/data.frame)\n $ dptp: num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date: Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n```\n:::\n:::\n\n\nYou can also use more general regular expressions if necessary. See the help page (`?select`) for more details.\n\n### `filter()`\n\nThe `filter()` function is used to **extract subsets of rows** from a data frame. This function is similar to the existing `subset()` function in R but is quite a bit faster in my experience.\n\n![Artwork by Allison Horst on filter() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_filter.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n::: callout-tip\n### Example\n\nSuppose we wanted to extract the rows of the `chicago` data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30)\nstr(chic.f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [194 × 8] (S3: tbl_df/tbl/data.frame)\n $ city      : chr [1:194] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd      : num [1:194] 23 28 55 59 57 57 75 61 73 78 ...\n $ dptp      : num [1:194] 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...\n $ date      : Date[1:194], format: \"1998-01-17\" \"1998-01-23\" ...\n $ pm25tmean2: num [1:194] 38.1 34 39.4 35.4 33.3 ...\n $ pm10tmean2: num [1:194] 32.5 38.7 34 28.5 35 ...\n $ o3tmean2  : num [1:194] 3.18 1.75 10.79 14.3 20.66 ...\n $ no2tmean2 : num [1:194] 25.3 29.4 25.3 31.4 26.8 ...\n```\n:::\n:::\n\n:::\n\nYou can see that there are now only 194 rows in the data frame and the distribution of the `pm25tmean2` values is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(chic.f$pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n  30.05   32.12   35.04   36.63   39.53   61.50 \n```\n:::\n:::\n\n\nWe can place an arbitrarily complex logical sequence inside of `filter()`, so we could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)\nselect(chic.f, date, tmpd, pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 17 × 3\n   date        tmpd pm25tmean2\n   <date>     <dbl>      <dbl>\n 1 1998-08-23    81       39.6\n 2 1998-09-06    81       31.5\n 3 2001-07-20    82       32.3\n 4 2001-08-01    84       43.7\n 5 2001-08-08    85       38.8\n 6 2001-08-09    84       38.2\n 7 2002-06-20    82       33  \n 8 2002-06-23    82       42.5\n 9 2002-07-08    81       33.1\n10 2002-07-18    82       38.8\n11 2003-06-25    82       33.9\n12 2003-07-04    84       32.9\n13 2005-06-24    86       31.9\n14 2005-06-27    82       51.5\n15 2005-06-28    85       31.2\n16 2005-07-17    84       32.7\n17 2005-08-03    84       37.9\n```\n:::\n:::\n\n\nNow there are only 17 observations where both of those conditions are met.\n\nOther logical operators you should be aware of include:\n\n|  Operator |                  Meaning |                        Example |\n|----------:|-------------------------:|-------------------------------:|\n|      `==` |                   Equals |                 `city == chic` |\n|      `!=` |           Does not equal |                 `city != chic` |\n|       `>` |             Greater than |                  `tmpd > 32.0` |\n|      `>=` | Greater than or equal to |                 `tmpd >- 32.0` |\n|       `<` |                Less than |                  `tmpd < 32.0` |\n|      `<=` |    Less than or equal to |                 `tmpd <= 32.0` |\n|    `%in%` |              Included in | `city %in% c(\"chic\", \"bmore\")` |\n| `is.na()` |       Is a missing value |            `is.na(pm10tmean2)` |\n\n::: callout-tip\n### Note\n\nIf you are ever unsure of how to write a logical statement, but know how to write its opposite, you can use the `!` operator to negate the whole statement.\n\nA common use of this is to identify observations with non-missing data (e.g., `!(is.na(pm10tmean2))`).\n:::\n\n### `arrange()`\n\nThe `arrange()` function is used to **reorder rows** of a data frame according to one of the variables/columns. Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R. The `arrange()` function simplifies the process quite a bit.\n\nHere we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, date)\n```\n:::\n\n\nWe can now check the first few rows\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  date       pm25tmean2\n  <date>          <dbl>\n1 1987-01-01         NA\n2 1987-01-02         NA\n3 1987-01-03         NA\n```\n:::\n:::\n\n\nand the last few rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  date       pm25tmean2\n  <date>          <dbl>\n1 2005-12-29       7.45\n2 2005-12-30      15.1 \n3 2005-12-31      15   \n```\n:::\n:::\n\n\nColumns can be arranged in descending order too by useing the special `desc()` operator.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, desc(date))\n```\n:::\n\n\nLooking at the first three and last three rows shows the dates in descending order.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  date       pm25tmean2\n  <date>          <dbl>\n1 2005-12-31      15   \n2 2005-12-30      15.1 \n3 2005-12-29       7.45\n```\n:::\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  date       pm25tmean2\n  <date>          <dbl>\n1 1987-01-03         NA\n2 1987-01-02         NA\n3 1987-01-01         NA\n```\n:::\n:::\n\n\n### `rename()`\n\n**Renaming a variable** in a data frame in R is surprisingly hard to do! The `rename()` function is designed to make this process easier.\n\nHere you can see the names of the first five variables in the `chicago` data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n  city   tmpd  dptp date       pm25tmean2\n  <chr> <dbl> <dbl> <date>          <dbl>\n1 chic     35  30.1 2005-12-31      15   \n2 chic     36  31   2005-12-30      15.1 \n3 chic     35  29.4 2005-12-29       7.45\n```\n:::\n:::\n\n\nThe `dptp` column is supposed to represent the dew point temperature and the `pm25tmean2` column provides the PM2.5 data.\n\nHowever, these names are pretty obscure or awkward and probably be renamed to something more sensible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n  city   tmpd dewpoint date        pm25\n  <chr> <dbl>    <dbl> <date>     <dbl>\n1 chic     35     30.1 2005-12-31 15   \n2 chic     36     31   2005-12-30 15.1 \n3 chic     35     29.4 2005-12-29  7.45\n```\n:::\n:::\n\n\nThe syntax inside the `rename()` function is to have the new name on the left-hand side of the `=` sign and the old name on the right-hand side.\n\n::: callout-note\n### Question\n\nHow would you do the equivalent in base R without `dplyr`?\n:::\n\n### `mutate()`\n\nThe `mutate()` function exists to **compute transformations of variables** in a data frame. Often, you want to create new variables that are derived from existing variables and `mutate()` provides a clean interface for doing that.\n\n![Artwork by Allison Horst on mutate() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_mutate.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nFor example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data.\n\n-   That way we can look at whether a given day's air pollution level is higher than or less than average (as opposed to looking at its absolute level).\n\nHere, we create a `pm25detrend` variable that subtracts the mean from the `pm25` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))\nhead(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 9\n  city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2\n  <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>\n1 chic     35     30.1 2005-12-31 15          23.5     2.53      13.2\n2 chic     36     31   2005-12-30 15.1        19.2     3.03      22.8\n3 chic     35     29.4 2005-12-29  7.45       23.5     6.79      20.0\n4 chic     37     34.5 2005-12-28 17.8        27.5     3.26      19.3\n5 chic     40     33.6 2005-12-27 23.6        27       4.47      23.5\n6 chic     35     29.6 2005-12-26  8.4         8.5    14.0       16.8\n# ℹ 1 more variable: pm25detrend <dbl>\n```\n:::\n:::\n\n\nThere is also the related `transmute()` function, which does the same thing as `mutate()` but then *drops all non-transformed variables*.\n\nHere, we de-trend the PM10 and ozone (O3) variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(transmute(chicago, \n               pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),\n               o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n  pm10detrend o3detrend\n        <dbl>     <dbl>\n1      -10.4     -16.9 \n2      -14.7     -16.4 \n3      -10.4     -12.6 \n4       -6.40    -16.2 \n5       -6.90    -15.0 \n6      -25.4      -5.39\n```\n:::\n:::\n\n\nNote that there are only two columns in the transmuted data frame.\n\n### `group_by()`\n\nThe `group_by()` function is used to **generate summary statistics** from the data frame within strata defined by a variable.\n\nFor example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is?\n\nSo the stratum is the year, and that is something we can derive from the `date` variable.\n\n**In conjunction** with the `group_by()` function, we often use the `summarize()` function (or `summarise()` for some parts of the world).\n\n::: callout-tip\n### Note\n\nThe **general operation** here is a combination of\n\n1.  Splitting a data frame into separate pieces defined by a variable or group of variables (`group_by()`)\n2.  Then, applying a summary function across those subsets (`summarize()`)\n:::\n\n::: callout-tip\n### Example\n\nFirst, we can create a `year` variable using `as.POSIXlt()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)\n```\n:::\n\n\nNow we can create a separate data frame that splits the original data frame by year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nyears <- group_by(chicago, year)\n```\n:::\n\n\nFinally, we compute summary statistics for each year in the data frame with the `summarize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(years, pm25 = mean(pm25, na.rm = TRUE), \n          o3 = max(o3tmean2, na.rm = TRUE), \n          no2 = median(no2tmean2, na.rm = TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 19 × 4\n    year  pm25    o3   no2\n   <dbl> <dbl> <dbl> <dbl>\n 1  1987 NaN    63.0  23.5\n 2  1988 NaN    61.7  24.5\n 3  1989 NaN    59.7  26.1\n 4  1990 NaN    52.2  22.6\n 5  1991 NaN    63.1  21.4\n 6  1992 NaN    50.8  24.8\n 7  1993 NaN    44.3  25.8\n 8  1994 NaN    52.2  28.5\n 9  1995 NaN    66.6  27.3\n10  1996 NaN    58.4  26.4\n11  1997 NaN    56.5  25.5\n12  1998  18.3  50.7  24.6\n13  1999  18.5  57.5  24.7\n14  2000  16.9  55.8  23.5\n15  2001  16.9  51.8  25.1\n16  2002  15.3  54.9  22.7\n17  2003  15.2  56.2  24.6\n18  2004  14.6  44.5  23.4\n19  2005  16.2  58.8  22.6\n```\n:::\n:::\n\n:::\n\n`summarize()` returns a data frame with `year` as the first column, and then the annual summary statistics of `pm25`, `o3`, and `no2`.\n\n::: callout-tip\n### More complicated example\n\nIn a slightly more complicated example, we might want to know what are the average levels of ozone (`o3`) and nitrogen dioxide (`no2`) within quintiles of `pm25`. A slicker way to do this would be through a regression model, but we can actually do this quickly with `group_by()` and `summarize()`.\n\nFirst, we can create a categorical variable of `pm25` divided into quantiles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)\nchicago <- mutate(chicago, pm25.quint = cut(pm25, qq))\n```\n:::\n\n\nNow we can group the data frame by the `pm25.quint` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquint <- group_by(chicago, pm25.quint)\n```\n:::\n\n\nFinally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(quint, o3 = mean(o3tmean2, na.rm = TRUE), \n          no2 = mean(no2tmean2, na.rm = TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n  pm25.quint     o3   no2\n  <fct>       <dbl> <dbl>\n1 (1.7,8.7]    21.7  18.0\n2 (8.7,12.4]   20.4  22.1\n3 (12.4,16.7]  20.7  24.4\n4 (16.7,22.6]  19.9  27.3\n5 (22.6,61.5]  20.3  29.6\n6 <NA>         18.8  25.8\n```\n:::\n:::\n\n:::\n\nFrom the table, it seems there is not a strong relationship between `pm25` and `o3`, but there appears to be a positive correlation between `pm25` and `no2`.\n\nMore sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of `dplyr` functions can often get you most of the way there.\n\n### `%>%`\n\nThe pipeline operator `%>%` is very handy for **stringing together multiple `dplyr` functions in a sequence of operations**.\n\nNotice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nthird(second(first(x)))\n```\n:::\n\n\nThis **nesting is not a natural way** to think about a sequence of operations.\n\nThe `%>%` operator allows you to string operations in a left-to-right fashion, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfirst(x) %>% second %>% third\n```\n:::\n\n\n::: callout-tip\n### Example\n\nTake the example that we just did in the last section.\n\nThat can be done with the following sequence in a single R expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago %>% \n  mutate(year = as.POSIXlt(date)$year + 1900) %>%    \n  group_by(year) %>% \n  summarize(pm25 = mean(pm25, na.rm = TRUE), \n            o3 = max(o3tmean2, na.rm = TRUE), \n            no2 = median(no2tmean2, na.rm = TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 19 × 4\n    year  pm25    o3   no2\n   <dbl> <dbl> <dbl> <dbl>\n 1  1987 NaN    63.0  23.5\n 2  1988 NaN    61.7  24.5\n 3  1989 NaN    59.7  26.1\n 4  1990 NaN    52.2  22.6\n 5  1991 NaN    63.1  21.4\n 6  1992 NaN    50.8  24.8\n 7  1993 NaN    44.3  25.8\n 8  1994 NaN    52.2  28.5\n 9  1995 NaN    66.6  27.3\n10  1996 NaN    58.4  26.4\n11  1997 NaN    56.5  25.5\n12  1998  18.3  50.7  24.6\n13  1999  18.5  57.5  24.7\n14  2000  16.9  55.8  23.5\n15  2001  16.9  51.8  25.1\n16  2002  15.3  54.9  22.7\n17  2003  15.2  56.2  24.6\n18  2004  14.6  44.5  23.4\n19  2005  16.2  58.8  22.6\n```\n:::\n:::\n\n:::\n\nThis way we do not have to create a set of temporary variables along the way or create a massive nested sequence of function calls.\n\n::: callout-tip\n### Note\n\nIn the above code, I pass the `chicago` data frame to the first call to `mutate()`, but then afterwards I do not have to pass the first argument to `group_by()` or `summarize()`.\n\nOnce you travel down the pipeline with `%>%`, the first argument is taken to be the output of the previous element in the pipeline.\n:::\n\nAnother example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutate(chicago, month = as.POSIXlt(date)$mon + 1) %>% \n        group_by(month) %>% \n        summarize(pm25 = mean(pm25, na.rm = TRUE), \n                  o3 = max(o3tmean2, na.rm = TRUE), \n                  no2 = median(no2tmean2, na.rm = TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 4\n   month  pm25    o3   no2\n   <dbl> <dbl> <dbl> <dbl>\n 1     1  17.8  28.2  25.4\n 2     2  20.4  37.4  26.8\n 3     3  17.4  39.0  26.8\n 4     4  13.9  47.9  25.0\n 5     5  14.1  52.8  24.2\n 6     6  15.9  66.6  25.0\n 7     7  16.6  59.5  22.4\n 8     8  16.9  54.0  23.0\n 9     9  15.9  57.5  24.5\n10    10  14.2  47.1  24.2\n11    11  15.2  29.5  23.6\n12    12  17.5  27.7  24.5\n```\n:::\n:::\n\n\nHere, we can see that `o3` tends to be low in the winter months and high in the summer while `no2` is higher in the winter and lower in the summer.\n\n### `slice_*()`\n\nThe `slice_sample()` function of the `dplyr` package will allow you to see a **sample of random rows** in random order.\n\nThe number of rows to show is specified by the `n` argument.\n\n-   This can be useful if you **do not want to print the entire tibble**, but you want to get a greater sense of the values.\n-   This is a **good option for data analysis reports**, where printing the entire tibble would not be appropriate if the tibble is quite large.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_sample(chicago, n = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 11\n   city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2\n   <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>\n 1 chic   45      25.7  2002-04-26   9.9       30.2    23.9       26.3\n 2 chic   52.5    41.4  1993-04-23  NA         63.5    24.6       40.1\n 3 chic   13       9.3  2005-01-27  10.8       22      20.7       27.3\n 4 chic   41      43.4  1993-11-12  NA         51.5    10.0       25.3\n 5 chic   52.5    39    1996-05-02  NA         38      22.5       32.1\n 6 chic   65.5    51.8  1990-09-27  NA         45.5    19.6       40.9\n 7 chic   46      31    2000-11-05  12.1       26      14.3       24.7\n 8 chic   86.5    73.4  1990-07-04  NA         60.6    52.2       12.8\n 9 chic   10       7.75 1992-12-24  NA         39       6.82      21.9\n10 chic   33      20    1992-10-19  NA         30       9.30      33.6\n# ℹ 3 more variables: pm25detrend <dbl>, year <dbl>, pm25.quint <fct>\n```\n:::\n:::\n\n:::\n\nYou can also use `slice_head()` or `slice_tail()` to take a look at the top rows or bottom rows of your tibble. Again the number of rows can be specified with the `n` argument.\n\nThis will show the first 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_head(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 11\n  city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2\n  <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>\n1 chic     35     30.1 2005-12-31 15          23.5     2.53      13.2\n2 chic     36     31   2005-12-30 15.1        19.2     3.03      22.8\n3 chic     35     29.4 2005-12-29  7.45       23.5     6.79      20.0\n4 chic     37     34.5 2005-12-28 17.8        27.5     3.26      19.3\n5 chic     40     33.6 2005-12-27 23.6        27       4.47      23.5\n# ℹ 3 more variables: pm25detrend <dbl>, year <dbl>, pm25.quint <fct>\n```\n:::\n:::\n\n\nThis will show the last 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_tail(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 11\n  city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2\n  <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>\n1 chic   32       28.9 1987-01-05    NA       NA       4.75      30.3\n2 chic   29       28.6 1987-01-04    NA       47       4.38      30.4\n3 chic   33       27.4 1987-01-03    NA       34.2     3.33      23.8\n4 chic   33       29.9 1987-01-02    NA       NA       3.30      23.2\n5 chic   31.5     31.5 1987-01-01    NA       34       4.25      20.0\n# ℹ 3 more variables: pm25detrend <dbl>, year <dbl>, pm25.quint <fct>\n```\n:::\n:::\n\n\n# Summary\n\nThe `dplyr` pacfkage provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.\n\nOnce you learn the `dplyr` grammar there are a few additional benefits\n\n-   `dplyr` can work with other data frame \"back ends\" such as SQL databases. There is an SQL interface for relational databases via the DBI package\n\n-   `dplyr` can be integrated with the `data.table` package for large fast tables\n\nThe `dplyr` package is handy way to both simplify and speed up your data frame management code. It is rare that you get such a combination at the same time!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  How can you tell if an object is a tibble?\n2.  What option controls how many additional column names are printed at the footer of a tibble?\n3.  Using the `trees` dataset in base R (this dataset stores the girth, height, and volume for Black Cherry Trees) and using the pipe operator: (i) convert the `data.frame` to a tibble, (ii) filter for rows with a tree height of greater than 70, and (iii) order rows by `Volume` (smallest to largest).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n  Girth Height Volume\n1   8.3     70   10.3\n2   8.6     65   10.3\n3   8.8     63   10.2\n4  10.5     72   16.4\n5  10.7     81   18.8\n6  10.8     83   19.7\n```\n:::\n:::\n\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://r4ds.had.co.nz/tibbles.html>\n-   <https://jhudatascience.org/tidyversecourse/wrangle-data.html#data-wrangling>\n-   [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"08 - Managing data frames with the Tidyverse\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \" An introduction to data frames in R and the managing them with the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tibble, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/08-managing-data-frames-with-tidyverse/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/tibbles>\n2.  <https://jhudatascience.org/tidyversecourse/wrangle-data.html#data-wrangling>\n3.  [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-managing-data-frames-with-the-tidyverse>\n-   <https://jhudatascience.org/tidyversecourse/get-data.html#tibbles>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Understand the advantages of a `tibble` and `data.frame` data objects in R\n-   Learn about the dplyr R package to manage data frames\n-   Recognize the key verbs to manage data frames in dplyr\n-   Use the \"pipe\" operator to combine verbs together\n:::\n\n# Data Frames\n\nThe **data frame** (or `data.frame`) is a **key data structure** in statistics and in R.\n\nThe basic structure of a data frame is that there is **one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation**.\n\nR has an internal implementation of data frames that is likely the one you will use most often. However, there are packages on CRAN that implement data frames via things like relational databases that allow you to operate on very, very large data frames (but we will not discuss them here).\n\nGiven the importance of managing data frames, it is **important that we have good tools for dealing with them.**\n\nFor example, **operations** like filtering rows, re-ordering rows, and selecting columns, can often be tedious operations in R whose syntax is not very intuitive. The `dplyr` package is designed to mitigate a lot of these problems and to provide a highly optimized set of routines specifically for dealing with data frames.\n\n## Tibbles\n\nAnother type of data structure that we need to discuss is called the **tibble**! It's best to think of tibbles as an updated and stylish version of the `data.frame`.\n\nTibbles are what tidyverse packages work with most seamlessly. Now, that **does not mean tidyverse packages *require* tibbles**.\n\nIn fact, they still work with `data.frames`, but the more you work with tidyverse and tidyverse-adjacent packages, the more you will see the advantages of using tibbles.\n\nBefore we go any further, tibbles *are* data frames, but they have some new bells and whistles to make your life easier.\n\n### How tibbles differ from `data.frame`\n\nThere are a number of differences between tibbles and `data.frames`.\n\n::: callout-tip\n### Note\n\nTo see a full vignette about tibbles and how they differ from data.frame, you will want to execute `vignette(\"tibble\")` and read through that vignette.\n:::\n\nWe will summarize some of the most important points here:\n\n-   **Input type remains unchanged** - `data.frame` is notorious for treating strings as factors; this will not happen with tibbles\n-   **Variable names remain unchanged** - In base R, creating `data.frames` will remove spaces from names, converting them to periods or add \"x\" before numeric column names. Creating tibbles will not change variable (column) names.\n-   **There are no `row.names()` for a tibble** - Tidy data requires that variables be stored in a consistent way, removing the need for row names.\n-   **Tibbles print first ten rows and columns that fit on one screen** - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.\n\n## Creating a tibble\n\nThe tibble package is part of the `tidyverse` and can thus be loaded in (once installed) using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### `as_tibble()`\n\nSince many packages use the historical `data.frame` from base R, you will often find yourself in the situation that you have a `data.frame` and want to convert that `data.frame` to a `tibbl`e.\n\nTo do so, the `as_tibble()` function is exactly what you are looking for.\n\nFor the example, here we use a dataset (`chicago.rds`) containing air pollution and temperature data for the city of Chicago in the U.S.\n\nThe dataset is available in the `/data` repository. You can load the data into R using the `readRDS()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing2023\n```\n:::\n\n```{.r .cell-code}\nchicago <- readRDS(here(\"data\", \"chicago.rds\"))\n```\n:::\n\n\nYou can see some basic characteristics of the dataset with the `dim()` and `str()` functions.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6940    8\n```\n:::\n\n```{.r .cell-code}\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t6940 obs. of  8 variables:\n $ city      : chr  \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...\n $ date      : Date, format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num  34 NA 34.2 47 NA ...\n $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nWe see this data structure is a `data.frame` with 6940 observations and 8 variables.\n\nTo convert this `data.frame` to a tibble you would use the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(as_tibble(chicago))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city      : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd      : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp      : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date      : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2  : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTibbles, by default, **only print the first ten rows to screen**.\n\nIf you were to print the `data.frame` `chicago` to screen, all 6940 rows would be displayed. When working with large `data.frames`, this **default behavior can be incredibly frustrating**.\n\nUsing tibbles removes this frustration because of the default settings for tibble printing.\n:::\n\nAdditionally, you will note that the **type of the variable is printed for each variable in the tibble**. This helpful feature is another added bonus of tibbles relative to `data.frame`.\n\n#### Want to see more of the tibble?\n\nIf you *do* want to see more rows from the tibble, there are a few options!\n\n1.  The `View()` function in RStudio is incredibly helpful. The input to this function is the `data.frame` or tibble you would like to see.\n\nSpecifically, `View(chicago)` would provide you, the viewer, with a scrollable view (in a new tab) of the complete dataset.\n\n2.  Use the fact that `print()` enables you to specify how many rows and columns you would like to display.\n\nHere, we again display the `chicago` data.frame as a tibble but specify that we would only like to see 5 rows. The `width = Inf` argument specifies that we would like to see all the possible columns. Here, there are only 8, but for larger datasets, this can be helpful to specify.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas_tibble(chicago) %>%\n    print(n = 5, width = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6,940 × 8\n  city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2\n  <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl>\n1 chic   31.5  31.5 1987-01-01         NA       34       4.25      20.0\n2 chic   33    29.9 1987-01-02         NA       NA       3.30      23.2\n3 chic   33    27.4 1987-01-03         NA       34.2     3.33      23.8\n4 chic   29    28.6 1987-01-04         NA       47       4.38      30.4\n5 chic   32    28.9 1987-01-05         NA       NA       4.75      30.3\n# ℹ 6,935 more rows\n```\n:::\n:::\n\n\n### `tibble()`\n\nAlternatively, you can **create a tibble on the fly** by using `tibble()` and specifying the information you would like stored in each column.\n\n::: callout-tip\n### Note\n\nIf you provide a single value, this value will be repeated across all rows of the tibble. This is referred to as \"recycling inputs of length 1.\"\n\nIn the example here, we see that the column `c` will contain the value '1' across all rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n    a = 1:5,\n    b = 6:10,\n    c = 1,\n    z = (a + b)^2 + c\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 4\n      a     b     c     z\n  <int> <int> <dbl> <dbl>\n1     1     6     1    50\n2     2     7     1    82\n3     3     8     1   122\n4     4     9     1   170\n5     5    10     1   226\n```\n:::\n:::\n\n:::\n\nThe `tibble()` function allows you to quickly generate tibbles and even allows you to **reference columns within the tibble you are creating**, as seen in column z of the example above.\n\n::: callout-tip\n### Note\n\n**Tibbles can have column names that are not allowed** in `data.frame`.\n\nIn the example below, we see that to utilize a nontraditional variable name, you surround the column name with backticks.\n\nNote that to refer to such columns in other tidyverse packages, you willl continue to use backticks surrounding the variable name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n    `two words` = 1:5,\n    `12` = \"numeric\",\n    `:)` = \"smile\",\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n  `two words` `12`    `:)` \n        <int> <chr>   <chr>\n1           1 numeric smile\n2           2 numeric smile\n3           3 numeric smile\n4           4 numeric smile\n5           5 numeric smile\n```\n:::\n:::\n\n:::\n\n## Subsetting tibbles\n\nSubsetting tibbles also differs slightly from how subsetting occurs with `data.frame`.\n\nWhen it comes to tibbles,\n\n-   `[[` can subset by name or position\n-   `$` only subsets by name\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n    a = 1:5,\n    b = 6:10,\n    c = 1,\n    z = (a + b)^2 + c\n)\n\n# Extract by name using $ or [[]]\ndf$z\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  50  82 122 170 226\n```\n:::\n\n```{.r .cell-code}\ndf[[\"z\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  50  82 122 170 226\n```\n:::\n\n```{.r .cell-code}\n# Extract by position requires [[]]\ndf[[4]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  50  82 122 170 226\n```\n:::\n:::\n\n\nHaving now discussed tibbles, which are the type of object most tidyverse and tidyverse-adjacent packages work best with, we now know the goal.\n\nIn many cases, **tibbles are ultimately what we want to work with in R**.\n\nHowever, **data are stored in many different formats outside of R**. We will spend the rest of this lesson discussing wrangling functions that work either a `data.frame` or `tibble`.\n\n# The `dplyr` Package\n\nThe `dplyr` package was developed by Posit (formely RStudio) and is **an optimized and distilled** version of the older `plyr` **package for data manipulation or wrangling**.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_wrangling.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe `dplyr` package does not provide any \"new\" functionality to R per se, in the sense that everything `dplyr` does could already be done with base R, but it **greatly** simplifies existing functionality in R.\n\nOne important contribution of the `dplyr` package is that it **provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames**.\n\nWith this grammar, you can sensibly communicate what it is that you are doing to a data frame that other people can understand (assuming they also know the grammar). This is useful because it **provides an abstraction for data manipulation that previously did not exist**.\n\nAnother useful contribution is that the `dplyr` functions are **very** fast, as many key operations are coded in C++.\n\n### `dplyr` grammar\n\nSome of the key \"verbs\" provided by the `dplyr` package are\n\n-   `select()`: return a subset of the columns of a data frame, using a flexible notation\n\n-   `filter()`: extract a subset of rows from a data frame based on logical conditions\n\n-   `arrange()`: reorder rows of a data frame\n\n-   `rename()`: rename variables in a data frame\n\n-   `mutate()`: add new variables/columns or transform existing variables\n\n-   `summarise()` / `summarize()`: generate summary statistics of different variables in the data frame, possibly within strata\n\n-   `%>%`: the \"pipe\" operator is used to connect multiple verb actions together into a pipeline\n\n::: callout-tip\n### Note\n\nThe `dplyr` package as a number of its own data types that it takes advantage of.\n\nFor example, there is a handy `print()` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.\n:::\n\n### `dplyr` functions\n\nAll of the functions that we will discuss here will have a few common characteristics. In particular,\n\n1.  The **first argument** is a data frame type object.\n\n2.  The **subsequent arguments** describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly (without using the `$` operator, just use the column names).\n\n3.  The **return result** of a function is a new data frame.\n\n4.  Data frames must be **properly formatted** and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### `dplyr` installation\n\nThe `dplyr` package can be installed from CRAN or from GitHub using the `devtools` package and the `install_github()` function. The GitHub repository will usually contain the latest updates to the package and the development version.\n\nTo install from CRAN, just run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"dplyr\")\n```\n:::\n\n\nThe `dplyr` package is also installed when you install the `tidyverse` meta-package.\n\nAfter installing the package it is important that you load it into your R session with the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n```\n:::\n\n\nYou may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings.\n\n### `select()`\n\nWe will continue to use the `chicago` dataset containing air pollution and temperature data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- as_tibble(chicago)\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city      : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd      : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp      : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date      : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2  : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nThe `select()` function can be used to **select columns of a data frame** that you want to focus on.\n\n::: callout-tip\n### Example\n\nSuppose we wanted to take the first 3 columns only. There are a few ways to do this.\n\nWe could for example use numerical indices:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(chicago)[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"city\" \"tmpd\" \"dptp\"\n```\n:::\n:::\n\n\nBut we can also use the names directly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, city:dptp)\nhead(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n  city   tmpd  dptp\n  <chr> <dbl> <dbl>\n1 chic   31.5  31.5\n2 chic   33    29.9\n3 chic   33    27.4\n4 chic   29    28.6\n5 chic   32    28.9\n6 chic   40    35.1\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nThe `:` normally cannot be used with names or strings, but inside the `select()` function you can use it to specify a range of variable names.\n:::\n\nYou can also **omit** variables using the `select()` function by using the negative sign. With `select()` you can do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect(chicago, -(city:dptp))\n```\n:::\n\n\nwhich indicates that we should include every variable *except* the variables `city` through `dptp`. The equivalent code in base R would be\n\n\n::: {.cell}\n\n```{.r .cell-code}\ni <- match(\"city\", names(chicago))\nj <- match(\"dptp\", names(chicago))\nhead(chicago[, -(i:j)])\n```\n:::\n\n\nNot super intuitive, right?\n\nThe `select()` function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a \"2\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, ends_with(\"2\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 4] (S3: tbl_df/tbl/data.frame)\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2  : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nOr if we wanted to keep every variable that starts with a \"d\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, starts_with(\"d\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 2] (S3: tbl_df/tbl/data.frame)\n $ dptp: num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date: Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n```\n:::\n:::\n\n\nYou can also use more general regular expressions if necessary. See the help page (`?select`) for more details.\n\n### `filter()`\n\nThe `filter()` function is used to **extract subsets of rows** from a data frame. This function is similar to the existing `subset()` function in R but is quite a bit faster in my experience.\n\n![Artwork by Allison Horst on filter() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_filter.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n::: callout-tip\n### Example\n\nSuppose we wanted to extract the rows of the `chicago` data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30)\nstr(chic.f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [194 × 8] (S3: tbl_df/tbl/data.frame)\n $ city      : chr [1:194] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd      : num [1:194] 23 28 55 59 57 57 75 61 73 78 ...\n $ dptp      : num [1:194] 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...\n $ date      : Date[1:194], format: \"1998-01-17\" \"1998-01-23\" ...\n $ pm25tmean2: num [1:194] 38.1 34 39.4 35.4 33.3 ...\n $ pm10tmean2: num [1:194] 32.5 38.7 34 28.5 35 ...\n $ o3tmean2  : num [1:194] 3.18 1.75 10.79 14.3 20.66 ...\n $ no2tmean2 : num [1:194] 25.3 29.4 25.3 31.4 26.8 ...\n```\n:::\n:::\n\n:::\n\nYou can see that there are now only 194 rows in the data frame and the distribution of the `pm25tmean2` values is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(chic.f$pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n  30.05   32.12   35.04   36.63   39.53   61.50 \n```\n:::\n:::\n\n\nWe can place an arbitrarily complex logical sequence inside of `filter()`, so we could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)\nselect(chic.f, date, tmpd, pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 17 × 3\n   date        tmpd pm25tmean2\n   <date>     <dbl>      <dbl>\n 1 1998-08-23    81       39.6\n 2 1998-09-06    81       31.5\n 3 2001-07-20    82       32.3\n 4 2001-08-01    84       43.7\n 5 2001-08-08    85       38.8\n 6 2001-08-09    84       38.2\n 7 2002-06-20    82       33  \n 8 2002-06-23    82       42.5\n 9 2002-07-08    81       33.1\n10 2002-07-18    82       38.8\n11 2003-06-25    82       33.9\n12 2003-07-04    84       32.9\n13 2005-06-24    86       31.9\n14 2005-06-27    82       51.5\n15 2005-06-28    85       31.2\n16 2005-07-17    84       32.7\n17 2005-08-03    84       37.9\n```\n:::\n:::\n\n\nNow there are only 17 observations where both of those conditions are met.\n\nOther logical operators you should be aware of include:\n\n|  Operator |                  Meaning |                        Example |\n|----------:|-------------------------:|-------------------------------:|\n|      `==` |                   Equals |                 `city == chic` |\n|      `!=` |           Does not equal |                 `city != chic` |\n|       `>` |             Greater than |                  `tmpd > 32.0` |\n|      `>=` | Greater than or equal to |                 `tmpd >- 32.0` |\n|       `<` |                Less than |                  `tmpd < 32.0` |\n|      `<=` |    Less than or equal to |                 `tmpd <= 32.0` |\n|    `%in%` |              Included in | `city %in% c(\"chic\", \"bmore\")` |\n| `is.na()` |       Is a missing value |            `is.na(pm10tmean2)` |\n\n::: callout-tip\n### Note\n\nIf you are ever unsure of how to write a logical statement, but know how to write its opposite, you can use the `!` operator to negate the whole statement.\n\nA common use of this is to identify observations with non-missing data (e.g., `!(is.na(pm10tmean2))`).\n:::\n\n### `arrange()`\n\nThe `arrange()` function is used to **reorder rows** of a data frame according to one of the variables/columns. Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R. The `arrange()` function simplifies the process quite a bit.\n\nHere we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, date)\n```\n:::\n\n\nWe can now check the first few rows\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  date       pm25tmean2\n  <date>          <dbl>\n1 1987-01-01         NA\n2 1987-01-02         NA\n3 1987-01-03         NA\n```\n:::\n:::\n\n\nand the last few rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  date       pm25tmean2\n  <date>          <dbl>\n1 2005-12-29       7.45\n2 2005-12-30      15.1 \n3 2005-12-31      15   \n```\n:::\n:::\n\n\nColumns can be arranged in descending order too by useing the special `desc()` operator.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, desc(date))\n```\n:::\n\n\nLooking at the first three and last three rows shows the dates in descending order.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  date       pm25tmean2\n  <date>          <dbl>\n1 2005-12-31      15   \n2 2005-12-30      15.1 \n3 2005-12-29       7.45\n```\n:::\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  date       pm25tmean2\n  <date>          <dbl>\n1 1987-01-03         NA\n2 1987-01-02         NA\n3 1987-01-01         NA\n```\n:::\n:::\n\n\n### `rename()`\n\n**Renaming a variable** in a data frame in R is surprisingly hard to do! The `rename()` function is designed to make this process easier.\n\nHere you can see the names of the first five variables in the `chicago` data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n  city   tmpd  dptp date       pm25tmean2\n  <chr> <dbl> <dbl> <date>          <dbl>\n1 chic     35  30.1 2005-12-31      15   \n2 chic     36  31   2005-12-30      15.1 \n3 chic     35  29.4 2005-12-29       7.45\n```\n:::\n:::\n\n\nThe `dptp` column is supposed to represent the dew point temperature and the `pm25tmean2` column provides the PM2.5 data.\n\nHowever, these names are pretty obscure or awkward and probably be renamed to something more sensible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n  city   tmpd dewpoint date        pm25\n  <chr> <dbl>    <dbl> <date>     <dbl>\n1 chic     35     30.1 2005-12-31 15   \n2 chic     36     31   2005-12-30 15.1 \n3 chic     35     29.4 2005-12-29  7.45\n```\n:::\n:::\n\n\nThe syntax inside the `rename()` function is to have the new name on the left-hand side of the `=` sign and the old name on the right-hand side.\n\n::: callout-note\n### Question\n\nHow would you do the equivalent in base R without `dplyr`?\n:::\n\n### `mutate()`\n\nThe `mutate()` function exists to **compute transformations of variables** in a data frame. Often, you want to create new variables that are derived from existing variables and `mutate()` provides a clean interface for doing that.\n\n![Artwork by Allison Horst on mutate() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_mutate.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nFor example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data.\n\n-   That way we can look at whether a given day's air pollution level is higher than or less than average (as opposed to looking at its absolute level).\n\nHere, we create a `pm25detrend` variable that subtracts the mean from the `pm25` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))\nhead(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 9\n  city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2\n  <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>\n1 chic     35     30.1 2005-12-31 15          23.5     2.53      13.2\n2 chic     36     31   2005-12-30 15.1        19.2     3.03      22.8\n3 chic     35     29.4 2005-12-29  7.45       23.5     6.79      20.0\n4 chic     37     34.5 2005-12-28 17.8        27.5     3.26      19.3\n5 chic     40     33.6 2005-12-27 23.6        27       4.47      23.5\n6 chic     35     29.6 2005-12-26  8.4         8.5    14.0       16.8\n# ℹ 1 more variable: pm25detrend <dbl>\n```\n:::\n:::\n\n\nThere is also the related `transmute()` function, which does the same thing as `mutate()` but then *drops all non-transformed variables*.\n\nHere, we de-trend the PM10 and ozone (O3) variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(transmute(chicago,\n    pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),\n    o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)\n))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n  pm10detrend o3detrend\n        <dbl>     <dbl>\n1      -10.4     -16.9 \n2      -14.7     -16.4 \n3      -10.4     -12.6 \n4       -6.40    -16.2 \n5       -6.90    -15.0 \n6      -25.4      -5.39\n```\n:::\n:::\n\n\nNote that there are only two columns in the transmuted data frame.\n\n### `group_by()`\n\nThe `group_by()` function is used to **generate summary statistics** from the data frame within strata defined by a variable.\n\nFor example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is?\n\nSo the stratum is the year, and that is something we can derive from the `date` variable.\n\n**In conjunction** with the `group_by()` function, we often use the `summarize()` function (or `summarise()` for some parts of the world).\n\n::: callout-tip\n### Note\n\nThe **general operation** here is a combination of\n\n1.  Splitting a data frame into separate pieces defined by a variable or group of variables (`group_by()`)\n2.  Then, applying a summary function across those subsets (`summarize()`)\n:::\n\n::: callout-tip\n### Example\n\nFirst, we can create a `year` variable using `as.POSIXlt()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)\n```\n:::\n\n\nNow we can create a separate data frame that splits the original data frame by year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nyears <- group_by(chicago, year)\n```\n:::\n\n\nFinally, we compute summary statistics for each year in the data frame with the `summarize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(years,\n    pm25 = mean(pm25, na.rm = TRUE),\n    o3 = max(o3tmean2, na.rm = TRUE),\n    no2 = median(no2tmean2, na.rm = TRUE)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 19 × 4\n    year  pm25    o3   no2\n   <dbl> <dbl> <dbl> <dbl>\n 1  1987 NaN    63.0  23.5\n 2  1988 NaN    61.7  24.5\n 3  1989 NaN    59.7  26.1\n 4  1990 NaN    52.2  22.6\n 5  1991 NaN    63.1  21.4\n 6  1992 NaN    50.8  24.8\n 7  1993 NaN    44.3  25.8\n 8  1994 NaN    52.2  28.5\n 9  1995 NaN    66.6  27.3\n10  1996 NaN    58.4  26.4\n11  1997 NaN    56.5  25.5\n12  1998  18.3  50.7  24.6\n13  1999  18.5  57.5  24.7\n14  2000  16.9  55.8  23.5\n15  2001  16.9  51.8  25.1\n16  2002  15.3  54.9  22.7\n17  2003  15.2  56.2  24.6\n18  2004  14.6  44.5  23.4\n19  2005  16.2  58.8  22.6\n```\n:::\n:::\n\n:::\n\n`summarize()` returns a data frame with `year` as the first column, and then the annual summary statistics of `pm25`, `o3`, and `no2`.\n\n::: callout-tip\n### More complicated example\n\nIn a slightly more complicated example, we might want to know what are the average levels of ozone (`o3`) and nitrogen dioxide (`no2`) within quintiles of `pm25`. A slicker way to do this would be through a regression model, but we can actually do this quickly with `group_by()` and `summarize()`.\n\nFirst, we can create a categorical variable of `pm25` divided into quantiles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)\nchicago <- mutate(chicago, pm25.quint = cut(pm25, qq))\n```\n:::\n\n\nNow we can group the data frame by the `pm25.quint` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquint <- group_by(chicago, pm25.quint)\n```\n:::\n\n\nFinally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(quint,\n    o3 = mean(o3tmean2, na.rm = TRUE),\n    no2 = mean(no2tmean2, na.rm = TRUE)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n  pm25.quint     o3   no2\n  <fct>       <dbl> <dbl>\n1 (1.7,8.7]    21.7  18.0\n2 (8.7,12.4]   20.4  22.1\n3 (12.4,16.7]  20.7  24.4\n4 (16.7,22.6]  19.9  27.3\n5 (22.6,61.5]  20.3  29.6\n6 <NA>         18.8  25.8\n```\n:::\n:::\n\n:::\n\nFrom the table, it seems there is not a strong relationship between `pm25` and `o3`, but there appears to be a positive correlation between `pm25` and `no2`.\n\nMore sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of `dplyr` functions can often get you most of the way there.\n\n### `%>%`\n\nThe pipeline operator `%>%` is very handy for **stringing together multiple `dplyr` functions in a sequence of operations**.\n\nNotice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nthird(second(first(x)))\n```\n:::\n\n\nThis **nesting is not a natural way** to think about a sequence of operations.\n\nThe `%>%` operator allows you to string operations in a left-to-right fashion, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfirst(x) %>%\n    second() %>%\n    third()\n```\n:::\n\n\n::: callout-tip\n### Example\n\nTake the example that we just did in the last section.\n\nThat can be done with the following sequence in a single R expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago %>%\n    mutate(year = as.POSIXlt(date)$year + 1900) %>%\n    group_by(year) %>%\n    summarize(\n        pm25 = mean(pm25, na.rm = TRUE),\n        o3 = max(o3tmean2, na.rm = TRUE),\n        no2 = median(no2tmean2, na.rm = TRUE)\n    )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 19 × 4\n    year  pm25    o3   no2\n   <dbl> <dbl> <dbl> <dbl>\n 1  1987 NaN    63.0  23.5\n 2  1988 NaN    61.7  24.5\n 3  1989 NaN    59.7  26.1\n 4  1990 NaN    52.2  22.6\n 5  1991 NaN    63.1  21.4\n 6  1992 NaN    50.8  24.8\n 7  1993 NaN    44.3  25.8\n 8  1994 NaN    52.2  28.5\n 9  1995 NaN    66.6  27.3\n10  1996 NaN    58.4  26.4\n11  1997 NaN    56.5  25.5\n12  1998  18.3  50.7  24.6\n13  1999  18.5  57.5  24.7\n14  2000  16.9  55.8  23.5\n15  2001  16.9  51.8  25.1\n16  2002  15.3  54.9  22.7\n17  2003  15.2  56.2  24.6\n18  2004  14.6  44.5  23.4\n19  2005  16.2  58.8  22.6\n```\n:::\n:::\n\n:::\n\nThis way we do not have to create a set of temporary variables along the way or create a massive nested sequence of function calls.\n\n::: callout-tip\n### Note\n\nIn the above code, I pass the `chicago` data frame to the first call to `mutate()`, but then afterwards I do not have to pass the first argument to `group_by()` or `summarize()`.\n\nOnce you travel down the pipeline with `%>%`, the first argument is taken to be the output of the previous element in the pipeline.\n:::\n\nAnother example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutate(chicago, month = as.POSIXlt(date)$mon + 1) %>%\n    group_by(month) %>%\n    summarize(\n        pm25 = mean(pm25, na.rm = TRUE),\n        o3 = max(o3tmean2, na.rm = TRUE),\n        no2 = median(no2tmean2, na.rm = TRUE)\n    )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 4\n   month  pm25    o3   no2\n   <dbl> <dbl> <dbl> <dbl>\n 1     1  17.8  28.2  25.4\n 2     2  20.4  37.4  26.8\n 3     3  17.4  39.0  26.8\n 4     4  13.9  47.9  25.0\n 5     5  14.1  52.8  24.2\n 6     6  15.9  66.6  25.0\n 7     7  16.6  59.5  22.4\n 8     8  16.9  54.0  23.0\n 9     9  15.9  57.5  24.5\n10    10  14.2  47.1  24.2\n11    11  15.2  29.5  23.6\n12    12  17.5  27.7  24.5\n```\n:::\n:::\n\n\nHere, we can see that `o3` tends to be low in the winter months and high in the summer while `no2` is higher in the winter and lower in the summer.\n\n### `slice_*()`\n\nThe `slice_sample()` function of the `dplyr` package will allow you to see a **sample of random rows** in random order.\n\nThe number of rows to show is specified by the `n` argument.\n\n-   This can be useful if you **do not want to print the entire tibble**, but you want to get a greater sense of the values.\n-   This is a **good option for data analysis reports**, where printing the entire tibble would not be appropriate if the tibble is quite large.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_sample(chicago, n = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 11\n   city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2\n   <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>\n 1 chic   49       40.2 2000-09-25   6.6        7      17.2       15.5\n 2 chic   35       24.1 1989-11-02  NA         25       8.83      17.3\n 3 chic   63.5     54.4 1996-04-18  NA         54      30.5       26.7\n 4 chic   70       65.9 1997-06-19  NA         60.5    32.4       39.9\n 5 chic   54       50.6 2005-11-05  27.2       32      11.5       18.2\n 6 chic   86.5     73.4 1990-07-04  NA         60.6    52.2       12.8\n 7 chic   74       74.6 1987-08-14  NA         49.5    24.2       18.6\n 8 chic   34.5     29.1 1995-11-27  NA         25       6.57      29.3\n 9 chic   73       61.2 1995-09-13  NA         46      25.3       26.5\n10 chic   79       64.6 2005-07-31  20.8       29.5    40.8       20.2\n# ℹ 3 more variables: pm25detrend <dbl>, year <dbl>, pm25.quint <fct>\n```\n:::\n:::\n\n:::\n\nYou can also use `slice_head()` or `slice_tail()` to take a look at the top rows or bottom rows of your tibble. Again the number of rows can be specified with the `n` argument.\n\nThis will show the first 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_head(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 11\n  city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2\n  <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>\n1 chic     35     30.1 2005-12-31 15          23.5     2.53      13.2\n2 chic     36     31   2005-12-30 15.1        19.2     3.03      22.8\n3 chic     35     29.4 2005-12-29  7.45       23.5     6.79      20.0\n4 chic     37     34.5 2005-12-28 17.8        27.5     3.26      19.3\n5 chic     40     33.6 2005-12-27 23.6        27       4.47      23.5\n# ℹ 3 more variables: pm25detrend <dbl>, year <dbl>, pm25.quint <fct>\n```\n:::\n:::\n\n\nThis will show the last 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_tail(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 11\n  city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2\n  <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>\n1 chic   32       28.9 1987-01-05    NA       NA       4.75      30.3\n2 chic   29       28.6 1987-01-04    NA       47       4.38      30.4\n3 chic   33       27.4 1987-01-03    NA       34.2     3.33      23.8\n4 chic   33       29.9 1987-01-02    NA       NA       3.30      23.2\n5 chic   31.5     31.5 1987-01-01    NA       34       4.25      20.0\n# ℹ 3 more variables: pm25detrend <dbl>, year <dbl>, pm25.quint <fct>\n```\n:::\n:::\n\n\n# Summary\n\nThe `dplyr` pacfkage provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.\n\nOnce you learn the `dplyr` grammar there are a few additional benefits\n\n-   `dplyr` can work with other data frame \"back ends\" such as SQL databases. There is an SQL interface for relational databases via the DBI package\n\n-   `dplyr` can be integrated with the `data.table` package for large fast tables\n\nThe `dplyr` package is handy way to both simplify and speed up your data frame management code. It is rare that you get such a combination at the same time!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  How can you tell if an object is a tibble?\n2.  What option controls how many additional column names are printed at the footer of a tibble?\n3.  Using the `trees` dataset in base R (this dataset stores the girth, height, and volume for Black Cherry Trees) and using the pipe operator: (i) convert the `data.frame` to a tibble, (ii) filter for rows with a tree height of greater than 70, and (iii) order rows by `Volume` (smallest to largest).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n  Girth Height Volume\n1   8.3     70   10.3\n2   8.6     65   10.3\n3   8.8     63   10.2\n4  10.5     72   16.4\n5  10.7     81   18.8\n6  10.8     83   19.7\n```\n:::\n:::\n\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://r4ds.had.co.nz/tibbles.html>\n-   <https://jhudatascience.org/tidyversecourse/wrangle-data.html#data-wrangling>\n-   [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/09-tidy-data-and-the-tidyverse/index/execute-results/html.json b/_freeze/posts/09-tidy-data-and-the-tidyverse/index/execute-results/html.json
index dd7cfe1..745ed89 100644
--- a/_freeze/posts/09-tidy-data-and-the-tidyverse/index/execute-results/html.json
+++ b/_freeze/posts/09-tidy-data-and-the-tidyverse/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "2d7b312f39fbff603576c0df9c26dee5",
+  "hash": "4988f121f5166e5d1440e31a42847a58",
   "result": {
-    "markdown": "---\ntitle: \"09 - Tidy data and the Tidyverse\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to tidy data and how to convert between wide and long data with the tidyr R package\"\ncategories: [module 2, week 2, R, programming, tidyr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/09-tidy-data-and-the-tidyverse/index.qmd).*\n\n<!-- Add interesting quote -->\n\n> \"Happy families are all alike; every unhappy family is unhappy in its own way.\" ---- Leo Tolstoy\n\n> \"Tidy datasets are all alike, but every messy dataset is messy in its own way.\" ---- Hadley Wickham\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software\n2.  <https://r4ds.had.co.nz/tidy-data>\n3.  [tidyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-tidy-data-and-the-tidyverse>\n-   <https://r4ds.had.co.nz/tidy-data>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Define tidy data\n-   Be able to transform non-tidy data into tidy data\n-   Be able to transform wide data into long data\n-   Be able to separate character columns into multiple columns\n-   Be able to unite multiple character columns into one column\n:::\n\n# Tidy data\n\nAs we learned in the last lesson, one unifying concept of the tidyverse is the notion of **tidy data**.\n\nAs defined by Hadley Wickham in his 2014 paper published in the *Journal of Statistical Software*, a [tidy dataset](https://www.jstatsoft.org/article/view/v059i10) has the following properties:\n\n1.  Each variable forms a column.\n\n2.  Each observation forms a row.\n\n3.  Each type of observational unit forms a table.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe **purpose of defining tidy data** is to highlight the fact that **most data do not start out life as tidy**.\n\nIn fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience).\n\n-   Once a dataset is tidy, it **can be used as input into a variety of other functions** that may transform, model, or visualize the data.\n\n::: callout-tip\n### Example\n\nAs a quick example, consider the following data illustrating **religion and income survey data** with the number of respondents with income range in column name.\n\nThis is in a classic table format:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\nrelig_income\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 11\n   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`\n   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>\n 1 Agnostic      27        34        60        81        76       137        122\n 2 Atheist       12        27        37        52        35        70         73\n 3 Buddhist      27        21        30        34        33        58         62\n 4 Catholic     418       617       732       670       638      1116        949\n 5 Don’t k…      15        14        15        11        10        35         21\n 6 Evangel…     575       869      1064       982       881      1486        949\n 7 Hindu          1         9         7         9        11        34         47\n 8 Histori…     228       244       236       238       197       223        131\n 9 Jehovah…      20        27        24        24        21        30         15\n10 Jewish        19        19        25        25        30        95         69\n11 Mainlin…     289       495       619       655       651      1107        939\n12 Mormon        29        40        48        51        56       112         85\n13 Muslim         6         7         9        10         9        23         16\n14 Orthodox      13        17        23        32        32        47         38\n15 Other C…       9         7        11        13        13        14         18\n16 Other F…      20        33        40        46        49        63         46\n17 Other W…       5         2         3         4         2         7          3\n18 Unaffil…     217       299       374       365       341       528        407\n# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,\n#   `Don't know/refused` <dbl>\n```\n:::\n:::\n\n:::\n\nWhile this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy.\n\n**This format violates the tidy form** because there are variables in the columns.\n\n-   In this case the variables are religion, income bracket, and the number of respondents, which is the third variable, is presented inside the table.\n\nConverting this data to tidy format would give us\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nrelig_income %>%\n  pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\") %>%\n  mutate(religion = factor(religion), income = factor(income))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 180 × 3\n   religion income             respondents\n   <fct>    <fct>                    <dbl>\n 1 Agnostic <$10k                       27\n 2 Agnostic $10-20k                     34\n 3 Agnostic $20-30k                     60\n 4 Agnostic $30-40k                     81\n 5 Agnostic $40-50k                     76\n 6 Agnostic $50-75k                    137\n 7 Agnostic $75-100k                   122\n 8 Agnostic $100-150k                  109\n 9 Agnostic >150k                       84\n10 Agnostic Don't know/refused          96\n# ℹ 170 more rows\n```\n:::\n:::\n\n\nSome of these functions you have seen before, others might be new to you. Let's talk about each one in the context of the `tidyverse` R packages.\n\n# The \"Tidyverse\"\n\nThere are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and **the collection of packages is often referred to as the \"tidyverse\"** because of their **dependence on and presumption of tidy data**.\n\n::: callout-tip\n### Note\n\nA subset of the \"Tidyverse\" packages include:\n\n-   [ggplot2](https://cran.r-project.org/package=ggplot2): a plotting system based on the grammar of graphics\n\n-   [magrittr](https://cran.r-project.org/package=magrittr%22): defines the `%>%` operator for chaining functions together in a series of operations on data\n\n-   [dplyr](https://cran.r-project.org/package=dplyr): a suite of (fast) functions for working with data frames\n\n-   [tidyr](https://cran.r-project.org/package=tidyr): easily tidy data with `pivot_wider()` and `pivot_longer()` functions (also `separate()` and `unite()`)\n\nA complete list can be found here (<https://www.tidyverse.org/packages>).\n:::\n\nWe will be using these packages quite a bit.\n\nThe \"tidyverse\" package can be used to install all of the packages in the tidyverse at once.\n\nFor example, instead of starting an R script with this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(readr)\nlibrary(ggplot2)\n```\n:::\n\n\nYou can start with this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\nIn the example above, let's talk about what we did using the `pivot_longer()` function.\n\nWe will also talk about `pivot_wider()`.\n\n### `pivot_longer()`\n\nThe `tidyr` package includes functions to transfer a data frame between *long* and *wide*.\n\n-   **Wide format** data tends to have different attributes or variables describing an observation placed in separate columns.\n-   **Long format** data tends to have different attributes encoded as levels of a single variable, followed by another column that contains tha values of the observation at those different levels.\n\n::: callout-tip\n### Example\n\nIn the section above, we showed an example that used `pivot_longer()` to convert data into a tidy format.\n\nThe **key problem** with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.\n\nTo **fix this**, you can use the `pivot_longer()` function to **gather values spread across several columns into a single column**, here with the column names gathered into an `income` column.\n\n**Note**: when gathering, exclude any columns that you do not want \"gathered\" (`religion` in this case) by including the column names with a the minus sign in the `pivot_longer()` function.\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Gather everything EXCEPT religion to tidy data\nrelig_income %>%\n  pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 180 × 3\n   religion income             respondents\n   <chr>    <chr>                    <dbl>\n 1 Agnostic <$10k                       27\n 2 Agnostic $10-20k                     34\n 3 Agnostic $20-30k                     60\n 4 Agnostic $30-40k                     81\n 5 Agnostic $40-50k                     76\n 6 Agnostic $50-75k                    137\n 7 Agnostic $75-100k                   122\n 8 Agnostic $100-150k                  109\n 9 Agnostic >150k                       84\n10 Agnostic Don't know/refused          96\n# ℹ 170 more rows\n```\n:::\n:::\n\n:::\n\nEven if your data is in a tidy format, `pivot_longer()` is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.\n\n### `pivot_wider()`\n\nThe `pivot_wider()` function is less commonly needed to tidy data. It can, however, be useful for creating summary tables.\n\n::: callout-tip\n### Example\n\nYou use the `summarize()` function in `dplyr` to summarize the total number of respondents per income category.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_income %>%\n  pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\") %>%\n  mutate(religion = factor(religion), income = factor(income)) %>% \n  group_by(income) %>% \n  summarize(total_respondents = sum(respondents)) %>%\n  pivot_wider(names_from = \"income\", \n              values_from = \"total_respondents\") %>%\n  knitr::kable()\n```\n\n::: {.cell-output-display}\n| <$10k| >150k| $10-20k| $100-150k| $20-30k| $30-40k| $40-50k| $50-75k| $75-100k| Don't know/refused|\n|-----:|-----:|-------:|---------:|-------:|-------:|-------:|-------:|--------:|------------------:|\n|  1930|  2608|    2781|      3197|    3357|    3302|    3085|    5185|     3990|               6121|\n:::\n:::\n\n:::\n\nNotice in this example how `pivot_wider()` has been used at the **very end of the code sequence** to convert the summarized data into a shape that **offers a better tabular presentation for a report**.\n\n::: callout-tip\n### Note\n\nIn the `pivot_wider()` call, you first specify the name of the column to use for the new column names (`income` in this example) and then specify the column to use for the cell values (`total_respondents` here).\n:::\n\n::: callout-tip\n### Example of `pivot_longer()`\n\nLet's try another dataset. This data contain an excerpt of the [Gapminder data](https://cran.r-project.org/web/packages/gapminder/README.html#gapminder) on life expectancy, GDP per capita, and population by country.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(gapminder)\ngapminder\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 6\n   country     continent  year lifeExp      pop gdpPercap\n   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>\n 1 Afghanistan Asia       1952    28.8  8425333      779.\n 2 Afghanistan Asia       1957    30.3  9240934      821.\n 3 Afghanistan Asia       1962    32.0 10267083      853.\n 4 Afghanistan Asia       1967    34.0 11537966      836.\n 5 Afghanistan Asia       1972    36.1 13079460      740.\n 6 Afghanistan Asia       1977    38.4 14880372      786.\n 7 Afghanistan Asia       1982    39.9 12881816      978.\n 8 Afghanistan Asia       1987    40.8 13867957      852.\n 9 Afghanistan Asia       1992    41.7 16317921      649.\n10 Afghanistan Asia       1997    41.8 22227415      635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\nIf we wanted to make `lifeExp`, `pop` and `gdpPercap` (all measurements that we observe) go from a wide table into a long table, what would we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nOne more! Try using `pivot_longer()` to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.\n\nAfterward, use `group_by()` and `summarize()` to calculate the average revenue for each company across all years and all quarters.\n\n**Bonus**: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n  \"company\" = rep(1:3, each=4), \n  \"year\"  = rep(2006:2009, 3),\n  \"Q1\"    = sample(x = 0:100, size = 12),\n  \"Q2\"    = sample(x = 0:100, size = 12),\n  \"Q3\"    = sample(x = 0:100, size = 12),\n  \"Q4\"    = sample(x = 0:100, size = 12),\n)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 6\n   company  year    Q1    Q2    Q3    Q4\n     <int> <int> <int> <int> <int> <int>\n 1       1  2006     8    55    99    17\n 2       1  2007     9    57    98    48\n 3       1  2008    20    40    77    24\n 4       1  2009    42    68    61    26\n 5       2  2006   100    84    13     3\n 6       2  2007    86    93    17    93\n 7       2  2008    97    83    62    62\n 8       2  2009    46    12    25    79\n 9       3  2006     5    48    81    41\n10       3  2007    53    73    73    34\n11       3  2008    81    39    49    84\n12       3  2009    90    69    30    56\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself \n```\n:::\n\n:::\n\n### `separate()` and `unite()`\n\nThe same `tidyr` package also contains two useful functions:\n\n-   `unite()`: combine contents of two or more columns into a single column\n-   `separate()`: separate contents of a column into two or more columns\n\nFirst, we combine the first three columns into one new column using `unite()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder %>% \n  unite(col=\"country_continent_year\", \n        country:year, \n        sep=\"_\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 4\n   country_continent_year lifeExp      pop gdpPercap\n   <chr>                    <dbl>    <int>     <dbl>\n 1 Afghanistan_Asia_1952     28.8  8425333      779.\n 2 Afghanistan_Asia_1957     30.3  9240934      821.\n 3 Afghanistan_Asia_1962     32.0 10267083      853.\n 4 Afghanistan_Asia_1967     34.0 11537966      836.\n 5 Afghanistan_Asia_1972     36.1 13079460      740.\n 6 Afghanistan_Asia_1977     38.4 14880372      786.\n 7 Afghanistan_Asia_1982     39.9 12881816      978.\n 8 Afghanistan_Asia_1987     40.8 13867957      852.\n 9 Afghanistan_Asia_1992     41.7 16317921      649.\n10 Afghanistan_Asia_1997     41.8 22227415      635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\nNext, we show how to separate the columns into three separate columns using `separate()` using the `col`, `into` and `sep` arguments.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder %>% \n  unite(col=\"country_continent_year\", \n        country:year, \n        sep=\"_\") %>% \n  separate(col=\"country_continent_year\", \n           into=c(\"country\", \"continent\", \"year\"), \n           sep=\"_\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 6\n   country     continent year  lifeExp      pop gdpPercap\n   <chr>       <chr>     <chr>   <dbl>    <int>     <dbl>\n 1 Afghanistan Asia      1952     28.8  8425333      779.\n 2 Afghanistan Asia      1957     30.3  9240934      821.\n 3 Afghanistan Asia      1962     32.0 10267083      853.\n 4 Afghanistan Asia      1967     34.0 11537966      836.\n 5 Afghanistan Asia      1972     36.1 13079460      740.\n 6 Afghanistan Asia      1977     38.4 14880372      786.\n 7 Afghanistan Asia      1982     39.9 12881816      978.\n 8 Afghanistan Asia      1987     40.8 13867957      852.\n 9 Afghanistan Asia      1992     41.7 16317921      649.\n10 Afghanistan Asia      1997     41.8 22227415      635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.\n\n2.  What do the extra and fill arguments do in `separate()`? Experiment with the various options for the following two toy datasets.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(x = c(\"a,b,c\", \"d,e,f,g\", \"h,i,j\")) %>% \n  separate(x, c(\"one\", \"two\", \"three\"))\n\ntibble(x = c(\"a,b,c\", \"d,e\", \"f,g,i\")) %>% \n  separate(x, c(\"one\", \"two\", \"three\"))\n```\n:::\n\n\n3.  Both `unite()` and `separate()` have a remove argument. What does it do? Why would you set it to FALSE?\n\n4.  Compare and contrast `separate()` and `extract()`. Why are there three variations of separation (by position, by separator, and with groups), but only one `unite()`?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software\n-   https://r4ds.had.co.nz/tidy-data.html\n-   [tidyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n gapminder   * 1.0.0   2023-03-10 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"09 - Tidy data and the Tidyverse\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to tidy data and how to convert between wide and long data with the tidyr R package\"\ncategories: [module 2, week 2, R, programming, tidyr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/09-tidy-data-and-the-tidyverse/index.qmd).*\n\n<!-- Add interesting quote -->\n\n> \"Happy families are all alike; every unhappy family is unhappy in its own way.\" ---- Leo Tolstoy\n\n> \"Tidy datasets are all alike, but every messy dataset is messy in its own way.\" ---- Hadley Wickham\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software\n2.  <https://r4ds.had.co.nz/tidy-data>\n3.  [tidyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-tidy-data-and-the-tidyverse>\n-   <https://r4ds.had.co.nz/tidy-data>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Define tidy data\n-   Be able to transform non-tidy data into tidy data\n-   Be able to transform wide data into long data\n-   Be able to separate character columns into multiple columns\n-   Be able to unite multiple character columns into one column\n:::\n\n# Tidy data\n\nAs we learned in the last lesson, one unifying concept of the tidyverse is the notion of **tidy data**.\n\nAs defined by Hadley Wickham in his 2014 paper published in the *Journal of Statistical Software*, a [tidy dataset](https://www.jstatsoft.org/article/view/v059i10) has the following properties:\n\n1.  Each variable forms a column.\n\n2.  Each observation forms a row.\n\n3.  Each type of observational unit forms a table.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe **purpose of defining tidy data** is to highlight the fact that **most data do not start out life as tidy**.\n\nIn fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience).\n\n-   Once a dataset is tidy, it **can be used as input into a variety of other functions** that may transform, model, or visualize the data.\n\n::: callout-tip\n### Example\n\nAs a quick example, consider the following data illustrating **religion and income survey data** with the number of respondents with income range in column name.\n\nThis is in a classic table format:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\nrelig_income\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 11\n   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`\n   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>\n 1 Agnostic      27        34        60        81        76       137        122\n 2 Atheist       12        27        37        52        35        70         73\n 3 Buddhist      27        21        30        34        33        58         62\n 4 Catholic     418       617       732       670       638      1116        949\n 5 Don’t k…      15        14        15        11        10        35         21\n 6 Evangel…     575       869      1064       982       881      1486        949\n 7 Hindu          1         9         7         9        11        34         47\n 8 Histori…     228       244       236       238       197       223        131\n 9 Jehovah…      20        27        24        24        21        30         15\n10 Jewish        19        19        25        25        30        95         69\n11 Mainlin…     289       495       619       655       651      1107        939\n12 Mormon        29        40        48        51        56       112         85\n13 Muslim         6         7         9        10         9        23         16\n14 Orthodox      13        17        23        32        32        47         38\n15 Other C…       9         7        11        13        13        14         18\n16 Other F…      20        33        40        46        49        63         46\n17 Other W…       5         2         3         4         2         7          3\n18 Unaffil…     217       299       374       365       341       528        407\n# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,\n#   `Don't know/refused` <dbl>\n```\n:::\n:::\n\n:::\n\nWhile this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy.\n\n**This format violates the tidy form** because there are variables in the columns.\n\n-   In this case the variables are religion, income bracket, and the number of respondents, which is the third variable, is presented inside the table.\n\nConverting this data to tidy format would give us\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nrelig_income %>%\n    pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\") %>%\n    mutate(religion = factor(religion), income = factor(income))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 180 × 3\n   religion income             respondents\n   <fct>    <fct>                    <dbl>\n 1 Agnostic <$10k                       27\n 2 Agnostic $10-20k                     34\n 3 Agnostic $20-30k                     60\n 4 Agnostic $30-40k                     81\n 5 Agnostic $40-50k                     76\n 6 Agnostic $50-75k                    137\n 7 Agnostic $75-100k                   122\n 8 Agnostic $100-150k                  109\n 9 Agnostic >150k                       84\n10 Agnostic Don't know/refused          96\n# ℹ 170 more rows\n```\n:::\n:::\n\n\nSome of these functions you have seen before, others might be new to you. Let's talk about each one in the context of the `tidyverse` R packages.\n\n# The \"Tidyverse\"\n\nThere are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and **the collection of packages is often referred to as the \"tidyverse\"** because of their **dependence on and presumption of tidy data**.\n\n::: callout-tip\n### Note\n\nA subset of the \"Tidyverse\" packages include:\n\n-   [ggplot2](https://cran.r-project.org/package=ggplot2): a plotting system based on the grammar of graphics\n\n-   [magrittr](https://cran.r-project.org/package=magrittr%22): defines the `%>%` operator for chaining functions together in a series of operations on data\n\n-   [dplyr](https://cran.r-project.org/package=dplyr): a suite of (fast) functions for working with data frames\n\n-   [tidyr](https://cran.r-project.org/package=tidyr): easily tidy data with `pivot_wider()` and `pivot_longer()` functions (also `separate()` and `unite()`)\n\nA complete list can be found here (<https://www.tidyverse.org/packages>).\n:::\n\nWe will be using these packages quite a bit.\n\nThe \"tidyverse\" package can be used to install all of the packages in the tidyverse at once.\n\nFor example, instead of starting an R script with this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(readr)\nlibrary(ggplot2)\n```\n:::\n\n\nYou can start with this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\nIn the example above, let's talk about what we did using the `pivot_longer()` function.\n\nWe will also talk about `pivot_wider()`.\n\n### `pivot_longer()`\n\nThe `tidyr` package includes functions to transfer a data frame between *long* and *wide*.\n\n-   **Wide format** data tends to have different attributes or variables describing an observation placed in separate columns.\n-   **Long format** data tends to have different attributes encoded as levels of a single variable, followed by another column that contains tha values of the observation at those different levels.\n\n::: callout-tip\n### Example\n\nIn the section above, we showed an example that used `pivot_longer()` to convert data into a tidy format.\n\nThe **key problem** with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.\n\nTo **fix this**, you can use the `pivot_longer()` function to **gather values spread across several columns into a single column**, here with the column names gathered into an `income` column.\n\n**Note**: when gathering, exclude any columns that you do not want \"gathered\" (`religion` in this case) by including the column names with a the minus sign in the `pivot_longer()` function.\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Gather everything EXCEPT religion to tidy data\nrelig_income %>%\n    pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 180 × 3\n   religion income             respondents\n   <chr>    <chr>                    <dbl>\n 1 Agnostic <$10k                       27\n 2 Agnostic $10-20k                     34\n 3 Agnostic $20-30k                     60\n 4 Agnostic $30-40k                     81\n 5 Agnostic $40-50k                     76\n 6 Agnostic $50-75k                    137\n 7 Agnostic $75-100k                   122\n 8 Agnostic $100-150k                  109\n 9 Agnostic >150k                       84\n10 Agnostic Don't know/refused          96\n# ℹ 170 more rows\n```\n:::\n:::\n\n:::\n\nEven if your data is in a tidy format, `pivot_longer()` is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.\n\n### `pivot_wider()`\n\nThe `pivot_wider()` function is less commonly needed to tidy data. It can, however, be useful for creating summary tables.\n\n::: callout-tip\n### Example\n\nYou use the `summarize()` function in `dplyr` to summarize the total number of respondents per income category.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_income %>%\n    pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\") %>%\n    mutate(religion = factor(religion), income = factor(income)) %>%\n    group_by(income) %>%\n    summarize(total_respondents = sum(respondents)) %>%\n    pivot_wider(\n        names_from = \"income\",\n        values_from = \"total_respondents\"\n    ) %>%\n    knitr::kable()\n```\n\n::: {.cell-output-display}\n| <$10k| >150k| $10-20k| $100-150k| $20-30k| $30-40k| $40-50k| $50-75k| $75-100k| Don't know/refused|\n|-----:|-----:|-------:|---------:|-------:|-------:|-------:|-------:|--------:|------------------:|\n|  1930|  2608|    2781|      3197|    3357|    3302|    3085|    5185|     3990|               6121|\n:::\n:::\n\n:::\n\nNotice in this example how `pivot_wider()` has been used at the **very end of the code sequence** to convert the summarized data into a shape that **offers a better tabular presentation for a report**.\n\n::: callout-tip\n### Note\n\nIn the `pivot_wider()` call, you first specify the name of the column to use for the new column names (`income` in this example) and then specify the column to use for the cell values (`total_respondents` here).\n:::\n\n::: callout-tip\n### Example of `pivot_longer()`\n\nLet's try another dataset. This data contain an excerpt of the [Gapminder data](https://cran.r-project.org/web/packages/gapminder/README.html#gapminder) on life expectancy, GDP per capita, and population by country.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(gapminder)\ngapminder\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 6\n   country     continent  year lifeExp      pop gdpPercap\n   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>\n 1 Afghanistan Asia       1952    28.8  8425333      779.\n 2 Afghanistan Asia       1957    30.3  9240934      821.\n 3 Afghanistan Asia       1962    32.0 10267083      853.\n 4 Afghanistan Asia       1967    34.0 11537966      836.\n 5 Afghanistan Asia       1972    36.1 13079460      740.\n 6 Afghanistan Asia       1977    38.4 14880372      786.\n 7 Afghanistan Asia       1982    39.9 12881816      978.\n 8 Afghanistan Asia       1987    40.8 13867957      852.\n 9 Afghanistan Asia       1992    41.7 16317921      649.\n10 Afghanistan Asia       1997    41.8 22227415      635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\nIf we wanted to make `lifeExp`, `pop` and `gdpPercap` (all measurements that we observe) go from a wide table into a long table, what would we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nOne more! Try using `pivot_longer()` to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.\n\nAfterward, use `group_by()` and `summarize()` to calculate the average revenue for each company across all years and all quarters.\n\n**Bonus**: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n    \"company\" = rep(1:3, each = 4),\n    \"year\" = rep(2006:2009, 3),\n    \"Q1\" = sample(x = 0:100, size = 12),\n    \"Q2\" = sample(x = 0:100, size = 12),\n    \"Q3\" = sample(x = 0:100, size = 12),\n    \"Q4\" = sample(x = 0:100, size = 12),\n)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 6\n   company  year    Q1    Q2    Q3    Q4\n     <int> <int> <int> <int> <int> <int>\n 1       1  2006    99     6    54    47\n 2       1  2007    28    79    90     9\n 3       1  2008     7    72    69    24\n 4       1  2009    16    56     6   100\n 5       2  2006    42    58    75    25\n 6       2  2007    64     1   100     6\n 7       2  2008    43    88    37    77\n 8       2  2009    95    74    17    44\n 9       3  2006    34    47    77    38\n10       3  2007    73    31    31    54\n11       3  2008     4    49    93     0\n12       3  2009    57     4    45    96\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n### `separate()` and `unite()`\n\nThe same `tidyr` package also contains two useful functions:\n\n-   `unite()`: combine contents of two or more columns into a single column\n-   `separate()`: separate contents of a column into two or more columns\n\nFirst, we combine the first three columns into one new column using `unite()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder %>%\n    unite(\n        col = \"country_continent_year\",\n        country:year,\n        sep = \"_\"\n    )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 4\n   country_continent_year lifeExp      pop gdpPercap\n   <chr>                    <dbl>    <int>     <dbl>\n 1 Afghanistan_Asia_1952     28.8  8425333      779.\n 2 Afghanistan_Asia_1957     30.3  9240934      821.\n 3 Afghanistan_Asia_1962     32.0 10267083      853.\n 4 Afghanistan_Asia_1967     34.0 11537966      836.\n 5 Afghanistan_Asia_1972     36.1 13079460      740.\n 6 Afghanistan_Asia_1977     38.4 14880372      786.\n 7 Afghanistan_Asia_1982     39.9 12881816      978.\n 8 Afghanistan_Asia_1987     40.8 13867957      852.\n 9 Afghanistan_Asia_1992     41.7 16317921      649.\n10 Afghanistan_Asia_1997     41.8 22227415      635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\nNext, we show how to separate the columns into three separate columns using `separate()` using the `col`, `into` and `sep` arguments.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder %>%\n    unite(\n        col = \"country_continent_year\",\n        country:year,\n        sep = \"_\"\n    ) %>%\n    separate(\n        col = \"country_continent_year\",\n        into = c(\"country\", \"continent\", \"year\"),\n        sep = \"_\"\n    )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 6\n   country     continent year  lifeExp      pop gdpPercap\n   <chr>       <chr>     <chr>   <dbl>    <int>     <dbl>\n 1 Afghanistan Asia      1952     28.8  8425333      779.\n 2 Afghanistan Asia      1957     30.3  9240934      821.\n 3 Afghanistan Asia      1962     32.0 10267083      853.\n 4 Afghanistan Asia      1967     34.0 11537966      836.\n 5 Afghanistan Asia      1972     36.1 13079460      740.\n 6 Afghanistan Asia      1977     38.4 14880372      786.\n 7 Afghanistan Asia      1982     39.9 12881816      978.\n 8 Afghanistan Asia      1987     40.8 13867957      852.\n 9 Afghanistan Asia      1992     41.7 16317921      649.\n10 Afghanistan Asia      1997     41.8 22227415      635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.\n\n2.  What do the extra and fill arguments do in `separate()`? Experiment with the various options for the following two toy datasets.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(x = c(\"a,b,c\", \"d,e,f,g\", \"h,i,j\")) %>%\n    separate(x, c(\"one\", \"two\", \"three\"))\n\ntibble(x = c(\"a,b,c\", \"d,e\", \"f,g,i\")) %>%\n    separate(x, c(\"one\", \"two\", \"three\"))\n```\n:::\n\n\n3.  Both `unite()` and `separate()` have a remove argument. What does it do? Why would you set it to FALSE?\n\n4.  Compare and contrast `separate()` and `extract()`. Why are there three variations of separation (by position, by separator, and with groups), but only one `unite()`?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software\n-   https://r4ds.had.co.nz/tidy-data.html\n-   [tidyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n gapminder   * 1.0.0   2023-03-10 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/10-joining-data-in-r/index/execute-results/html.json b/_freeze/posts/10-joining-data-in-r/index/execute-results/html.json
index f61dc52..e81da47 100644
--- a/_freeze/posts/10-joining-data-in-r/index/execute-results/html.json
+++ b/_freeze/posts/10-joining-data-in-r/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "9484ca31dfe45edc4a74b72df287baad",
+  "hash": "c065e9bff64a00dbd0bc879eea567da8",
   "result": {
-    "markdown": "---\ntitle: \"10 - Joining data in R\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to relational data and join functions in the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/10-joining-data-in-r/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/relational-data>\n2.  <https://rafalab.github.io/dsbook/joining-tables>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-joining-data-in-r-basics>\n-   <https://r4ds.had.co.nz/relational-data>\n-   <https://rafalab.github.io/dsbook/joining-tables>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to define relational data and keys\n-   Be able to define the three types of join functions for relational data\n-   Be able to implement mutational join functions\n:::\n\n# Relational data\n\nData analyses rarely involve only a single table of data.\n\nTypically you have many tables of data, and you **must combine the datasets** to answer the questions that you are interested in.\n\nCollectively, **multiple tables of data are called relational data** because it is the *relations*, not just the individual datasets, that are important.\n\nRelations are **always defined between a pair of tables**. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.\n\nSometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.\n\nTo work with relational data you **need verbs that work with pairs of tables**.\n\n::: callout-tip\n### Three important families of verbs\n\nThere are three families of verbs designed to work with relational data:\n\n-   [**Mutating joins**](https://r4ds.had.co.nz/relational-data.html#mutating-joins): A mutating join allows you to **combine variables from two tables**. It first matches observations by their keys, then copies across variables from one table to the other on the right side of the table (similar to `mutate()`). We will discuss a few of these below.\n    -   See @sec-mutjoins for Table of mutating joins.\n-   [**Filtering joins**](https://r4ds.had.co.nz/relational-data.html#filtering-joins): Filtering joins **match observations** in the same way as mutating joins, **but affect the observations, not the variables** (i.e. filter observations from one data frame based on whether or not they match an observation in the other).\n    -   Two types: `semi_join(x, y)` and `anti_join(x, y)`.\n-   [**Set operations**](https://r4ds.had.co.nz/relational-data.html#set-operations): Treat **observations as if they were set elements**. Typically used less frequently, but occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:\n    -   Examples of set operations: `intersect(x, y)`, `union(x, y)`, and `setdiff(x, y)`.\n:::\n\n## Keys\n\nThe **variables used to connect each pair of tables** are called **keys**. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation.\n\n::: callout-tip\n### Note\n\nThere are two types of keys:\n\n-   A **primary key** uniquely identifies an observation in its own table.\n-   A **foreign key** uniquely identifies an observation in another table.\n:::\n\nLet's consider an example to help us understand the difference between a **primary key** and **foreign key**.\n\n## Example of keys\n\nImagine you are conduct a study and **collecting data on subjects and a health outcome**.\n\nOften, subjects will **make multiple visits** (a so-called longitudinal study) and so we will record the outcome for each visit. Similarly, we may record other information about them, such as the kind of housing they live in.\n\n### The first table\n\nThis code creates a simple table with some made up data about some hypothetical subjects' outcomes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\noutcomes <- tibble(\n        id = rep(c(\"a\", \"b\", \"c\"), each = 3),\n        visit = rep(0:2, 3),\n        outcome = rnorm(3 * 3, 3)\n)\n\nprint(outcomes)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 3\n  id    visit outcome\n  <chr> <int>   <dbl>\n1 a         0    5.04\n2 a         1    2.18\n3 a         2    4.33\n4 b         0    3.33\n5 b         1    2.32\n6 b         2    2.06\n7 c         0    1.98\n8 c         1    1.28\n9 c         2    2.29\n```\n:::\n:::\n\n\nNote that subjects are labeled by a unique identifer in the `id` column.\n\n### A second table\n\nHere is some code to create a second table (we will be joining the first and second tables shortly). This table contains some data about the hypothetical subjects' housing situation by recording the type of house they live in.\n\n\n::: {.cell exercise='true'}\n\n```{.r .cell-code}\nsubjects <- tibble(\n        id = c(\"a\", \"b\", \"c\"),\n        house = c(\"detached\", \"rowhouse\", \"rowhouse\")\n)\n\nprint(subjects)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  id    house   \n  <chr> <chr>   \n1 a     detached\n2 b     rowhouse\n3 c     rowhouse\n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nWhat is the **primary key** and **foreign key**?\n\n-   The `outcomes$id` is a **primary key** because it uniquely identifies each subject in the `outcomes` table.\n-   The `subjects$id` is a **foreign key** because it appears in the `subjects` table where it matches each subject to a unique `id`.\n:::\n\n# Mutating joins {#sec-mutjoins}\n\nThe `dplyr` package provides a set of **functions for joining two data frames** into a single data frame based on a set of key columns.\n\nThere are several functions in the `*_join()` family.\n\n-   These functions all merge together two data frames\n-   They differ in how they handle observations that exist in one but not both data frames.\n\nHere, are the **four functions from this family** that you will likely use the most often:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n|Function       |What it includes in merged data frame                                                                     |\n|:--------------|:---------------------------------------------------------------------------------------------------------|\n|`left_join()`  |Includes all observations in the left data frame, whether or not there is a match in the right data frame |\n|`right_join()` |Includes all observations in the right data frame, whether or not there is a match in the left data frame |\n|`inner_join()` |Includes only observations that are in both data frames                                                   |\n|`full_join()`  |Includes all observations from both data frames                                                           |\n:::\n:::\n\n\n![](https://d33wubrfki0l68.cloudfront.net/aeab386461820b029b7e7606ccff1286f623bae1/ef0d4/diagrams/join-venn.png)\n\n\\[[Source from R for Data Science](https://r4ds.had.co.nz/relational-data#relational-data)\\]\n\n## Left Join\n\nRecall the `outcomes` and `subjects` datasets above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\noutcomes\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 3\n  id    visit outcome\n  <chr> <int>   <dbl>\n1 a         0    5.04\n2 a         1    2.18\n3 a         2    4.33\n4 b         0    3.33\n5 b         1    2.32\n6 b         2    2.06\n7 c         0    1.98\n8 c         1    1.28\n9 c         2    2.29\n```\n:::\n\n```{.r .cell-code}\nsubjects\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  id    house   \n  <chr> <chr>   \n1 a     detached\n2 b     rowhouse\n3 c     rowhouse\n```\n:::\n:::\n\n\nSuppose we want to create a table that combines the information about houses (`subjects`) with the information about the outcomes (`outcomes`).\n\nWe can use the `left_join()` function to merge the `outcomes` and `subjects` tables and produce the output above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(x = outcomes, y = subjects, by = \"id\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n  id    visit outcome house   \n  <chr> <int>   <dbl> <chr>   \n1 a         0    5.04 detached\n2 a         1    2.18 detached\n3 a         2    4.33 detached\n4 b         0    3.33 rowhouse\n5 b         1    2.32 rowhouse\n6 b         2    2.06 rowhouse\n7 c         0    1.98 rowhouse\n8 c         1    1.28 rowhouse\n9 c         2    2.29 rowhouse\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `by` argument indicates the column (or columns) that the two tables have in common.\n:::\n\n### Left Join with Incomplete Data\n\nIn the previous examples, the `subjects` table didn't have a `visit` column. But suppose it did? Maybe people move around during the study. We could image a table like this one.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubjects <- tibble(\n        id = c(\"a\", \"b\", \"c\"),\n        visit = c(0, 1, 0),\n        house = c(\"detached\", \"rowhouse\", \"rowhouse\"),\n)\n\nprint(subjects)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  id    visit house   \n  <chr> <dbl> <chr>   \n1 a         0 detached\n2 b         1 rowhouse\n3 c         0 rowhouse\n```\n:::\n:::\n\n\nWhen we left joint the tables now we get:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(outcomes, subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n  id    visit outcome house   \n  <chr> <dbl>   <dbl> <chr>   \n1 a         0    5.04 detached\n2 a         1    2.18 <NA>    \n3 a         2    4.33 <NA>    \n4 b         0    3.33 <NA>    \n5 b         1    2.32 rowhouse\n6 b         2    2.06 <NA>    \n7 c         0    1.98 rowhouse\n8 c         1    1.28 <NA>    \n9 c         2    2.29 <NA>    \n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTwo things to point out here:\n\n1.  If we do not have information about a subject's housing in a given visit, the `left_join()` function automatically inserts an `NA` value to indicate that it is missing.\n\n2.  We can \"join\" on multiple variable (e.g. here we joined on the `id` and the `visit` columns).\n:::\n\nWe may even have a situation where we are missing housing data for a subject completely. The following table has no information about subject `a`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubjects <- tibble(\n        id = c(\"b\", \"c\"),\n        visit = c(1, 0),\n        house = c(\"rowhouse\", \"rowhouse\"),\n)\n\nsubjects\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 3\n  id    visit house   \n  <chr> <dbl> <chr>   \n1 b         1 rowhouse\n2 c         0 rowhouse\n```\n:::\n:::\n\n\nBut we can still join the tables together and the `house` values for subject `a` will all be `NA`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n  id    visit outcome house   \n  <chr> <dbl>   <dbl> <chr>   \n1 a         0    5.04 <NA>    \n2 a         1    2.18 <NA>    \n3 a         2    4.33 <NA>    \n4 b         0    3.33 <NA>    \n5 b         1    2.32 rowhouse\n6 b         2    2.06 <NA>    \n7 c         0    1.98 rowhouse\n8 c         1    1.28 <NA>    \n9 c         2    2.29 <NA>    \n```\n:::\n:::\n\n\n::: callout-tip\n### Important\n\nThe bottom line for `left_join()` is that it **always retains the values in the \"left\" argument** (in this case the `outcomes` table).\n\n-   If there are no corresponding values in the \"right\" argument, `NA` values will be filled in.\n:::\n\n## Inner Join\n\nThe `inner_join()` function only **retains the rows of both tables** that have corresponding values. Here we can see the difference.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninner_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 4\n  id    visit outcome house   \n  <chr> <dbl>   <dbl> <chr>   \n1 b         1    2.32 rowhouse\n2 c         0    1.98 rowhouse\n```\n:::\n:::\n\n\n## Right Join\n\nThe `right_join()` function is like the `left_join()` function except that it **gives priority to the \"right\" hand argument**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nright_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 4\n  id    visit outcome house   \n  <chr> <dbl>   <dbl> <chr>   \n1 b         1    2.32 rowhouse\n2 c         0    1.98 rowhouse\n```\n:::\n:::\n\n\n# Summary\n\n-   `left_join()` is useful for merging a \"large\" data frame with a \"smaller\" one while retaining all the rows of the \"large\" data frame\n\n-   `inner_join()` gives you the intersection of the rows between two data frames\n\n-   `right_join()` is like `left_join()` with the arguments reversed (likely only useful at the end of a pipeline)\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  If you had three data frames to combine with a shared key, how would you join them using the verbs you now know?\n\n2.  Using `df1` and `df2` below, what is the difference between `inner_join(df1, df2)`, `semi_join(df1, df2)` and `anti_join(df1, df2)`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Create first example data frame\ndf1 <- data.frame(ID = 1:3,\n                  X1 = c(\"a1\", \"a2\", \"a3\"))\n# Create second example data frame\ndf2 <- data.frame(ID = 2:4, \n                  X2 = c(\"b1\", \"b2\", \"b3\"))\n```\n:::\n\n\n3.  Try changing the order from the above e.g. `inner_join(df2, df1)`, `semi_join(df2, df1)` and `anti_join(df2, df1)`. What changed? What did not change?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-joining-data-in-r-basics>\n-   <https://r4ds.had.co.nz/relational-data>\n-   <https://rafalab.github.io/dsbook/joining-tables>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr       * 1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"10 - Joining data in R\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to relational data and join functions in the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/10-joining-data-in-r/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/relational-data>\n2.  <https://rafalab.github.io/dsbook/joining-tables>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-joining-data-in-r-basics>\n-   <https://r4ds.had.co.nz/relational-data>\n-   <https://rafalab.github.io/dsbook/joining-tables>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to define relational data and keys\n-   Be able to define the three types of join functions for relational data\n-   Be able to implement mutational join functions\n:::\n\n# Relational data\n\nData analyses rarely involve only a single table of data.\n\nTypically you have many tables of data, and you **must combine the datasets** to answer the questions that you are interested in.\n\nCollectively, **multiple tables of data are called relational data** because it is the *relations*, not just the individual datasets, that are important.\n\nRelations are **always defined between a pair of tables**. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.\n\nSometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.\n\nTo work with relational data you **need verbs that work with pairs of tables**.\n\n::: callout-tip\n### Three important families of verbs\n\nThere are three families of verbs designed to work with relational data:\n\n-   [**Mutating joins**](https://r4ds.had.co.nz/relational-data.html#mutating-joins): A mutating join allows you to **combine variables from two tables**. It first matches observations by their keys, then copies across variables from one table to the other on the right side of the table (similar to `mutate()`). We will discuss a few of these below.\n    -   See @sec-mutjoins for Table of mutating joins.\n-   [**Filtering joins**](https://r4ds.had.co.nz/relational-data.html#filtering-joins): Filtering joins **match observations** in the same way as mutating joins, **but affect the observations, not the variables** (i.e. filter observations from one data frame based on whether or not they match an observation in the other).\n    -   Two types: `semi_join(x, y)` and `anti_join(x, y)`.\n-   [**Set operations**](https://r4ds.had.co.nz/relational-data.html#set-operations): Treat **observations as if they were set elements**. Typically used less frequently, but occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:\n    -   Examples of set operations: `intersect(x, y)`, `union(x, y)`, and `setdiff(x, y)`.\n:::\n\n## Keys\n\nThe **variables used to connect each pair of tables** are called **keys**. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation.\n\n::: callout-tip\n### Note\n\nThere are two types of keys:\n\n-   A **primary key** uniquely identifies an observation in its own table.\n-   A **foreign key** uniquely identifies an observation in another table.\n:::\n\nLet's consider an example to help us understand the difference between a **primary key** and **foreign key**.\n\n## Example of keys\n\nImagine you are conduct a study and **collecting data on subjects and a health outcome**.\n\nOften, subjects will **make multiple visits** (a so-called longitudinal study) and so we will record the outcome for each visit. Similarly, we may record other information about them, such as the kind of housing they live in.\n\n### The first table\n\nThis code creates a simple table with some made up data about some hypothetical subjects' outcomes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\noutcomes <- tibble(\n    id = rep(c(\"a\", \"b\", \"c\"), each = 3),\n    visit = rep(0:2, 3),\n    outcome = rnorm(3 * 3, 3)\n)\n\nprint(outcomes)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 3\n  id    visit outcome\n  <chr> <int>   <dbl>\n1 a         0    3.07\n2 a         1    3.25\n3 a         2    3.93\n4 b         0    2.18\n5 b         1    2.91\n6 b         2    2.83\n7 c         0    1.49\n8 c         1    2.56\n9 c         2    1.46\n```\n:::\n:::\n\n\nNote that subjects are labeled by a unique identifer in the `id` column.\n\n### A second table\n\nHere is some code to create a second table (we will be joining the first and second tables shortly). This table contains some data about the hypothetical subjects' housing situation by recording the type of house they live in.\n\n\n::: {.cell exercise='true'}\n\n```{.r .cell-code}\nsubjects <- tibble(\n    id = c(\"a\", \"b\", \"c\"),\n    house = c(\"detached\", \"rowhouse\", \"rowhouse\")\n)\n\nprint(subjects)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  id    house   \n  <chr> <chr>   \n1 a     detached\n2 b     rowhouse\n3 c     rowhouse\n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nWhat is the **primary key** and **foreign key**?\n\n-   The `outcomes$id` is a **primary key** because it uniquely identifies each subject in the `outcomes` table.\n-   The `subjects$id` is a **foreign key** because it appears in the `subjects` table where it matches each subject to a unique `id`.\n:::\n\n# Mutating joins {#sec-mutjoins}\n\nThe `dplyr` package provides a set of **functions for joining two data frames** into a single data frame based on a set of key columns.\n\nThere are several functions in the `*_join()` family.\n\n-   These functions all merge together two data frames\n-   They differ in how they handle observations that exist in one but not both data frames.\n\nHere, are the **four functions from this family** that you will likely use the most often:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n|Function       |What it includes in merged data frame                                                                     |\n|:--------------|:---------------------------------------------------------------------------------------------------------|\n|`left_join()`  |Includes all observations in the left data frame, whether or not there is a match in the right data frame |\n|`right_join()` |Includes all observations in the right data frame, whether or not there is a match in the left data frame |\n|`inner_join()` |Includes only observations that are in both data frames                                                   |\n|`full_join()`  |Includes all observations from both data frames                                                           |\n:::\n:::\n\n\n![](https://d33wubrfki0l68.cloudfront.net/aeab386461820b029b7e7606ccff1286f623bae1/ef0d4/diagrams/join-venn.png)\n\n\\[[Source from R for Data Science](https://r4ds.had.co.nz/relational-data#relational-data)\\]\n\n## Left Join\n\nRecall the `outcomes` and `subjects` datasets above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\noutcomes\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 3\n  id    visit outcome\n  <chr> <int>   <dbl>\n1 a         0    3.07\n2 a         1    3.25\n3 a         2    3.93\n4 b         0    2.18\n5 b         1    2.91\n6 b         2    2.83\n7 c         0    1.49\n8 c         1    2.56\n9 c         2    1.46\n```\n:::\n\n```{.r .cell-code}\nsubjects\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  id    house   \n  <chr> <chr>   \n1 a     detached\n2 b     rowhouse\n3 c     rowhouse\n```\n:::\n:::\n\n\nSuppose we want to create a table that combines the information about houses (`subjects`) with the information about the outcomes (`outcomes`).\n\nWe can use the `left_join()` function to merge the `outcomes` and `subjects` tables and produce the output above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(x = outcomes, y = subjects, by = \"id\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n  id    visit outcome house   \n  <chr> <int>   <dbl> <chr>   \n1 a         0    3.07 detached\n2 a         1    3.25 detached\n3 a         2    3.93 detached\n4 b         0    2.18 rowhouse\n5 b         1    2.91 rowhouse\n6 b         2    2.83 rowhouse\n7 c         0    1.49 rowhouse\n8 c         1    2.56 rowhouse\n9 c         2    1.46 rowhouse\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `by` argument indicates the column (or columns) that the two tables have in common.\n:::\n\n### Left Join with Incomplete Data\n\nIn the previous examples, the `subjects` table didn't have a `visit` column. But suppose it did? Maybe people move around during the study. We could image a table like this one.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubjects <- tibble(\n    id = c(\"a\", \"b\", \"c\"),\n    visit = c(0, 1, 0),\n    house = c(\"detached\", \"rowhouse\", \"rowhouse\"),\n)\n\nprint(subjects)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  id    visit house   \n  <chr> <dbl> <chr>   \n1 a         0 detached\n2 b         1 rowhouse\n3 c         0 rowhouse\n```\n:::\n:::\n\n\nWhen we left joint the tables now we get:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(outcomes, subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n  id    visit outcome house   \n  <chr> <dbl>   <dbl> <chr>   \n1 a         0    3.07 detached\n2 a         1    3.25 <NA>    \n3 a         2    3.93 <NA>    \n4 b         0    2.18 <NA>    \n5 b         1    2.91 rowhouse\n6 b         2    2.83 <NA>    \n7 c         0    1.49 rowhouse\n8 c         1    2.56 <NA>    \n9 c         2    1.46 <NA>    \n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTwo things to point out here:\n\n1.  If we do not have information about a subject's housing in a given visit, the `left_join()` function automatically inserts an `NA` value to indicate that it is missing.\n\n2.  We can \"join\" on multiple variable (e.g. here we joined on the `id` and the `visit` columns).\n:::\n\nWe may even have a situation where we are missing housing data for a subject completely. The following table has no information about subject `a`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubjects <- tibble(\n    id = c(\"b\", \"c\"),\n    visit = c(1, 0),\n    house = c(\"rowhouse\", \"rowhouse\"),\n)\n\nsubjects\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 3\n  id    visit house   \n  <chr> <dbl> <chr>   \n1 b         1 rowhouse\n2 c         0 rowhouse\n```\n:::\n:::\n\n\nBut we can still join the tables together and the `house` values for subject `a` will all be `NA`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n  id    visit outcome house   \n  <chr> <dbl>   <dbl> <chr>   \n1 a         0    3.07 <NA>    \n2 a         1    3.25 <NA>    \n3 a         2    3.93 <NA>    \n4 b         0    2.18 <NA>    \n5 b         1    2.91 rowhouse\n6 b         2    2.83 <NA>    \n7 c         0    1.49 rowhouse\n8 c         1    2.56 <NA>    \n9 c         2    1.46 <NA>    \n```\n:::\n:::\n\n\n::: callout-tip\n### Important\n\nThe bottom line for `left_join()` is that it **always retains the values in the \"left\" argument** (in this case the `outcomes` table).\n\n-   If there are no corresponding values in the \"right\" argument, `NA` values will be filled in.\n:::\n\n## Inner Join\n\nThe `inner_join()` function only **retains the rows of both tables** that have corresponding values. Here we can see the difference.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninner_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 4\n  id    visit outcome house   \n  <chr> <dbl>   <dbl> <chr>   \n1 b         1    2.91 rowhouse\n2 c         0    1.49 rowhouse\n```\n:::\n:::\n\n\n## Right Join\n\nThe `right_join()` function is like the `left_join()` function except that it **gives priority to the \"right\" hand argument**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nright_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 4\n  id    visit outcome house   \n  <chr> <dbl>   <dbl> <chr>   \n1 b         1    2.91 rowhouse\n2 c         0    1.49 rowhouse\n```\n:::\n:::\n\n\n# Summary\n\n-   `left_join()` is useful for merging a \"large\" data frame with a \"smaller\" one while retaining all the rows of the \"large\" data frame\n\n-   `inner_join()` gives you the intersection of the rows between two data frames\n\n-   `right_join()` is like `left_join()` with the arguments reversed (likely only useful at the end of a pipeline)\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  If you had three data frames to combine with a shared key, how would you join them using the verbs you now know?\n\n2.  Using `df1` and `df2` below, what is the difference between `inner_join(df1, df2)`, `semi_join(df1, df2)` and `anti_join(df1, df2)`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Create first example data frame\ndf1 <- data.frame(\n    ID = 1:3,\n    X1 = c(\"a1\", \"a2\", \"a3\")\n)\n# Create second example data frame\ndf2 <- data.frame(\n    ID = 2:4,\n    X2 = c(\"b1\", \"b2\", \"b3\")\n)\n```\n:::\n\n\n3.  Try changing the order from the above e.g. `inner_join(df2, df1)`, `semi_join(df2, df1)` and `anti_join(df2, df1)`. What changed? What did not change?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-joining-data-in-r-basics>\n-   <https://r4ds.had.co.nz/relational-data>\n-   <https://rafalab.github.io/dsbook/joining-tables>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr       * 1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/11-plotting-systems/index/execute-results/html.json b/_freeze/posts/11-plotting-systems/index/execute-results/html.json
index e10a6cd..d091bad 100644
--- a/_freeze/posts/11-plotting-systems/index/execute-results/html.json
+++ b/_freeze/posts/11-plotting-systems/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "33dd70dca1357e6e4f0495a8ce6313a1",
+  "hash": "6c6916c9091b2c6d74a4ff7c2b8b7db1",
   "result": {
-    "markdown": "---\ntitle: \"11 - Plotting Systems\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Overview of three plotting systems in R\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/11-plotting-systems/index.qmd).*\n\n> The data may not contain the answer. And, if you torture the data long enough, it will tell you anything. ---*John W. Tukey*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/data-visualisation>\n2.  Paul Murrell (2011). *R Graphics*, CRC Press.\n3.  Hadley Wickham (2009). *ggplot2*, Springer.\n4.  Deepayan Sarkar (2008). *Lattice: Multivariate Data Visualization with R*, Springer.\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-plotting-systems>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to identify and describe the three plotting systems in R\n:::\n\n# Plotting Systems\n\nThere are **three different plotting systems in R** and they each have different characteristics and modes of operation.\n\n::: callout-tip\n### Important\n\nThe three systems are\n\n1.  The base plotting system\n2.  The lattice system\n3.  The ggplot2 system\n\n**This course will focus primarily on the ggplot2 plotting system**. The other two systems are presented for context.\n:::\n\n## The Base Plotting System\n\nThe **base plotting system** is the original plotting system for R. The basic model is sometimes **referred to as the \"artist's palette\" model**.\n\nThe idea is you start with blank canvas and build up from there.\n\nIn more R-specific terms, you **typically start with `plot()` function** (or similar plot creating function) to *initiate* a plot and then *annotate* the plot with various annotation functions (`text`, `lines`, `points`, `axis`)\n\nThe base plotting system is **often the most convenient plotting system** to use because it mirrors how we sometimes think of building plots and analyzing data.\n\nIf we do not have a completely well-formed idea of how we want to look at some data, often we will start by \"throwing some data on the page\" and then slowly add more information to it as our thought process evolves.\n\n::: callout-tip\n### Example\n\nWe might look at a simple scatterplot and then decide to add a linear regression line or a smoother to it to highlight the trends.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(airquality)\nwith(airquality, {\n        plot(Temp, Ozone)\n        lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot with loess curve](index_files/figure-html/unnamed-chunk-1-1.png){width=480}\n:::\n:::\n\n:::\n\nIn the code above:\n\n-   The `plot()` function creates the initial plot and draws the points (circles) on the canvas.\n-   The `lines` function is used to annotate or add to the plot (in this case it adds a loess smoother to the scatterplot).\n\nNext, we use the `plot()` function to draw the points on the scatterplot and then use the `main` argument to add a main title to the plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(airquality)\nwith(airquality, {\n        plot(Temp, Ozone, main = \"my plot\")\n        lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot with loess curve](index_files/figure-html/unnamed-chunk-2-1.png){width=480}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nOne downside with constructing base plots is that you **cannot go backwards once the plot has started**.\n\nIt is possible that you could start down the road of constructing a plot and realize later (when it is too late) that you do not have enough room to add a y-axis label or something like that\n:::\n\nIf you have specific plot in mind, there is then a need to **plan in advance** to make sure, for example, that you have set your margins to be the right size to fit all of the annotations that you may want to include.\n\nWhile the base plotting system is nice in that it gives you the flexibility to specify these kinds of details to painstaking accuracy, **sometimes it would be nice if the system could just figure it out for you**.\n\n::: callout-tip\n### Note\n\nAnother downside of the base plotting system is that it is **difficult to describe or translate a plot to others because there is no clear graphical language or grammar** that can be used to communicate what you have done.\n\nThe only real way to describe what you have done in a base plot is to just list the series of commands/functions that you have executed, which is not a particularly compact way of communicating things.\n\nThis is one problem that the `ggplot2` package attempts to address.\n:::\n\n::: callout-tip\n### Example\n\nAnother typical base plot is constructed with the following code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(cars)\n\n## Create the plot / draw canvas\nwith(cars, plot(speed, dist))\n\n## Add annotation\ntitle(\"Speed vs. Stopping distance\")\n```\n\n::: {.cell-output-display}\n![Base plot with title](index_files/figure-html/unnamed-chunk-3-1.png){width=480}\n:::\n:::\n\n:::\n\nWe will go into more detail on what these functions do in later lessons.\n\n## The Lattice System\n\nThe **lattice plotting system** is implemented in the `lattice` R package which comes with every installation of R (although it is not loaded by default).\n\nTo **use the lattice plotting functions**, you must first load the `lattice` package with the `library` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(lattice)\n```\n:::\n\n\nWith the lattice system, **plots are created with a single function call**, such as `xyplot()` or `bwplot()`.\n\nThere is **no real distinction between functions that create or initiate plots** and **functions that annotate plots** because it all happens at once.\n\nLattice plots tend to be **most useful for conditioning types of plots**, i.e. looking at how `y` changes with `x` across levels of `z`.\n\n-   e.g. these types of plots are useful for looking at multi-dimensional data and often allow you to squeeze a lot of information into a single window or page.\n\nAnother aspect of lattice that makes it different from base plotting is that **things like margins and spacing are set automatically**.\n\nThis is possible because entire plot is specified at once via a single function call, so all of the available information needed to figure out the spacing and margins is already there.\n\n::: callout-tip\n### Example\n\nHere is a lattice plot that looks at the relationship between life expectancy and income and how that relationship varies by region in the United States.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstate <- data.frame(state.x77, region = state.region)\nxyplot(Life.Exp ~ Income | region, data = state, layout = c(4, 1))\n```\n\n::: {.cell-output-display}\n![Lattice plot](index_files/figure-html/unnamed-chunk-5-1.png){width=768}\n:::\n:::\n\n:::\n\nYou can see that the entire plot was generated by the call to `xyplot()` and all of the data for the plot were stored in the `state` data frame.\n\nThe **plot itself contains four panels**---one for each region---and **within each panel is a scatterplot** of life expectancy and income.\n\nThe notion of *panels* comes up a lot with lattice plots because you typically have many panels in a lattice plot (each panel typically represents a *condition*, like \"region\").\n\n::: callout-tip\n### Note\n\nDownsides with the lattice system\n\n-   It can sometimes be very **awkward to specify an entire plot** in a single function call (you end up with functions with many many arguments).\n-   **Annotation in panels in plots is not especially intuitive** and can be difficult to explain. In particular, the use of custom panel functions and subscripts can be difficult to wield and requires intense preparation.\n-   Once a plot is created, **you cannot \"add\" to the plot** (but of course you can just make it again with modifications).\n:::\n\n## The ggplot2 System\n\nThe **ggplot2 plotting system** attempts to split the difference between base and lattice in a number of ways.\n\n::: callout-tip\n### Note\n\nTaking cues from lattice, the ggplot2 system automatically deals with spacings, text, titles but also allows you to annotate by \"adding\" to a plot.\n:::\n\nThe ggplot2 system is implemented in the `ggplot2` package (part of the `tidyverse` package), which is available from CRAN (it does not come with R).\n\nYou can install it from CRAN via\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nand then load it into R via the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n```\n:::\n\n\nSuperficially, the `ggplot2` functions are similar to `lattice`, but the system is generally easier and more intuitive to use.\n\nThe defaults used in `ggplot2` make many choices for you, but you can still customize plots to your heart's desire.\n\n::: callout-tip\n### Example\n\nA typical plot with the `ggplot2` package looks as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata(mpg)\nmpg %>%\n  ggplot(aes(displ, hwy)) + \n  geom_point()\n```\n\n::: {.cell-output-display}\n![ggplot2 plot](index_files/figure-html/unnamed-chunk-8-1.png){width=576}\n:::\n:::\n\n:::\n\nThere are additional functions in `ggplot2` that allow you to make arbitrarily sophisticated plots.\n\nWe will discuss more about this in the next lecture.\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice     * 0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"11 - Plotting Systems\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Overview of three plotting systems in R\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/11-plotting-systems/index.qmd).*\n\n> The data may not contain the answer. And, if you torture the data long enough, it will tell you anything. ---*John W. Tukey*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/data-visualisation>\n2.  Paul Murrell (2011). *R Graphics*, CRC Press.\n3.  Hadley Wickham (2009). *ggplot2*, Springer.\n4.  Deepayan Sarkar (2008). *Lattice: Multivariate Data Visualization with R*, Springer.\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-plotting-systems>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to identify and describe the three plotting systems in R\n:::\n\n# Plotting Systems\n\nThere are **three different plotting systems in R** and they each have different characteristics and modes of operation.\n\n::: callout-tip\n### Important\n\nThe three systems are\n\n1.  The base plotting system\n2.  The lattice system\n3.  The ggplot2 system\n\n**This course will focus primarily on the ggplot2 plotting system**. The other two systems are presented for context.\n:::\n\n## The Base Plotting System\n\nThe **base plotting system** is the original plotting system for R. The basic model is sometimes **referred to as the \"artist's palette\" model**.\n\nThe idea is you start with blank canvas and build up from there.\n\nIn more R-specific terms, you **typically start with `plot()` function** (or similar plot creating function) to *initiate* a plot and then *annotate* the plot with various annotation functions (`text`, `lines`, `points`, `axis`)\n\nThe base plotting system is **often the most convenient plotting system** to use because it mirrors how we sometimes think of building plots and analyzing data.\n\nIf we do not have a completely well-formed idea of how we want to look at some data, often we will start by \"throwing some data on the page\" and then slowly add more information to it as our thought process evolves.\n\n::: callout-tip\n### Example\n\nWe might look at a simple scatterplot and then decide to add a linear regression line or a smoother to it to highlight the trends.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(airquality)\nwith(airquality, {\n    plot(Temp, Ozone)\n    lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot with loess curve](index_files/figure-html/unnamed-chunk-1-1.png){width=480}\n:::\n:::\n\n:::\n\nIn the code above:\n\n-   The `plot()` function creates the initial plot and draws the points (circles) on the canvas.\n-   The `lines` function is used to annotate or add to the plot (in this case it adds a loess smoother to the scatterplot).\n\nNext, we use the `plot()` function to draw the points on the scatterplot and then use the `main` argument to add a main title to the plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(airquality)\nwith(airquality, {\n    plot(Temp, Ozone, main = \"my plot\")\n    lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot with loess curve](index_files/figure-html/unnamed-chunk-2-1.png){width=480}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nOne downside with constructing base plots is that you **cannot go backwards once the plot has started**.\n\nIt is possible that you could start down the road of constructing a plot and realize later (when it is too late) that you do not have enough room to add a y-axis label or something like that\n:::\n\nIf you have specific plot in mind, there is then a need to **plan in advance** to make sure, for example, that you have set your margins to be the right size to fit all of the annotations that you may want to include.\n\nWhile the base plotting system is nice in that it gives you the flexibility to specify these kinds of details to painstaking accuracy, **sometimes it would be nice if the system could just figure it out for you**.\n\n::: callout-tip\n### Note\n\nAnother downside of the base plotting system is that it is **difficult to describe or translate a plot to others because there is no clear graphical language or grammar** that can be used to communicate what you have done.\n\nThe only real way to describe what you have done in a base plot is to just list the series of commands/functions that you have executed, which is not a particularly compact way of communicating things.\n\nThis is one problem that the `ggplot2` package attempts to address.\n:::\n\n::: callout-tip\n### Example\n\nAnother typical base plot is constructed with the following code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(cars)\n\n## Create the plot / draw canvas\nwith(cars, plot(speed, dist))\n\n## Add annotation\ntitle(\"Speed vs. Stopping distance\")\n```\n\n::: {.cell-output-display}\n![Base plot with title](index_files/figure-html/unnamed-chunk-3-1.png){width=480}\n:::\n:::\n\n:::\n\nWe will go into more detail on what these functions do in later lessons.\n\n## The Lattice System\n\nThe **lattice plotting system** is implemented in the `lattice` R package which comes with every installation of R (although it is not loaded by default).\n\nTo **use the lattice plotting functions**, you must first load the `lattice` package with the `library` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(lattice)\n```\n:::\n\n\nWith the lattice system, **plots are created with a single function call**, such as `xyplot()` or `bwplot()`.\n\nThere is **no real distinction between functions that create or initiate plots** and **functions that annotate plots** because it all happens at once.\n\nLattice plots tend to be **most useful for conditioning types of plots**, i.e. looking at how `y` changes with `x` across levels of `z`.\n\n-   e.g. these types of plots are useful for looking at multi-dimensional data and often allow you to squeeze a lot of information into a single window or page.\n\nAnother aspect of lattice that makes it different from base plotting is that **things like margins and spacing are set automatically**.\n\nThis is possible because entire plot is specified at once via a single function call, so all of the available information needed to figure out the spacing and margins is already there.\n\n::: callout-tip\n### Example\n\nHere is a lattice plot that looks at the relationship between life expectancy and income and how that relationship varies by region in the United States.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstate <- data.frame(state.x77, region = state.region)\nxyplot(Life.Exp ~ Income | region, data = state, layout = c(4, 1))\n```\n\n::: {.cell-output-display}\n![Lattice plot](index_files/figure-html/unnamed-chunk-5-1.png){width=768}\n:::\n:::\n\n:::\n\nYou can see that the entire plot was generated by the call to `xyplot()` and all of the data for the plot were stored in the `state` data frame.\n\nThe **plot itself contains four panels**---one for each region---and **within each panel is a scatterplot** of life expectancy and income.\n\nThe notion of *panels* comes up a lot with lattice plots because you typically have many panels in a lattice plot (each panel typically represents a *condition*, like \"region\").\n\n::: callout-tip\n### Note\n\nDownsides with the lattice system\n\n-   It can sometimes be very **awkward to specify an entire plot** in a single function call (you end up with functions with many many arguments).\n-   **Annotation in panels in plots is not especially intuitive** and can be difficult to explain. In particular, the use of custom panel functions and subscripts can be difficult to wield and requires intense preparation.\n-   Once a plot is created, **you cannot \"add\" to the plot** (but of course you can just make it again with modifications).\n:::\n\n## The ggplot2 System\n\nThe **ggplot2 plotting system** attempts to split the difference between base and lattice in a number of ways.\n\n::: callout-tip\n### Note\n\nTaking cues from lattice, the ggplot2 system automatically deals with spacings, text, titles but also allows you to annotate by \"adding\" to a plot.\n:::\n\nThe ggplot2 system is implemented in the `ggplot2` package (part of the `tidyverse` package), which is available from CRAN (it does not come with R).\n\nYou can install it from CRAN via\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nand then load it into R via the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n```\n:::\n\n\nSuperficially, the `ggplot2` functions are similar to `lattice`, but the system is generally easier and more intuitive to use.\n\nThe defaults used in `ggplot2` make many choices for you, but you can still customize plots to your heart's desire.\n\n::: callout-tip\n### Example\n\nA typical plot with the `ggplot2` package looks as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata(mpg)\nmpg %>%\n    ggplot(aes(displ, hwy)) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![ggplot2 plot](index_files/figure-html/unnamed-chunk-8-1.png){width=576}\n:::\n:::\n\n:::\n\nThere are additional functions in `ggplot2` that allow you to make arbitrarily sophisticated plots.\n\nWe will discuss more about this in the next lecture.\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice     * 0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/12-ggplot2-plotting-system-part-1/index/execute-results/html.json b/_freeze/posts/12-ggplot2-plotting-system-part-1/index/execute-results/html.json
index a32e326..36e05af 100644
--- a/_freeze/posts/12-ggplot2-plotting-system-part-1/index/execute-results/html.json
+++ b/_freeze/posts/12-ggplot2-plotting-system-part-1/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "7a7a1e3161c0ff96ad2a3bef68a62f06",
+  "hash": "94c3dfcaf944f3b724e55205af55c601",
   "result": {
-    "markdown": "---\ntitle: \"12 - The ggplot2 plotting system: qplot()\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An overview of the ggplot2 plotting system in R with qplot()\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/12-ggplot2-plotting-system-part-1/index.qmd).*\n\n> \"The greatest value of a picture is when it forces us to notice what we never expected to see.\" ---John Tukey\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/data-visualisation>\n2.  <http://vita.had.co.nz/papers/layered-grammar.pdf>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-the-ggplot2-plotting-system-part-1>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Recognize the difference between *aesthetics* and *geoms*\n-   Become familiar with different types of plots (e.g. scatterplots, boxplots, and histograms)\n-   Be able to facet plots into a grid\n:::\n\n# The ggplot2 Plotting System\n\nThe `ggplot2` package in R is **an implementation of *The Grammar of Graphics*** as described by Leland Wilkinson in his book. The **package was originally written by Hadley Wickham** while he was a graduate student at Iowa State University (he still actively maintains the package).\n\nThe package implements what might be considered a third graphics system for R (along with `base` graphics and `lattice`).\n\nThe package is available from [CRAN](http://cran.r-project.org/package=ggplot2) via `install.packages()`; the latest version of the source can be found on the package's [GitHub Repository](https://github.com/hadley/ggplot2). Documentation of the package can be found at [the tidyverse web site](https://ggplot2.tidyverse.org).\n\nThe **grammar of graphics** represents **an abstraction of graphics ideas and objects**.\n\nYou can think of this as **developing the verbs, nouns, and adjectives for data graphics**.\n\n::: callout-tip\n### Note\n\nDeveloping such a **grammar allows for a \"theory\" of graphics** on which to build new graphics and graphics objects.\n\nTo quote from Hadley Wickham's book on `ggplot2`, we want to \"shorten the distance from mind to page\". In summary,\n\n> \"...the grammar tells us that a statistical graphic is a **mapping** from data to **aesthetic** attributes (colour, shape, size) of **geometric** objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system\" -- from *ggplot2* book\n:::\n\nYou might ask yourself \"Why do we need a grammar of graphics?\".\n\nWell, for much the same reasons that **having a grammar is useful for spoken languages**. The grammar allows for\n\n-   A more compact summary of the base components of a language\n-   An extension of the language to handle situations that we have not before seen\n\nIf you think about making a plot with the base graphics system, the plot is **constructed by calling a series of functions that either create or annotate a plot**. There's **no convenient agreed-upon way to describe the plot**, except to just recite the series of R functions that were called to create the thing in the first place.\n\n::: callout-tip\n### Example\n\nConsider the following plot made using base graphics previously.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwith(airquality, { \n        plot(Temp, Ozone)\n        lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot of Temperature and Ozone in New York (base graphics)](index_files/figure-html/unnamed-chunk-1-1.png){width=480}\n:::\n:::\n\n\nHow would one **describe the creation of this plot**?\n\nWell, we could say that we called the `plot()` function and then added a loess smoother by calling the `lines()` function on the output of `loess.smooth()`.\n\nWhile the base plotting system is convenient and it often mirrors how we think of building plots and analyzing data, there are **drawbacks**:\n\n-   You cannot go back once plot has started (e.g. to adjust margins), so there is in fact a need to plan in advance.\n-   It is difficult to \"translate\" a plot to others because there is no formal graphical language; each plot is just a series of R commands.\n:::\n\nHere is the same plot made using `ggplot2` in the `tidyverse`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nairquality %>%\n        ggplot(aes(Temp, Ozone)) + \n        geom_point() + \n        geom_smooth(method = \"loess\", \n                    se = FALSE) + \n        theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scatterplot of Temperature and Ozone in New York (ggplot2)](index_files/figure-html/unnamed-chunk-2-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe output is roughly equivalent and the amount of code is similar, but `ggplot2` allows for a more elegant way of expressing the components of the plot.\n\nIn this case, the plot is a **dataset** (`airquality`) with **aesthetic mappings** (visual properties of the objects in your plot) derived from the `Temp` and `Ozone` variables, a set of **points**, and a **smoother**.\n:::\n\nIn a sense, the `ggplot2` system takes many of the cues from the base plotting system and from the `lattice` plotting systems, and formalizes the cues a bit.\n\nIt **automatically handles things like margins and spacing**, and also has the concept of \"themes\" which **provide a default set of plotting symbols and colors** (which are all customizable).\n\nWhile `ggplot2` bears a superficial similarity to `lattice`, `ggplot2` is generally easier and more intuitive to use.\n\n## The Basics: `qplot()`\n\nThe `qplot()` function in `ggplot2` is meant to get you going **q**uickly.\n\nIt works much like the `plot()` function in base graphics system. It **looks for variables to plot within a data frame**, similar to lattice, or in the parent environment.\n\nIn general, it is good to get used to putting your data in a data frame and then passing it to `qplot()`.\n\n::: callout-tip\n### Pro tip\n\nThe `qplot()` function is **somewhat discouraged** in `ggplot2` now and new users are encouraged to use the more general `ggplot()` function (more details in the next lesson).\n\nHowever, the `qplot()` function is still useful and may be easier to use if transitioning from the base plotting system or a different statistical package.\n:::\n\nPlots are made up of\n\n-   **aesthetics** (e.g. size, shape, color)\n-   **geoms** (e.g. points, lines)\n\nFactors play an important role for indicating subsets of the data (if they are to have different properties) so they should be **labeled** properly.\n\nThe `qplot()` hides much of what goes on underneath, which is okay for most operations, `ggplot()` is the core function and is very flexible for doing things `qplot()` cannot do.\n\n## Before you start: label your data\n\nOne thing that is always true, but is particularly useful when using `ggplot2`, is that you should always **use informative and descriptive labels on your data**.\n\nMore generally, your data should have appropriate **metadata** so that you can quickly look at a dataset and know\n\n-   what are variables?\n-   what do the values of each variable mean?\n\n::: callout-tip\n### Pro tip\n\n-   **Each column** of a data frame should **have a meaningful (but concise) variable name** that accurately reflects the data stored in that column\n-   Non-numeric or **categorical variables should be coded as factor variables** and have meaningful labels for each level of the factor.\n    -   Might be common to code a binary variable as a \"0\" or a \"1\", but the problem is that from quickly looking at the data, it's impossible to know whether which level of that variable is represented by a \"0\" or a \"1\".\n    -   Much better to simply label each observation as what they are.\n    -   If a variable represents temperature categories, it might be better to use \"cold\", \"mild\", and \"hot\" rather than \"1\", \"2\", and \"3\".\n:::\n\nWhile it is sometimes a pain to make sure all of your data are properly labeled, this **investment in time can pay dividends down the road** when you're trying to figure out what you were plotting.\n\nIn other words, including the proper metadata can make your exploratory plots essentially self-documenting.\n\n## ggplot2 \"Hello, world!\"\n\nThis example dataset comes with the `ggplot2` package and contains data on the fuel economy of 38 popular car models from 1999 to 2008.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse) # this loads the ggplot2 R package\n# library(ggplot2) # an alternative way to just load the ggplot2 R package\nglimpse(mpg)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 234\nColumns: 11\n$ manufacturer <chr> \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"…\n$ model        <chr> \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4 quattro\", \"…\n$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…\n$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…\n$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …\n$ trans        <chr> \"auto(l5)\", \"manual(m5)\", \"manual(m6)\", \"auto(av)\", \"auto…\n$ drv          <chr> \"f\", \"f\", \"f\", \"f\", \"f\", \"f\", \"f\", \"4\", \"4\", \"4\", \"4\", \"4…\n$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…\n$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…\n$ fl           <chr> \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p…\n$ class        <chr> \"compact\", \"compact\", \"compact\", \"compact\", \"compact\", \"c…\n```\n:::\n:::\n\n\nYou can see from the `glimpse()` (part of the `dplyr` package) output that all of the categorical variables (like \"manufacturer\" or \"class\") are \\*\\*appropriately coded with meaningful label\\*s\\*\\*.\n\nThis will come in handy when `qplot()` has to label different aspects of a plot.\n\nAlso note that all of the **columns/variables have meaningful names** (if sometimes abbreviated), rather than names like \"X1\", and \"X2\", etc.\n\n::: callout-tip\n### Example\n\nWe can **make a quick scatterplot** using `qplot()` of the engine displacement (`displ`) and the highway miles per gallon (`hwy`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: `qplot()` was deprecated in ggplot2 3.4.0.\n```\n:::\n\n::: {.cell-output-display}\n![Plot of engine displacement and highway mileage using the mtcars dataset](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n:::\n\nIt has a *very* similar feeling to `plot()` in base R.\n\n::: callout-tip\n### Note\n\nIn the call to `qplot()` you **must specify the `data` argument** so that `qplot()` knows where to look up the variables.\n\nYou must also specify `x` and `y`, but hopefully that part is obvious.\n:::\n\n## Modifying aesthetics\n\nWe can introduce a third variable into the plot by **modifying the color** of the points based on the value of that third variable.\n\nColor (or colour) is one type of **aesthetic** and using the `ggplot2` language:\n\n> \"the color of each point can be mapped to a variable\"\n\nThis sounds technical, but let's give an example.\n\n::: callout-tip\n### Example\n\nWe map the `color` argument to the `drv` variable, which indicates whether a car is front wheel drive, rear wheel drive, or 4-wheel drive.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, color = drv)\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n:::\n\nNow we can see that the front wheel drive cars tend to have lower displacement relative to the 4-wheel or rear wheel drive cars.\n\nAlso, it's clear that the 4-wheel drive cars have the lowest highway gas mileage.\n\n::: callout-tip\n### Note\n\nThe `x` argument and `y` argument are aesthetics too, and they got mapped to the `displ` and `hwy` variables, respectively.\n:::\n\n::: callout-note\n### Question\n\nIn the above plot, I did not specify the `x` and `y` variable. What happens when you run these two code chunks. What's the difference?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, displ, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(y = hwy, x = displ, data = mpg, color = drv)\n```\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nLet's try mapping colors in another dataset, namely the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/) dataset. These data contain observations for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Palmer penguins](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(palmerpenguins)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 344\nColumns: 8\n$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…\n$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…\n$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …\n$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …\n$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…\n$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …\n$ sex               <fct> male, female, female, NA, female, male, female, male…\n$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…\n```\n:::\n:::\n\n\nIf we wanted to count the number of penguins for each of the three species, we can use the `count()` function in `dplyr`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins %>% \n  count(species)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  species       n\n  <fct>     <int>\n1 Adelie      152\n2 Chinstrap    68\n3 Gentoo      124\n```\n:::\n:::\n\n:::\n\nFor example, we see there are a total of 152 Adelie penguins in the `palmerpenguins` dataset.\n\n::: callout-note\n### Question\n\nIf we wanted to use `qplot()` to map `flipper_length_mm` and `bill_length_mm` to the x and y coordinates, what would we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n\nNow try mapping color to the `species` variable on top of the code you just wrote:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Adding a geom\n\nSometimes it is nice to **add a smoother** to a scatterplot to highlight any trends.\n\nTrends can be difficult to see if the data are very noisy or there are many data points obscuring the view.\n\nA smoother is a **type of \"geom\"** that you can add along with your data points.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, geom = c(\"point\", \"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage w/smoother](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n:::\n\nHere it seems that engine displacement and highway mileage have a nonlinear U-shaped relationship, but from the previous plot we know that this is largely due to confounding by the drive class of the car.\n\n::: callout-tip\n### Note\n\nPreviously, we did not have to specify `geom = \"point\"` because that was done automatically.\n\nBut if you want the smoother overlaid with the points, then you need to specify both explicitly.\n:::\n\nLook at what happens if we *do not* include the `point` geom.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, geom = c(\"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage w/smoother](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\nSometimes that is the plot you want to show, but in this case it might make more sense to show the data along with the smoother.\n\n::: callout-note\n### Question\n\nLet's **add a smoother** to our `palmerpenguins` dataset example.\n\nUsing the code we previously wrote mapping variables to points and color, add a \"point\" and \"smooth\" geom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Histograms and boxplots\n\nThe `qplot()` function can be used to be used to plot 1-dimensional data too.\n\nBy **specifying a single variable**, `qplot()` will by default make a **histogram**.\n\n::: callout-tip\n### Example\n\nWe can make a histogram of the highway mileage data and stratify on the drive class. So technically this is three histograms on top of each other.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, data = mpg, fill = drv, binwidth = 2)\n```\n\n::: {.cell-output-display}\n![Histogram of highway mileage by drive class](index_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n:::\n\n::: callout-note\n### Question\n\nNotice, I used `fill` here to map color to the `drv` variable. Why is this? What happens when you use `color` instead?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\nHaving the different colors for each drive class is nice, but the three histograms can be a bit difficult to separate out.\n\n**Side-by-side boxplots** are one solution to this problem.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(drv, hwy, data = mpg, geom = \"boxplot\")\n```\n\n::: {.cell-output-display}\n![Boxplots of highway mileage by drive class](index_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\nAnother solution is to plot the histograms in separate panels using facets.\n\n## Facets\n\n**Facets** are a way to **create multiple panels of plots based on the levels of categorical variable**.\n\nHere, we want to see a histogram of the highway mileages and the categorical variable is the drive class variable. We can do that using the `facets` argument to `qplot()`.\n\n::: callout-tip\n### Note\n\nThe `facets` argument **expects a formula type of input**, with a `~` separating the left hand side variable and the right hand side variable.\n\n-   The **left hand side** variable indicates how the rows of the panels should be divided\n-   The **right hand side** variable indicates how the columns of the panels should be divided\n:::\n\n::: callout-tip\n### Example\n\nHere, we just want three rows of histograms (and just one column), one for each drive class, so we specify `drv` on the left hand side and `.` on the right hand side indicating that there's no variable there (it's empty).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)\n```\n\n::: {.cell-output-display}\n![Histogram of highway mileage by drive class](index_files/figure-html/unnamed-chunk-22-1.png){width=480}\n:::\n:::\n\n:::\n\nWe could also look at **more data using facets**, so instead of histograms we could look at scatter plots of engine displacement and highway mileage by drive class.\n\nHere, we put the `drv` variable on the right hand side to indicate that we want a column for each drive class (as opposed to splitting by rows like we did above).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, facets = . ~ drv)\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class](index_files/figure-html/unnamed-chunk-23-1.png){width=576}\n:::\n:::\n\n\nWhat if you wanted to **add a smoother to each one of those panels**? Simple, you literally just add the smoother as another **geom**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, facets = . ~ drv) + \n  geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class w/smoother](index_files/figure-html/unnamed-chunk-24-1.png){width=576}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nWe used a different type of smoother above.\n\nHere, we add a **linear regression line** (a type of smoother) to each group to see if there's any difference.\n:::\n\n::: callout-note\n### Question\n\nLet's facet our `palmerpenguins` dataset example and explore different types of plots.\n\nBuilding off the code we previously wrote, perform the following tasks:\n\n-   Facet the plot based on `species` with the the three species along rows.\n-   Add a linear regression line to each the types of `species`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n\nNext, make a histogram of the `body_mass_g` for each of the species colored by the three species.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Summary\n\nThe `qplot()` function in `ggplot2` is the analog of `plot()` in base graphics but with many built-in features that the traditionaly `plot()` does not provide. The syntax is somewhere in between the base and lattice graphics system. The `qplot()` function is useful for quickly putting data on the page/screen, but for ultimate customization, it may make more sense to use some of the lower level functions that we discuss later in the next lesson.\n\n# Post-lecture materials\n\n### Case Study: MAACS Cohort\n\n<details>\n\n<summary>Click here for case study practicing the `qplot()` function.</summary>\n\nThis case study will use data based on the Mouse Allergen and Asthma Cohort Study (MAACS). This study was aimed at characterizing the indoor (home) environment and its relationship with asthma morbidity amonst children aged 5--17 living in Baltimore, MD. The children all had persistent asthma, defined as having had an exacerbation in the past year. A representative publication of results from this study can be found in this paper by [Lu, et al.](https://pubmed.ncbi.nlm.nih.gov/23403052/)\n\n::: keyideas\nBecause the individual-level data for this study are protected by various U.S. privacy laws, we cannot make those data available. For the purposes of this lesson, we have simulated data that share many of the same features of the original data, but do not contain any of the actual measurements or values contained in the original dataset.\n:::\n\nHere is a snapshot of what the data look like.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing2023\n```\n:::\n\n```{.r .cell-code}\nmaacs <- read_csv(here(\"data\", \"maacs_sim.csv\"), col_types = \"icnn\")\nmaacs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 750 × 4\n      id mopos  pm25    eno\n   <int> <chr> <dbl>  <dbl>\n 1     1 yes    6.01  28.8 \n 2     2 no    25.2   17.7 \n 3     3 yes   21.8   43.6 \n 4     4 no    13.4  288.  \n 5     5 no    49.4    7.60\n 6     6 no    43.4   12.0 \n 7     7 yes   33.0   79.2 \n 8     8 yes   32.7   34.2 \n 9     9 yes   52.2   12.1 \n10    10 yes   51.9   65.0 \n# ℹ 740 more rows\n```\n:::\n:::\n\n\nThe key variables are:\n\n-   `mopos`: an indicator of whether the subject is allergic to mouse allergen (yes/no)\n\n-   `pm25`: average level of PM2.5 over the course of 7 days (micrograms per cubic meter)\n\n-   `eno`: exhaled nitric oxide\n\nThe outcome of interest for this analysis will be exhaled nitric oxide (eNO), which is a measure of pulmonary inflamation. We can get a sense of how eNO is distributed in this population by making a quick histogram of the variable. Here, we take the log of eNO because some right-skew in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n```\n:::\n\n::: {.cell-output-display}\n![Histogram of log eNO](index_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\nA quick glance suggests that the histogram is a bit \"fat\", suggesting that there might be multiple groups of people being lumped together. We can stratify the histogram by whether they are allergic to mouse.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, fill = mopos)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n```\n:::\n\n::: {.cell-output-display}\n![Histogram of log eNO by mouse allergic status](index_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\nWe can see from this plot that the non-allergic subjects are shifted slightly to the left, indicating a lower eNO and less pulmonary inflammation. That said, there is significant overlap between the two groups.\n\nAn alternative to histograms is a density smoother, which sometimes can be easier to visualize when there are multiple groups. Here is a density smooth of the entire study population.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, geom = \"density\")\n```\n\n::: {.cell-output-display}\n![Density smooth of log eNO](index_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n\nAnd here are the densities straitified by allergic status. We can map the color aesthetic to the `mopos` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, geom = \"density\", color = mopos)\n```\n\n::: {.cell-output-display}\n![Density smooth of log eNO by mouse allergic status](index_files/figure-html/unnamed-chunk-31-1.png){width=672}\n:::\n:::\n\n\nThese tell the same story as the stratified histograms, which should come as no surprise.\n\nNow we can examine the indoor environment and its relationship to eNO. Here, we use the level of indoor PM2.5 as a measure of indoor environment air quality. We can make a simple scatterplot of PM2.5 and eNO.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, geom = c(\"point\", \"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![eNO and PM2.5](index_files/figure-html/unnamed-chunk-32-1.png){width=672}\n:::\n:::\n\n\nThe relationship appears modest at best, as there is substantial noise in the data. However, one question that we might be interested in is whether allergic individuals are perhaps more sensitive to PM2.5 inhalation than non-allergic individuals. To examine that question we can stratify the data into two groups.\n\nThis first plot uses different plot symbols for the two groups and overlays them on a single canvas. We can do this by mapping the `mopos` variable to the `shape` aesthetic.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, shape = mopos)\n```\n\n::: {.cell-output-display}\n![eNO and PM2.5 by mouse allergic status](index_files/figure-html/unnamed-chunk-33-1.png){width=672}\n:::\n:::\n\n\nBecause there is substantial overlap in the data it is a bit challenging to discern the circles from the triangles. Part of the reason might be that all of the symbols are the same color (black).\n\nWe can plot each group a different color to see if that helps.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, color = mopos)\n```\n\n::: {.cell-output-display}\n![eNO and PM2.5 by mouse allergic status](index_files/figure-html/unnamed-chunk-34-1.png){width=672}\n:::\n:::\n\n\nThis is slightly better but the substantial overlap makes it difficult to discern any trends in the data. For this we need to add a smoother of some sort. Here we add a linear regression line (a type of smoother) to each group to see if there's any difference.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, color = mopos) + \n        geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-35-1.png){width=672}\n:::\n:::\n\n\nHere we see quite clearly that the red group and the green group exhibit rather different relationships between PM2.5 and eNO. For the non-allergic individuals, there appears to be a slightly negative relationship between PM2.5 and eNO and for the allergic individuals, there is a positive relationship. This suggests a strong interaction between PM2.5 and allergic status, an hypothesis perhaps worth following up on in greater detail than this brief exploratory analysis.\n\nAnother, and perhaps more clear, way to visualize this interaction is to use separate panels for the non-allergic and allergic individuals using the `facets` argument to `qplot()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, facets = . ~ mopos) + \n        geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-36-1.png){width=864}\n:::\n:::\n\n\n</details>\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  What is gone wrong with this code? Why are the points not blue?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg, color = \"blue\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-37-1.png){width=672}\n:::\n:::\n\n\n2.  Which variables in `mpg` are categorical? Which variables are continuous? (Hint: type `?mpg` to read the documentation for the dataset). How can you see this information when you run `mpg`?\n\n3.  Map a continuous variable to `color`, `size`, and `shape` aesthetics. How do these aesthetics behave differently for categorical vs. continuous variables?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://r4ds.had.co.nz/data-visualisation>\n-   <https://rdpeng.github.io/Biostat776/lecture-the-ggplot2-plotting-system-part-1>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package        * version date (UTC) lib source\n bit              4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64            4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout         1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon           1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest           0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver           2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2        * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here           * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools        0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite         1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr            1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling         0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice          0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n Matrix           1.6-1   2023-08-14 [1] CRAN (R 4.3.0)\n mgcv             1.9-0   2023-07-11 [1] CRAN (R 4.3.0)\n munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n nlme             3.1-163 2023-08-09 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)\n pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown        2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot        2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom            1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun             0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml             2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"12 - The ggplot2 plotting system: qplot()\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An overview of the ggplot2 plotting system in R with qplot()\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/12-ggplot2-plotting-system-part-1/index.qmd).*\n\n> \"The greatest value of a picture is when it forces us to notice what we never expected to see.\" ---John Tukey\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/data-visualisation>\n2.  <http://vita.had.co.nz/papers/layered-grammar.pdf>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-the-ggplot2-plotting-system-part-1>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Recognize the difference between *aesthetics* and *geoms*\n-   Become familiar with different types of plots (e.g. scatterplots, boxplots, and histograms)\n-   Be able to facet plots into a grid\n:::\n\n# The ggplot2 Plotting System\n\nThe `ggplot2` package in R is **an implementation of *The Grammar of Graphics*** as described by Leland Wilkinson in his book. The **package was originally written by Hadley Wickham** while he was a graduate student at Iowa State University (he still actively maintains the package).\n\nThe package implements what might be considered a third graphics system for R (along with `base` graphics and `lattice`).\n\nThe package is available from [CRAN](http://cran.r-project.org/package=ggplot2) via `install.packages()`; the latest version of the source can be found on the package's [GitHub Repository](https://github.com/hadley/ggplot2). Documentation of the package can be found at [the tidyverse web site](https://ggplot2.tidyverse.org).\n\nThe **grammar of graphics** represents **an abstraction of graphics ideas and objects**.\n\nYou can think of this as **developing the verbs, nouns, and adjectives for data graphics**.\n\n::: callout-tip\n### Note\n\nDeveloping such a **grammar allows for a \"theory\" of graphics** on which to build new graphics and graphics objects.\n\nTo quote from Hadley Wickham's book on `ggplot2`, we want to \"shorten the distance from mind to page\". In summary,\n\n> \"...the grammar tells us that a statistical graphic is a **mapping** from data to **aesthetic** attributes (colour, shape, size) of **geometric** objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system\" -- from *ggplot2* book\n:::\n\nYou might ask yourself \"Why do we need a grammar of graphics?\".\n\nWell, for much the same reasons that **having a grammar is useful for spoken languages**. The grammar allows for\n\n-   A more compact summary of the base components of a language\n-   An extension of the language to handle situations that we have not before seen\n\nIf you think about making a plot with the base graphics system, the plot is **constructed by calling a series of functions that either create or annotate a plot**. There's **no convenient agreed-upon way to describe the plot**, except to just recite the series of R functions that were called to create the thing in the first place.\n\n::: callout-tip\n### Example\n\nConsider the following plot made using base graphics previously.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwith(airquality, {\n    plot(Temp, Ozone)\n    lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot of Temperature and Ozone in New York (base graphics)](index_files/figure-html/unnamed-chunk-1-1.png){width=480}\n:::\n:::\n\n\nHow would one **describe the creation of this plot**?\n\nWell, we could say that we called the `plot()` function and then added a loess smoother by calling the `lines()` function on the output of `loess.smooth()`.\n\nWhile the base plotting system is convenient and it often mirrors how we think of building plots and analyzing data, there are **drawbacks**:\n\n-   You cannot go back once plot has started (e.g. to adjust margins), so there is in fact a need to plan in advance.\n-   It is difficult to \"translate\" a plot to others because there is no formal graphical language; each plot is just a series of R commands.\n:::\n\nHere is the same plot made using `ggplot2` in the `tidyverse`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nairquality %>%\n    ggplot(aes(Temp, Ozone)) +\n    geom_point() +\n    geom_smooth(\n        method = \"loess\",\n        se = FALSE\n    ) +\n    theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scatterplot of Temperature and Ozone in New York (ggplot2)](index_files/figure-html/unnamed-chunk-2-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe output is roughly equivalent and the amount of code is similar, but `ggplot2` allows for a more elegant way of expressing the components of the plot.\n\nIn this case, the plot is a **dataset** (`airquality`) with **aesthetic mappings** (visual properties of the objects in your plot) derived from the `Temp` and `Ozone` variables, a set of **points**, and a **smoother**.\n:::\n\nIn a sense, the `ggplot2` system takes many of the cues from the base plotting system and from the `lattice` plotting systems, and formalizes the cues a bit.\n\nIt **automatically handles things like margins and spacing**, and also has the concept of \"themes\" which **provide a default set of plotting symbols and colors** (which are all customizable).\n\nWhile `ggplot2` bears a superficial similarity to `lattice`, `ggplot2` is generally easier and more intuitive to use.\n\n## The Basics: `qplot()`\n\nThe `qplot()` function in `ggplot2` is meant to get you going **q**uickly.\n\nIt works much like the `plot()` function in base graphics system. It **looks for variables to plot within a data frame**, similar to lattice, or in the parent environment.\n\nIn general, it is good to get used to putting your data in a data frame and then passing it to `qplot()`.\n\n::: callout-tip\n### Pro tip\n\nThe `qplot()` function is **somewhat discouraged** in `ggplot2` now and new users are encouraged to use the more general `ggplot()` function (more details in the next lesson).\n\nHowever, the `qplot()` function is still useful and may be easier to use if transitioning from the base plotting system or a different statistical package.\n:::\n\nPlots are made up of\n\n-   **aesthetics** (e.g. size, shape, color)\n-   **geoms** (e.g. points, lines)\n\nFactors play an important role for indicating subsets of the data (if they are to have different properties) so they should be **labeled** properly.\n\nThe `qplot()` hides much of what goes on underneath, which is okay for most operations, `ggplot()` is the core function and is very flexible for doing things `qplot()` cannot do.\n\n## Before you start: label your data\n\nOne thing that is always true, but is particularly useful when using `ggplot2`, is that you should always **use informative and descriptive labels on your data**.\n\nMore generally, your data should have appropriate **metadata** so that you can quickly look at a dataset and know\n\n-   what are variables?\n-   what do the values of each variable mean?\n\n::: callout-tip\n### Pro tip\n\n-   **Each column** of a data frame should **have a meaningful (but concise) variable name** that accurately reflects the data stored in that column\n-   Non-numeric or **categorical variables should be coded as factor variables** and have meaningful labels for each level of the factor.\n    -   Might be common to code a binary variable as a \"0\" or a \"1\", but the problem is that from quickly looking at the data, it's impossible to know whether which level of that variable is represented by a \"0\" or a \"1\".\n    -   Much better to simply label each observation as what they are.\n    -   If a variable represents temperature categories, it might be better to use \"cold\", \"mild\", and \"hot\" rather than \"1\", \"2\", and \"3\".\n:::\n\nWhile it is sometimes a pain to make sure all of your data are properly labeled, this **investment in time can pay dividends down the road** when you're trying to figure out what you were plotting.\n\nIn other words, including the proper metadata can make your exploratory plots essentially self-documenting.\n\n## ggplot2 \"Hello, world!\"\n\nThis example dataset comes with the `ggplot2` package and contains data on the fuel economy of 38 popular car models from 1999 to 2008.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse) # this loads the ggplot2 R package\n# library(ggplot2) # an alternative way to just load the ggplot2 R package\nglimpse(mpg)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 234\nColumns: 11\n$ manufacturer <chr> \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"…\n$ model        <chr> \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4 quattro\", \"…\n$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…\n$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…\n$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …\n$ trans        <chr> \"auto(l5)\", \"manual(m5)\", \"manual(m6)\", \"auto(av)\", \"auto…\n$ drv          <chr> \"f\", \"f\", \"f\", \"f\", \"f\", \"f\", \"f\", \"4\", \"4\", \"4\", \"4\", \"4…\n$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…\n$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…\n$ fl           <chr> \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p…\n$ class        <chr> \"compact\", \"compact\", \"compact\", \"compact\", \"compact\", \"c…\n```\n:::\n:::\n\n\nYou can see from the `glimpse()` (part of the `dplyr` package) output that all of the categorical variables (like \"manufacturer\" or \"class\") are \\*\\*appropriately coded with meaningful label\\*s\\*\\*.\n\nThis will come in handy when `qplot()` has to label different aspects of a plot.\n\nAlso note that all of the **columns/variables have meaningful names** (if sometimes abbreviated), rather than names like \"X1\", and \"X2\", etc.\n\n::: callout-tip\n### Example\n\nWe can **make a quick scatterplot** using `qplot()` of the engine displacement (`displ`) and the highway miles per gallon (`hwy`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: `qplot()` was deprecated in ggplot2 3.4.0.\n```\n:::\n\n::: {.cell-output-display}\n![Plot of engine displacement and highway mileage using the mtcars dataset](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n:::\n\nIt has a *very* similar feeling to `plot()` in base R.\n\n::: callout-tip\n### Note\n\nIn the call to `qplot()` you **must specify the `data` argument** so that `qplot()` knows where to look up the variables.\n\nYou must also specify `x` and `y`, but hopefully that part is obvious.\n:::\n\n## Modifying aesthetics\n\nWe can introduce a third variable into the plot by **modifying the color** of the points based on the value of that third variable.\n\nColor (or colour) is one type of **aesthetic** and using the `ggplot2` language:\n\n> \"the color of each point can be mapped to a variable\"\n\nThis sounds technical, but let's give an example.\n\n::: callout-tip\n### Example\n\nWe map the `color` argument to the `drv` variable, which indicates whether a car is front wheel drive, rear wheel drive, or 4-wheel drive.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, color = drv)\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n:::\n\nNow we can see that the front wheel drive cars tend to have lower displacement relative to the 4-wheel or rear wheel drive cars.\n\nAlso, it's clear that the 4-wheel drive cars have the lowest highway gas mileage.\n\n::: callout-tip\n### Note\n\nThe `x` argument and `y` argument are aesthetics too, and they got mapped to the `displ` and `hwy` variables, respectively.\n:::\n\n::: callout-note\n### Question\n\nIn the above plot, I did not specify the `x` and `y` variable. What happens when you run these two code chunks. What's the difference?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, displ, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(y = hwy, x = displ, data = mpg, color = drv)\n```\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nLet's try mapping colors in another dataset, namely the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/) dataset. These data contain observations for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Palmer penguins](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(palmerpenguins)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 344\nColumns: 8\n$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…\n$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…\n$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …\n$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …\n$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…\n$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …\n$ sex               <fct> male, female, female, NA, female, male, female, male…\n$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…\n```\n:::\n:::\n\n\nIf we wanted to count the number of penguins for each of the three species, we can use the `count()` function in `dplyr`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins %>%\n    count(species)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  species       n\n  <fct>     <int>\n1 Adelie      152\n2 Chinstrap    68\n3 Gentoo      124\n```\n:::\n:::\n\n:::\n\nFor example, we see there are a total of 152 Adelie penguins in the `palmerpenguins` dataset.\n\n::: callout-note\n### Question\n\nIf we wanted to use `qplot()` to map `flipper_length_mm` and `bill_length_mm` to the x and y coordinates, what would we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n\nNow try mapping color to the `species` variable on top of the code you just wrote:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Adding a geom\n\nSometimes it is nice to **add a smoother** to a scatterplot to highlight any trends.\n\nTrends can be difficult to see if the data are very noisy or there are many data points obscuring the view.\n\nA smoother is a **type of \"geom\"** that you can add along with your data points.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, geom = c(\"point\", \"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage w/smoother](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n:::\n\nHere it seems that engine displacement and highway mileage have a nonlinear U-shaped relationship, but from the previous plot we know that this is largely due to confounding by the drive class of the car.\n\n::: callout-tip\n### Note\n\nPreviously, we did not have to specify `geom = \"point\"` because that was done automatically.\n\nBut if you want the smoother overlaid with the points, then you need to specify both explicitly.\n:::\n\nLook at what happens if we *do not* include the `point` geom.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, geom = c(\"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage w/smoother](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\nSometimes that is the plot you want to show, but in this case it might make more sense to show the data along with the smoother.\n\n::: callout-note\n### Question\n\nLet's **add a smoother** to our `palmerpenguins` dataset example.\n\nUsing the code we previously wrote mapping variables to points and color, add a \"point\" and \"smooth\" geom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Histograms and boxplots\n\nThe `qplot()` function can be used to be used to plot 1-dimensional data too.\n\nBy **specifying a single variable**, `qplot()` will by default make a **histogram**.\n\n::: callout-tip\n### Example\n\nWe can make a histogram of the highway mileage data and stratify on the drive class. So technically this is three histograms on top of each other.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, data = mpg, fill = drv, binwidth = 2)\n```\n\n::: {.cell-output-display}\n![Histogram of highway mileage by drive class](index_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n:::\n\n::: callout-note\n### Question\n\nNotice, I used `fill` here to map color to the `drv` variable. Why is this? What happens when you use `color` instead?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\nHaving the different colors for each drive class is nice, but the three histograms can be a bit difficult to separate out.\n\n**Side-by-side boxplots** are one solution to this problem.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(drv, hwy, data = mpg, geom = \"boxplot\")\n```\n\n::: {.cell-output-display}\n![Boxplots of highway mileage by drive class](index_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\nAnother solution is to plot the histograms in separate panels using facets.\n\n## Facets\n\n**Facets** are a way to **create multiple panels of plots based on the levels of categorical variable**.\n\nHere, we want to see a histogram of the highway mileages and the categorical variable is the drive class variable. We can do that using the `facets` argument to `qplot()`.\n\n::: callout-tip\n### Note\n\nThe `facets` argument **expects a formula type of input**, with a `~` separating the left hand side variable and the right hand side variable.\n\n-   The **left hand side** variable indicates how the rows of the panels should be divided\n-   The **right hand side** variable indicates how the columns of the panels should be divided\n:::\n\n::: callout-tip\n### Example\n\nHere, we just want three rows of histograms (and just one column), one for each drive class, so we specify `drv` on the left hand side and `.` on the right hand side indicating that there's no variable there (it's empty).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)\n```\n\n::: {.cell-output-display}\n![Histogram of highway mileage by drive class](index_files/figure-html/unnamed-chunk-22-1.png){width=480}\n:::\n:::\n\n:::\n\nWe could also look at **more data using facets**, so instead of histograms we could look at scatter plots of engine displacement and highway mileage by drive class.\n\nHere, we put the `drv` variable on the right hand side to indicate that we want a column for each drive class (as opposed to splitting by rows like we did above).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, facets = . ~ drv)\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class](index_files/figure-html/unnamed-chunk-23-1.png){width=576}\n:::\n:::\n\n\nWhat if you wanted to **add a smoother to each one of those panels**? Simple, you literally just add the smoother as another **geom**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, facets = . ~ drv) +\n    geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class w/smoother](index_files/figure-html/unnamed-chunk-24-1.png){width=576}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nWe used a different type of smoother above.\n\nHere, we add a **linear regression line** (a type of smoother) to each group to see if there's any difference.\n:::\n\n::: callout-note\n### Question\n\nLet's facet our `palmerpenguins` dataset example and explore different types of plots.\n\nBuilding off the code we previously wrote, perform the following tasks:\n\n-   Facet the plot based on `species` with the the three species along rows.\n-   Add a linear regression line to each the types of `species`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n\nNext, make a histogram of the `body_mass_g` for each of the species colored by the three species.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Summary\n\nThe `qplot()` function in `ggplot2` is the analog of `plot()` in base graphics but with many built-in features that the traditionaly `plot()` does not provide. The syntax is somewhere in between the base and lattice graphics system. The `qplot()` function is useful for quickly putting data on the page/screen, but for ultimate customization, it may make more sense to use some of the lower level functions that we discuss later in the next lesson.\n\n# Post-lecture materials\n\n### Case Study: MAACS Cohort\n\n<details>\n\n<summary>Click here for case study practicing the `qplot()` function.</summary>\n\nThis case study will use data based on the Mouse Allergen and Asthma Cohort Study (MAACS). This study was aimed at characterizing the indoor (home) environment and its relationship with asthma morbidity amonst children aged 5--17 living in Baltimore, MD. The children all had persistent asthma, defined as having had an exacerbation in the past year. A representative publication of results from this study can be found in this paper by [Lu, et al.](https://pubmed.ncbi.nlm.nih.gov/23403052/)\n\n::: keyideas\nBecause the individual-level data for this study are protected by various U.S. privacy laws, we cannot make those data available. For the purposes of this lesson, we have simulated data that share many of the same features of the original data, but do not contain any of the actual measurements or values contained in the original dataset.\n:::\n\nHere is a snapshot of what the data look like.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing2023\n```\n:::\n\n```{.r .cell-code}\nmaacs <- read_csv(here(\"data\", \"maacs_sim.csv\"), col_types = \"icnn\")\nmaacs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 750 × 4\n      id mopos  pm25    eno\n   <int> <chr> <dbl>  <dbl>\n 1     1 yes    6.01  28.8 \n 2     2 no    25.2   17.7 \n 3     3 yes   21.8   43.6 \n 4     4 no    13.4  288.  \n 5     5 no    49.4    7.60\n 6     6 no    43.4   12.0 \n 7     7 yes   33.0   79.2 \n 8     8 yes   32.7   34.2 \n 9     9 yes   52.2   12.1 \n10    10 yes   51.9   65.0 \n# ℹ 740 more rows\n```\n:::\n:::\n\n\nThe key variables are:\n\n-   `mopos`: an indicator of whether the subject is allergic to mouse allergen (yes/no)\n\n-   `pm25`: average level of PM2.5 over the course of 7 days (micrograms per cubic meter)\n\n-   `eno`: exhaled nitric oxide\n\nThe outcome of interest for this analysis will be exhaled nitric oxide (eNO), which is a measure of pulmonary inflamation. We can get a sense of how eNO is distributed in this population by making a quick histogram of the variable. Here, we take the log of eNO because some right-skew in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n```\n:::\n\n::: {.cell-output-display}\n![Histogram of log eNO](index_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\nA quick glance suggests that the histogram is a bit \"fat\", suggesting that there might be multiple groups of people being lumped together. We can stratify the histogram by whether they are allergic to mouse.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, fill = mopos)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n```\n:::\n\n::: {.cell-output-display}\n![Histogram of log eNO by mouse allergic status](index_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\nWe can see from this plot that the non-allergic subjects are shifted slightly to the left, indicating a lower eNO and less pulmonary inflammation. That said, there is significant overlap between the two groups.\n\nAn alternative to histograms is a density smoother, which sometimes can be easier to visualize when there are multiple groups. Here is a density smooth of the entire study population.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, geom = \"density\")\n```\n\n::: {.cell-output-display}\n![Density smooth of log eNO](index_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n\nAnd here are the densities straitified by allergic status. We can map the color aesthetic to the `mopos` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, geom = \"density\", color = mopos)\n```\n\n::: {.cell-output-display}\n![Density smooth of log eNO by mouse allergic status](index_files/figure-html/unnamed-chunk-31-1.png){width=672}\n:::\n:::\n\n\nThese tell the same story as the stratified histograms, which should come as no surprise.\n\nNow we can examine the indoor environment and its relationship to eNO. Here, we use the level of indoor PM2.5 as a measure of indoor environment air quality. We can make a simple scatterplot of PM2.5 and eNO.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, geom = c(\"point\", \"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![eNO and PM2.5](index_files/figure-html/unnamed-chunk-32-1.png){width=672}\n:::\n:::\n\n\nThe relationship appears modest at best, as there is substantial noise in the data. However, one question that we might be interested in is whether allergic individuals are perhaps more sensitive to PM2.5 inhalation than non-allergic individuals. To examine that question we can stratify the data into two groups.\n\nThis first plot uses different plot symbols for the two groups and overlays them on a single canvas. We can do this by mapping the `mopos` variable to the `shape` aesthetic.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, shape = mopos)\n```\n\n::: {.cell-output-display}\n![eNO and PM2.5 by mouse allergic status](index_files/figure-html/unnamed-chunk-33-1.png){width=672}\n:::\n:::\n\n\nBecause there is substantial overlap in the data it is a bit challenging to discern the circles from the triangles. Part of the reason might be that all of the symbols are the same color (black).\n\nWe can plot each group a different color to see if that helps.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, color = mopos)\n```\n\n::: {.cell-output-display}\n![eNO and PM2.5 by mouse allergic status](index_files/figure-html/unnamed-chunk-34-1.png){width=672}\n:::\n:::\n\n\nThis is slightly better but the substantial overlap makes it difficult to discern any trends in the data. For this we need to add a smoother of some sort. Here we add a linear regression line (a type of smoother) to each group to see if there's any difference.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, color = mopos) +\n    geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-35-1.png){width=672}\n:::\n:::\n\n\nHere we see quite clearly that the red group and the green group exhibit rather different relationships between PM2.5 and eNO. For the non-allergic individuals, there appears to be a slightly negative relationship between PM2.5 and eNO and for the allergic individuals, there is a positive relationship. This suggests a strong interaction between PM2.5 and allergic status, an hypothesis perhaps worth following up on in greater detail than this brief exploratory analysis.\n\nAnother, and perhaps more clear, way to visualize this interaction is to use separate panels for the non-allergic and allergic individuals using the `facets` argument to `qplot()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, facets = . ~ mopos) +\n    geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-36-1.png){width=864}\n:::\n:::\n\n\n</details>\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  What is gone wrong with this code? Why are the points not blue?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg, color = \"blue\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-37-1.png){width=672}\n:::\n:::\n\n\n2.  Which variables in `mpg` are categorical? Which variables are continuous? (Hint: type `?mpg` to read the documentation for the dataset). How can you see this information when you run `mpg`?\n\n3.  Map a continuous variable to `color`, `size`, and `shape` aesthetics. How do these aesthetics behave differently for categorical vs. continuous variables?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://r4ds.had.co.nz/data-visualisation>\n-   <https://rdpeng.github.io/Biostat776/lecture-the-ggplot2-plotting-system-part-1>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package        * version date (UTC) lib source\n bit              4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64            4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout         1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon           1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest           0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver           2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2        * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here           * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools        0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite         1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr            1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling         0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice          0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n Matrix           1.6-1   2023-08-14 [1] CRAN (R 4.3.0)\n mgcv             1.9-0   2023-07-11 [1] CRAN (R 4.3.0)\n munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n nlme             3.1-163 2023-08-09 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)\n pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown        2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot        2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom            1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun             0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml             2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/execute-results/html.json b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/execute-results/html.json
index 831e44b..d6c02cd 100644
--- a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/execute-results/html.json
+++ b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "76ec5c219411d3408adcbd15aa8e66ab",
+  "hash": "d55dbad2db95c2293e2f5b900bdc8450",
   "result": {
-    "markdown": "---\ntitle: \"13 - The ggplot2 plotting system: ggplot()\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An overview of the ggplot2 plotting system in R with ggplot()\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/13-ggplot2-plotting-system-part-2/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/data-visualisation>\n2.  <http://vita.had.co.nz/papers/layered-grammar.pdf>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-the-ggplot2-plotting-system-part-2>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to build up layers of graphics using `ggplot()`\n-   Be able to modify properties of a `ggplot()` including layers and labels\n:::\n\n# The ggplot2 Plotting System\n\nIn this lesson, we will get into a little more of the nitty gritty of **how `ggplot2` builds plots** and how you can customize various aspects of any plot.\n\nPreviously, we used the `qplot()` function to quickly put points on a page.\n\n-   The `qplot()` function's syntax is very similar to that of the `plot()` function in base graphics so for those switching over, it makes for an easy transition.\n\nBut it is worth knowing the underlying details of how `ggplot2` works so that you can really exploit its power.\n\n## Basic components of a ggplot2 plot\n\n::: callout-tip\n### Key components\n\nA **`ggplot2` plot** consists of a number of **key components**.\n\n-   A **data frame**: stores all of the data that will be displayed on the plot\n\n-   **aesthetic mappings**: describe how data are mapped to color, size, shape, location\n\n-   **geoms**: geometric objects like points, lines, shapes\n\n-   **facets**: describes how conditional/panel plots should be constructed\n\n-   **stats**: statistical transformations like binning, quantiles, smoothing\n\n-   **scales**: what scale an aesthetic map uses (example: left-handed = red, right-handed = blue)\n\n-   **coordinate system**: describes the system in which the locations of the geoms will be drawn\n:::\n\nIt is **essential to organize your data into a data frame** before you start with `ggplot2` (and all the **appropriate metadata** so that your data frame is self-describing and your plots will be self-documenting).\n\nWhen **building plots in `ggplot2`** (rather than using `qplot()`), the **\"artist's palette\" model may be the closest analogy**.\n\nEssentially, you start with some raw data, and then you **gradually add bits and pieces to it to create a plot**.\n\n::: callout-tip\n### Note\n\nPlots are built up in layers, with the typically ordering being\n\n1.  Plot the data\n2.  Overlay a summary\n3.  Add metadata and annotation\n:::\n\nFor quick exploratory plots you may not get past step 1.\n\n## Example: BMI, PM2.5, Asthma\n\nTo demonstrate the various pieces of `ggplot2` we will use a running example from the **Mouse Allergen and Asthma Cohort Study (MAACS)**. Here, the question we are interested in is\n\n> \"Are overweight individuals, as measured by body mass index (BMI), more susceptible than normal weight individuals to the harmful effects of PM2.5 on asthma symptoms?\"\n\nThere is a suggestion that overweight individuals may be more susceptible to the negative effects of inhaling PM2.5.\n\nThis would suggest that increases in PM2.5 exposure in the home of an overweight child would be more deleterious to his/her asthma symptoms than they would be in the home of a normal weight child.\n\nWe want to see if we can see that difference in the data from MAACS.\n\n::: callout-tip\n### Note\n\nBecause the individual-level data for this study are protected by various U.S. privacy laws, we cannot make those data available.\n\nFor the purposes of this lesson, we have **simulated data** that share many of the same features of the original data, but do not contain any of the actual measurements or values contained in the original dataset.\n:::\n\n::: callout-tip\n### Example\n\nWe can look at the data quickly by reading it in as a tibble with `read_csv()` in the `tidyverse` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(here)\nmaacs <- read_csv(here(\"data\", \"bmi_pm25_no2_sim.csv\"),\n                  col_types = \"nnci\")\nmaacs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 517 × 4\n   logpm25 logno2_new bmicat        NocturnalSympt\n     <dbl>      <dbl> <chr>                  <int>\n 1   1.25       1.18  normal weight              1\n 2   1.12       1.55  overweight                 0\n 3   1.93       1.43  normal weight              0\n 4   1.37       1.77  overweight                 2\n 5   0.775      0.765 normal weight              0\n 6   1.49       1.11  normal weight              0\n 7   2.16       1.43  normal weight              0\n 8   1.65       1.40  normal weight              0\n 9   1.55       1.81  normal weight              0\n10   2.04       1.35  overweight                 3\n# ℹ 507 more rows\n```\n:::\n:::\n\n:::\n\nThe outcome we will look at here (`NocturnalSymp`) is the number of days in the past 2 weeks where the child experienced asthma symptoms (e.g. coughing, wheezing) while sleeping.\n\nThe other key variables are:\n\n-   `logpm25`: average level of PM2.5 over the course of 7 days (micrograms per cubic meter) on the log scale\n\n-   `logno2_new`: exhaled nitric oxide on the log scale\n\n-   `bmicat`: categorical variable with BMI status\n\n# Building up in layers\n\nFirst, we can **create a `ggplot` object** that stores the dataset and the basic aesthetics for mapping the x- and y-coordinates for the plot.\n\n::: callout-tip\n### Example\n\nHere, we will eventually be plotting the log of PM2.5 and `NocturnalSymp` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(maacs, aes(x = logpm25, \n                       y = NocturnalSympt))\nsummary(g)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ndata: logpm25, logno2_new, bmicat, NocturnalSympt [517x4]\nmapping:  x = ~logpm25, y = ~NocturnalSympt\nfaceting: <ggproto object: Class FacetNull, Facet, gg>\n    compute_layout: function\n    draw_back: function\n    draw_front: function\n    draw_labels: function\n    draw_panels: function\n    finish_data: function\n    init_scales: function\n    map_data: function\n    params: list\n    setup_data: function\n    setup_params: function\n    shrink: TRUE\n    train_scales: function\n    vars: function\n    super:  <ggproto object: Class FacetNull, Facet, gg>\n```\n:::\n\n```{.r .cell-code}\nclass(g)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"gg\"     \"ggplot\"\n```\n:::\n:::\n\n:::\n\nYou can see above that the object `g` contains the dataset `maacs` and the mappings.\n\nNow, normally if you were to `print()` a `ggplot` object a plot would appear on the plot device, however, our object `g` actually does not contain enough information to make a plot yet.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- maacs %>%\n        ggplot(aes(logpm25, NocturnalSympt))\nprint(g)\n```\n\n::: {.cell-output-display}\n![Nothing to see here!](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n## First plot with point layer\n\nTo make a scatter plot, we need add at least one **geom**, such as points.\n\nHere, we add the `geom_point()` function to create a traditional scatter plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- maacs %>%\n        ggplot(aes(logpm25, NocturnalSympt))\ng + geom_point()\n```\n\n::: {.cell-output-display}\n![Scatterplot of PM2.5 and days with nocturnal symptoms](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nHow does ggplot know what points to plot? In this case, it can grab them from the data frame `maacs` that served as the input into the `ggplot()` function.\n\n## Adding more layers\n\n### smooth\n\nBecause the data appear rather noisy, it might be better if we added a smoother on top of the points to see if there is a trend in the data with PM2.5.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n  geom_point() + \n  geom_smooth()\n```\n\n::: {.cell-output-display}\n![Scatterplot with smoother](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nThe default smoother is a loess smoother, which is flexible and nonparametric but might be too flexible for our purposes. Perhaps we'd prefer a simple linear regression line to highlight any first order trends. We can do this by specifying `method = \"lm\"` to `geom_smooth()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n  geom_point() + \n  geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output-display}\n![Scatterplot with linear regression line](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nHere, we can see there appears to be a slight increasing trend, suggesting that higher levels of PM2.5 are associated with increased days with nocturnal symptoms.\n\n::: callout-note\n### Question\n\nLet's use the `ggplot()` function with our `palmerpenguins` dataset example and make a scatter plot with `flipper_length_mm` on the x-axis, `bill_length_mm` on the y-axis, colored by `species`, and a smoother by adding a linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(palmerpenguins)\npenguins \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>\n 1 Adelie  Torgersen           39.1          18.7               181        3750\n 2 Adelie  Torgersen           39.5          17.4               186        3800\n 3 Adelie  Torgersen           40.3          18                 195        3250\n 4 Adelie  Torgersen           NA            NA                  NA          NA\n 5 Adelie  Torgersen           36.7          19.3               193        3450\n 6 Adelie  Torgersen           39.3          20.6               190        3650\n 7 Adelie  Torgersen           38.9          17.8               181        3625\n 8 Adelie  Torgersen           39.2          19.6               195        4675\n 9 Adelie  Torgersen           34.1          18.1               193        3475\n10 Adelie  Torgersen           42            20.2               190        4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex <fct>, year <int>\n```\n:::\n:::\n\n:::\n\n### facets\n\nBecause our primary question involves comparing overweight individuals to normal weight individuals, we can **stratify the scatter plot** of PM2.5 and nocturnal symptoms by the BMI category (`bmicat`) variable, which indicates whether an individual is overweight or now.\n\nTo visualize this we can **add a `facet_grid()`**, which takes a formula argument.\n\n::: callout-tip\n### Example\n\nWe want one row and two columns, one column for each weight category. So we specify `bmicat` on the right hand side of the forumla passed to `facet_grid()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n  geom_point() + \n  geom_smooth(method = \"lm\") +\n  facet_grid(. ~ bmicat) \n```\n\n::: {.cell-output-display}\n![Scatterplot of PM2.5 and nocturnal symptoms by BMI category](index_files/figure-html/unnamed-chunk-8-1.png){width=864}\n:::\n:::\n\n:::\n\nNow it seems clear that the relationship between PM2.5 and nocturnal symptoms is relatively flat among normal weight individuals, while the relationship is increasing among overweight individuals.\n\nThis plot suggests that overweight individuals may be more susceptible to the effects of PM2.5.\n\n# Modifying geom properties\n\nYou can **modify properties of geoms** by specifying options to their respective `geom_*()` functions.\n\n### map aesthetics to constants\n\n::: callout-tip\n### Example\n\nFor example, here we modify the points in the scatterplot to make the color \"steelblue\", the size larger, and the alpha transparency greater.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + geom_point(color = \"steelblue\", size = 4, alpha = 1/2)\n```\n\n::: {.cell-output-display}\n![Modifying point color with a constant](index_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n:::\n\n### map aesthetics to variables\n\nIn addition to setting specific geom attributes to constant values, we can **map aesthetics to variables** in our dataset.\n\nFor example, we can map the aesthetic `color` to the variable `bmicat`, so the points will be colored according to the levels of `bmicat`.\n\nWe use the `aes()` function to indicate this difference from the plot above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + geom_point(aes(color = bmicat), size = 4, alpha = 1/2)\n```\n\n::: {.cell-output-display}\n![Mapping color to a variable](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Customizing the smooth\n\nWe can also **customize aspects of the geoms**.\n\nFor example, we can customize the smoother that we overlay on the points with `geom_smooth()`.\n\nHere we change the line type and increase the size from the default. We also remove the shaded standard error from the line.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n  geom_point(aes(color = bmicat), \n             size = 2, \n             alpha = 1/2) + \n  geom_smooth(size = 4, \n              linetype = 3, \n              method = \"lm\", \n              se = FALSE)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n```\n:::\n\n::: {.cell-output-display}\n![Customizing a smoother](index_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\n# Other important stuff\n\n## Changing the theme\n\nThe **default theme for `ggplot2` uses the gray background** with white grid lines.\n\nIf you don't find this suitable, you can use the black and white theme by using the `theme_bw()` function.\n\nThe `theme_bw()` function also allows you to set the typeface for the plot, in case you don't want the default Helvetica. Here we change the typeface to Times.\n\n::: callout-tip\n### Note\n\nFor things that only make sense globally, use `theme()`, i.e. `theme(legend.position = \"none\")`. Two standard appearance themes are included\n\n-   `theme_gray()`: The default theme (gray background)\n-   `theme_bw()`: More stark/plain\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n  geom_point(aes(color = bmicat)) + \n  theme_bw(base_family = \"Times\")\n```\n\n::: {.cell-output-display}\n![Modifying the theme for a plot](index_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's take our `palmerpenguins` scatterplot from above and change out the theme to use `theme_dark()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(palmerpenguins)\npenguins \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>\n 1 Adelie  Torgersen           39.1          18.7               181        3750\n 2 Adelie  Torgersen           39.5          17.4               186        3800\n 3 Adelie  Torgersen           40.3          18                 195        3250\n 4 Adelie  Torgersen           NA            NA                  NA          NA\n 5 Adelie  Torgersen           36.7          19.3               193        3450\n 6 Adelie  Torgersen           39.3          20.6               190        3650\n 7 Adelie  Torgersen           38.9          17.8               181        3625\n 8 Adelie  Torgersen           39.2          19.6               195        4675\n 9 Adelie  Torgersen           34.1          18.1               193        3475\n10 Adelie  Torgersen           42            20.2               190        4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex <fct>, year <int>\n```\n:::\n:::\n\n:::\n\n## Modifying labels\n\n::: callout-tip\n### Note\n\nThere are a variety of **annotations** you can add to a plot, including **different kinds of labels**.\n\n-   `xlab()` for x-axis labels\n-   `ylab()` for y-axis labels\n-   `ggtitle()` for specifying plot titles\n\n`labs()` function is generic and can be used to modify multiple types of labels at once\n:::\n\nHere is an example of modifying the title and the `x` and `y` labels to make the plot a bit more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n  geom_point(aes(color = bmicat)) + \n  labs(title = \"MAACS Cohort\") + \n  labs(x = expression(\"log \" * PM[2.5]), \n       y = \"Nocturnal Symptoms\")\n```\n\n::: {.cell-output-display}\n![Modifying plot labels](index_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## A quick aside about axis limits\n\nOne quick **quirk about `ggplot2`** that caught me up when I first started using the package can be displayed in the following example.\n\nIf you make a lot of time series plots, you often **want to restrict the range of the y-axis** while still plotting all the data.\n\nIn the base graphics system you can do that as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntestdat <- data.frame(x = 1:100, \n                      y = rnorm(100))\ntestdat[50,2] <- 100  ## Outlier!\nplot(testdat$x, \n     testdat$y,\n     type = \"l\", \n     ylim = c(-3,3))\n```\n\n::: {.cell-output-display}\n![Time series plot with base graphics](index_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nHere, we have restricted the y-axis range to be between -3 and 3, even though there is a clear outlier in the data.\n\n::: callout-tip\n### Example\n\nWith `ggplot2` the default settings will give you this.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(testdat, aes(x = x, y = y))\ng + geom_line()\n```\n\n::: {.cell-output-display}\n![Time series plot with default settings](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\nOne might think that modifying the `ylim()` attribute would give you the same thing as the base plot, but it doesn't (?????)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n  geom_line() + \n  ylim(-3, 3)\n```\n\n::: {.cell-output-display}\n![Time series plot with modified ylim](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n:::\n\nEffectively, what this does is subset the data so that only observations between -3 and 3 are included, then plot the data.\n\nTo plot the data without subsetting it first and still get the restricted range, you have to do the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n  geom_line() + \n  coord_cartesian(ylim = c(-3, 3))\n```\n\n::: {.cell-output-display}\n![Time series plot with restricted y-axis range](index_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\nAnd now you know!\n\n# Post-lecture materials\n\n### Resources\n\n-   The *ggplot2* book by Hadley Wickham\n-   The *R Graphics Cookbook* by Winston Chang (examples in base plots and in `ggplot2`)\n-   [tidyverse web site](http://ggplot2.tidyverse.org)\n\n### More complex example with `ggplot2`\n\nNow you get the sense that plots in the `ggplot2` system are constructed by successively adding components to the plot, starting with the base dataset and maybe a scatterplot. In this section bleow, you can see a slightly more complicated example with an additional variable.\n\n<details>\n\n<summary>Click here for a slightly more complicated example with `ggplot()`.</summary>\n\nNow, we will ask the question\n\n> How does the relationship between PM2.5 and nocturnal symptoms vary by BMI category and nitrogen dioxide (NO2)?\n\nUnlike our previous BMI variable, NO2 is continuous, and so we need to make NO2 categorical so we can condition on it in the plotting. We can use the `cut()` function for this purpose. We will divide the NO2 variable into tertiles.\n\nFirst we need to calculate the tertiles with the `quantile()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncutpoints <- quantile(maacs$logno2_new, seq(0, 1, length = 4), na.rm = TRUE)\n```\n:::\n\n\nThen we need to divide the original `logno2_new` variable into the ranges defined by the cut points computed above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmaacs$no2tert <- cut(maacs$logno2_new, cutpoints)\n```\n:::\n\n\nThe `not2tert` variable is now a categorical factor variable containing 3 levels, indicating the ranges of NO2 (on the log scale).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## See the levels of the newly created factor variable\nlevels(maacs$no2tert)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"(0.342,1.23]\" \"(1.23,1.47]\"  \"(1.47,2.17]\" \n```\n:::\n:::\n\n\nThe final plot shows the relationship between PM2.5 and nocturnal symptoms by BMI category and NO2 tertile.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Setup ggplot with data frame\ng <- maacs %>%\n        ggplot(aes(logpm25, NocturnalSympt))\n\n## Add layers\ng + geom_point(alpha = 1/3) + \n        facet_grid(bmicat ~ no2tert) + \n        geom_smooth(method=\"lm\", se=FALSE, col=\"steelblue\") + \n        theme_bw(base_family = \"Avenir\", base_size = 10) + \n        labs(x = expression(\"log \" * PM[2.5])) + \n        labs(y = \"Nocturnal Symptoms\") + \n        labs(title = \"MAACS Cohort\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![PM2.5 and nocturnal symptoms by BMI category and NO2 tertile](index_files/figure-html/unnamed-chunk-22-1.png){width=864}\n:::\n:::\n\n\n</details>\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  What happens if you facet on a continuous variable?\n\n2.  Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other options control the layout of the individual panels? Why doesn't `facet_grid()` have `nrow` and `ncol` arguments?\n\n3.  What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?\n\n4.  What does `geom_col()` do? How is it different to `geom_bar()`?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://r4ds.had.co.nz/data-visualisation>\n-   <https://rdpeng.github.io/Biostat776/lecture-the-ggplot2-plotting-system-part-2>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package        * version date (UTC) lib source\n bit              4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64            4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout         1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon           1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest           0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver           2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2        * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here           * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools        0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite         1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr            1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling         0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice          0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n Matrix           1.6-1   2023-08-14 [1] CRAN (R 4.3.0)\n mgcv             1.9-0   2023-07-11 [1] CRAN (R 4.3.0)\n munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n nlme             3.1-163 2023-08-09 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)\n pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown        2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot        2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom            1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun             0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml             2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"13 - The ggplot2 plotting system: ggplot()\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An overview of the ggplot2 plotting system in R with ggplot()\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/13-ggplot2-plotting-system-part-2/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/data-visualisation>\n2.  <http://vita.had.co.nz/papers/layered-grammar.pdf>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-the-ggplot2-plotting-system-part-2>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to build up layers of graphics using `ggplot()`\n-   Be able to modify properties of a `ggplot()` including layers and labels\n:::\n\n# The ggplot2 Plotting System\n\nIn this lesson, we will get into a little more of the nitty gritty of **how `ggplot2` builds plots** and how you can customize various aspects of any plot.\n\nPreviously, we used the `qplot()` function to quickly put points on a page.\n\n-   The `qplot()` function's syntax is very similar to that of the `plot()` function in base graphics so for those switching over, it makes for an easy transition.\n\nBut it is worth knowing the underlying details of how `ggplot2` works so that you can really exploit its power.\n\n## Basic components of a ggplot2 plot\n\n::: callout-tip\n### Key components\n\nA **`ggplot2` plot** consists of a number of **key components**.\n\n-   A **data frame**: stores all of the data that will be displayed on the plot\n\n-   **aesthetic mappings**: describe how data are mapped to color, size, shape, location\n\n-   **geoms**: geometric objects like points, lines, shapes\n\n-   **facets**: describes how conditional/panel plots should be constructed\n\n-   **stats**: statistical transformations like binning, quantiles, smoothing\n\n-   **scales**: what scale an aesthetic map uses (example: left-handed = red, right-handed = blue)\n\n-   **coordinate system**: describes the system in which the locations of the geoms will be drawn\n:::\n\nIt is **essential to organize your data into a data frame** before you start with `ggplot2` (and all the **appropriate metadata** so that your data frame is self-describing and your plots will be self-documenting).\n\nWhen **building plots in `ggplot2`** (rather than using `qplot()`), the **\"artist's palette\" model may be the closest analogy**.\n\nEssentially, you start with some raw data, and then you **gradually add bits and pieces to it to create a plot**.\n\n::: callout-tip\n### Note\n\nPlots are built up in layers, with the typically ordering being\n\n1.  Plot the data\n2.  Overlay a summary\n3.  Add metadata and annotation\n:::\n\nFor quick exploratory plots you may not get past step 1.\n\n## Example: BMI, PM2.5, Asthma\n\nTo demonstrate the various pieces of `ggplot2` we will use a running example from the **Mouse Allergen and Asthma Cohort Study (MAACS)**. Here, the question we are interested in is\n\n> \"Are overweight individuals, as measured by body mass index (BMI), more susceptible than normal weight individuals to the harmful effects of PM2.5 on asthma symptoms?\"\n\nThere is a suggestion that overweight individuals may be more susceptible to the negative effects of inhaling PM2.5.\n\nThis would suggest that increases in PM2.5 exposure in the home of an overweight child would be more deleterious to his/her asthma symptoms than they would be in the home of a normal weight child.\n\nWe want to see if we can see that difference in the data from MAACS.\n\n::: callout-tip\n### Note\n\nBecause the individual-level data for this study are protected by various U.S. privacy laws, we cannot make those data available.\n\nFor the purposes of this lesson, we have **simulated data** that share many of the same features of the original data, but do not contain any of the actual measurements or values contained in the original dataset.\n:::\n\n::: callout-tip\n### Example\n\nWe can look at the data quickly by reading it in as a tibble with `read_csv()` in the `tidyverse` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(here)\nmaacs <- read_csv(here(\"data\", \"bmi_pm25_no2_sim.csv\"),\n    col_types = \"nnci\"\n)\nmaacs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 517 × 4\n   logpm25 logno2_new bmicat        NocturnalSympt\n     <dbl>      <dbl> <chr>                  <int>\n 1   1.25       1.18  normal weight              1\n 2   1.12       1.55  overweight                 0\n 3   1.93       1.43  normal weight              0\n 4   1.37       1.77  overweight                 2\n 5   0.775      0.765 normal weight              0\n 6   1.49       1.11  normal weight              0\n 7   2.16       1.43  normal weight              0\n 8   1.65       1.40  normal weight              0\n 9   1.55       1.81  normal weight              0\n10   2.04       1.35  overweight                 3\n# ℹ 507 more rows\n```\n:::\n:::\n\n:::\n\nThe outcome we will look at here (`NocturnalSymp`) is the number of days in the past 2 weeks where the child experienced asthma symptoms (e.g. coughing, wheezing) while sleeping.\n\nThe other key variables are:\n\n-   `logpm25`: average level of PM2.5 over the course of 7 days (micrograms per cubic meter) on the log scale\n\n-   `logno2_new`: exhaled nitric oxide on the log scale\n\n-   `bmicat`: categorical variable with BMI status\n\n# Building up in layers\n\nFirst, we can **create a `ggplot` object** that stores the dataset and the basic aesthetics for mapping the x- and y-coordinates for the plot.\n\n::: callout-tip\n### Example\n\nHere, we will eventually be plotting the log of PM2.5 and `NocturnalSymp` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(maacs, aes(\n    x = logpm25,\n    y = NocturnalSympt\n))\nsummary(g)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ndata: logpm25, logno2_new, bmicat, NocturnalSympt [517x4]\nmapping:  x = ~logpm25, y = ~NocturnalSympt\nfaceting: <ggproto object: Class FacetNull, Facet, gg>\n    compute_layout: function\n    draw_back: function\n    draw_front: function\n    draw_labels: function\n    draw_panels: function\n    finish_data: function\n    init_scales: function\n    map_data: function\n    params: list\n    setup_data: function\n    setup_params: function\n    shrink: TRUE\n    train_scales: function\n    vars: function\n    super:  <ggproto object: Class FacetNull, Facet, gg>\n```\n:::\n\n```{.r .cell-code}\nclass(g)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"gg\"     \"ggplot\"\n```\n:::\n:::\n\n:::\n\nYou can see above that the object `g` contains the dataset `maacs` and the mappings.\n\nNow, normally if you were to `print()` a `ggplot` object a plot would appear on the plot device, however, our object `g` actually does not contain enough information to make a plot yet.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- maacs %>%\n    ggplot(aes(logpm25, NocturnalSympt))\nprint(g)\n```\n\n::: {.cell-output-display}\n![Nothing to see here!](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n## First plot with point layer\n\nTo make a scatter plot, we need add at least one **geom**, such as points.\n\nHere, we add the `geom_point()` function to create a traditional scatter plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- maacs %>%\n    ggplot(aes(logpm25, NocturnalSympt))\ng + geom_point()\n```\n\n::: {.cell-output-display}\n![Scatterplot of PM2.5 and days with nocturnal symptoms](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nHow does ggplot know what points to plot? In this case, it can grab them from the data frame `maacs` that served as the input into the `ggplot()` function.\n\n## Adding more layers\n\n### smooth\n\nBecause the data appear rather noisy, it might be better if we added a smoother on top of the points to see if there is a trend in the data with PM2.5.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n    geom_point() +\n    geom_smooth()\n```\n\n::: {.cell-output-display}\n![Scatterplot with smoother](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nThe default smoother is a loess smoother, which is flexible and nonparametric but might be too flexible for our purposes. Perhaps we'd prefer a simple linear regression line to highlight any first order trends. We can do this by specifying `method = \"lm\"` to `geom_smooth()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n    geom_point() +\n    geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output-display}\n![Scatterplot with linear regression line](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nHere, we can see there appears to be a slight increasing trend, suggesting that higher levels of PM2.5 are associated with increased days with nocturnal symptoms.\n\n::: callout-note\n### Question\n\nLet's use the `ggplot()` function with our `palmerpenguins` dataset example and make a scatter plot with `flipper_length_mm` on the x-axis, `bill_length_mm` on the y-axis, colored by `species`, and a smoother by adding a linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(palmerpenguins)\npenguins\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>\n 1 Adelie  Torgersen           39.1          18.7               181        3750\n 2 Adelie  Torgersen           39.5          17.4               186        3800\n 3 Adelie  Torgersen           40.3          18                 195        3250\n 4 Adelie  Torgersen           NA            NA                  NA          NA\n 5 Adelie  Torgersen           36.7          19.3               193        3450\n 6 Adelie  Torgersen           39.3          20.6               190        3650\n 7 Adelie  Torgersen           38.9          17.8               181        3625\n 8 Adelie  Torgersen           39.2          19.6               195        4675\n 9 Adelie  Torgersen           34.1          18.1               193        3475\n10 Adelie  Torgersen           42            20.2               190        4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex <fct>, year <int>\n```\n:::\n:::\n\n:::\n\n### facets\n\nBecause our primary question involves comparing overweight individuals to normal weight individuals, we can **stratify the scatter plot** of PM2.5 and nocturnal symptoms by the BMI category (`bmicat`) variable, which indicates whether an individual is overweight or now.\n\nTo visualize this we can **add a `facet_grid()`**, which takes a formula argument.\n\n::: callout-tip\n### Example\n\nWe want one row and two columns, one column for each weight category. So we specify `bmicat` on the right hand side of the forumla passed to `facet_grid()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n    geom_point() +\n    geom_smooth(method = \"lm\") +\n    facet_grid(. ~ bmicat)\n```\n\n::: {.cell-output-display}\n![Scatterplot of PM2.5 and nocturnal symptoms by BMI category](index_files/figure-html/unnamed-chunk-8-1.png){width=864}\n:::\n:::\n\n:::\n\nNow it seems clear that the relationship between PM2.5 and nocturnal symptoms is relatively flat among normal weight individuals, while the relationship is increasing among overweight individuals.\n\nThis plot suggests that overweight individuals may be more susceptible to the effects of PM2.5.\n\n# Modifying geom properties\n\nYou can **modify properties of geoms** by specifying options to their respective `geom_*()` functions.\n\n### map aesthetics to constants\n\n::: callout-tip\n### Example\n\nFor example, here we modify the points in the scatterplot to make the color \"steelblue\", the size larger, and the alpha transparency greater.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + geom_point(color = \"steelblue\", size = 4, alpha = 1 / 2)\n```\n\n::: {.cell-output-display}\n![Modifying point color with a constant](index_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n:::\n\n### map aesthetics to variables\n\nIn addition to setting specific geom attributes to constant values, we can **map aesthetics to variables** in our dataset.\n\nFor example, we can map the aesthetic `color` to the variable `bmicat`, so the points will be colored according to the levels of `bmicat`.\n\nWe use the `aes()` function to indicate this difference from the plot above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + geom_point(aes(color = bmicat), size = 4, alpha = 1 / 2)\n```\n\n::: {.cell-output-display}\n![Mapping color to a variable](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Customizing the smooth\n\nWe can also **customize aspects of the geoms**.\n\nFor example, we can customize the smoother that we overlay on the points with `geom_smooth()`.\n\nHere we change the line type and increase the size from the default. We also remove the shaded standard error from the line.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n    geom_point(aes(color = bmicat),\n        size = 2,\n        alpha = 1 / 2\n    ) +\n    geom_smooth(\n        size = 4,\n        linetype = 3,\n        method = \"lm\",\n        se = FALSE\n    )\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n```\n:::\n\n::: {.cell-output-display}\n![Customizing a smoother](index_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\n# Other important stuff\n\n## Changing the theme\n\nThe **default theme for `ggplot2` uses the gray background** with white grid lines.\n\nIf you don't find this suitable, you can use the black and white theme by using the `theme_bw()` function.\n\nThe `theme_bw()` function also allows you to set the typeface for the plot, in case you don't want the default Helvetica. Here we change the typeface to Times.\n\n::: callout-tip\n### Note\n\nFor things that only make sense globally, use `theme()`, i.e. `theme(legend.position = \"none\")`. Two standard appearance themes are included\n\n-   `theme_gray()`: The default theme (gray background)\n-   `theme_bw()`: More stark/plain\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n    geom_point(aes(color = bmicat)) +\n    theme_bw(base_family = \"Times\")\n```\n\n::: {.cell-output-display}\n![Modifying the theme for a plot](index_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's take our `palmerpenguins` scatterplot from above and change out the theme to use `theme_dark()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(palmerpenguins)\npenguins\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>\n 1 Adelie  Torgersen           39.1          18.7               181        3750\n 2 Adelie  Torgersen           39.5          17.4               186        3800\n 3 Adelie  Torgersen           40.3          18                 195        3250\n 4 Adelie  Torgersen           NA            NA                  NA          NA\n 5 Adelie  Torgersen           36.7          19.3               193        3450\n 6 Adelie  Torgersen           39.3          20.6               190        3650\n 7 Adelie  Torgersen           38.9          17.8               181        3625\n 8 Adelie  Torgersen           39.2          19.6               195        4675\n 9 Adelie  Torgersen           34.1          18.1               193        3475\n10 Adelie  Torgersen           42            20.2               190        4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex <fct>, year <int>\n```\n:::\n:::\n\n:::\n\n## Modifying labels\n\n::: callout-tip\n### Note\n\nThere are a variety of **annotations** you can add to a plot, including **different kinds of labels**.\n\n-   `xlab()` for x-axis labels\n-   `ylab()` for y-axis labels\n-   `ggtitle()` for specifying plot titles\n\n`labs()` function is generic and can be used to modify multiple types of labels at once\n:::\n\nHere is an example of modifying the title and the `x` and `y` labels to make the plot a bit more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n    geom_point(aes(color = bmicat)) +\n    labs(title = \"MAACS Cohort\") +\n    labs(\n        x = expression(\"log \" * PM[2.5]),\n        y = \"Nocturnal Symptoms\"\n    )\n```\n\n::: {.cell-output-display}\n![Modifying plot labels](index_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## A quick aside about axis limits\n\nOne quick **quirk about `ggplot2`** that caught me up when I first started using the package can be displayed in the following example.\n\nIf you make a lot of time series plots, you often **want to restrict the range of the y-axis** while still plotting all the data.\n\nIn the base graphics system you can do that as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntestdat <- data.frame(\n    x = 1:100,\n    y = rnorm(100)\n)\ntestdat[50, 2] <- 100 ## Outlier!\nplot(testdat$x,\n    testdat$y,\n    type = \"l\",\n    ylim = c(-3, 3)\n)\n```\n\n::: {.cell-output-display}\n![Time series plot with base graphics](index_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nHere, we have restricted the y-axis range to be between -3 and 3, even though there is a clear outlier in the data.\n\n::: callout-tip\n### Example\n\nWith `ggplot2` the default settings will give you this.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(testdat, aes(x = x, y = y))\ng + geom_line()\n```\n\n::: {.cell-output-display}\n![Time series plot with default settings](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\nOne might think that modifying the `ylim()` attribute would give you the same thing as the base plot, but it doesn't (?????)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n    geom_line() +\n    ylim(-3, 3)\n```\n\n::: {.cell-output-display}\n![Time series plot with modified ylim](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n:::\n\nEffectively, what this does is subset the data so that only observations between -3 and 3 are included, then plot the data.\n\nTo plot the data without subsetting it first and still get the restricted range, you have to do the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n    geom_line() +\n    coord_cartesian(ylim = c(-3, 3))\n```\n\n::: {.cell-output-display}\n![Time series plot with restricted y-axis range](index_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\nAnd now you know!\n\n# Post-lecture materials\n\n### Resources\n\n-   The *ggplot2* book by Hadley Wickham\n-   The *R Graphics Cookbook* by Winston Chang (examples in base plots and in `ggplot2`)\n-   [tidyverse web site](http://ggplot2.tidyverse.org)\n\n### More complex example with `ggplot2`\n\nNow you get the sense that plots in the `ggplot2` system are constructed by successively adding components to the plot, starting with the base dataset and maybe a scatterplot. In this section bleow, you can see a slightly more complicated example with an additional variable.\n\n<details>\n\n<summary>Click here for a slightly more complicated example with `ggplot()`.</summary>\n\nNow, we will ask the question\n\n> How does the relationship between PM2.5 and nocturnal symptoms vary by BMI category and nitrogen dioxide (NO2)?\n\nUnlike our previous BMI variable, NO2 is continuous, and so we need to make NO2 categorical so we can condition on it in the plotting. We can use the `cut()` function for this purpose. We will divide the NO2 variable into tertiles.\n\nFirst we need to calculate the tertiles with the `quantile()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncutpoints <- quantile(maacs$logno2_new, seq(0, 1, length = 4), na.rm = TRUE)\n```\n:::\n\n\nThen we need to divide the original `logno2_new` variable into the ranges defined by the cut points computed above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmaacs$no2tert <- cut(maacs$logno2_new, cutpoints)\n```\n:::\n\n\nThe `not2tert` variable is now a categorical factor variable containing 3 levels, indicating the ranges of NO2 (on the log scale).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## See the levels of the newly created factor variable\nlevels(maacs$no2tert)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"(0.342,1.23]\" \"(1.23,1.47]\"  \"(1.47,2.17]\" \n```\n:::\n:::\n\n\nThe final plot shows the relationship between PM2.5 and nocturnal symptoms by BMI category and NO2 tertile.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Setup ggplot with data frame\ng <- maacs %>%\n    ggplot(aes(logpm25, NocturnalSympt))\n\n## Add layers\ng + geom_point(alpha = 1 / 3) +\n    facet_grid(bmicat ~ no2tert) +\n    geom_smooth(method = \"lm\", se = FALSE, col = \"steelblue\") +\n    theme_bw(base_family = \"Avenir\", base_size = 10) +\n    labs(x = expression(\"log \" * PM[2.5])) +\n    labs(y = \"Nocturnal Symptoms\") +\n    labs(title = \"MAACS Cohort\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![PM2.5 and nocturnal symptoms by BMI category and NO2 tertile](index_files/figure-html/unnamed-chunk-22-1.png){width=864}\n:::\n:::\n\n\n</details>\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  What happens if you facet on a continuous variable?\n\n2.  Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other options control the layout of the individual panels? Why doesn't `facet_grid()` have `nrow` and `ncol` arguments?\n\n3.  What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?\n\n4.  What does `geom_col()` do? How is it different to `geom_bar()`?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://r4ds.had.co.nz/data-visualisation>\n-   <https://rdpeng.github.io/Biostat776/lecture-the-ggplot2-plotting-system-part-2>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package        * version date (UTC) lib source\n bit              4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64            4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout         1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon           1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest           0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver           2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2        * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here           * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools        0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite         1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr            1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling         0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice          0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n Matrix           1.6-1   2023-08-14 [1] CRAN (R 4.3.0)\n mgcv             1.9-0   2023-07-11 [1] CRAN (R 4.3.0)\n munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n nlme             3.1-163 2023-08-09 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)\n pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown        2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot        2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom            1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun             0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml             2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-15-1.png b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-15-1.png
index d15ae2e..807f126 100644
Binary files a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-15-1.png and b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-15-1.png differ
diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-16-1.png b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-16-1.png
index cd87fdf..ed79c8a 100644
Binary files a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-16-1.png and b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-16-1.png differ
diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-17-1.png b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-17-1.png
index 0c3c0d5..126a383 100644
Binary files a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-17-1.png and b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-17-1.png differ
diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-18-1.png b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-18-1.png
index a455f86..1f3afa1 100644
Binary files a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-18-1.png and b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-18-1.png differ
diff --git a/_freeze/posts/15-control-structures/index/execute-results/html.json b/_freeze/posts/15-control-structures/index/execute-results/html.json
index 326fd20..ff844d6 100644
--- a/_freeze/posts/15-control-structures/index/execute-results/html.json
+++ b/_freeze/posts/15-control-structures/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "4468daea3a3229a9227b455c6194b86c",
+  "hash": "5879f7961c587861a8d64dc17e5f9fdf",
   "result": {
-    "markdown": "---\ntitle: \"15 - Control Structures\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to control the flow of execution of a series of R expressions\"\ncategories: [module 4, week 4, R, programming]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/15-control-structures/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rafalab.github.io/dsbook/programming-basics>\n2.  <https://r4ds.had.co.nz/iteration>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-control-structures>\n-   <https://r4ds.had.co.nz/iteration>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to use commonly used control structures including `if`, `while`, `repeat`, and `for`\n-   Be able to skip an iteration of a loop using `next`\n-   Be able to exit a loop immediately using `break`\n:::\n\n# Control Structures\n\n**Control structures** in R allow you to **control the flow of execution of a series of R expressions**.\n\nBasically, control structures allow you to put some \"logic\" into your R code, rather than just always executing the same R code every time.\n\nControl structures **allow you to respond to inputs or to features of the data** and execute different R expressions accordingly.\n\nCommonly used control structures are\n\n-   `if` and `else`: testing a condition and acting on it\n\n-   `for`: execute a loop a fixed number of times\n\n-   `while`: execute a loop *while* a condition is true\n\n-   `repeat`: execute an infinite loop (must `break` out of it to stop)\n\n-   `break`: break the execution of a loop\n\n-   `next`: skip an interation of a loop\n\n::: callout-tip\n### Pro-tip\n\nMost control structures are not used in interactive sessions, but rather when writing functions or longer expressions.\n\nHowever, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions.\n:::\n\n## `if`-`else`\n\nThe `if`-`else` combination is probably the most commonly used control structure in R (or perhaps any language). This structure allows you to test a condition and act on it depending on whether it's true or false.\n\nFor starters, you can just use the `if` statement.\n\n``` r\nif(<condition>) {\n        ## do something\n} \n## Continue with rest of code\n```\n\nThe above code does nothing if the condition is false. If you have an action you want to execute when the condition is false, then you need an `else` clause.\n\n``` r\nif(<condition>) {\n        ## do something\n} \nelse {\n        ## do something else\n}\n```\n\nYou can have a series of tests by following the initial `if` with any number of `else if`s.\n\n``` r\nif(<condition1>) {\n        ## do something\n} else if(<condition2>)  {\n        ## do something different\n} else {\n        ## do something different\n}\n```\n\nHere is an example of a valid if/else structure.\n\nLet's use the `runif(n, min=0, max=1)` function which draws a random value between a min and max value with the default being between 0 and 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(n=1, min=0, max=10)  \nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4.495949\n```\n:::\n:::\n\n\nThen, we can write and `if`-`else` statement that tests whethere `x` is greater than 3 or not.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx > 3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\nIf `x` is greater than 3, then the first condition occurs. If `x` is not greater than 3, then the second condition occurs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif(x > 3) {\n    y <- 10\n  } else {\n    y <- 0\n  }\n```\n:::\n\n\nFinally, we can auto print `y` to see what the value is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10\n```\n:::\n:::\n\n\nThis expression can also be written a different (but equivalent!) way in R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- if(x > 3) {\n    10\n  } else { \n    0\n  }\n\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nNeither way of writing this expression is more correct than the other.\n\nWhich one you use will **depend on your preference** and perhaps those of the team you may be working with.\n:::\n\nOf course, the `else` clause is not necessary. You could have a series of if clauses that always get executed if their respective conditions are true.\n\n``` r\nif(<condition1>) {\n\n}\n\nif(<condition2>) {\n\n}\n```\n\n::: callout-note\n### Question\n\nLet's use the `palmerpenguins` dataset and write a if-else statement that\n\n1.  Randomly samples a value from a standard normal distribution (**Hint**: check out the `rnorm(n, mean = 0, sd = 1)` function in base R).\n2.  If the value is larger than 0, use `dplyr` functions to keep only the `Chinstrap` penguins.\n3.  Otherwise, keep only the `Gentoo` penguins.\n4.  Re-run the code 10 times and look at output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(tidyverse)\nlibrary(palmerpenguins)\npenguins \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>\n 1 Adelie  Torgersen           39.1          18.7               181        3750\n 2 Adelie  Torgersen           39.5          17.4               186        3800\n 3 Adelie  Torgersen           40.3          18                 195        3250\n 4 Adelie  Torgersen           NA            NA                  NA          NA\n 5 Adelie  Torgersen           36.7          19.3               193        3450\n 6 Adelie  Torgersen           39.3          20.6               190        3650\n 7 Adelie  Torgersen           38.9          17.8               181        3625\n 8 Adelie  Torgersen           39.2          19.6               195        4675\n 9 Adelie  Torgersen           34.1          18.1               193        3475\n10 Adelie  Torgersen           42            20.2               190        4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex <fct>, year <int>\n```\n:::\n:::\n\n:::\n\n## `for` Loops\n\n**For loops** are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop was not sufficient.\n\nIn R, for loops take an iterator variable and assign it successive values from a sequence or vector.\n\nFor loops are most commonly used for **iterating over the elements of an object** (list, vector, etc.)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in 1:10) {\n        print(i)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n[1] 6\n[1] 7\n[1] 8\n[1] 9\n[1] 10\n```\n:::\n:::\n\n\nThis **loop takes the `i` variable** and in **each iteration of the loop** gives it values 1, 2, 3, ..., 10, then **executes the code** within the curly braces, and then the loop exits.\n\nThe following three loops all have the same behavior.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## define the loop to iterate over\nx <- c(\"a\", \"b\", \"c\", \"d\")\n\n## create for loop\nfor(i in 1:4) {\n        ## Print out each element of 'x'\n        print(x[i])  \n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nWe can also print just the iteration value (`i`) itself\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## define the loop to iterate over\nx <- c(\"a\", \"b\", \"c\", \"d\")\n\n## create for loop\nfor(i in 1:4) {\n        ## Print out just 'i'\n        print(i)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n```\n:::\n:::\n\n\n### `seq_along()`\n\nThe `seq_along()` function is **commonly used in conjunction with `for` loops** in order to generate an integer sequence based on the length of an object (or `ncol()` of an R object) (in this case, the object `x`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\" \"b\" \"c\" \"d\"\n```\n:::\n\n```{.r .cell-code}\nseq_along(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2 3 4\n```\n:::\n:::\n\n\nThe `seq_along()` function takes in a vector and then **returns a sequence of integers** that is the same length as the input vector. It doesn't matter what class the vector is.\n\nLet's put `seq_along()` and `for` loops together.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Generate a sequence based on length of 'x'\nfor(i in seq_along(x)) {   \n        print(x[i])\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nIt is not necessary to use an index-type variable (i.e. `i`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(babyshark in x) {\n        print(babyshark)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(candyisgreat in x) {\n        print(candyisgreat)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(RememberToVote in x) {\n        print(RememberToVote)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nYou can use any character index you want (but not with symbols or numbers).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(1999 in x) {\n        print(1999)\n}\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: <text>:1:5: unexpected numeric constant\n1: for(1999\n        ^\n```\n:::\n:::\n\n\nFor one line loops, the curly braces are not strictly necessary.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in 1:4) print(x[i])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nHowever, I like to use curly braces even for one-line loops, because that way if you decide to expand the loop to multiple lines, you won't be burned because you forgot to add curly braces (and you **will** be burned by this).\n\n::: callout-note\n### Question\n\nLet's use the `palmerpenguins` dataset. Here are the tasks:\n\n1.  Start a `for` loop\n2.  Iterate over the columns of `penguins`\n3.  For each column, extract the values of that column (**Hint**: check out the `pull()` function in `dplyr`).\n4.  Using a `if`-`else` statement, test whether or not the values in the column are numeric or not (**Hint**: remember the `is.numeric()` function to test if a value is numeric).\n5.  If they are numeric, compute the column mean. Otherwise, report a `NA`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n### Nested `for` loops\n\n`for` loops can be **nested** inside of each other.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(1:6, nrow = 2, ncol = 3)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n     [,1] [,2] [,3]\n[1,]    1    3    5\n[2,]    2    4    6\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in seq_len(nrow(x))) {\n        for(j in seq_len(ncol(x))) {\n                print(x[i, j])\n        }   \n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 3\n[1] 5\n[1] 2\n[1] 4\n[1] 6\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `j` index goes across the columns. That's why we values 1, 3, etc.\n:::\n\nNested loops are commonly needed for **multidimensional or hierarchical data structures** (e.g. matrices, lists). Be careful with nesting though.\n\nNesting beyond 2 to 3 levels often makes it **difficult to read/understand the code**.\n\nIf you find yourself in need of a large number of nested loops, you may want to **break up the loops by using functions** (discussed later).\n\n## `while` Loops\n\n**`while` loops** begin by **testing a condition**.\n\nIf it is true, then they execute the loop body.\n\nOnce the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncount <- 0\nwhile(count < 10) {\n        print(count)\n        count <- count + 1\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n[1] 6\n[1] 7\n[1] 8\n[1] 9\n```\n:::\n:::\n\n\n`while` loops can potentially result in infinite loops if not written properly. **Use with care!**\n\nSometimes there will be more than one condition in the test.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- 5\nset.seed(1)\n\nwhile(z >= 3 && z <= 10) {\n        coin <- rbinom(1, 1, 0.5)\n        \n        if(coin == 1) {  ## random walk\n                z <- z + 1\n        } else {\n                z <- z - 1\n        } \n}\nprint(z)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nWhat's the difference between using one `&` or two `&&` ?\n\nIf you use only one `&`, these are vectorized operations, meaning they can **return a vector**, like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n-2:2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -2 -1  0  1  2\n```\n:::\n\n```{.r .cell-code}\n((-2:2) >= 0) & ((-2:2) <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE FALSE  TRUE FALSE FALSE\n```\n:::\n:::\n\n\nIf you use two `&&` (as above), then these **conditions are evaluated left to right**. For example, in the above code, if `z` were less than 3, the second test would not have been evaluated.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(2 >= 0) && (-2 <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n(-2 >= 0) && (-2 <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n:::\n\n## `repeat` Loops\n\n**`repeat` initiates an infinite loop** right from the start. These are **not commonly used** in statistical or data analysis applications, but they do have their uses.\n\n::: callout-tip\n### IMPORTANT (READ THIS AND DON'T FORGET... I'M SERIOUS... YOU WANT TO REMEMBER THIS.. FOR REALZ PLZ REMEMBER THIS)\n\nThe only way to exit a `repeat` loop is to call `break`.\n:::\n\nOne possible paradigm might be in an iterative algorithm where you may be searching for a solution and you do not want to stop until you are close enough to the solution.\n\nIn this kind of situation, you often don't know in advance how many iterations it's going to take to get \"close enough\" to the solution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx0 <- 1\ntol <- 1e-8\n\nrepeat {\n        x1 <- computeEstimate()\n        \n        if(abs(x1 - x0) < tol) {  ## Close enough?\n                break\n        } else {\n                x0 <- x1\n        } \n}\n```\n:::\n\n\n::: callout-tip\n### Note\n\nThe above code will not run if the `computeEstimate()` function is not defined (I just made it up for the purposes of this demonstration).\n:::\n\n::: callout-tip\n### Pro-tip\n\nThe loop above is a bit **dangerous** because there is no guarantee it will stop.\n\nYou could get in a situation where the values of `x0` and `x1` oscillate back and forth and never converge.\n\nBetter to set a hard limit on the number of iterations by using a `for` loop and then report whether convergence was achieved or not.\n:::\n\n## `next`, `break`\n\n`next` is used to skip an iteration of a loop.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in 1:100) {\n        if(i <= 20) {\n                ## Skip the first 20 iterations\n                next                 \n        }\n        ## Do something here\n}\n```\n:::\n\n\n`break` is used to exit a loop immediately, regardless of what iteration the loop may be on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in 1:100) {\n      print(i)\n\n      if(i > 20) {\n              ## Stop loop after 20 iterations\n              break  \n      }\t\t\n}\n```\n:::\n\n\n# Summary\n\n-   Control structures like `if`, `while`, and `for` allow you to control the flow of an R program\n-   Infinite loops should generally be avoided, even if (you believe) they are theoretically correct.\n-   Control structures mentioned here are primarily useful for writing programs; for command-line interactive work, the \"apply\" functions are more useful.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Write for loops to compute the mean of every column in `mtcars`.\n\n2.  Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, `files <- dir(\"data/\", pattern = \"\\\\.csv$\", full.names = TRUE)`, and now want to read each one with `read_csv()`. Write the for loop that will load them into a single data frame.\n\n3.  What happens if you use `for (nm in names(x))` and `x` has no names? What if only some of the elements are named? What if the names are not unique?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-control-structures>\n-   <https://rafalab.github.io/dsbook/programming-basics>\n-   <https://r4ds.had.co.nz/iteration>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package        * version date (UTC) lib source\n cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout         1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest           0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2        * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools        0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite         1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr            1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)\n pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown        2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun             0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml             2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"15 - Control Structures\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to control the flow of execution of a series of R expressions\"\ncategories: [module 4, week 4, R, programming]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/15-control-structures/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rafalab.github.io/dsbook/programming-basics>\n2.  <https://r4ds.had.co.nz/iteration>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-control-structures>\n-   <https://r4ds.had.co.nz/iteration>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Be able to use commonly used control structures including `if`, `while`, `repeat`, and `for`\n-   Be able to skip an iteration of a loop using `next`\n-   Be able to exit a loop immediately using `break`\n:::\n\n# Control Structures\n\n**Control structures** in R allow you to **control the flow of execution of a series of R expressions**.\n\nBasically, control structures allow you to put some \"logic\" into your R code, rather than just always executing the same R code every time.\n\nControl structures **allow you to respond to inputs or to features of the data** and execute different R expressions accordingly.\n\nCommonly used control structures are\n\n-   `if` and `else`: testing a condition and acting on it\n\n-   `for`: execute a loop a fixed number of times\n\n-   `while`: execute a loop *while* a condition is true\n\n-   `repeat`: execute an infinite loop (must `break` out of it to stop)\n\n-   `break`: break the execution of a loop\n\n-   `next`: skip an interation of a loop\n\n::: callout-tip\n### Pro-tip\n\nMost control structures are not used in interactive sessions, but rather when writing functions or longer expressions.\n\nHowever, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions.\n:::\n\n## `if`-`else`\n\nThe `if`-`else` combination is probably the most commonly used control structure in R (or perhaps any language). This structure allows you to test a condition and act on it depending on whether it's true or false.\n\nFor starters, you can just use the `if` statement.\n\n``` r\nif(<condition>) {\n        ## do something\n} \n## Continue with rest of code\n```\n\nThe above code does nothing if the condition is false. If you have an action you want to execute when the condition is false, then you need an `else` clause.\n\n``` r\nif(<condition>) {\n        ## do something\n} \nelse {\n        ## do something else\n}\n```\n\nYou can have a series of tests by following the initial `if` with any number of `else if`s.\n\n``` r\nif(<condition1>) {\n        ## do something\n} else if(<condition2>)  {\n        ## do something different\n} else {\n        ## do something different\n}\n```\n\nHere is an example of a valid if/else structure.\n\nLet's use the `runif(n, min=0, max=1)` function which draws a random value between a min and max value with the default being between 0 and 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(n = 1, min = 0, max = 10)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.521267\n```\n:::\n:::\n\n\nThen, we can write and `if`-`else` statement that tests whethere `x` is greater than 3 or not.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx > 3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\nIf `x` is greater than 3, then the first condition occurs. If `x` is not greater than 3, then the second condition occurs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (x > 3) {\n    y <- 10\n} else {\n    y <- 0\n}\n```\n:::\n\n\nFinally, we can auto print `y` to see what the value is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10\n```\n:::\n:::\n\n\nThis expression can also be written a different (but equivalent!) way in R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- if (x > 3) {\n    10\n} else {\n    0\n}\n\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nNeither way of writing this expression is more correct than the other.\n\nWhich one you use will **depend on your preference** and perhaps those of the team you may be working with.\n:::\n\nOf course, the `else` clause is not necessary. You could have a series of if clauses that always get executed if their respective conditions are true.\n\n``` r\nif(<condition1>) {\n\n}\n\nif(<condition2>) {\n\n}\n```\n\n::: callout-note\n### Question\n\nLet's use the `palmerpenguins` dataset and write a if-else statement that\n\n1.  Randomly samples a value from a standard normal distribution (**Hint**: check out the `rnorm(n, mean = 0, sd = 1)` function in base R).\n2.  If the value is larger than 0, use `dplyr` functions to keep only the `Chinstrap` penguins.\n3.  Otherwise, keep only the `Gentoo` penguins.\n4.  Re-run the code 10 times and look at output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(tidyverse)\nlibrary(palmerpenguins)\npenguins\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>\n 1 Adelie  Torgersen           39.1          18.7               181        3750\n 2 Adelie  Torgersen           39.5          17.4               186        3800\n 3 Adelie  Torgersen           40.3          18                 195        3250\n 4 Adelie  Torgersen           NA            NA                  NA          NA\n 5 Adelie  Torgersen           36.7          19.3               193        3450\n 6 Adelie  Torgersen           39.3          20.6               190        3650\n 7 Adelie  Torgersen           38.9          17.8               181        3625\n 8 Adelie  Torgersen           39.2          19.6               195        4675\n 9 Adelie  Torgersen           34.1          18.1               193        3475\n10 Adelie  Torgersen           42            20.2               190        4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex <fct>, year <int>\n```\n:::\n:::\n\n:::\n\n## `for` Loops\n\n**For loops** are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop was not sufficient.\n\nIn R, for loops take an iterator variable and assign it successive values from a sequence or vector.\n\nFor loops are most commonly used for **iterating over the elements of an object** (list, vector, etc.)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:10) {\n    print(i)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n[1] 6\n[1] 7\n[1] 8\n[1] 9\n[1] 10\n```\n:::\n:::\n\n\nThis **loop takes the `i` variable** and in **each iteration of the loop** gives it values 1, 2, 3, ..., 10, then **executes the code** within the curly braces, and then the loop exits.\n\nThe following three loops all have the same behavior.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## define the loop to iterate over\nx <- c(\"a\", \"b\", \"c\", \"d\")\n\n## create for loop\nfor (i in 1:4) {\n    ## Print out each element of 'x'\n    print(x[i])\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nWe can also print just the iteration value (`i`) itself\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## define the loop to iterate over\nx <- c(\"a\", \"b\", \"c\", \"d\")\n\n## create for loop\nfor (i in 1:4) {\n    ## Print out just 'i'\n    print(i)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n```\n:::\n:::\n\n\n### `seq_along()`\n\nThe `seq_along()` function is **commonly used in conjunction with `for` loops** in order to generate an integer sequence based on the length of an object (or `ncol()` of an R object) (in this case, the object `x`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\" \"b\" \"c\" \"d\"\n```\n:::\n\n```{.r .cell-code}\nseq_along(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2 3 4\n```\n:::\n:::\n\n\nThe `seq_along()` function takes in a vector and then **returns a sequence of integers** that is the same length as the input vector. It doesn't matter what class the vector is.\n\nLet's put `seq_along()` and `for` loops together.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Generate a sequence based on length of 'x'\nfor (i in seq_along(x)) {\n    print(x[i])\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nIt is not necessary to use an index-type variable (i.e. `i`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (babyshark in x) {\n    print(babyshark)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (candyisgreat in x) {\n    print(candyisgreat)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (RememberToVote in x) {\n    print(RememberToVote)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nYou can use any character index you want (but not with symbols or numbers).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (1999 in x) {\n    print(1999)\n}\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: <text>:1:6: unexpected numeric constant\n1: for (1999\n         ^\n```\n:::\n:::\n\n\nFor one line loops, the curly braces are not strictly necessary.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:4) print(x[i])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nHowever, I like to use curly braces even for one-line loops, because that way if you decide to expand the loop to multiple lines, you won't be burned because you forgot to add curly braces (and you **will** be burned by this).\n\n::: callout-note\n### Question\n\nLet's use the `palmerpenguins` dataset. Here are the tasks:\n\n1.  Start a `for` loop\n2.  Iterate over the columns of `penguins`\n3.  For each column, extract the values of that column (**Hint**: check out the `pull()` function in `dplyr`).\n4.  Using a `if`-`else` statement, test whether or not the values in the column are numeric or not (**Hint**: remember the `is.numeric()` function to test if a value is numeric).\n5.  If they are numeric, compute the column mean. Otherwise, report a `NA`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n### Nested `for` loops\n\n`for` loops can be **nested** inside of each other.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(1:6, nrow = 2, ncol = 3)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n     [,1] [,2] [,3]\n[1,]    1    3    5\n[2,]    2    4    6\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in seq_len(nrow(x))) {\n    for (j in seq_len(ncol(x))) {\n        print(x[i, j])\n    }\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 3\n[1] 5\n[1] 2\n[1] 4\n[1] 6\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `j` index goes across the columns. That's why we values 1, 3, etc.\n:::\n\nNested loops are commonly needed for **multidimensional or hierarchical data structures** (e.g. matrices, lists). Be careful with nesting though.\n\nNesting beyond 2 to 3 levels often makes it **difficult to read/understand the code**.\n\nIf you find yourself in need of a large number of nested loops, you may want to **break up the loops by using functions** (discussed later).\n\n## `while` Loops\n\n**`while` loops** begin by **testing a condition**.\n\nIf it is true, then they execute the loop body.\n\nOnce the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncount <- 0\nwhile (count < 10) {\n    print(count)\n    count <- count + 1\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n[1] 6\n[1] 7\n[1] 8\n[1] 9\n```\n:::\n:::\n\n\n`while` loops can potentially result in infinite loops if not written properly. **Use with care!**\n\nSometimes there will be more than one condition in the test.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- 5\nset.seed(1)\n\nwhile (z >= 3 && z <= 10) {\n    coin <- rbinom(1, 1, 0.5)\n\n    if (coin == 1) { ## random walk\n        z <- z + 1\n    } else {\n        z <- z - 1\n    }\n}\nprint(z)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nWhat's the difference between using one `&` or two `&&` ?\n\nIf you use only one `&`, these are vectorized operations, meaning they can **return a vector**, like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n-2:2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -2 -1  0  1  2\n```\n:::\n\n```{.r .cell-code}\n((-2:2) >= 0) & ((-2:2) <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE FALSE  TRUE FALSE FALSE\n```\n:::\n:::\n\n\nIf you use two `&&` (as above), then these **conditions are evaluated left to right**. For example, in the above code, if `z` were less than 3, the second test would not have been evaluated.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(2 >= 0) && (-2 <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n(-2 >= 0) && (-2 <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n:::\n\n## `repeat` Loops\n\n**`repeat` initiates an infinite loop** right from the start. These are **not commonly used** in statistical or data analysis applications, but they do have their uses.\n\n::: callout-tip\n### IMPORTANT (READ THIS AND DON'T FORGET... I'M SERIOUS... YOU WANT TO REMEMBER THIS.. FOR REALZ PLZ REMEMBER THIS)\n\nThe only way to exit a `repeat` loop is to call `break`.\n:::\n\nOne possible paradigm might be in an iterative algorithm where you may be searching for a solution and you do not want to stop until you are close enough to the solution.\n\nIn this kind of situation, you often don't know in advance how many iterations it's going to take to get \"close enough\" to the solution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx0 <- 1\ntol <- 1e-8\n\nrepeat {\n    x1 <- computeEstimate()\n\n    if (abs(x1 - x0) < tol) { ## Close enough?\n        break\n    } else {\n        x0 <- x1\n    }\n}\n```\n:::\n\n\n::: callout-tip\n### Note\n\nThe above code will not run if the `computeEstimate()` function is not defined (I just made it up for the purposes of this demonstration).\n:::\n\n::: callout-tip\n### Pro-tip\n\nThe loop above is a bit **dangerous** because there is no guarantee it will stop.\n\nYou could get in a situation where the values of `x0` and `x1` oscillate back and forth and never converge.\n\nBetter to set a hard limit on the number of iterations by using a `for` loop and then report whether convergence was achieved or not.\n:::\n\n## `next`, `break`\n\n`next` is used to skip an iteration of a loop.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:100) {\n    if (i <= 20) {\n        ## Skip the first 20 iterations\n        next\n    }\n    ## Do something here\n}\n```\n:::\n\n\n`break` is used to exit a loop immediately, regardless of what iteration the loop may be on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:100) {\n    print(i)\n\n    if (i > 20) {\n        ## Stop loop after 20 iterations\n        break\n    }\n}\n```\n:::\n\n\n# Summary\n\n-   Control structures like `if`, `while`, and `for` allow you to control the flow of an R program\n-   Infinite loops should generally be avoided, even if (you believe) they are theoretically correct.\n-   Control structures mentioned here are primarily useful for writing programs; for command-line interactive work, the \"apply\" functions are more useful.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Write for loops to compute the mean of every column in `mtcars`.\n\n2.  Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, `files <- dir(\"data/\", pattern = \"\\\\.csv$\", full.names = TRUE)`, and now want to read each one with `read_csv()`. Write the for loop that will load them into a single data frame.\n\n3.  What happens if you use `for (nm in names(x))` and `x` has no names? What if only some of the elements are named? What if the names are not unique?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-control-structures>\n-   <https://rafalab.github.io/dsbook/programming-basics>\n-   <https://r4ds.had.co.nz/iteration>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package        * version date (UTC) lib source\n cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout         1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest           0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2        * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools        0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite         1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr            1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)\n pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown        2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun             0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml             2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/16-functions/index/execute-results/html.json b/_freeze/posts/16-functions/index/execute-results/html.json
index 21be540..d99d8f2 100644
--- a/_freeze/posts/16-functions/index/execute-results/html.json
+++ b/_freeze/posts/16-functions/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "2bcbe1b7adebc05021228f054529b187",
+  "hash": "4e89da80ee98fc8abb67c9e1b35b137b",
   "result": {
-    "markdown": "---\ntitle: \"16 - Functions\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to writing functions in R\"\ncategories: [module 4, week 4, R, programming, functions]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/16-functions/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/functions>\n2.  <https://adv-r.hadley.nz/functions.html?#functions>\n3.  <https://swcarpentry.github.io/r-novice-inflammation/02-func-R>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-functions>\n-   <https://rdpeng.github.io/Biostat776/lecture-scoping-rules-of-r>\n-   <https://r4ds.had.co.nz/functions>\n-   <https://r4ds.had.co.nz/functions.html#environment>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Know how to create a **function** using `function()` in R\n-   Know how to define **named arguments** inside a function with default values\n-   Be able to use named matching or **positional matching** in the argument list\n-   Understand what is **lazy evaluation**\n-   Understand the the special `...` argument in a function definition\n:::\n\n# Introduction\n\nWriting functions is a **core activity** of an R programmer. It represents the key step of the transition from a mere \"user\" to a developer who creates new functionality for R.\n\n**Functions** are often used to **encapsulate a sequence of expressions that need to be executed numerous times**, perhaps under slightly different conditions.\n\nAlso, functions are also often written **when code must be shared with others or the public**.\n\nThe writing of a function allows a developer to create an interface to the code, that is explicitly specified with a set of **arguments** (or parameters).\n\nThis interface provides an **abstraction of the code** to potential users. This abstraction simplifies the users' lives because it relieves them from having to know every detail of how the code operates.\n\nIn addition, the creation of an interface allows the developer to **communicate to the user the aspects of the code that are important** or are most relevant.\n\n## Functions in R\n\nFunctions in R are \"first class objects\", which means that they can be treated much like any other R object.\n\n::: callout-tip\n### Important facts about R functions\n\n-   Functions can be passed as arguments to other functions.\n    -   This is very handy for the various apply functions, like `lapply()` and `sapply()`.\n-   Functions can be nested, so that you can define a function inside of another function.\n:::\n\nIf you are familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis.\n\n## Your First Function\n\nFunctions are defined using the `function()` directive and are **stored as R objects** just like anything else.\n\n::: callout-tip\n### Important\n\nIn particular, functions are R objects of class `function`.\n\nHere's a simple function that takes no arguments and does nothing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n        ## This is an empty function\n}\n## Functions have their own class\nclass(f)  \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"function\"\n```\n:::\n\n```{.r .cell-code}\n## Execute this function\nf()       \n```\n\n::: {.cell-output .cell-output-stdout}\n```\nNULL\n```\n:::\n:::\n\n:::\n\nNot very interesting, but it is a start!\n\nThe next thing we can do is **create a function** that actually has a non-trivial **function body**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n        # this is the function body\n        hello <- \"Hello, world!\\n\"\n        cat(hello) \n}\nf()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\n`cat()` is useful and preferable to `print()` in several settings. One reason is that it doesn't output new lines (i.e. `\\n`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhello <- \"Hello, world!\\n\"\n\nprint(hello)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Hello, world!\\n\"\n```\n:::\n\n```{.r .cell-code}\ncat(hello)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n:::\n\n:::\n\nThe last aspect of a basic function is the **function arguments**.\n\nThese are **the options that you can specify to the user** that the user may explicitly set.\n\nFor this basic function, we can add an argument that determines how many times \"Hello, world!\" is printed to the console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num) {\n        for(i in seq_len(num)) {\n                hello <- \"Hello, world!\\n\"\n                cat(hello) \n        }\n}\nf(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\nHello, world!\n```\n:::\n:::\n\n\nObviously, we **could have just cut-and-pasted** the `cat(\"Hello, world!\\n\")` code three times to achieve the same effect, but then we wouldn't be programming, would we?\n\nAlso, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see \"Hello, world!\".\n\n::: callout-tip\n### Pro-tip\n\nIf you find yourself doing a lot of cutting and pasting, that's usually a good sign that you might need to write a function.\n:::\n\nFinally, the function above doesn't **return** anything.\n\nIt just prints \"Hello, world!\" to the console `num` number of times and then exits.\n\nBut often it is useful **if a function returns something** that perhaps can be fed into another section of code.\n\nThis next function returns the total number of characters printed to the console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num) {\n        hello <- \"Hello, world!\\n\"\n        for(i in seq_len(num)) {\n                 cat(hello)\n        }\n        chars <- nchar(hello) * num\n        chars\n}\nmeaningoflife <- f(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\nHello, world!\n```\n:::\n\n```{.r .cell-code}\nprint(meaningoflife)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 42\n```\n:::\n:::\n\n\nIn the above function, we did not have to indicate anything special in order for the function to return the number of characters.\n\nIn R, the **return value of a function** is always the very **last expression that is evaluated**.\n\nBecause the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function.\n\n::: callout-tip\n### Note\n\nThere is a `return()` function that can be used to return an explicitly value from a function, but it is rarely used in R (we will discuss it a bit later in this lesson).\n:::\n\nFinally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in f(): argument \"num\" is missing, with no default\n```\n:::\n:::\n\n\nWe can modify this behavior by setting a **default value** for the argument `num`.\n\n**Any function argument can have a default value**, if you wish to specify it.\n\nSometimes, argument values are rarely modified (except in special cases) and it makes sense to set a default value for that argument. This relieves the user from having to specify the value of that argument every single time the function is called.\n\nHere, for example, we could set the default value for `num` to be 1, so that if the function is called without the `num` argument being explicitly specified, then it will print \"Hello, world!\" to the console once.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num = 1) {\n        hello <- \"Hello, world!\\n\"\n        for(i in seq_len(num)) {\n                cat(hello)\n        }\n        chars <- nchar(hello) * num\n        chars\n}\n\n\nf()    ## Use default value for 'num'\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n\n```{.r .cell-code}\nf(2)   ## Use user-specified value\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 28\n```\n:::\n:::\n\n\nRemember that the function still returns the number of characters printed to the console.\n\n::: callout-tip\n### Pro-tip\n\nThe `formals()` function returns a list of all the formal arguments of a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nformals(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$num\n[1] 1\n```\n:::\n:::\n\n:::\n\n## Summary\n\nWe have written a function that\n\n-   has one *formal argument* named `num` with a *default value* of 1. The *formal arguments* are the arguments included in the function definition.\n\n-   prints the message \"Hello, world!\" to the console a number of times indicated by the argument `num`\n\n-   *returns* the number of characters printed to the console\n\n# Arguments\n\n## Named arguments\n\nAbove, we have learned that functions have **named arguments**, which can optionally have default values.\n\nBecause all function arguments have names, they can be specified using their name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf(num = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 28\n```\n:::\n:::\n\n\nSpecifying an argument by its name is sometimes useful **if a function has many arguments** and it may not always be clear which argument is being specified.\n\nHere, our function only has one argument so there's no confusion.\n\n## Argument matching\n\nCalling an **R function with multiple arguments** can be done in a variety of ways.\n\nThis may be confusing at first, but it's really handy when doing interactive work at the command line. R functions arguments can be matched **positionally** or **by name**.\n\n-   **Positional matching** just means that R assigns the first value to the first argument, the second value to second argument, etc.\n\nSo, in the following call to `rnorm()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (n, mean = 0, sd = 1)  \n```\n:::\n\n```{.r .cell-code}\nmydata <- rnorm(100, 2, 1)              ## Generate some data\n```\n:::\n\n\n100 is assigned to the `n` argument, 2 is assigned to the `mean` argument, and 1 is assigned to the `sd` argument, all by positional matching.\n\nThe following calls to the `sd()` function (which computes the empirical standard deviation of a vector of numbers) are all equivalent.\n\n::: callout-tip\n### Note\n\n`sd(x, na.rm = FALSE)` has two arguments:\n\n-   `x` indicates the vector of numbers\n-   `na.rm` is a logical indicating whether missing values should be removed or not (default is `FALSE`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Positional match first argument, default for 'na.rm'\nsd(mydata)                     \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n\n```{.r .cell-code}\n## Specify 'x' argument by name, default for 'na.rm'\nsd(x = mydata)                 \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n\n```{.r .cell-code}\n## Specify both arguments by name\nsd(x = mydata, na.rm = FALSE) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n:::\n\n:::\n\nWhen **specifying the function arguments by name**, it **doesn't matter in what order** you specify them.\n\nIn the example below, we specify the `na.rm` argument first, followed by `x`, even though `x` is the first argument defined in the function definition.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Specify both arguments by name\nsd(na.rm = FALSE, x = mydata)     \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n:::\n\n\nYou **can mix positional matching with matching by name**.\n\nWhen an argument is matched by name, **it is \"taken out\" of the argument list** and the remaining unnamed arguments are matched in the order that they are listed in the function definition.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsd(na.rm = FALSE, mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n:::\n\n\nHere, the `mydata` object is assigned to the `x` argument, because it's the only argument not yet specified.\n\n::: callout-tip\n### Pro-tip\n\nThe `args()` function displays the argument names and corresponding default values of a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (num = 1) \nNULL\n```\n:::\n:::\n\n:::\n\nBelow is the argument list for the `lm()` function, which fits linear models to a dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (formula, data, subset, weights, na.action, method = \"qr\", \n    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, \n    contrasts = NULL, offset, ...) \nNULL\n```\n:::\n:::\n\n\nThe following two calls are equivalent.\n\n``` r\nlm(data = mydata, y ~ x, model = FALSE, 1:100)\nlm(y ~ x, mydata, 1:100, model = FALSE)\n```\n\n::: callout-tip\n### Pro-tip\n\nEven though it's legal, I don't recommend messing around with the order of the arguments too much, since it can lead to some confusion.\n:::\n\nMost of the time, **named arguments are helpful**:\n\n-   On the command line when you have a long argument list and you want to use the defaults for everything except for an argument near the end of the list\n-   If you can remember the name of the argument and not its position on the argument list\n\nFor example, **plotting functions** often have a lot of options to allow for customization, but this makes it difficult to remember exactly the position of every argument on the argument list.\n\nFunction arguments can also be **partially matched**, which is useful for interactive work.\n\n::: callout-tip\n### Pro-tip\n\nThe order of operations when given an argument is\n\n1.  Check for exact match for a named argument\n2.  Check for a partial match\n3.  Check for a positional match\n:::\n\n**Partial matching should be avoided when writing longer code or programs**, because it may lead to confusion if someone is reading the code. However, partial matching is very useful when calling functions interactively that have very long argument names.\n\n## Lazy Evaluation\n\nArguments to functions are **evaluated lazily**, so they are evaluated only as needed in the body of the function.\n\nIn this example, the function `f()` has two arguments: `a` and `b`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a, b) {\n        a^2\n} \nf(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n:::\n\n\nThis **function never actually uses the argument `b`**, so calling `f(2)` will not produce an error because the 2 gets positionally matched to `a`.\n\nThis behavior can be good or bad. It's common to write a function that doesn't use an argument and not notice it simply because R never throws an error.\n\nThis example also shows lazy evaluation at work, but does eventually result in an error.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a, b) {\n        print(a)\n        print(b)\n}\nf(45)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 45\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nError in f(45): argument \"b\" is missing, with no default\n```\n:::\n:::\n\n\nNotice that \"45\" got printed first before the error was triggered! This is because `b` did not have to be evaluated until after `print(a)`.\n\nOnce the function tried to evaluate `print(b)` the function had to throw an error.\n\n## The `...` Argument\n\nThere is a **special argument in R known as the `...` argument**, which indicates **a variable number of arguments** that are usually passed on to other functions.\n\nThe `...` argument is **often used when extending another function** and you do not want to copy the entire argument list of the original function\n\nFor example, a custom plotting function may want to make use of the default `plot()` function along with its entire argument list. The function below changes the default for the `type` argument to the value `type = \"l\"` (the original default was `type = \"p\"`).\n\n``` r\nmyplot <- function(x, y, type = \"l\", ...) {\n        plot(x, y, type = type, ...)    ## Pass '...' to 'plot' function\n}\n```\n\nGeneric functions use `...` so that extra arguments can be passed to methods.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (x, ...) \nUseMethod(\"mean\")\n<bytecode: 0x1039e2de8>\n<environment: namespace:base>\n```\n:::\n:::\n\n\nThe `...` argument is necessary when the number of arguments passed to the function cannot be known in advance. This is clear in functions like `paste()` and `cat()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"one\", \"two\", \"three\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one two three\"\n```\n:::\n\n```{.r .cell-code}\npaste(\"one\", \"two\", \"three\", \"four\", \"five\", sep=\"_\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one_two_three_four_five\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(paste)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (..., sep = \" \", collapse = NULL, recycle0 = FALSE) \nNULL\n```\n:::\n:::\n\n\nBecause `paste()` prints out text to the console by combining multiple character vectors together, it is impossible for this function to know in advance how many character vectors will be passed to the function by the user.\n\nSo the first argument in the function is `...`.\n\n## Arguments Coming After the `...` Argument\n\nOne catch with `...` is that any **arguments that appear after** `...` on the argument list **must be named explicitly and cannot be partially matched or matched positionally**.\n\nTake a look at the arguments to the `paste()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(paste)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (..., sep = \" \", collapse = NULL, recycle0 = FALSE) \nNULL\n```\n:::\n:::\n\n\nWith the `paste()` function, the arguments `sep` and `collapse` must be named explicitly and in full if the default values are not going to be used.\n\nHere, I specify that I want \"a\" and \"b\" to be pasted together and separated by a colon.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"a\", \"b\", sep = \":\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a:b\"\n```\n:::\n:::\n\n\nIf I don't specify the `sep` argument in full and attempt to rely on partial matching, I don't get the expected result.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"a\", \"b\", se = \":\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a b :\"\n```\n:::\n:::\n\n\n# Functions are for humans and computers\n\nAs you start to write your own functions, it's important to keep in mind that functions are not just for the computer, but are also for humans. Technically, R does not care what your function is called, or what comments it contains, but these are important for **human readers**.\n\nThis section discusses some things that you should bear in mind when writing functions that humans can understand.\n\n## The name of a function is important\n\nIn an ideal world, you want the name of your function to be short but clearly describe what the function does. This is not always easy, but here are some tips.\n\nThe **function names** should be **verbs**, and **arguments** should be **nouns**.\n\nThere are some exceptions:\n\n-   nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`).\n-   A good sign that a noun might be a better choice is if you are using a very broad verb like \"get\", \"compute\", \"calculate\", or \"determine\". Use your best judgement and do not be afraid to rename a function if you figure out a better name later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Too short\nf()\n\n# Not a verb, or descriptive\nmy_awesome_function()\n\n# Long, but clear\nimpute_missing()\ncollapse_years()\n```\n:::\n\n\n## snake_case vs camelCase\n\nIf your function name is composed of multiple words, **use \"snake_case\"**, where each lowercase word is separated by an underscore.\n\n\"camelCase\" is a popular alternative. It does not really matter which one you pick, the important thing is to be consistent: **pick one or the other and stick with it**.\n\nR itself is not very consistent, but there is nothing you can do about that. Make sure you do not fall into the same trap by making your code as consistent as possible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Never do this!\ncol_mins <- function(x, y) {}\nrowMaxes <- function(x, y) {}\n```\n:::\n\n\n## Use a common prefix\n\nIf you have a family of functions that do similar things, make sure they have consistent names and arguments.\n\nIt's a good idea to indicate that they are connected. That is better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Good\ninput_select()\ninput_checkbox()\ninput_text()\n\n# Not so good\nselect_input()\ncheckbox_input()\ntext_input()\n```\n:::\n\n\n## Avoid overriding exisiting functions\n\nWhere possible, avoid overriding existing functions and variables.\n\nIt is impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Don't do this!\nT <- FALSE\nc <- 10\nmean <- function(x) sum(x)\n```\n:::\n\n\n## Use comments\n\nUse **comments** are lines starting with #. They can explain the \"why\" of your code.\n\nYou generally should avoid comments that explain the \"what\" or the \"how\". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear.\n\n-   Do you need to add some intermediate variables with useful names?\n-   Do you need to break out a subcomponent of a large function so you can name it?\n\nHowever, your code can never capture the reasoning behind your decisions:\n\n-   Why did you choose this approach instead of an alternative?\n-   What else did you try that didn't work?\n\nIt's a great idea to capture that sort of thinking in a comment.\n\n# Environment\n\nThe last component of a function is its **environment**.\n\nThis is not something you need to understand deeply when you first start writing functions. However, it's important to know a little bit about environments because they are crucial to how functions work.\n\nThe **environment of a function** controls how R finds the value associated with a name.\n\nFor example, take this function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(x) {\n  x + y\n} \n```\n:::\n\n\nIn many programming languages, this would be an error, because `y` is not defined inside the function.\n\nIn R, this is valid code because R uses rules called **lexical scoping** to find the value associated with a name.\n\nSince `y` is not defined inside the function, R will look in the environment where the function was defined:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- 100\nf(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 110\n```\n:::\n\n```{.r .cell-code}\ny <- 1000\nf(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1010\n```\n:::\n:::\n\n\nThis behavior seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it does not cause too many problems (especially if you regularly restart R to get to a clean slate).\n\nThe **advantage of this behavior** is that from a language standpoint **it allows R to be very consistent**.\n\n-   Every name is looked up using the same set of rules.\n\nFor `f()` that includes the behavior of two things that you might not expect: `{` and `+`. This allows you to do devious things like:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`+` <- function(x, y) {\n  if (runif(1) < 0.1) {\n    sum(x, y)\n  } else {\n    sum(x, y) * 1.1\n  }\n}\ntable(replicate(1000, 1 + 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n  3 3.3 \n100 900 \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nrm(`+`)\n```\n:::\n\n\nThis is a common phenomenon in R. R places few limits on your power. You can do many things that you can't do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like `ggplot2` and `dplyr` possible.\n\n::: callout-tip\n### More resources\n\nIf you are interested in learning more about scoping, check out\n\n-   <https://adv-r.hadley.nz/functions.html?#lexical-scoping>\n-   <https://rdpeng.github.io/Biostat776/lecture-scoping-rules-of-r>\n:::\n\n# Summary\n\n-   Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object\n\n-   Functions have can be defined with named arguments; these function arguments can have default values\n\n-   Functions arguments can be specified by name or by position in the argument list\n\n-   Functions always return the last expression evaluated in the function body\n\n-   A variable number of arguments can be specified using the special `...` argument in a function definition.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(is.na(x))\n\nx / sum(x, na.rm = TRUE)\n```\n:::\n\n\n2.  Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo) to \"Little Bunny Foo Foo\". There is a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.\n\n3.  Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.\n\n4.  What does the `trim` argument to `mean()` do? When might you use it?\n\n5.  The default value for the method argument to `cor()` is `c(\"pearson\", \"kendall\", \"spearman\")`. What does that mean? What value is used by default?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-functions>\n-   <https://rdpeng.github.io/Biostat776/lecture-scoping-rules-of-r>\n-   <https://r4ds.had.co.nz/functions>\n-   <https://r4ds.had.co.nz/functions.html#environment>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"16 - Functions\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to writing functions in R\"\ncategories: [module 4, week 4, R, programming, functions]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/16-functions/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/functions>\n2.  <https://adv-r.hadley.nz/functions.html?#functions>\n3.  <https://swcarpentry.github.io/r-novice-inflammation/02-func-R>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-functions>\n-   <https://rdpeng.github.io/Biostat776/lecture-scoping-rules-of-r>\n-   <https://r4ds.had.co.nz/functions>\n-   <https://r4ds.had.co.nz/functions.html#environment>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Know how to create a **function** using `function()` in R\n-   Know how to define **named arguments** inside a function with default values\n-   Be able to use named matching or **positional matching** in the argument list\n-   Understand what is **lazy evaluation**\n-   Understand the the special `...` argument in a function definition\n:::\n\n# Introduction\n\nWriting functions is a **core activity** of an R programmer. It represents the key step of the transition from a mere \"user\" to a developer who creates new functionality for R.\n\n**Functions** are often used to **encapsulate a sequence of expressions that need to be executed numerous times**, perhaps under slightly different conditions.\n\nAlso, functions are also often written **when code must be shared with others or the public**.\n\nThe writing of a function allows a developer to create an interface to the code, that is explicitly specified with a set of **arguments** (or parameters).\n\nThis interface provides an **abstraction of the code** to potential users. This abstraction simplifies the users' lives because it relieves them from having to know every detail of how the code operates.\n\nIn addition, the creation of an interface allows the developer to **communicate to the user the aspects of the code that are important** or are most relevant.\n\n## Functions in R\n\nFunctions in R are \"first class objects\", which means that they can be treated much like any other R object.\n\n::: callout-tip\n### Important facts about R functions\n\n-   Functions can be passed as arguments to other functions.\n    -   This is very handy for the various apply functions, like `lapply()` and `sapply()`.\n-   Functions can be nested, so that you can define a function inside of another function.\n:::\n\nIf you are familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis.\n\n## Your First Function\n\nFunctions are defined using the `function()` directive and are **stored as R objects** just like anything else.\n\n::: callout-tip\n### Important\n\nIn particular, functions are R objects of class `function`.\n\nHere's a simple function that takes no arguments and does nothing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n    ## This is an empty function\n}\n## Functions have their own class\nclass(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"function\"\n```\n:::\n\n```{.r .cell-code}\n## Execute this function\nf()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nNULL\n```\n:::\n:::\n\n:::\n\nNot very interesting, but it is a start!\n\nThe next thing we can do is **create a function** that actually has a non-trivial **function body**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n    # this is the function body\n    hello <- \"Hello, world!\\n\"\n    cat(hello)\n}\nf()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\n`cat()` is useful and preferable to `print()` in several settings. One reason is that it doesn't output new lines (i.e. `\\n`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhello <- \"Hello, world!\\n\"\n\nprint(hello)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Hello, world!\\n\"\n```\n:::\n\n```{.r .cell-code}\ncat(hello)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n:::\n\n:::\n\nThe last aspect of a basic function is the **function arguments**.\n\nThese are **the options that you can specify to the user** that the user may explicitly set.\n\nFor this basic function, we can add an argument that determines how many times \"Hello, world!\" is printed to the console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num) {\n    for (i in seq_len(num)) {\n        hello <- \"Hello, world!\\n\"\n        cat(hello)\n    }\n}\nf(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\nHello, world!\n```\n:::\n:::\n\n\nObviously, we **could have just cut-and-pasted** the `cat(\"Hello, world!\\n\")` code three times to achieve the same effect, but then we wouldn't be programming, would we?\n\nAlso, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see \"Hello, world!\".\n\n::: callout-tip\n### Pro-tip\n\nIf you find yourself doing a lot of cutting and pasting, that's usually a good sign that you might need to write a function.\n:::\n\nFinally, the function above doesn't **return** anything.\n\nIt just prints \"Hello, world!\" to the console `num` number of times and then exits.\n\nBut often it is useful **if a function returns something** that perhaps can be fed into another section of code.\n\nThis next function returns the total number of characters printed to the console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num) {\n    hello <- \"Hello, world!\\n\"\n    for (i in seq_len(num)) {\n        cat(hello)\n    }\n    chars <- nchar(hello) * num\n    chars\n}\nmeaningoflife <- f(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\nHello, world!\n```\n:::\n\n```{.r .cell-code}\nprint(meaningoflife)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 42\n```\n:::\n:::\n\n\nIn the above function, we did not have to indicate anything special in order for the function to return the number of characters.\n\nIn R, the **return value of a function** is always the very **last expression that is evaluated**.\n\nBecause the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function.\n\n::: callout-tip\n### Note\n\nThere is a `return()` function that can be used to return an explicitly value from a function, but it is rarely used in R (we will discuss it a bit later in this lesson).\n:::\n\nFinally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in f(): argument \"num\" is missing, with no default\n```\n:::\n:::\n\n\nWe can modify this behavior by setting a **default value** for the argument `num`.\n\n**Any function argument can have a default value**, if you wish to specify it.\n\nSometimes, argument values are rarely modified (except in special cases) and it makes sense to set a default value for that argument. This relieves the user from having to specify the value of that argument every single time the function is called.\n\nHere, for example, we could set the default value for `num` to be 1, so that if the function is called without the `num` argument being explicitly specified, then it will print \"Hello, world!\" to the console once.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num = 1) {\n    hello <- \"Hello, world!\\n\"\n    for (i in seq_len(num)) {\n        cat(hello)\n    }\n    chars <- nchar(hello) * num\n    chars\n}\n\n\nf() ## Use default value for 'num'\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n\n```{.r .cell-code}\nf(2) ## Use user-specified value\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 28\n```\n:::\n:::\n\n\nRemember that the function still returns the number of characters printed to the console.\n\n::: callout-tip\n### Pro-tip\n\nThe `formals()` function returns a list of all the formal arguments of a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nformals(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$num\n[1] 1\n```\n:::\n:::\n\n:::\n\n## Summary\n\nWe have written a function that\n\n-   has one *formal argument* named `num` with a *default value* of 1. The *formal arguments* are the arguments included in the function definition.\n\n-   prints the message \"Hello, world!\" to the console a number of times indicated by the argument `num`\n\n-   *returns* the number of characters printed to the console\n\n# Arguments\n\n## Named arguments\n\nAbove, we have learned that functions have **named arguments**, which can optionally have default values.\n\nBecause all function arguments have names, they can be specified using their name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf(num = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 28\n```\n:::\n:::\n\n\nSpecifying an argument by its name is sometimes useful **if a function has many arguments** and it may not always be clear which argument is being specified.\n\nHere, our function only has one argument so there's no confusion.\n\n## Argument matching\n\nCalling an **R function with multiple arguments** can be done in a variety of ways.\n\nThis may be confusing at first, but it's really handy when doing interactive work at the command line. R functions arguments can be matched **positionally** or **by name**.\n\n-   **Positional matching** just means that R assigns the first value to the first argument, the second value to second argument, etc.\n\nSo, in the following call to `rnorm()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (n, mean = 0, sd = 1)  \n```\n:::\n\n```{.r .cell-code}\nmydata <- rnorm(100, 2, 1) ## Generate some data\n```\n:::\n\n\n100 is assigned to the `n` argument, 2 is assigned to the `mean` argument, and 1 is assigned to the `sd` argument, all by positional matching.\n\nThe following calls to the `sd()` function (which computes the empirical standard deviation of a vector of numbers) are all equivalent.\n\n::: callout-tip\n### Note\n\n`sd(x, na.rm = FALSE)` has two arguments:\n\n-   `x` indicates the vector of numbers\n-   `na.rm` is a logical indicating whether missing values should be removed or not (default is `FALSE`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Positional match first argument, default for 'na.rm'\nsd(mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n\n```{.r .cell-code}\n## Specify 'x' argument by name, default for 'na.rm'\nsd(x = mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n\n```{.r .cell-code}\n## Specify both arguments by name\nsd(x = mydata, na.rm = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n:::\n\n:::\n\nWhen **specifying the function arguments by name**, it **doesn't matter in what order** you specify them.\n\nIn the example below, we specify the `na.rm` argument first, followed by `x`, even though `x` is the first argument defined in the function definition.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Specify both arguments by name\nsd(na.rm = FALSE, x = mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n:::\n\n\nYou **can mix positional matching with matching by name**.\n\nWhen an argument is matched by name, **it is \"taken out\" of the argument list** and the remaining unnamed arguments are matched in the order that they are listed in the function definition.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsd(na.rm = FALSE, mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n:::\n\n\nHere, the `mydata` object is assigned to the `x` argument, because it's the only argument not yet specified.\n\n::: callout-tip\n### Pro-tip\n\nThe `args()` function displays the argument names and corresponding default values of a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (num = 1) \nNULL\n```\n:::\n:::\n\n:::\n\nBelow is the argument list for the `lm()` function, which fits linear models to a dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (formula, data, subset, weights, na.action, method = \"qr\", \n    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, \n    contrasts = NULL, offset, ...) \nNULL\n```\n:::\n:::\n\n\nThe following two calls are equivalent.\n\n``` r\nlm(data = mydata, y ~ x, model = FALSE, 1:100)\nlm(y ~ x, mydata, 1:100, model = FALSE)\n```\n\n::: callout-tip\n### Pro-tip\n\nEven though it's legal, I don't recommend messing around with the order of the arguments too much, since it can lead to some confusion.\n:::\n\nMost of the time, **named arguments are helpful**:\n\n-   On the command line when you have a long argument list and you want to use the defaults for everything except for an argument near the end of the list\n-   If you can remember the name of the argument and not its position on the argument list\n\nFor example, **plotting functions** often have a lot of options to allow for customization, but this makes it difficult to remember exactly the position of every argument on the argument list.\n\nFunction arguments can also be **partially matched**, which is useful for interactive work.\n\n::: callout-tip\n### Pro-tip\n\nThe order of operations when given an argument is\n\n1.  Check for exact match for a named argument\n2.  Check for a partial match\n3.  Check for a positional match\n:::\n\n**Partial matching should be avoided when writing longer code or programs**, because it may lead to confusion if someone is reading the code. However, partial matching is very useful when calling functions interactively that have very long argument names.\n\n## Lazy Evaluation\n\nArguments to functions are **evaluated lazily**, so they are evaluated only as needed in the body of the function.\n\nIn this example, the function `f()` has two arguments: `a` and `b`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a, b) {\n    a^2\n}\nf(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n:::\n\n\nThis **function never actually uses the argument `b`**, so calling `f(2)` will not produce an error because the 2 gets positionally matched to `a`.\n\nThis behavior can be good or bad. It's common to write a function that doesn't use an argument and not notice it simply because R never throws an error.\n\nThis example also shows lazy evaluation at work, but does eventually result in an error.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a, b) {\n    print(a)\n    print(b)\n}\nf(45)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 45\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nError in f(45): argument \"b\" is missing, with no default\n```\n:::\n:::\n\n\nNotice that \"45\" got printed first before the error was triggered! This is because `b` did not have to be evaluated until after `print(a)`.\n\nOnce the function tried to evaluate `print(b)` the function had to throw an error.\n\n## The `...` Argument\n\nThere is a **special argument in R known as the `...` argument**, which indicates **a variable number of arguments** that are usually passed on to other functions.\n\nThe `...` argument is **often used when extending another function** and you do not want to copy the entire argument list of the original function\n\nFor example, a custom plotting function may want to make use of the default `plot()` function along with its entire argument list. The function below changes the default for the `type` argument to the value `type = \"l\"` (the original default was `type = \"p\"`).\n\n``` r\nmyplot <- function(x, y, type = \"l\", ...) {\n        plot(x, y, type = type, ...)    ## Pass '...' to 'plot' function\n}\n```\n\nGeneric functions use `...` so that extra arguments can be passed to methods.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (x, ...) \nUseMethod(\"mean\")\n<bytecode: 0x1075ea1e8>\n<environment: namespace:base>\n```\n:::\n:::\n\n\nThe `...` argument is necessary when the number of arguments passed to the function cannot be known in advance. This is clear in functions like `paste()` and `cat()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"one\", \"two\", \"three\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one two three\"\n```\n:::\n\n```{.r .cell-code}\npaste(\"one\", \"two\", \"three\", \"four\", \"five\", sep = \"_\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one_two_three_four_five\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(paste)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (..., sep = \" \", collapse = NULL, recycle0 = FALSE) \nNULL\n```\n:::\n:::\n\n\nBecause `paste()` prints out text to the console by combining multiple character vectors together, it is impossible for this function to know in advance how many character vectors will be passed to the function by the user.\n\nSo the first argument in the function is `...`.\n\n## Arguments Coming After the `...` Argument\n\nOne catch with `...` is that any **arguments that appear after** `...` on the argument list **must be named explicitly and cannot be partially matched or matched positionally**.\n\nTake a look at the arguments to the `paste()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(paste)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (..., sep = \" \", collapse = NULL, recycle0 = FALSE) \nNULL\n```\n:::\n:::\n\n\nWith the `paste()` function, the arguments `sep` and `collapse` must be named explicitly and in full if the default values are not going to be used.\n\nHere, I specify that I want \"a\" and \"b\" to be pasted together and separated by a colon.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"a\", \"b\", sep = \":\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a:b\"\n```\n:::\n:::\n\n\nIf I don't specify the `sep` argument in full and attempt to rely on partial matching, I don't get the expected result.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"a\", \"b\", se = \":\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a b :\"\n```\n:::\n:::\n\n\n# Functions are for humans and computers\n\nAs you start to write your own functions, it's important to keep in mind that functions are not just for the computer, but are also for humans. Technically, R does not care what your function is called, or what comments it contains, but these are important for **human readers**.\n\nThis section discusses some things that you should bear in mind when writing functions that humans can understand.\n\n## The name of a function is important\n\nIn an ideal world, you want the name of your function to be short but clearly describe what the function does. This is not always easy, but here are some tips.\n\nThe **function names** should be **verbs**, and **arguments** should be **nouns**.\n\nThere are some exceptions:\n\n-   nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`).\n-   A good sign that a noun might be a better choice is if you are using a very broad verb like \"get\", \"compute\", \"calculate\", or \"determine\". Use your best judgement and do not be afraid to rename a function if you figure out a better name later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Too short\nf()\n\n# Not a verb, or descriptive\nmy_awesome_function()\n\n# Long, but clear\nimpute_missing()\ncollapse_years()\n```\n:::\n\n\n## snake_case vs camelCase\n\nIf your function name is composed of multiple words, **use \"snake_case\"**, where each lowercase word is separated by an underscore.\n\n\"camelCase\" is a popular alternative. It does not really matter which one you pick, the important thing is to be consistent: **pick one or the other and stick with it**.\n\nR itself is not very consistent, but there is nothing you can do about that. Make sure you do not fall into the same trap by making your code as consistent as possible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Never do this!\ncol_mins <- function(x, y) {}\nrowMaxes <- function(x, y) {}\n```\n:::\n\n\n## Use a common prefix\n\nIf you have a family of functions that do similar things, make sure they have consistent names and arguments.\n\nIt's a good idea to indicate that they are connected. That is better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Good\ninput_select()\ninput_checkbox()\ninput_text()\n\n# Not so good\nselect_input()\ncheckbox_input()\ntext_input()\n```\n:::\n\n\n## Avoid overriding exisiting functions\n\nWhere possible, avoid overriding existing functions and variables.\n\nIt is impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Don't do this!\nT <- FALSE\nc <- 10\nmean <- function(x) sum(x)\n```\n:::\n\n\n## Use comments\n\nUse **comments** are lines starting with #. They can explain the \"why\" of your code.\n\nYou generally should avoid comments that explain the \"what\" or the \"how\". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear.\n\n-   Do you need to add some intermediate variables with useful names?\n-   Do you need to break out a subcomponent of a large function so you can name it?\n\nHowever, your code can never capture the reasoning behind your decisions:\n\n-   Why did you choose this approach instead of an alternative?\n-   What else did you try that didn't work?\n\nIt's a great idea to capture that sort of thinking in a comment.\n\n# Environment\n\nThe last component of a function is its **environment**.\n\nThis is not something you need to understand deeply when you first start writing functions. However, it's important to know a little bit about environments because they are crucial to how functions work.\n\nThe **environment of a function** controls how R finds the value associated with a name.\n\nFor example, take this function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(x) {\n    x + y\n}\n```\n:::\n\n\nIn many programming languages, this would be an error, because `y` is not defined inside the function.\n\nIn R, this is valid code because R uses rules called **lexical scoping** to find the value associated with a name.\n\nSince `y` is not defined inside the function, R will look in the environment where the function was defined:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- 100\nf(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 110\n```\n:::\n\n```{.r .cell-code}\ny <- 1000\nf(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1010\n```\n:::\n:::\n\n\nThis behavior seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it does not cause too many problems (especially if you regularly restart R to get to a clean slate).\n\nThe **advantage of this behavior** is that from a language standpoint **it allows R to be very consistent**.\n\n-   Every name is looked up using the same set of rules.\n\nFor `f()` that includes the behavior of two things that you might not expect: `{` and `+`. This allows you to do devious things like:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`+` <- function(x, y) {\n    if (runif(1) < 0.1) {\n        sum(x, y)\n    } else {\n        sum(x, y) * 1.1\n    }\n}\ntable(replicate(1000, 1 + 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n  3 3.3 \n 82 918 \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nrm(`+`)\n```\n:::\n\n\nThis is a common phenomenon in R. R places few limits on your power. You can do many things that you can't do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like `ggplot2` and `dplyr` possible.\n\n::: callout-tip\n### More resources\n\nIf you are interested in learning more about scoping, check out\n\n-   <https://adv-r.hadley.nz/functions.html?#lexical-scoping>\n-   <https://rdpeng.github.io/Biostat776/lecture-scoping-rules-of-r>\n:::\n\n# Summary\n\n-   Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object\n\n-   Functions have can be defined with named arguments; these function arguments can have default values\n\n-   Functions arguments can be specified by name or by position in the argument list\n\n-   Functions always return the last expression evaluated in the function body\n\n-   A variable number of arguments can be specified using the special `...` argument in a function definition.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(is.na(x))\n\nx / sum(x, na.rm = TRUE)\n```\n:::\n\n\n2.  Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo) to \"Little Bunny Foo Foo\". There is a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.\n\n3.  Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.\n\n4.  What does the `trim` argument to `mean()` do? When might you use it?\n\n5.  The default value for the method argument to `cor()` is `c(\"pearson\", \"kendall\", \"spearman\")`. What does that mean? What value is used by default?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-functions>\n-   <https://rdpeng.github.io/Biostat776/lecture-scoping-rules-of-r>\n-   <https://r4ds.had.co.nz/functions>\n-   <https://r4ds.had.co.nz/functions.html#environment>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/17-loop-functions/index/execute-results/html.json b/_freeze/posts/17-loop-functions/index/execute-results/html.json
index 07a66d5..9941616 100644
--- a/_freeze/posts/17-loop-functions/index/execute-results/html.json
+++ b/_freeze/posts/17-loop-functions/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "d6faa0c6c51529d5dc808757f7036a0a",
+  "hash": "d7a61d6652db2a9bb07ed8e7a7941d8f",
   "result": {
-    "markdown": "---\ntitle: \"17 - Vectorization and loop functionals\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to vectorization and loop functionals\"\ncategories: [module 4, week 5, R, programming, functions]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/17-loop-functions/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rafalab.github.io/dsbook/programming-basics.html#vectorization>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-loop-functions>\n-   <https://rafalab.github.io/dsbook/programming-basics.html#vectorization>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Understand how to perform vector arithmetics in R\n-   Implement the 5 functional loops in R (vs e.g. for loops) in R\n:::\n\n# Vectorization\n\nWriting `for` and `while` loops are useful and easy to understand, but in R we rarely use them.\n\nAs you learn more R, you will realize that **vectorization** is preferred over for-loops since it results in shorter and clearer code.\n\n## Vector arithmetics\n\n### Rescaling a vector\n\nIn R, arithmetic operations on **vectors occur element-wise**. For a quick example, suppose we have height in inches:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)\n```\n:::\n\n\nand want to convert to centimeters.\n\nNotice what happens when we multiply inches by 2.54:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches * 2.54\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80\n```\n:::\n:::\n\n\nIn the line above, we **multiplied each element** by 2.54.\n\nSimilarly, if for each entry we want to compute how many inches taller or shorter than 69 inches (the average height for males), we can subtract it from every entry like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches - 69\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  0 -7 -3  1  1  4 -2  4 -2  1\n```\n:::\n:::\n\n\n### Two vectors\n\nIf we have **two vectors of the same length**, and we sum them in R, they will be **added entry by entry** as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\ny <- 1:10 \nx + y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  2  4  6  8 10 12 14 16 18 20\n```\n:::\n:::\n\n\nThe same holds for other mathematical operations, such as `-`, `*` and `/`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\nsqrt(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427\n [9] 3.000000 3.162278\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- 1:10\nx*y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]   1   4   9  16  25  36  49  64  81 100\n```\n:::\n:::\n\n\n# Functional loops\n\nWhile `for` loops are perfectly valid, when you use vectorization in an element-wise fashion, there is no need for `for` loops because we can apply what are called functional loops.\n\n**Functional loops** are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here are a list of them:\n\n-   `lapply()`: Loop over a list and evaluate a function on each element\n\n-   `sapply()`: Same as `lapply` but try to simplify the result\n\n-   `apply()`: Apply a function over the margins of an array\n\n-   `tapply()`: Apply a function over subsets of a vector\n\n-   `mapply()`: Multivariate version of `lapply` (won't cover)\n\nAn auxiliary function `split()` is also useful, particularly in conjunction with `lapply()`.\n\n## `lapply()`\n\nThe `lapply()` function does the following simple series of operations:\n\n1.  it loops over a list, iterating over each element in that list\n2.  it applies a *function* to each element of the list (a function that you specify)\n3.  and returns a list (the `l` in `lapply()` is for \"list\").\n\nThis function takes three arguments: (1) a list `X`; (2) a function (or the name of a function) `FUN`; (3) other arguments via its `...` argument. If `X` is not a list, it will be coerced to a list using `as.list()`.\n\nThe body of the `lapply()` function can be seen here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, FUN, ...) \n{\n    FUN <- match.fun(FUN)\n    if (!is.vector(X) || is.object(X)) \n        X <- as.list(X)\n    .Internal(lapply(X, FUN))\n}\n<bytecode: 0x11f8e3dd0>\n<environment: namespace:base>\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe actual looping is done internally in C code for efficiency reasons.\n:::\n\nIt is important to remember that `lapply()` always returns a list, regardless of the class of the input.\n\n::: callout-tip\n### Example\n\nHere's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:5, b = rnorm(10))\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2 3 4 5\n\n$b\n [1]  1.2229485  0.1878172 -0.7560246 -0.8520380  1.7012165 -0.6487224\n [7] -0.6863177  0.2483112 -0.4892715 -1.3013705\n```\n:::\n\n```{.r .cell-code}\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 3\n\n$b\n[1] -0.1373451\n```\n:::\n:::\n\n\nNotice that here we are passing the `mean()` function as an argument to the `lapply()` function.\n:::\n\n**Functions in R can be** used this way and can be **passed back and forth as arguments** just like any other object inR.\n\nWhen you pass a function to another function, you do not need to include the open and closed parentheses `()` like you do when you are **calling** a function.\n\n::: callout-tip\n### Example\n\nHere is another example of using `lapply()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 2.5\n\n$b\n[1] 0.08347435\n\n$c\n[1] 0.7113253\n\n$d\n[1] 5.115736\n```\n:::\n:::\n\n:::\n\nYou can use `lapply()` to evaluate a function multiple times each with a different argument.\n\nNext is an example where I call the `runif()` function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nlapply(x, runif)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 0.5243483\n\n[[2]]\n[1] 0.6741892 0.2195450\n\n[[3]]\n[1] 0.5812777 0.9505693 0.7417778\n\n[[4]]\n[1] 0.7255782 0.5353819 0.2978647 0.7711454\n```\n:::\n:::\n\n\n::: callout-tip\n### What happened?\n\nWhen you pass a function to `lapply()`, `lapply()` takes elements of the list and passes them as the *first argument* of the function you are applying.\n\nIn the above example, the first argument of `runif()` is `n`, and so the elements of the sequence `1:4` all got passed to the `n` argument of `runif()`.\n:::\n\nFunctions that you pass to `lapply()` may have other arguments. For example, the `runif()` function has a `min` and `max` argument too.\n\n::: callout-note\n### Question\n\nIn the example above I used the default values for `min` and `max`.\n\n-   How would you be able to specify different values for that in the context of `lapply()`?\n:::\n\nHere is where the `...` argument to `lapply()` comes into play. Any arguments that you place in the `...` argument will get passed down to the function being applied to the elements of the list.\n\nHere, the `min = 0` and `max = 10` arguments are passed down to `runif()` every time it gets called.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nlapply(x, runif, min = 0, max = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 3.788686\n\n[[2]]\n[1] 7.370226 2.895551\n\n[[3]]\n[1] 4.742728 5.781041 7.590877\n\n[[4]]\n[1] 9.5078106 4.9129254 0.9233773 1.9651523\n```\n:::\n:::\n\n\nSo now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10.\n\nThe `lapply()` function (and its friends) makes heavy use of *anonymous* functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These functions are generated \"on the fly\" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace.\n\n::: callout-tip\n### Example\n\nHere I am creating a list that contains two matrices.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2)) \nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n     [,1] [,2]\n[1,]    1    3\n[2,]    2    4\n\n$b\n     [,1] [,2]\n[1,]    1    4\n[2,]    2    5\n[3,]    3    6\n```\n:::\n:::\n\n\nSuppose I wanted to extract the first column of each matrix in the list. I could write an anonymous function for extracting the first column of each matrix.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(x, function(elt) { elt[,1] })\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2\n\n$b\n[1] 1 2 3\n```\n:::\n:::\n\n\nNotice that I put the `function()` definition right in the call to `lapply()`.\n:::\n\nThis is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside `lapply()`, but if it's going to be more complicated, it's probably a better idea to define the function separately.\n\nFor example, I could have done the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(elt) {\n        elt[, 1]\n}\nlapply(x, f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2\n\n$b\n[1] 1 2 3\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nNow the function is no longer anonymous; its name is `f`.\n:::\n\nWhether you use an anonymous function or you define a function first depends on your context. If you think the function `f` is something you are going to need a lot in other parts of your code, you might want to define it separately. But if you are just going to use it for this call to `lapply()`, then it is probably simpler to use an anonymous function.\n\n## `sapply()`\n\nThe `sapply()` function behaves similarly to `lapply()`; the only real difference is in the return value. `sapply()` will try to simplify the result of `lapply()` if possible. Essentially, `sapply()` calls `lapply()` on its input and then applies the following algorithm:\n\n-   If the result is a list where every element is length 1, then a vector is returned\n\n-   If the result is a list where every element is a vector of the same length (\\> 1), a matrix is returned.\n\n-   If it can't figure things out, a list is returned\n\nHere's the result of calling `lapply()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 2.5\n\n$b\n[1] 0.1887374\n\n$c\n[1] 1.304012\n\n$d\n[1] 5.012149\n```\n:::\n:::\n\n\nNotice that `lapply()` returns a list (as usual), but that each element of the list has length 1.\n\nHere's the result of calling `sapply()` on the same list.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(x, mean) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n        a         b         c         d \n2.5000000 0.1887374 1.3040116 5.0121492 \n```\n:::\n:::\n\n\nBecause the result of `lapply()` was a list where each element had length 1, `sapply()` collapsed the output into a numeric vector, which is often more useful than a list.\n\n## `split()`\n\nThe `split()` function takes a vector or other objects and splits it into groups determined by a factor or list of factors.\n\nThe arguments to `split()` are\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(split)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (x, f, drop = FALSE, ...)  \n```\n:::\n:::\n\n\nwhere\n\n-   `x` is a vector (or list) or data frame\n-   `f` is a factor (or coerced to one) or a list of factors\n-   `drop` indicates whether empty factors levels should be dropped\n\nThe combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying that function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as \"map-reduce\" in other contexts.\n\nHere we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to \"generate levels\" in a factor variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(rnorm(10), runif(10), rnorm(10, 1))\nf <- gl(3, 10) # generate factor levels\nf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3\nLevels: 1 2 3\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsplit(x, f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n [1]  0.3894649 -0.9253587  1.4289026 -0.4602414 -0.6726225  3.0199478\n [7]  1.2191391  0.3484649  0.8080023  0.3774965\n\n$`2`\n [1] 0.8782538 0.4346257 0.5970352 0.1635630 0.3839225 0.8525622 0.4988508\n [8] 0.8624590 0.2599047 0.1006897\n\n$`3`\n [1]  0.20936028 -0.17011167  0.34303862  1.04290024  0.80930785  3.20756177\n [7]  0.55197354  0.49465007  0.06888936 -0.41865555\n```\n:::\n:::\n\n\nA common idiom is `split` followed by an `lapply`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(split(x, f), mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n[1] 0.5533195\n\n$`2`\n[1] 0.5031867\n\n$`3`\n[1] 0.6138915\n```\n:::\n:::\n\n\n### Splitting a Data Frame\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(datasets)\nhead(airquality)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n  Ozone Solar.R Wind Temp Month Day\n1    41     190  7.4   67     5   1\n2    36     118  8.0   72     5   2\n3    12     149 12.6   74     5   3\n4    18     313 11.5   62     5   4\n5    NA      NA 14.3   56     5   5\n6    28      NA 14.9   66     5   6\n```\n:::\n:::\n\n\nWe can split the `airquality` data frame by the `Month` variable so that we have separate sub-data frames for each month.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ns <- split(airquality, airquality$Month)\nstr(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 5\n $ 5:'data.frame':\t31 obs. of  6 variables:\n  ..$ Ozone  : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...\n  ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...\n  ..$ Wind   : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...\n  ..$ Temp   : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...\n  ..$ Month  : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...\n  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 6:'data.frame':\t30 obs. of  6 variables:\n  ..$ Ozone  : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...\n  ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...\n  ..$ Wind   : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...\n  ..$ Temp   : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...\n  ..$ Month  : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...\n  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...\n $ 7:'data.frame':\t31 obs. of  6 variables:\n  ..$ Ozone  : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...\n  ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...\n  ..$ Wind   : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...\n  ..$ Temp   : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...\n  ..$ Month  : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...\n  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 8:'data.frame':\t31 obs. of  6 variables:\n  ..$ Ozone  : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...\n  ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...\n  ..$ Wind   : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...\n  ..$ Temp   : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...\n  ..$ Month  : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...\n  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 9:'data.frame':\t30 obs. of  6 variables:\n  ..$ Ozone  : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...\n  ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...\n  ..$ Wind   : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...\n  ..$ Temp   : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...\n  ..$ Month  : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...\n  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...\n```\n:::\n:::\n\n\nThen we can take the column means for `Ozone`, `Solar.R`, and `Wind` for each sub-data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(s, function(x) {\n        colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")])\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`5`\n   Ozone  Solar.R     Wind \n      NA       NA 11.62258 \n\n$`6`\n    Ozone   Solar.R      Wind \n       NA 190.16667  10.26667 \n\n$`7`\n     Ozone    Solar.R       Wind \n        NA 216.483871   8.941935 \n\n$`8`\n   Ozone  Solar.R     Wind \n      NA       NA 8.793548 \n\n$`9`\n   Ozone  Solar.R     Wind \n      NA 167.4333  10.1800 \n```\n:::\n:::\n\n\nUsing `sapply()` might be better here for a more readable output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(s, function(x) {\n        colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")])\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n               5         6          7        8        9\nOzone         NA        NA         NA       NA       NA\nSolar.R       NA 190.16667 216.483871       NA 167.4333\nWind    11.62258  10.26667   8.941935 8.793548  10.1800\n```\n:::\n:::\n\n\nUnfortunately, there are `NA`s in the data so we cannot simply take the means of those variables. However, we can tell the `colMeans` function to remove the `NA`s before computing the mean.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(s, function(x) {\n        colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")], \n                 na.rm = TRUE)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n                5         6          7          8         9\nOzone    23.61538  29.44444  59.115385  59.961538  31.44828\nSolar.R 181.29630 190.16667 216.483871 171.857143 167.43333\nWind     11.62258  10.26667   8.941935   8.793548  10.18000\n```\n:::\n:::\n\n\n## tapply\n\n`tapply()` is used to apply a function over subsets of a vector. It can be thought of as a combination of `split()` and `sapply()` for vectors only. I've been told that the \"t\" in `tapply()` refers to \"table\", but that is unconfirmed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(tapply)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)  \n```\n:::\n:::\n\n\nThe arguments to `tapply()` are as follows:\n\n-   `X` is a vector\n-   `INDEX` is a factor or a list of factors (or else they are coerced to factors)\n-   `FUN` is a function to be applied\n-   ... contains other arguments to be passed `FUN`\n-   `simplify`, should we simplify the result?\n\n::: callout-tip\n### Example\n\nGiven a vector of numbers, one simple operation is to take group means.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Simulate some data\nx <- c(rnorm(10), runif(10), rnorm(10, 1))\n## Define some groups with a factor variable\nf <- gl(3, 10)   \nf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3\nLevels: 1 2 3\n```\n:::\n\n```{.r .cell-code}\ntapply(x, f, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n        1         2         3 \n0.1087117 0.5112950 1.2957052 \n```\n:::\n:::\n\n:::\n\nWe can also apply functions that return more than a single value. In this case, `tapply()` will not simplify the result and will return a list. Here's an example of finding the `range()` (min and max) of each sub-group.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntapply(x, f, range)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n[1] -0.8435817  1.0096577\n\n$`2`\n[1] 0.05537228 0.83863420\n\n$`3`\n[1] -0.8717398  2.0814386\n```\n:::\n:::\n\n\n## `apply()`\n\nThe `apply()` function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. Using `apply()` is not really faster than writing a loop, but it works in one line and is highly compact.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(apply)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, MARGIN, FUN, ..., simplify = TRUE)  \n```\n:::\n:::\n\n\nThe arguments to `apply()` are\n\n-   `X` is an array\n-   `MARGIN` is an integer vector indicating which margins should be \"retained\".\n-   `FUN` is a function to be applied\n-   `...` is for other arguments to be passed to `FUN`\n\n::: callout-tip\n### Example\n\nHere I create a 20 by 10 matrix of Normal random numbers. I then compute the mean of each column.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(rnorm(200), 20, 10)\nhead(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n           [,1]       [,2]       [,3]       [,4]       [,5]       [,6]\n[1,] 0.84236865 -0.1124551 -0.7392402 -0.1509353  0.9928648  0.6976390\n[2,] 0.02815865  1.6436612 -1.1889939 -0.4767309 -1.1571802  0.6857492\n[3,] 0.01051357 -0.2242306  0.6610240  0.4123025 -0.2424516  1.3800597\n[4,] 0.72109478  0.2949731  0.9722644  0.7911567  0.5616605 -0.6849934\n[5,] 1.15863939 -0.8584817  0.5789411  1.2627121 -0.5413249 -0.4740383\n[6,] 1.32952051  0.4994495 -0.2406128 -1.2990888  1.1405130 -0.4864257\n           [,7]        [,8]        [,9]      [,10]\n[1,]  0.9033576 -0.29090608  0.54385628 -1.5146458\n[2,]  0.6788826  0.92735898 -0.16479486 -2.7966950\n[3,] -2.0091145 -0.01351013  0.60310429  1.4103354\n[4,]  0.2323455  0.83083908 -1.08912322  0.6458769\n[5,] -1.5478440  0.05488507  0.03434319  0.1834132\n[6,] -0.1445403 -0.62788083  0.44295763 -0.2051684\n```\n:::\n\n```{.r .cell-code}\napply(x, 2, mean)  ## Take the mean of each column\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  0.15369716  0.23398541 -0.18962509 -0.05729131  0.01771256 -0.16540884\n [7] -0.16674585  0.40803348 -0.13061682 -0.09736721\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nI can also compute the sum of each row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 1, sum)   ## Take the mean of each row\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  1.1719039 -1.8205843  1.9880325  3.2760942 -0.1487549  0.4087238\n [7] -4.0732499  2.1231502  0.4270388 -0.9871043 -0.9797849 -3.0278088\n[13]  0.6285069  2.3373183  1.6973524 -3.8844200  4.2352140 -0.3202212\n[19] -1.7755148 -1.1484223\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nIn both calls to `apply()`, the return value was a vector of numbers.\n:::\n\nYou've probably noticed that the second argument is either a 1 or a 2, depending on whether we want row statistics or column statistics. What exactly *is* the second argument to `apply()`?\n\nThe `MARGIN` argument essentially indicates to `apply()` which dimension of the array you want to preserve or retain.\n\nSo when taking the mean of each column, I specify\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 2, mean)\n```\n:::\n\n\nbecause I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 1, mean)\n```\n:::\n\n\nbecause I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension).\n\n### Col/Row Sums and Means\n\n::: callout-tip\n### Pro-tip\n\nFor the special case of column/row sums and column/row means of matrices, we have some useful shortcuts.\n\n-   `rowSums` = `apply(x, 1, sum)`\n-   `rowMeans` = `apply(x, 1, mean)`\n-   `colSums` = `apply(x, 2, sum)`\n-   `colMeans` = `apply(x, 2, mean)`\n:::\n\nThe shortcut functions are heavily optimized and hence are **much** faster, but you probably won't notice unless you're using a large matrix.\n\nAnother nice aspect of these functions is that they are a bit more descriptive. It's arguably more clear to write `colMeans(x)` in your code than `apply(x, 2, mean)`.\n\n### Other Ways to Apply\n\nYou can do more than take sums and means with the `apply()` function.\n\n::: callout-tip\n### Example\n\nFor example, you can compute quantiles of the rows of a matrix using the `quantile()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(rnorm(200), 20, 10)\nhead(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n           [,1]        [,2]       [,3]       [,4]        [,5]        [,6]\n[1,] -0.8222286 -0.06264826  0.4138949 -1.4608268  0.67442318 -0.24230780\n[2,] -0.1836086 -0.35550379  1.0199434 -0.2441630 -0.46562697  0.09719075\n[3,]  1.6292807 -1.49980763 -2.3034693 -0.6154384 -0.04040846  0.21809278\n[4,] -0.5413514  0.58643377  0.8135796  1.7305934  0.69119103  0.33314754\n[5,] -0.9640885  0.38238569  0.2424066 -0.4919602  0.90386972  1.13194597\n[6,] -0.1440487  1.68226290  2.2330038 -0.4490778  0.08801544 -1.29481321\n            [,7]        [,8]        [,9]       [,10]\n[1,] -0.38935245 -0.77293702  0.07677024  0.12423212\n[2,]  0.06646333  0.28363323 -0.11256258  1.18545766\n[3,]  0.22432215 -0.09068388  0.71109748 -0.09385239\n[4,]  0.89742043 -0.70635739  1.58637686 -1.89408900\n[5,]  1.28360089  0.08283728 -0.43722835 -0.60018278\n[6,] -1.15045126 -2.19382879 -0.20375594 -0.32560297\n```\n:::\n\n```{.r .cell-code}\n## Get row quantiles\napply(x, 1, quantile, probs = c(0.25, 0.75))    \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n          [,1]       [,2]       [,3]       [,4]       [,5]        [,6]\n25% -0.6770409 -0.2290244 -0.4850419 -0.3227267 -0.4782773 -0.97510790\n75%  0.1123667  0.2370226  0.2227648  0.8764602  0.7734987  0.02999941\n           [,7]        [,8]       [,9]       [,10]      [,11]      [,12]\n25% -0.64819251 -0.05986007 -1.2978967 -1.11450694 -0.4172096 -1.2630159\n75%  0.04747862  0.38562818  0.2370868 -0.08617673  0.7178610  0.9225057\n         [,13]      [,14]      [,15]      [,16]      [,17]       [,18]\n25% -0.4066897 -0.7840147 -0.9955023 -0.4760713 -0.9869851 -1.36026776\n75%  0.7307004  0.3125165  0.9180706  0.8888653  0.8109465  0.04670028\n         [,19]      [,20]\n25% -0.5785061 -0.4817867\n75%  0.3695974  0.5743219\n```\n:::\n:::\n\n\nNotice that I had to pass the `probs = c(0.25, 0.75)` argument to `quantile()` via the `...` argument to `apply()`.\n:::\n\n## Vectorizing a Function\n\nLet's talk about how we can **\"vectorize\" a function**.\n\nWhat this means is that we can write function that typically only takes single arguments and create a new function that can take vector arguments.\n\nThis is often needed when you want to plot functions.\n\n::: callout-tip\n### Example\n\nHere's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is $\\sum_{i=1}^n(x_i-\\mu)^2/\\sigma^2$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumsq <- function(mu, sigma, x) {\n        sum(((x - mu) / sigma)^2)\n}\n```\n:::\n\n\nThis function takes a mean `mu`, a standard deviation `sigma`, and some data in a vector `x`.\n\nIn many statistical applications, we want to minimize the sum of squares to find the optimal `mu` and `sigma`. Before we do that, we may want to evaluate or plot the function for many different values of `mu` or `sigma`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- rnorm(100)       ## Generate some data\nsumsq(mu=1, sigma=1, x)  ## This works (returns one value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 167.2459\n```\n:::\n:::\n\n\nHowever, passing a vector of `mu`s or `sigma`s won't work with this function because it's not vectorized.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumsq(1:10, 1:10, x)  ## This is not what we want\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 107.6407\n```\n:::\n:::\n\n:::\n\nThere's even a function in R called `Vectorize()` that **automatically can create a vectorized version of your function**.\n\nSo we could create a `vsumsq()` function that is fully vectorized as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvsumsq <- Vectorize(sumsq, c(\"mu\", \"sigma\"))\nvsumsq(1:10, 1:10, x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 167.24586 113.22134 104.28054 101.51027 100.39215  99.87343  99.61394\n [8]  99.48004  99.41187  99.38001\n```\n:::\n:::\n\n\nPretty cool, right?\n\n# Summary\n\n-   The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form\n\n-   The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.\n\n-   Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere\n\n-   The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Write a function `compute_s_n()` that for any given `n` computes the sum\n\n$$\nS_n = 1^2 + 2^2 + 3^2 + \\ldots + n^2\n$$\n\nReport the value of the sum when $n$ = 10.\n\n2.  Define an empty numerical vector `s_n` of size 25 using `s_n <- vector(\"numeric\", 25)` and store in the results of $S_1, S_2, \\ldots, S_n$ using a for-loop.\n\n3.  Repeat Q3, but this time use `sapply()`.\n\n4.  Plot `s_n` versus `n`. Use points defined by $n= 1, \\ldots, 25$\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-loop-functions>\n-   <https://rafalab.github.io/dsbook/programming-basics.html#vectorization>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"17 - Vectorization and loop functionals\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to vectorization and loop functionals\"\ncategories: [module 4, week 5, R, programming, functions]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/17-loop-functions/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rafalab.github.io/dsbook/programming-basics.html#vectorization>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-loop-functions>\n-   <https://rafalab.github.io/dsbook/programming-basics.html#vectorization>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Understand how to perform vector arithmetics in R\n-   Implement the 5 functional loops in R (vs e.g. for loops) in R\n:::\n\n# Vectorization\n\nWriting `for` and `while` loops are useful and easy to understand, but in R we rarely use them.\n\nAs you learn more R, you will realize that **vectorization** is preferred over for-loops since it results in shorter and clearer code.\n\n## Vector arithmetics\n\n### Rescaling a vector\n\nIn R, arithmetic operations on **vectors occur element-wise**. For a quick example, suppose we have height in inches:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)\n```\n:::\n\n\nand want to convert to centimeters.\n\nNotice what happens when we multiply inches by 2.54:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches * 2.54\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80\n```\n:::\n:::\n\n\nIn the line above, we **multiplied each element** by 2.54.\n\nSimilarly, if for each entry we want to compute how many inches taller or shorter than 69 inches (the average height for males), we can subtract it from every entry like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches - 69\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  0 -7 -3  1  1  4 -2  4 -2  1\n```\n:::\n:::\n\n\n### Two vectors\n\nIf we have **two vectors of the same length**, and we sum them in R, they will be **added entry by entry** as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\ny <- 1:10\nx + y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  2  4  6  8 10 12 14 16 18 20\n```\n:::\n:::\n\n\nThe same holds for other mathematical operations, such as `-`, `*` and `/`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\nsqrt(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427\n [9] 3.000000 3.162278\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- 1:10\nx * y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]   1   4   9  16  25  36  49  64  81 100\n```\n:::\n:::\n\n\n# Functional loops\n\nWhile `for` loops are perfectly valid, when you use vectorization in an element-wise fashion, there is no need for `for` loops because we can apply what are called functional loops.\n\n**Functional loops** are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here are a list of them:\n\n-   `lapply()`: Loop over a list and evaluate a function on each element\n\n-   `sapply()`: Same as `lapply` but try to simplify the result\n\n-   `apply()`: Apply a function over the margins of an array\n\n-   `tapply()`: Apply a function over subsets of a vector\n\n-   `mapply()`: Multivariate version of `lapply` (won't cover)\n\nAn auxiliary function `split()` is also useful, particularly in conjunction with `lapply()`.\n\n## `lapply()`\n\nThe `lapply()` function does the following simple series of operations:\n\n1.  it loops over a list, iterating over each element in that list\n2.  it applies a *function* to each element of the list (a function that you specify)\n3.  and returns a list (the `l` in `lapply()` is for \"list\").\n\nThis function takes three arguments: (1) a list `X`; (2) a function (or the name of a function) `FUN`; (3) other arguments via its `...` argument. If `X` is not a list, it will be coerced to a list using `as.list()`.\n\nThe body of the `lapply()` function can be seen here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, FUN, ...) \n{\n    FUN <- match.fun(FUN)\n    if (!is.vector(X) || is.object(X)) \n        X <- as.list(X)\n    .Internal(lapply(X, FUN))\n}\n<bytecode: 0x12d9335d0>\n<environment: namespace:base>\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe actual looping is done internally in C code for efficiency reasons.\n:::\n\nIt is important to remember that `lapply()` always returns a list, regardless of the class of the input.\n\n::: callout-tip\n### Example\n\nHere's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:5, b = rnorm(10))\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2 3 4 5\n\n$b\n [1] -0.6113707  0.5950531  0.6319343  0.5595441  0.3188799 -0.4400711\n [7]  1.6687028  0.4501791  1.4356856 -0.3858270\n```\n:::\n\n```{.r .cell-code}\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 3\n\n$b\n[1] 0.422271\n```\n:::\n:::\n\n\nNotice that here we are passing the `mean()` function as an argument to the `lapply()` function.\n:::\n\n**Functions in R can be** used this way and can be **passed back and forth as arguments** just like any other object inR.\n\nWhen you pass a function to another function, you do not need to include the open and closed parentheses `()` like you do when you are **calling** a function.\n\n::: callout-tip\n### Example\n\nHere is another example of using `lapply()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 2.5\n\n$b\n[1] 0.1655327\n\n$c\n[1] 0.9767504\n\n$d\n[1] 4.951283\n```\n:::\n:::\n\n:::\n\nYou can use `lapply()` to evaluate a function multiple times each with a different argument.\n\nNext is an example where I call the `runif()` function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nlapply(x, runif)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 0.5924944\n\n[[2]]\n[1] 0.8660588 0.3277243\n\n[[3]]\n[1] 0.5009080 0.2951163 0.6264905\n\n[[4]]\n[1] 0.04282267 0.14951908 0.82034538 0.64614463\n```\n:::\n:::\n\n\n::: callout-tip\n### What happened?\n\nWhen you pass a function to `lapply()`, `lapply()` takes elements of the list and passes them as the *first argument* of the function you are applying.\n\nIn the above example, the first argument of `runif()` is `n`, and so the elements of the sequence `1:4` all got passed to the `n` argument of `runif()`.\n:::\n\nFunctions that you pass to `lapply()` may have other arguments. For example, the `runif()` function has a `min` and `max` argument too.\n\n::: callout-note\n### Question\n\nIn the example above I used the default values for `min` and `max`.\n\n-   How would you be able to specify different values for that in the context of `lapply()`?\n:::\n\nHere is where the `...` argument to `lapply()` comes into play. Any arguments that you place in the `...` argument will get passed down to the function being applied to the elements of the list.\n\nHere, the `min = 0` and `max = 10` arguments are passed down to `runif()` every time it gets called.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nlapply(x, runif, min = 0, max = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 5.653385\n\n[[2]]\n[1] 8.325503 7.234466\n\n[[3]]\n[1] 5.968981 9.174316 7.920678\n\n[[4]]\n[1] 9.491500 3.023649 2.990945 8.757496\n```\n:::\n:::\n\n\nSo now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10.\n\nThe `lapply()` function (and its friends) makes heavy use of *anonymous* functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These functions are generated \"on the fly\" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace.\n\n::: callout-tip\n### Example\n\nHere I am creating a list that contains two matrices.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n     [,1] [,2]\n[1,]    1    3\n[2,]    2    4\n\n$b\n     [,1] [,2]\n[1,]    1    4\n[2,]    2    5\n[3,]    3    6\n```\n:::\n:::\n\n\nSuppose I wanted to extract the first column of each matrix in the list. I could write an anonymous function for extracting the first column of each matrix.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(x, function(elt) {\n    elt[, 1]\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2\n\n$b\n[1] 1 2 3\n```\n:::\n:::\n\n\nNotice that I put the `function()` definition right in the call to `lapply()`.\n:::\n\nThis is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside `lapply()`, but if it's going to be more complicated, it's probably a better idea to define the function separately.\n\nFor example, I could have done the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(elt) {\n    elt[, 1]\n}\nlapply(x, f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2\n\n$b\n[1] 1 2 3\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nNow the function is no longer anonymous; its name is `f`.\n:::\n\nWhether you use an anonymous function or you define a function first depends on your context. If you think the function `f` is something you are going to need a lot in other parts of your code, you might want to define it separately. But if you are just going to use it for this call to `lapply()`, then it is probably simpler to use an anonymous function.\n\n## `sapply()`\n\nThe `sapply()` function behaves similarly to `lapply()`; the only real difference is in the return value. `sapply()` will try to simplify the result of `lapply()` if possible. Essentially, `sapply()` calls `lapply()` on its input and then applies the following algorithm:\n\n-   If the result is a list where every element is length 1, then a vector is returned\n\n-   If the result is a list where every element is a vector of the same length (\\> 1), a matrix is returned.\n\n-   If it can't figure things out, a list is returned\n\nHere's the result of calling `lapply()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 2.5\n\n$b\n[1] -0.1478465\n\n$c\n[1] 0.819794\n\n$d\n[1] 4.954484\n```\n:::\n:::\n\n\nNotice that `lapply()` returns a list (as usual), but that each element of the list has length 1.\n\nHere's the result of calling `sapply()` on the same list.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n         a          b          c          d \n 2.5000000 -0.1478465  0.8197940  4.9544836 \n```\n:::\n:::\n\n\nBecause the result of `lapply()` was a list where each element had length 1, `sapply()` collapsed the output into a numeric vector, which is often more useful than a list.\n\n## `split()`\n\nThe `split()` function takes a vector or other objects and splits it into groups determined by a factor or list of factors.\n\nThe arguments to `split()` are\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(split)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (x, f, drop = FALSE, ...)  \n```\n:::\n:::\n\n\nwhere\n\n-   `x` is a vector (or list) or data frame\n-   `f` is a factor (or coerced to one) or a list of factors\n-   `drop` indicates whether empty factors levels should be dropped\n\nThe combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying that function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as \"map-reduce\" in other contexts.\n\nHere we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to \"generate levels\" in a factor variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(rnorm(10), runif(10), rnorm(10, 1))\nf <- gl(3, 10) # generate factor levels\nf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3\nLevels: 1 2 3\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsplit(x, f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n [1]  0.78541247 -0.06267966 -0.89713180  0.11796725  0.66689447 -0.02523006\n [7] -0.19081948  0.44974528 -0.51005146 -0.08103298\n\n$`2`\n [1] 0.29977033 0.31873253 0.53182993 0.85507540 0.21585775 0.89867742\n [7] 0.78109747 0.06887742 0.79661568 0.60022565\n\n$`3`\n [1] -0.38262045  0.06294368  0.41768485  1.57972821  1.17555228  1.47374130\n [7]  1.79199913  2.25569283  1.55226509 -1.51811384\n```\n:::\n:::\n\n\nA common idiom is `split` followed by an `lapply`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(split(x, f), mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n[1] 0.0253074\n\n$`2`\n[1] 0.536676\n\n$`3`\n[1] 0.8408873\n```\n:::\n:::\n\n\n### Splitting a Data Frame\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(datasets)\nhead(airquality)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n  Ozone Solar.R Wind Temp Month Day\n1    41     190  7.4   67     5   1\n2    36     118  8.0   72     5   2\n3    12     149 12.6   74     5   3\n4    18     313 11.5   62     5   4\n5    NA      NA 14.3   56     5   5\n6    28      NA 14.9   66     5   6\n```\n:::\n:::\n\n\nWe can split the `airquality` data frame by the `Month` variable so that we have separate sub-data frames for each month.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ns <- split(airquality, airquality$Month)\nstr(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 5\n $ 5:'data.frame':\t31 obs. of  6 variables:\n  ..$ Ozone  : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...\n  ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...\n  ..$ Wind   : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...\n  ..$ Temp   : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...\n  ..$ Month  : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...\n  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 6:'data.frame':\t30 obs. of  6 variables:\n  ..$ Ozone  : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...\n  ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...\n  ..$ Wind   : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...\n  ..$ Temp   : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...\n  ..$ Month  : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...\n  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...\n $ 7:'data.frame':\t31 obs. of  6 variables:\n  ..$ Ozone  : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...\n  ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...\n  ..$ Wind   : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...\n  ..$ Temp   : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...\n  ..$ Month  : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...\n  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 8:'data.frame':\t31 obs. of  6 variables:\n  ..$ Ozone  : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...\n  ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...\n  ..$ Wind   : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...\n  ..$ Temp   : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...\n  ..$ Month  : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...\n  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 9:'data.frame':\t30 obs. of  6 variables:\n  ..$ Ozone  : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...\n  ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...\n  ..$ Wind   : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...\n  ..$ Temp   : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...\n  ..$ Month  : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...\n  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...\n```\n:::\n:::\n\n\nThen we can take the column means for `Ozone`, `Solar.R`, and `Wind` for each sub-data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(s, function(x) {\n    colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")])\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`5`\n   Ozone  Solar.R     Wind \n      NA       NA 11.62258 \n\n$`6`\n    Ozone   Solar.R      Wind \n       NA 190.16667  10.26667 \n\n$`7`\n     Ozone    Solar.R       Wind \n        NA 216.483871   8.941935 \n\n$`8`\n   Ozone  Solar.R     Wind \n      NA       NA 8.793548 \n\n$`9`\n   Ozone  Solar.R     Wind \n      NA 167.4333  10.1800 \n```\n:::\n:::\n\n\nUsing `sapply()` might be better here for a more readable output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(s, function(x) {\n    colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")])\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n               5         6          7        8        9\nOzone         NA        NA         NA       NA       NA\nSolar.R       NA 190.16667 216.483871       NA 167.4333\nWind    11.62258  10.26667   8.941935 8.793548  10.1800\n```\n:::\n:::\n\n\nUnfortunately, there are `NA`s in the data so we cannot simply take the means of those variables. However, we can tell the `colMeans` function to remove the `NA`s before computing the mean.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(s, function(x) {\n    colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")],\n        na.rm = TRUE\n    )\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n                5         6          7          8         9\nOzone    23.61538  29.44444  59.115385  59.961538  31.44828\nSolar.R 181.29630 190.16667 216.483871 171.857143 167.43333\nWind     11.62258  10.26667   8.941935   8.793548  10.18000\n```\n:::\n:::\n\n\n## tapply\n\n`tapply()` is used to apply a function over subsets of a vector. It can be thought of as a combination of `split()` and `sapply()` for vectors only. I've been told that the \"t\" in `tapply()` refers to \"table\", but that is unconfirmed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(tapply)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)  \n```\n:::\n:::\n\n\nThe arguments to `tapply()` are as follows:\n\n-   `X` is a vector\n-   `INDEX` is a factor or a list of factors (or else they are coerced to factors)\n-   `FUN` is a function to be applied\n-   ... contains other arguments to be passed `FUN`\n-   `simplify`, should we simplify the result?\n\n::: callout-tip\n### Example\n\nGiven a vector of numbers, one simple operation is to take group means.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Simulate some data\nx <- c(rnorm(10), runif(10), rnorm(10, 1))\n## Define some groups with a factor variable\nf <- gl(3, 10)\nf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3\nLevels: 1 2 3\n```\n:::\n\n```{.r .cell-code}\ntapply(x, f, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n        1         2         3 \n0.3554738 0.5195466 0.6764006 \n```\n:::\n:::\n\n:::\n\nWe can also apply functions that return more than a single value. In this case, `tapply()` will not simplify the result and will return a list. Here's an example of finding the `range()` (min and max) of each sub-group.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntapply(x, f, range)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n[1] -1.431912  2.695089\n\n$`2`\n[1] 0.1263379 0.8959040\n\n$`3`\n[1] -1.207741  1.696309\n```\n:::\n:::\n\n\n## `apply()`\n\nThe `apply()` function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. Using `apply()` is not really faster than writing a loop, but it works in one line and is highly compact.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(apply)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, MARGIN, FUN, ..., simplify = TRUE)  \n```\n:::\n:::\n\n\nThe arguments to `apply()` are\n\n-   `X` is an array\n-   `MARGIN` is an integer vector indicating which margins should be \"retained\".\n-   `FUN` is a function to be applied\n-   `...` is for other arguments to be passed to `FUN`\n\n::: callout-tip\n### Example\n\nHere I create a 20 by 10 matrix of Normal random numbers. I then compute the mean of each column.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(rnorm(200), 20, 10)\nhead(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n          [,1]       [,2]       [,3]        [,4]       [,5]       [,6]\n[1,]  1.589728  0.7733454 -1.3311072 -0.77084025 -0.1947478  0.1748546\n[2,]  2.395088  0.3243910 -1.5133366  0.09199955  0.3850993  0.1851718\n[3,]  1.039643 -2.1721402 -0.9933217 -1.89261272  0.1748050  1.0563987\n[4,] -1.580978 -0.9884235 -1.4976744 -0.51011200 -2.7512079  0.5547477\n[5,]  1.264799 -2.0551874  0.4483417 -3.08561764 -0.1549359 -0.8384706\n[6,]  1.756973  0.9244522  0.2740854 -0.61441465 -1.0661350  1.4497808\n           [,7]        [,8]       [,9]      [,10]\n[1,]  0.7163086 -0.01817166  0.2193225 -0.3346788\n[2,]  0.7606851  0.42082416  0.1099027  0.2834439\n[3,] -1.1218204 -1.17000278  0.4302792 -0.5684986\n[4,]  0.6082452  0.46763465 -0.3481830 -0.1765517\n[5,] -0.7460224 -0.01123782  1.8116342 -0.1033175\n[6,]  1.0160202 -0.82361401 -0.1616471 -0.1628032\n```\n:::\n\n```{.r .cell-code}\napply(x, 2, mean) ## Take the mean of each column\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  0.083759441 -0.134507982 -0.246473461 -0.371270102 -0.078433882\n [6] -0.101665531 -0.007126106 -0.003193726  0.114767264  0.070612124\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nI can also compute the sum of each row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 1, sum) ## Take the mean of each row\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  0.82401382  3.44326903 -5.21727094 -6.22250299 -3.47001414  2.59269751\n [7] -1.76049948 -0.54534465  1.26993157 -0.05660623  1.89101638  2.60154094\n[13] -0.80804188  1.96321614 -2.68869045  0.56525640  0.44214056 -4.25890694\n[19] -3.02509115 -1.01075274\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nIn both calls to `apply()`, the return value was a vector of numbers.\n:::\n\nYou've probably noticed that the second argument is either a 1 or a 2, depending on whether we want row statistics or column statistics. What exactly *is* the second argument to `apply()`?\n\nThe `MARGIN` argument essentially indicates to `apply()` which dimension of the array you want to preserve or retain.\n\nSo when taking the mean of each column, I specify\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 2, mean)\n```\n:::\n\n\nbecause I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 1, mean)\n```\n:::\n\n\nbecause I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension).\n\n### Col/Row Sums and Means\n\n::: callout-tip\n### Pro-tip\n\nFor the special case of column/row sums and column/row means of matrices, we have some useful shortcuts.\n\n-   `rowSums` = `apply(x, 1, sum)`\n-   `rowMeans` = `apply(x, 1, mean)`\n-   `colSums` = `apply(x, 2, sum)`\n-   `colMeans` = `apply(x, 2, mean)`\n:::\n\nThe shortcut functions are heavily optimized and hence are **much** faster, but you probably won't notice unless you're using a large matrix.\n\nAnother nice aspect of these functions is that they are a bit more descriptive. It's arguably more clear to write `colMeans(x)` in your code than `apply(x, 2, mean)`.\n\n### Other Ways to Apply\n\nYou can do more than take sums and means with the `apply()` function.\n\n::: callout-tip\n### Example\n\nFor example, you can compute quantiles of the rows of a matrix using the `quantile()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(rnorm(200), 20, 10)\nhead(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n            [,1]         [,2]      [,3]       [,4]        [,5]         [,6]\n[1,]  0.58654399 -0.502546440 1.1493478  0.6257709 -0.02866237  1.490139530\n[2,] -0.14969248  0.327632870 0.0202589  0.2889600 -0.16552218 -0.829703298\n[3,]  1.12561766  0.707836011 0.6038607 -0.6722613  0.85092968  0.550785886\n[4,] -1.71719604  0.554424755 0.4229181  0.1484968  0.22134369  0.258853355\n[5,]  0.31827641  1.555568589 0.8971850 -0.7742244  0.45459793 -0.043814576\n[6,] -0.08429415  0.001737282 0.1906608  1.1145869  0.54156791 -0.004889302\n           [,7]        [,8]       [,9]      [,10]\n[1,] -0.7879713  1.02206400 -1.0420765 -1.2779945\n[2,]  1.7217146  0.06728039  0.6408182 -0.3551929\n[3,] -0.2439192 -0.71553120 -0.8273868  0.2559954\n[4,] -0.1085818 -0.28763268  1.9010457  1.7950971\n[5,] -1.4082747 -1.07621679  0.5428189  0.4538626\n[6,] -1.0644006 -0.04186614 -0.8150566  1.0490749\n```\n:::\n\n```{.r .cell-code}\n## Get row quantiles\napply(x, 1, quantile, probs = c(0.25, 0.75))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n          [,1]       [,2]       [,3]        [,4]       [,5]        [,6]\n25% -0.7166151 -0.1615648 -0.5651758 -0.04431213 -0.5916219 -0.07368714\n75%  0.9229907  0.3179646  0.6818422  0.52154809  0.5207637  0.45384114\n          [,7]       [,8]       [,9]      [,10]      [,11]      [,12]\n25% -0.4355993 -0.1313015 -0.8149658 -0.9260982 0.02077709 -0.1343613\n75%  1.5985929  0.8889319  0.2213238  0.3661333 0.82424899  0.4156328\n         [,13]      [,14]      [,15]      [,16]      [,17]      [,18]\n25% -0.1281593 -0.6691927 -0.2824997 -0.6574923 0.06421797 -0.7905708\n75%  1.3073689  1.2450340  0.5072401  0.5023885 1.08294108  0.4653062\n         [,19]      [,20]\n25% -0.5826196 -0.6965163\n75%  0.1313324  0.6849689\n```\n:::\n:::\n\n\nNotice that I had to pass the `probs = c(0.25, 0.75)` argument to `quantile()` via the `...` argument to `apply()`.\n:::\n\n## Vectorizing a Function\n\nLet's talk about how we can **\"vectorize\" a function**.\n\nWhat this means is that we can write function that typically only takes single arguments and create a new function that can take vector arguments.\n\nThis is often needed when you want to plot functions.\n\n::: callout-tip\n### Example\n\nHere's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is $\\sum_{i=1}^n(x_i-\\mu)^2/\\sigma^2$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumsq <- function(mu, sigma, x) {\n    sum(((x - mu) / sigma)^2)\n}\n```\n:::\n\n\nThis function takes a mean `mu`, a standard deviation `sigma`, and some data in a vector `x`.\n\nIn many statistical applications, we want to minimize the sum of squares to find the optimal `mu` and `sigma`. Before we do that, we may want to evaluate or plot the function for many different values of `mu` or `sigma`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- rnorm(100) ## Generate some data\nsumsq(mu = 1, sigma = 1, x) ## This works (returns one value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 248.8765\n```\n:::\n:::\n\n\nHowever, passing a vector of `mu`s or `sigma`s won't work with this function because it's not vectorized.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumsq(1:10, 1:10, x) ## This is not what we want\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 119.3071\n```\n:::\n:::\n\n:::\n\nThere's even a function in R called `Vectorize()` that **automatically can create a vectorized version of your function**.\n\nSo we could create a `vsumsq()` function that is fully vectorized as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvsumsq <- Vectorize(sumsq, c(\"mu\", \"sigma\"))\nvsumsq(1:10, 1:10, x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 248.8765 146.5055 124.7964 116.2695 111.8983 109.2945 107.5867 106.3890\n [9] 105.5067 104.8318\n```\n:::\n:::\n\n\nPretty cool, right?\n\n# Summary\n\n-   The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form\n\n-   The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.\n\n-   Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere\n\n-   The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Write a function `compute_s_n()` that for any given `n` computes the sum\n\n$$\nS_n = 1^2 + 2^2 + 3^2 + \\ldots + n^2\n$$\n\nReport the value of the sum when $n$ = 10.\n\n2.  Define an empty numerical vector `s_n` of size 25 using `s_n <- vector(\"numeric\", 25)` and store in the results of $S_1, S_2, \\ldots, S_n$ using a for-loop.\n\n3.  Repeat Q3, but this time use `sapply()`.\n\n4.  Plot `s_n` versus `n`. Use points defined by $n= 1, \\ldots, 25$\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-loop-functions>\n-   <https://rafalab.github.io/dsbook/programming-basics.html#vectorization>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/18-debugging-r-code/index/execute-results/html.json b/_freeze/posts/18-debugging-r-code/index/execute-results/html.json
index eee8e21..7537551 100644
--- a/_freeze/posts/18-debugging-r-code/index/execute-results/html.json
+++ b/_freeze/posts/18-debugging-r-code/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "7466f680cc707a58b3cb6baa83e2368a",
+  "hash": "8ab23c4f0460ca2e98ce61454db02a6d",
   "result": {
-    "markdown": "---\ntitle: \"18 - Debugging R Code\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Help! What's wrong with my code???\"\ncategories: [module 4, week 5, programming, debugging]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/18-debugging-r-code/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://adv-r.hadley.nz/debugging>\n2.  <https://rstats.wtf/debugging-r-code>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-debugging-r-code>\n-   <https://adv-r.hadley.nz/debugging>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Discuss an overall approach to debugging code in R\n-   Recognize the three main indications of a problem/condition (`message`, `warning`, `error`) and a fatal problem (`error`)\n-   Understand the importance of reproducing the problem when debugging a function or piece of code\n-   Learn how to use interactive debugging tools `traceback`, `debug`, `recover`, `browser`, and `trace` can be used to find problematic code in functions\n:::\n\n# Debugging R Code\n\n## Overall approach\n\nFinding the **root cause of a problem is always challenging**.\n\nMost bugs are subtle and hard to find because if they were obvious, you would have avoided them in the first place.\n\nA good strategy helps. Below I outline a four step process that I have found useful:\n\n### 1. Google!\n\nWhenever you see an error message, **start by googling it**.\n\nIf you are lucky, you will discover that it's a common error with a known solution.\n\n::: callout-tip\n### Pro-tip\n\nWhen googling, improve your chances of a good match by removing any variable names or values that are specific to your problem.\n:::\n\n### 2. Make it repeatable\n\nTo find the root cause of an error, you are going to need to execute the code many times as you consider and reject hypotheses.\n\n**To make that iteration as quick possible**, it's worth some upfront investment to **make the problem both easy and fast to reproduce**.\n\nStart by creating a **rep**roducible **ex**ample (reprex).\n\n-   This will help others help you, and **often leads to a solution without asking others**, because in the course of making the problem reproducible you often figure out the root cause.\n\nMake the **example minimal by removing code and simplifying data**.\n\n-   As you do this, you may discover inputs that do not trigger the error. - Make note of them: they will be helpful when diagnosing the root cause.\n\n::: callout-tip\n### Example\n\nLet's try making a **reprex** [using the `reprex` package](https://www.tidyverse.org/help) (installed with the `tidyverse`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(reprex)\n```\n:::\n\n\nWrite a bit of code and copy it to the clipboard:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(y <- 1:4)\nmean(y)\n```\n:::\n\n\nEnter `reprex()` in the R Console. In RStudio, you'll see a preview of your rendered reprex.\n\nIt is now ready and waiting on your clipboard, so you can paste it into, say, a GitHub issue.\n\nOne last step. Let's go here and open up an issue on the course website:\n\n-   <https://github.com/stephaniehicks/jhustatcomputing2022/issues>\n\nWe will paste in the code from our reprex.\n:::\n\nIn RStudio, you can access reprex from the addins menu, which makes it even easier to point out your code and select the output format.\n\n### 3. Figure out where it is\n\nIt's a great idea to adopt the scientific method here.\n\n-   Generate hypotheses\n-   Design experiments to test them\n-   Record your results\n\nThis may seem like a lot of work, but **a systematic approach** will end up saving you time.\n\nOften **a lot of time can be wasted relying on my intuition to solve a bug** (\"oh, it must be an off-by-one error, so I'll just subtract 1 here\"), when I would have been better off taking a systematic approach.\n\nIf this fails, you **might need to ask help from someone else**.\n\nIf you have followed the previous step, you will have a small example that is easy to share with others. That makes it much easier for other people to look at the problem, and more likely to help you find a solution.\n\n### 4. Fix it and test it\n\nOnce you have found the bug, you need to **figure out how to fix it** and to **check that the fix actually worked**.\n\nAgain, it is very useful to have automated tests in place.\n\n-   Not only does this help to ensure that you **have actually fixed the bug**, it also **helps to ensure you have not introduced any new bugs** in the process.\n-   In the absence of automated tests, make sure to **carefully record the correct output**, and check against the inputs that previously failed.\n\n## Something's Wrong!\n\nOnce you have made the error repeatable, the next step is to figure out where it comes from.\n\nR has a number of **ways to indicate to you that something is not right**.\n\nThere are **different levels of indication** that can be used, ranging from mere notification to fatal error. Executing any function in R may result in the following **conditions**.\n\n-   `message`: A **generic notification/diagnostic message** produced by the `message()` function; execution of the function continues\n-   `warning`: An indication that **something is wrong but not necessarily fatal**; execution of the function continues. Warnings are generated by the `warning()` function\n-   `error`: An indication that **a fatal problem has occurred** and execution of the function stops. Errors are produced by the `stop()` function.\n-   `condition`: A generic concept for indicating that **something unexpected has occurred**; programmers can create their own custom conditions if they want.\n\n::: callout-tip\n### Example\n\nHere is an example of a warning that you might receive in the course of using R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(-1)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in log(-1): NaNs produced\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NaN\n```\n:::\n:::\n\n\nThis warning lets you know that taking the log of a negative number results in a `NaN` value because you **can't take the log of negative numbers**.\n:::\n\nNevertheless, R doesn't give an error, because it has a useful value that it can return, the **`NaN` value**.\n\nThe **warning is just there** to let you know that **something unexpected happen**.\n\nDepending on what you are programming, you may have intentionally taken the log of a negative number in order to move on to another section of code.\n\n::: callout-tip\n### Example\n\nHere is another function that is designed to print a message to the console depending on the nature of its input.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message <- function(x) {\n        if(x > 0) {\n                print(\"x is greater than zero\")\n        } else {\n                print(\"x is less than or equal to zero\")\n        }  \n        invisible(x)        \n}\n```\n:::\n\n\nThis function is simple:\n\n-   It **prints a message** telling you whether `x` is greater than zero or less than or equal to zero.\n-   It also returns its input **invisibly**, which is a common practice with \"print\" functions.\n\n**Returning an object invisibly** means that the **return value does not get auto-printed** when the function is called.\n\nTake a hard look at the function above and see if you can identify any bugs or problems.\n\nWe can execute the function as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message(1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is greater than zero\"\n```\n:::\n:::\n\n\nThe function seems to work fine at this point. No errors, warnings, or messages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message(NA)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in if (x > 0) {: missing value where TRUE/FALSE needed\n```\n:::\n:::\n\n:::\n\nWhat happened?\n\n-   Well, the first thing the function does is test if `x > 0`.\n-   But you can't do that test if `x` is a `NA` or `NaN` value.\n-   R **doesn't know what to do in this case** so it **stops with a fatal error**.\n\nWe can **fix this problem** by anticipating the possibility of `NA` values and checking to see if the input is `NA` with the `is.na()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message2 <- function(x) {\n        if(is.na(x))\n                print(\"x is a missing value!\")\n        else if(x > 0)\n                print(\"x is greater than zero\")\n        else\n                print(\"x is less than or equal to zero\")\n        invisible(x)\n}\n```\n:::\n\n\nNow we can run the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message2(NA)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is a missing value!\"\n```\n:::\n:::\n\n\nAnd all is fine.\n\nNow what about the following situation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- log(c(-1, 2))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in log(c(-1, 2)): NaNs produced\n```\n:::\n\n```{.r .cell-code}\nprint_message2(x)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in if (is.na(x)) print(\"x is a missing value!\") else if (x > 0) print(\"x is greater than zero\") else print(\"x is less than or equal to zero\"): the condition has length > 1\n```\n:::\n:::\n\n\nNow what?? Why are we getting this warning?\n\nThe **warning** says \"the condition has length \\> 1 and only the first element will be used\".\n\nThe **problem here** is that I passed `print_message2()` a vector `x` that was of length 2 rather then length 1.\n\nInside the body of `print_message2()` the expression `is.na(x)` returns a vector that is tested in the `if` statement.\n\nHowever, `if` cannot take vector arguments, so you get a warning.\n\nThe fundamental problem here is that `print_message2()` is not **vectorized**.\n\nWe can **solve this problem** two ways.\n\n1.  Simply **not allow vector arguments**.\n2.  The other way is to **vectorize** the `print_message2()` function to allow it to take vector arguments.\n\nFor the **first way**, we simply need to check the length of the input.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message3 <- function(x) {\n        if(length(x) > 1L)\n                stop(\"'x' has length > 1\")\n        if(is.na(x))\n                print(\"x is a missing value!\")\n        else if(x > 0)\n                print(\"x is greater than zero\")\n        else\n                print(\"x is less than or equal to zero\")\n        invisible(x)\n}\n```\n:::\n\n\nNow when we pass `print_message3()` a vector, we should get an **error**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message3(1:2)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in print_message3(1:2): 'x' has length > 1\n```\n:::\n:::\n\n\nVectorizing the function can be accomplished easily with the `Vectorize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message4 <- Vectorize(print_message2)\nout <- print_message4(c(-1, 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is less than or equal to zero\"\n[1] \"x is greater than zero\"\n```\n:::\n:::\n\n\nYou can see now that the **correct messages are printed without any warning or error**.\n\n::: callout-tip\n### Note\n\nI stored the return value of `print_message4()` in a separate R object called `out`.\n\nThis is because when I use the `Vectorize()` function it no longer preserves the invisibility of the return value.\n:::\n\n::: callout-tip\n### Helpful tips\n\nThe **primary task of debugging** any R code is **correctly diagnosing what the problem is**.\n\nWhen diagnosing a problem with your code (or somebody else's), it's important **first understand what you were expecting to occur**.\n\nThen you need to **idenfity what did occur** and **how did it deviate from your expectations**.\n\nSome basic questions you need to ask are\n\n-   What was your input? How did you call the function?\n-   What were you expecting? Output, messages, other results?\n-   What did you get?\n-   How does what you get differ from what you were expecting?\n-   Were your expectations correct in the first place?\n-   Can you reproduce the problem (exactly)?\n:::\n\nBeing able to answer these questions is important not just for your own sake, but in situations where you may need to ask someone else for help with debugging the problem.\n\nSeasoned programmers will be asking you these exact questions.\n\n# Debugging Tools in R\n\nR provides a number of tools to help you with debugging your code. The primary tools for debugging functions in R are\n\n-   `traceback()`: **prints out the function call stack** after an error occurs; does nothing if there's no error\n-   `debug()`: **flags a function for \"debug\" mode** which allows you to step through execution of a function one line at a time\n-   `browser()`: **suspends the execution of a function** wherever it is called and puts the function in debug mode\n-   `trace()`: allows you to **insert debugging code into a function** at specific places\n-   `recover()`: allows you to **modify the error behavior** so that you can browse the function call stack\n\nThese functions are interactive tools specifically designed to allow you to pick through a function. There is also the more blunt technique of inserting `print()` or `cat()` statements in the function.\n\n## Using `traceback()`\n\nThe `traceback()` function **prints out the function call stack** after an error has occurred.\n\nThe **function call stack** is the **sequence of functions that was called before the error occurred**.\n\nFor example, you may have a function `a()` which subsequently calls function `b()` which calls `c()` and then `d()`.\n\nIf an error occurs, it may not be immediately clear in which function the error occurred.\n\nThe `traceback()` function **shows you how many levels deep** you were when the error occurred.\n\n::: callout-tip\n### Example\n\nLet's use the `mean()` function on a vector `z` that does not exist in our R environment\n\n``` r\n> mean(z)\nError in mean(z) : object 'z' not found\n> traceback()\n1: mean(z)\n```\n\nHere, it's **clear that the error occurred** inside the `mean()` function because the object `z` does not exist.\n:::\n\nThe `traceback()` function **must be called immediately after an error** occurs. Once another function is called, you lose the traceback.\n\n::: callout-tip\n### Example\n\nHere is a slightly more complicated example using the `lm()` function for linear modeling.\n\n``` r\n> lm(y ~ x)\nError in eval(expr, envir, enclos) : object ’y’ not found\n> traceback()\n7: eval(expr, envir, enclos)\n6: eval(predvars, data, env)\n5: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)\n4: model.frame(formula = y ~ x, drop.unused.levels = TRUE)\n3: eval(expr, envir, enclos)\n2: eval(mf, parent.frame())\n1: lm(y ~ x)\n```\n\nYou can see now that the **error did not get thrown until the 7th level of the function call stack**, in which case the `eval()` function tried to evaluate the formula `y ~ x` and **realized the object `y` did not exist**.\n:::\n\nLooking at the traceback is useful for figuring out roughly where an error occurred but it's not useful for more detailed debugging. For that you might turn to the `debug()` function.\n\n## Using `debug()`\n\n<details>\n\n<summary>Click here for how to use `debug()` with an interactive browser.</summary>\n\nThe `debug()` function initiates an interactive debugger (also known as the \"browser\" in R) for a function. With the debugger, you can step through an R function one expression at a time to pinpoint exactly where an error occurs.\n\nThe `debug()` function takes a function as its first argument. Here is an example of debugging the `lm()` function.\n\n``` r\n> debug(lm)      ## Flag the 'lm()' function for interactive debugging\n> lm(y ~ x)\ndebugging in: lm(y ~ x)\ndebug: {\n    ret.x <- x\n    ret.y <- y\n    cl <- match.call()\n    ...\n    if (!qr)\n        z$qr <- NULL \n    z\n} \nBrowse[2]>\n```\n\nNow, every time you call the `lm()` function it will launch the interactive debugger. To turn this behavior off you need to call the `undebug()` function.\n\nThe debugger calls the browser at the very top level of the function body. From there you can step through each expression in the body. There are a few special commands you can call in the browser:\n\n-   `n` executes the current expression and moves to the next expression\n-   `c` continues execution of the function and does not stop until either an error or the function exits\n-   `Q` quits the browser\n\nHere's an example of a browser session with the `lm()` function.\n\n``` r\nBrowse[2]> n   ## Evalute this expression and move to the next one\ndebug: ret.x <- x\nBrowse[2]> n\ndebug: ret.y <- y\nBrowse[2]> n\ndebug: cl <- match.call()\nBrowse[2]> n\ndebug: mf <- match.call(expand.dots = FALSE)\nBrowse[2]> n\ndebug: m <- match(c(\"formula\", \"data\", \"subset\", \"weights\", \"na.action\",\n    \"offset\"), names(mf), 0L)\n```\n\nWhile you are in the browser you can execute any other R function that might be available to you in a regular session. In particular, you can use `ls()` to see what is in your current environment (the function environment) and `print()` to print out the values of R objects in the function environment.\n\nYou can turn off interactive debugging with the `undebug()` function.\n\n``` r\nundebug(lm)    ## Unflag the 'lm()' function for debugging\n```\n\n</details>\n\n## Using `recover()`\n\n<details>\n\n<summary>Click here for how to use `recover()` with an interactive browser.</summary>\n\nThe `recover()` function can be used to modify the error behavior of R when an error occurs. Normally, when an error occurs in a function, R will print out an error message, exit out of the function, and return you to your workspace to await further commands.\n\nWith `recover()` you can tell R that when an error occurs, it should halt execution at the exact point at which the error occurred. That can give you the opportunity to poke around in the environment in which the error occurred. This can be useful to see if there are any R objects or data that have been corrupted or mistakenly modified.\n\n``` r\n> options(error = recover)    ## Change default R error behavior\n> read.csv(\"nosuchfile\")      ## This code doesn't work\nError in file(file, \"rt\") : cannot open the connection\nIn addition: Warning message:\nIn file(file, \"rt\") :\n  cannot open file ’nosuchfile’: No such file or directory\n  \nEnter a frame number, or 0 to exit\n\n1: read.csv(\"nosuchfile\")\n2: read.table(file = file, header = header, sep = sep, quote = quote, dec =\n3: file(file, \"rt\")\n\nSelection:\n```\n\nThe `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around.\n\n</details>\n\n# Summary\n\n-   There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal\n-   When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation\n-   Interactive debugging tools `traceback`, `debug`, `recover`, `browser`, and `trace` can be used to find problematic code in functions\n-   Debugging tools are not a substitute for thinking!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Try using `traceback()` to debug this piece of code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a) g(a)\ng <- function(b) h(b)\nh <- function(c) i(c)\ni <- function(d) {\n  if (!is.numeric(d)) {\n    stop(\"`d` must be numeric\", call. = FALSE)\n  }\n  d + 10\n}\nf(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: `d` must be numeric\n```\n:::\n:::\n\n\nDescribe in words what is happening above?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-debugging-r-code>\n-   <https://adv-r.hadley.nz/debugging>\n-   <https://rstats.wtf/debugging-r-code>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.0)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n reprex      * 2.0.2   2022-08-17 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"18 - Debugging R Code\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Help! What's wrong with my code???\"\ncategories: [module 4, week 5, programming, debugging]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/18-debugging-r-code/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://adv-r.hadley.nz/debugging>\n2.  <https://rstats.wtf/debugging-r-code>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-debugging-r-code>\n-   <https://adv-r.hadley.nz/debugging>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Discuss an overall approach to debugging code in R\n-   Recognize the three main indications of a problem/condition (`message`, `warning`, `error`) and a fatal problem (`error`)\n-   Understand the importance of reproducing the problem when debugging a function or piece of code\n-   Learn how to use interactive debugging tools `traceback`, `debug`, `recover`, `browser`, and `trace` can be used to find problematic code in functions\n:::\n\n# Debugging R Code\n\n## Overall approach\n\nFinding the **root cause of a problem is always challenging**.\n\nMost bugs are subtle and hard to find because if they were obvious, you would have avoided them in the first place.\n\nA good strategy helps. Below I outline a four step process that I have found useful:\n\n### 1. Google!\n\nWhenever you see an error message, **start by googling it**.\n\nIf you are lucky, you will discover that it's a common error with a known solution.\n\n::: callout-tip\n### Pro-tip\n\nWhen googling, improve your chances of a good match by removing any variable names or values that are specific to your problem.\n:::\n\n### 2. Make it repeatable\n\nTo find the root cause of an error, you are going to need to execute the code many times as you consider and reject hypotheses.\n\n**To make that iteration as quick possible**, it's worth some upfront investment to **make the problem both easy and fast to reproduce**.\n\nStart by creating a **rep**roducible **ex**ample (reprex).\n\n-   This will help others help you, and **often leads to a solution without asking others**, because in the course of making the problem reproducible you often figure out the root cause.\n\nMake the **example minimal by removing code and simplifying data**.\n\n-   As you do this, you may discover inputs that do not trigger the error. - Make note of them: they will be helpful when diagnosing the root cause.\n\n::: callout-tip\n### Example\n\nLet's try making a **reprex** [using the `reprex` package](https://www.tidyverse.org/help) (installed with the `tidyverse`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(reprex)\n```\n:::\n\n\nWrite a bit of code and copy it to the clipboard:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(y <- 1:4)\nmean(y)\n```\n:::\n\n\nEnter `reprex()` in the R Console. In RStudio, you'll see a preview of your rendered reprex.\n\nIt is now ready and waiting on your clipboard, so you can paste it into, say, a GitHub issue.\n\nOne last step. Let's go here and open up an issue on the course website:\n\n-   <https://github.com/stephaniehicks/jhustatcomputing2022/issues>\n\nWe will paste in the code from our reprex.\n:::\n\nIn RStudio, you can access reprex from the addins menu, which makes it even easier to point out your code and select the output format.\n\n### 3. Figure out where it is\n\nIt's a great idea to adopt the scientific method here.\n\n-   Generate hypotheses\n-   Design experiments to test them\n-   Record your results\n\nThis may seem like a lot of work, but **a systematic approach** will end up saving you time.\n\nOften **a lot of time can be wasted relying on my intuition to solve a bug** (\"oh, it must be an off-by-one error, so I'll just subtract 1 here\"), when I would have been better off taking a systematic approach.\n\nIf this fails, you **might need to ask help from someone else**.\n\nIf you have followed the previous step, you will have a small example that is easy to share with others. That makes it much easier for other people to look at the problem, and more likely to help you find a solution.\n\n### 4. Fix it and test it\n\nOnce you have found the bug, you need to **figure out how to fix it** and to **check that the fix actually worked**.\n\nAgain, it is very useful to have automated tests in place.\n\n-   Not only does this help to ensure that you **have actually fixed the bug**, it also **helps to ensure you have not introduced any new bugs** in the process.\n-   In the absence of automated tests, make sure to **carefully record the correct output**, and check against the inputs that previously failed.\n\n## Something's Wrong!\n\nOnce you have made the error repeatable, the next step is to figure out where it comes from.\n\nR has a number of **ways to indicate to you that something is not right**.\n\nThere are **different levels of indication** that can be used, ranging from mere notification to fatal error. Executing any function in R may result in the following **conditions**.\n\n-   `message`: A **generic notification/diagnostic message** produced by the `message()` function; execution of the function continues\n-   `warning`: An indication that **something is wrong but not necessarily fatal**; execution of the function continues. Warnings are generated by the `warning()` function\n-   `error`: An indication that **a fatal problem has occurred** and execution of the function stops. Errors are produced by the `stop()` function.\n-   `condition`: A generic concept for indicating that **something unexpected has occurred**; programmers can create their own custom conditions if they want.\n\n::: callout-tip\n### Example\n\nHere is an example of a warning that you might receive in the course of using R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(-1)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in log(-1): NaNs produced\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NaN\n```\n:::\n:::\n\n\nThis warning lets you know that taking the log of a negative number results in a `NaN` value because you **can't take the log of negative numbers**.\n:::\n\nNevertheless, R doesn't give an error, because it has a useful value that it can return, the **`NaN` value**.\n\nThe **warning is just there** to let you know that **something unexpected happen**.\n\nDepending on what you are programming, you may have intentionally taken the log of a negative number in order to move on to another section of code.\n\n::: callout-tip\n### Example\n\nHere is another function that is designed to print a message to the console depending on the nature of its input.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message <- function(x) {\n    if (x > 0) {\n        print(\"x is greater than zero\")\n    } else {\n        print(\"x is less than or equal to zero\")\n    }\n    invisible(x)\n}\n```\n:::\n\n\nThis function is simple:\n\n-   It **prints a message** telling you whether `x` is greater than zero or less than or equal to zero.\n-   It also returns its input **invisibly**, which is a common practice with \"print\" functions.\n\n**Returning an object invisibly** means that the **return value does not get auto-printed** when the function is called.\n\nTake a hard look at the function above and see if you can identify any bugs or problems.\n\nWe can execute the function as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message(1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is greater than zero\"\n```\n:::\n:::\n\n\nThe function seems to work fine at this point. No errors, warnings, or messages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message(NA)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in if (x > 0) {: missing value where TRUE/FALSE needed\n```\n:::\n:::\n\n:::\n\nWhat happened?\n\n-   Well, the first thing the function does is test if `x > 0`.\n-   But you can't do that test if `x` is a `NA` or `NaN` value.\n-   R **doesn't know what to do in this case** so it **stops with a fatal error**.\n\nWe can **fix this problem** by anticipating the possibility of `NA` values and checking to see if the input is `NA` with the `is.na()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message2 <- function(x) {\n    if (is.na(x)) {\n        print(\"x is a missing value!\")\n    } else if (x > 0) {\n        print(\"x is greater than zero\")\n    } else {\n        print(\"x is less than or equal to zero\")\n    }\n    invisible(x)\n}\n```\n:::\n\n\nNow we can run the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message2(NA)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is a missing value!\"\n```\n:::\n:::\n\n\nAnd all is fine.\n\nNow what about the following situation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- log(c(-1, 2))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in log(c(-1, 2)): NaNs produced\n```\n:::\n\n```{.r .cell-code}\nprint_message2(x)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in if (is.na(x)) {: the condition has length > 1\n```\n:::\n:::\n\n\nNow what?? Why are we getting this warning?\n\nThe **warning** says \"the condition has length \\> 1 and only the first element will be used\".\n\nThe **problem here** is that I passed `print_message2()` a vector `x` that was of length 2 rather then length 1.\n\nInside the body of `print_message2()` the expression `is.na(x)` returns a vector that is tested in the `if` statement.\n\nHowever, `if` cannot take vector arguments, so you get a warning.\n\nThe fundamental problem here is that `print_message2()` is not **vectorized**.\n\nWe can **solve this problem** two ways.\n\n1.  Simply **not allow vector arguments**.\n2.  The other way is to **vectorize** the `print_message2()` function to allow it to take vector arguments.\n\nFor the **first way**, we simply need to check the length of the input.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message3 <- function(x) {\n    if (length(x) > 1L) {\n        stop(\"'x' has length > 1\")\n    }\n    if (is.na(x)) {\n        print(\"x is a missing value!\")\n    } else if (x > 0) {\n        print(\"x is greater than zero\")\n    } else {\n        print(\"x is less than or equal to zero\")\n    }\n    invisible(x)\n}\n```\n:::\n\n\nNow when we pass `print_message3()` a vector, we should get an **error**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message3(1:2)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in print_message3(1:2): 'x' has length > 1\n```\n:::\n:::\n\n\nVectorizing the function can be accomplished easily with the `Vectorize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message4 <- Vectorize(print_message2)\nout <- print_message4(c(-1, 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is less than or equal to zero\"\n[1] \"x is greater than zero\"\n```\n:::\n:::\n\n\nYou can see now that the **correct messages are printed without any warning or error**.\n\n::: callout-tip\n### Note\n\nI stored the return value of `print_message4()` in a separate R object called `out`.\n\nThis is because when I use the `Vectorize()` function it no longer preserves the invisibility of the return value.\n:::\n\n::: callout-tip\n### Helpful tips\n\nThe **primary task of debugging** any R code is **correctly diagnosing what the problem is**.\n\nWhen diagnosing a problem with your code (or somebody else's), it's important **first understand what you were expecting to occur**.\n\nThen you need to **idenfity what did occur** and **how did it deviate from your expectations**.\n\nSome basic questions you need to ask are\n\n-   What was your input? How did you call the function?\n-   What were you expecting? Output, messages, other results?\n-   What did you get?\n-   How does what you get differ from what you were expecting?\n-   Were your expectations correct in the first place?\n-   Can you reproduce the problem (exactly)?\n:::\n\nBeing able to answer these questions is important not just for your own sake, but in situations where you may need to ask someone else for help with debugging the problem.\n\nSeasoned programmers will be asking you these exact questions.\n\n# Debugging Tools in R\n\nR provides a number of tools to help you with debugging your code. The primary tools for debugging functions in R are\n\n-   `traceback()`: **prints out the function call stack** after an error occurs; does nothing if there's no error\n-   `debug()`: **flags a function for \"debug\" mode** which allows you to step through execution of a function one line at a time\n-   `browser()`: **suspends the execution of a function** wherever it is called and puts the function in debug mode\n-   `trace()`: allows you to **insert debugging code into a function** at specific places\n-   `recover()`: allows you to **modify the error behavior** so that you can browse the function call stack\n\nThese functions are interactive tools specifically designed to allow you to pick through a function. There is also the more blunt technique of inserting `print()` or `cat()` statements in the function.\n\n## Using `traceback()`\n\nThe `traceback()` function **prints out the function call stack** after an error has occurred.\n\nThe **function call stack** is the **sequence of functions that was called before the error occurred**.\n\nFor example, you may have a function `a()` which subsequently calls function `b()` which calls `c()` and then `d()`.\n\nIf an error occurs, it may not be immediately clear in which function the error occurred.\n\nThe `traceback()` function **shows you how many levels deep** you were when the error occurred.\n\n::: callout-tip\n### Example\n\nLet's use the `mean()` function on a vector `z` that does not exist in our R environment\n\n``` r\n> mean(z)\nError in mean(z) : object 'z' not found\n> traceback()\n1: mean(z)\n```\n\nHere, it's **clear that the error occurred** inside the `mean()` function because the object `z` does not exist.\n:::\n\nThe `traceback()` function **must be called immediately after an error** occurs. Once another function is called, you lose the traceback.\n\n::: callout-tip\n### Example\n\nHere is a slightly more complicated example using the `lm()` function for linear modeling.\n\n``` r\n> lm(y ~ x)\nError in eval(expr, envir, enclos) : object ’y’ not found\n> traceback()\n7: eval(expr, envir, enclos)\n6: eval(predvars, data, env)\n5: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)\n4: model.frame(formula = y ~ x, drop.unused.levels = TRUE)\n3: eval(expr, envir, enclos)\n2: eval(mf, parent.frame())\n1: lm(y ~ x)\n```\n\nYou can see now that the **error did not get thrown until the 7th level of the function call stack**, in which case the `eval()` function tried to evaluate the formula `y ~ x` and **realized the object `y` did not exist**.\n:::\n\nLooking at the traceback is useful for figuring out roughly where an error occurred but it's not useful for more detailed debugging. For that you might turn to the `debug()` function.\n\n## Using `debug()`\n\n<details>\n\n<summary>Click here for how to use `debug()` with an interactive browser.</summary>\n\nThe `debug()` function initiates an interactive debugger (also known as the \"browser\" in R) for a function. With the debugger, you can step through an R function one expression at a time to pinpoint exactly where an error occurs.\n\nThe `debug()` function takes a function as its first argument. Here is an example of debugging the `lm()` function.\n\n``` r\n> debug(lm)      ## Flag the 'lm()' function for interactive debugging\n> lm(y ~ x)\ndebugging in: lm(y ~ x)\ndebug: {\n    ret.x <- x\n    ret.y <- y\n    cl <- match.call()\n    ...\n    if (!qr)\n        z$qr <- NULL \n    z\n} \nBrowse[2]>\n```\n\nNow, every time you call the `lm()` function it will launch the interactive debugger. To turn this behavior off you need to call the `undebug()` function.\n\nThe debugger calls the browser at the very top level of the function body. From there you can step through each expression in the body. There are a few special commands you can call in the browser:\n\n-   `n` executes the current expression and moves to the next expression\n-   `c` continues execution of the function and does not stop until either an error or the function exits\n-   `Q` quits the browser\n\nHere's an example of a browser session with the `lm()` function.\n\n``` r\nBrowse[2]> n   ## Evalute this expression and move to the next one\ndebug: ret.x <- x\nBrowse[2]> n\ndebug: ret.y <- y\nBrowse[2]> n\ndebug: cl <- match.call()\nBrowse[2]> n\ndebug: mf <- match.call(expand.dots = FALSE)\nBrowse[2]> n\ndebug: m <- match(c(\"formula\", \"data\", \"subset\", \"weights\", \"na.action\",\n    \"offset\"), names(mf), 0L)\n```\n\nWhile you are in the browser you can execute any other R function that might be available to you in a regular session. In particular, you can use `ls()` to see what is in your current environment (the function environment) and `print()` to print out the values of R objects in the function environment.\n\nYou can turn off interactive debugging with the `undebug()` function.\n\n``` r\nundebug(lm)    ## Unflag the 'lm()' function for debugging\n```\n\n</details>\n\n## Using `recover()`\n\n<details>\n\n<summary>Click here for how to use `recover()` with an interactive browser.</summary>\n\nThe `recover()` function can be used to modify the error behavior of R when an error occurs. Normally, when an error occurs in a function, R will print out an error message, exit out of the function, and return you to your workspace to await further commands.\n\nWith `recover()` you can tell R that when an error occurs, it should halt execution at the exact point at which the error occurred. That can give you the opportunity to poke around in the environment in which the error occurred. This can be useful to see if there are any R objects or data that have been corrupted or mistakenly modified.\n\n``` r\n> options(error = recover)    ## Change default R error behavior\n> read.csv(\"nosuchfile\")      ## This code doesn't work\nError in file(file, \"rt\") : cannot open the connection\nIn addition: Warning message:\nIn file(file, \"rt\") :\n  cannot open file ’nosuchfile’: No such file or directory\n  \nEnter a frame number, or 0 to exit\n\n1: read.csv(\"nosuchfile\")\n2: read.table(file = file, header = header, sep = sep, quote = quote, dec =\n3: file(file, \"rt\")\n\nSelection:\n```\n\nThe `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around.\n\n</details>\n\n# Summary\n\n-   There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal\n-   When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation\n-   Interactive debugging tools `traceback`, `debug`, `recover`, `browser`, and `trace` can be used to find problematic code in functions\n-   Debugging tools are not a substitute for thinking!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Try using `traceback()` to debug this piece of code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a) g(a)\ng <- function(b) h(b)\nh <- function(c) i(c)\ni <- function(d) {\n    if (!is.numeric(d)) {\n        stop(\"`d` must be numeric\", call. = FALSE)\n    }\n    d + 10\n}\nf(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: `d` must be numeric\n```\n:::\n:::\n\n\nDescribe in words what is happening above?\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-debugging-r-code>\n-   <https://adv-r.hadley.nz/debugging>\n-   <https://rstats.wtf/debugging-r-code>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.0)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n reprex      * 2.0.2   2022-08-17 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/19-error-handling-and-generation/index/execute-results/html.json b/_freeze/posts/19-error-handling-and-generation/index/execute-results/html.json
index bc1a96c..3bd7d55 100644
--- a/_freeze/posts/19-error-handling-and-generation/index/execute-results/html.json
+++ b/_freeze/posts/19-error-handling-and-generation/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "0526af525b38b7e0548e2c7611c85749",
+  "hash": "09a27c1c17c10264f75e75c5203f76e9",
   "result": {
-    "markdown": "---\ntitle: \"19 - Error Handling and Generation\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Implement exception handling routines in R functions\"\ncategories: [module 4, week 5, programming, debugging]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/19-error-handling-and-generation/index.qmd).*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://adv-r.hadley.nz/debugging>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-error-handling-and-generation>\n-   <https://adv-r.hadley.nz/debugging>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Create errors, warnings, and messages in R functions using the functions `stop`, `stopifnot`, `warning`, and `message`.\n-   Understand the importance of providing useful error messaging to improve user experience with functions. However, these can also slow down code substantially.\n:::\n\n# Error Handling and Generation\n\n## What is an error?\n\n**Errors most often occur** when code is used in a way that **it is not intended to be used**.\n\n::: callout-tip\n### Example\n\nFor example adding two strings together produces the following error:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\"hello\" + \"world\"\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in \"hello\" + \"world\": non-numeric argument to binary operator\n```\n:::\n:::\n\n:::\n\nThe `+` operator is essentially a **function** that takes two numbers as arguments and finds their sum.\n\nSince neither `\"hello\"` nor `\"world\"` are numbers, the R interpreter produces an error.\n\n**Errors will stop the execution of your program**, and they will (hopefully) print an error message to the R console.\n\nIn R there are two other constructs which are related to errors:\n\n1.  Warnings\n2.  Messages\n\n**Warnings** are meant to indicate that **something seems to have gone wrong** in your program that should be inspected.\n\n::: callout-tip\n### Example\n\nHere's a simple example of a warning being generated:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"5\", \"6\", \"seven\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  5  6 NA\n```\n:::\n:::\n\n\nThe `as.numeric()` function attempts to **convert each string** in `c(\"5\", \"6\", \"seven\")` into a number, however it is impossible to convert `\"seven\"`, so a warning is generated.\n\nExecution of the code is not halted, and an `NA` is produced for `\"seven\"` instead of a number.\n:::\n\n**Messages** simply **print to the R console**, though they are generated by an underlying mechanism that is similar to how errors and warning are generated.\n\n::: callout-tip\n### Example\n\nHere's a small function that will generate a message:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(){\n  message(\"This is a message.\")\n}\n\nf()\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nThis is a message.\n```\n:::\n:::\n\n:::\n\n## Generating Errors\n\nThere are a few essential functions for **generating** errors, warnings, and messages in R.\n\nThe `stop()` function will generate an error.\n\n::: callout-tip\n### Example\n\nLet's generate an error:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstop(\"Something erroneous has occurred!\")\n```\n:::\n\n\n``` r\nError: Something erroneous has occurred!\n```\n:::\n\nIf an error occurs inside of a function, then the **name of that function will appear in the error message**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nname_of_function <- function(){\n  stop(\"Something bad happened.\")\n}\n\nname_of_function()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in name_of_function(): Something bad happened.\n```\n:::\n:::\n\n\nThe `stopifnot()` function takes a series of logical expressions as arguments and if any of them are false an error is generated specifying which expression is false.\n\n::: callout-tip\n### Example\n\nLet's take a look at an example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nerror_if_n_is_greater_than_zero <- function(n){\n  stopifnot(n <= 0)\n  n\n}\n\nerror_if_n_is_greater_than_zero(5)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in error_if_n_is_greater_than_zero(5): n <= 0 is not TRUE\n```\n:::\n:::\n\n:::\n\nThe `warning()` function creates a warning, and the function itself is very similar to the `stop()` function. Remember that a warning does not stop the execution of a program (unlike an error.)\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwarning(\"Consider yourself warned!\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Consider yourself warned!\n```\n:::\n:::\n\n:::\n\nJust like errors, a warning generated inside of a function will include the name of the function in which it was generated:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmake_NA <- function(x){\n  warning(\"Generating an NA.\")\n  NA\n}\n\nmake_NA(\"Sodium\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in make_NA(\"Sodium\"): Generating an NA.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n:::\n\n\nMessages are simpler than errors or warnings; they just print strings to the R console.\n\nYou can issue a message with the `message()` function:\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmessage(\"In a bottle.\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nIn a bottle.\n```\n:::\n:::\n\n:::\n\n## When to generate errors or warnings\n\nStopping the execution of your program with `stop()` should only happen in the event of a catastrophe - meaning only if it is impossible for your program to continue.\n\n-   If there are **conditions that you can anticipate** that would cause your program to create an error, then you **should document those conditions** so whoever uses your software is aware.\n\nAn example includes:\n\n-   Providing invalid arguments to a function. You could check this at the beginning of your program using `stopifnot()` so that the user can quickly realize something has gone wrong.\n\nYou can think of a function as kind of contract between you and the user:\n\n-   if the user provides specified arguments, your program will provide predictable results.\n\nOf course it's **impossible for you to anticipate** all of the potential uses of your program.\n\nIt's **appropriate to create a warning** when this contract between you and the user is violated.\n\nA perfect example of this situation is the result of\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"5\", \"6\", \"seven\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  5  6 NA\n```\n:::\n:::\n\n\nThe user expects a vector of numbers to be returned as the result of `as.numeric()` but `\"seven\"` is coerced into being NA, which is not completely intuitive.\n\nR has largely been developed according to the [Unix Philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which generally **discourages printing text to the console unless something unexpected has occurred**.\n\nLanguages that commonly run on Unix systems like C and C++ are rarely used interactively, meaning that they usually underpin computer infrastructure (computers \"talking\" to other computers).\n\n**Messages printed to the console** are therefore not very useful since nobody will ever read them and it's not straightforward for other programs to capture and interpret them.\n\nIn contrast, R code is frequently executed by human beings in the R console, which serves as an interactive environment between the computer and person at the keyboard.\n\nIf you **think your program should produce a message**, make sure that the **output of the message is primarily meant for a human to read**.\n\nYou should avoid signaling a condition or the result of your program to another program by creating a message.\n\n## How should errors be handled?\n\nImagine writing a program that will take a long time to complete because of a complex calculation or because you're handling a large amount of data. If an error occurs during this computation then you're liable to lose all of the results that were calculated before the error, or your program may not finish a critical task that a program further down your pipeline is depending on. If you anticipate the possibility of errors occurring during the execution of your program, then you can design your program to handle them appropriately.\n\nThe `tryCatch()` function is the workhorse of handling errors and warnings in R. The first argument of this function is any R expression, followed by conditions which specify how to handle an error or a warning. The last argument, `finally`, specifies a function or expression that will be executed after the expression no matter what, even in the event of an error or a warning.\n\nLet's construct a simple function I'm going to call [`beera`](https://en.wikipedia.org/wiki/Yogi_Berra) that catches errors and warnings gracefully.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeera <- function(expr){\n  tryCatch(expr,\n         error = function(e){\n           message(\"An error occurred:\\n\", e)\n         },\n         warning = function(w){\n           message(\"A warning occured:\\n\", w)\n         },\n         finally = {\n           message(\"Finally done!\")\n         })\n}\n```\n:::\n\n\nThis function takes an expression as an argument and tries to evaluate it. If the expression can be evaluated without any errors or warnings then the result of the expression is returned and the message `Finally done!` is printed to the R console. If an error or warning is generated, then the functions that are provided to the `error` or `warning` arguments are printed. Let's try this function out with a few examples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeera({\n  2 + 2\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nFinally done!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n\n```{.r .cell-code}\nbeera({\n  \"two\" + 2\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nAn error occurred:\nError in \"two\" + 2: non-numeric argument to binary operator\n\nFinally done!\n```\n:::\n\n```{.r .cell-code}\nbeera({\n  as.numeric(c(1, \"two\", 3))\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nA warning occured:\nsimpleWarning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced by coercion\n\nFinally done!\n```\n:::\n:::\n\n\nNotice that we've effectively transformed errors and warnings into messages.\n\nNow that you know the basics of generating and catching errors you'll need to decide when your program should generate an error. My advice to you is to limit the number of errors your program generates as much as possible. Even if you design your program so that it's able to catch and handle errors, the error handling process slows down your program by orders of magnitude. Imagine you wanted to write a simple function that checks if an argument is an even number. You might write the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even <- function(n){\n  n %% 2 == 0\n}\n\nis_even(768)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even(\"two\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in n%%2: non-numeric argument to binary operator\n```\n:::\n:::\n\n\nYou can see that providing a string causes this function to raise an error. You could imagine though that you want to use this function across a list of different data types, and you only want to know which elements of that list are even numbers. You might think to write the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even_error <- function(n){\n  tryCatch(n %% 2 == 0,\n           error = function(e){\n             FALSE\n           })\n}\n\nis_even_error(714)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even_error(\"eight\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nThis appears to be working the way you intended, however when applied to more data this function will be seriously slow compared to alternatives. For example I could check that `n` is numeric before treating `n` like a number:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even_check <- function(n){\n  is.numeric(n) && n %% 2 == 0\n}\n\nis_even_check(1876)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even_check(\"twelve\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\n::: keyideas\nNotice that by using `is.numeric()` before the \"AND\" operator (`&&`), the expression `n %% 2 == 0` is never evaluated. This is a programming language design feature called \"short circuiting.\" The expression can never evaluate to `TRUE` if the left hand side of `&&` evaluates to `FALSE`, so the right hand side is ignored.\n:::\n\nTo demonstrate the difference in the speed of the code, we will use the `microbenchmark` package to measure how long it takes for each function to be applied to the same data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(microbenchmark)\nmicrobenchmark(sapply(letters, is_even_check))\n```\n:::\n\n\n```         \nUnit: microseconds\n                           expr    min      lq     mean  median      uq     max neval\n sapply(letters, is_even_check) 46.224 47.7975 61.43616 48.6445 58.4755 167.091   100\n```\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmicrobenchmark(sapply(letters, is_even_error))\n```\n:::\n\n\n```         \nUnit: microseconds\n                           expr     min       lq     mean   median       uq      max neval\n sapply(letters, is_even_error) 640.067 678.0285 906.3037 784.4315 1044.501 2308.931   100\n```\n\nThe error catching approach is nearly 15 times slower!\n\nProper error handling is an essential tool for any software developer so that you can design programs that are error tolerant. Creating clear and informative error messages is essential for building quality software.\n\n::: callout-tip\n### Pro-tip\n\nOne closing tip I recommend is to put documentation for your software online, including the meaning of the errors that your software can potentially throw. Often a user's first instinct when encountering an error is to search online for that error message, which should lead them to your documentation!\n:::\n\n# Summary\n\n-   Errors, warnings, and messages can be generated within R code using the functions `stop`, `stopifnot`, `warning`, and `message`.\n\n-   Catching errors, and providing useful error messaging, can improve user experience with functions but can also slow down code substantially.\n\n# Post-lecture materials\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-error-handling-and-generation>\n-   <https://adv-r.hadley.nz/debugging>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"19 - Error Handling and Generation\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Implement exception handling routines in R functions\"\ncategories: [module 4, week 5, programming, debugging]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/19-error-handling-and-generation/index.qmd).*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://adv-r.hadley.nz/debugging>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-error-handling-and-generation>\n-   <https://adv-r.hadley.nz/debugging>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Create errors, warnings, and messages in R functions using the functions `stop`, `stopifnot`, `warning`, and `message`.\n-   Understand the importance of providing useful error messaging to improve user experience with functions. However, these can also slow down code substantially.\n:::\n\n# Error Handling and Generation\n\n## What is an error?\n\n**Errors most often occur** when code is used in a way that **it is not intended to be used**.\n\n::: callout-tip\n### Example\n\nFor example adding two strings together produces the following error:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\"hello\" + \"world\"\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in \"hello\" + \"world\": non-numeric argument to binary operator\n```\n:::\n:::\n\n:::\n\nThe `+` operator is essentially a **function** that takes two numbers as arguments and finds their sum.\n\nSince neither `\"hello\"` nor `\"world\"` are numbers, the R interpreter produces an error.\n\n**Errors will stop the execution of your program**, and they will (hopefully) print an error message to the R console.\n\nIn R there are two other constructs which are related to errors:\n\n1.  Warnings\n2.  Messages\n\n**Warnings** are meant to indicate that **something seems to have gone wrong** in your program that should be inspected.\n\n::: callout-tip\n### Example\n\nHere's a simple example of a warning being generated:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"5\", \"6\", \"seven\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  5  6 NA\n```\n:::\n:::\n\n\nThe `as.numeric()` function attempts to **convert each string** in `c(\"5\", \"6\", \"seven\")` into a number, however it is impossible to convert `\"seven\"`, so a warning is generated.\n\nExecution of the code is not halted, and an `NA` is produced for `\"seven\"` instead of a number.\n:::\n\n**Messages** simply **print to the R console**, though they are generated by an underlying mechanism that is similar to how errors and warning are generated.\n\n::: callout-tip\n### Example\n\nHere's a small function that will generate a message:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n    message(\"This is a message.\")\n}\n\nf()\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nThis is a message.\n```\n:::\n:::\n\n:::\n\n## Generating Errors\n\nThere are a few essential functions for **generating** errors, warnings, and messages in R.\n\nThe `stop()` function will generate an error.\n\n::: callout-tip\n### Example\n\nLet's generate an error:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstop(\"Something erroneous has occurred!\")\n```\n:::\n\n\n``` r\nError: Something erroneous has occurred!\n```\n:::\n\nIf an error occurs inside of a function, then the **name of that function will appear in the error message**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nname_of_function <- function() {\n    stop(\"Something bad happened.\")\n}\n\nname_of_function()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in name_of_function(): Something bad happened.\n```\n:::\n:::\n\n\nThe `stopifnot()` function takes a series of logical expressions as arguments and if any of them are false an error is generated specifying which expression is false.\n\n::: callout-tip\n### Example\n\nLet's take a look at an example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nerror_if_n_is_greater_than_zero <- function(n) {\n    stopifnot(n <= 0)\n    n\n}\n\nerror_if_n_is_greater_than_zero(5)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in error_if_n_is_greater_than_zero(5): n <= 0 is not TRUE\n```\n:::\n:::\n\n:::\n\nThe `warning()` function creates a warning, and the function itself is very similar to the `stop()` function. Remember that a warning does not stop the execution of a program (unlike an error.)\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwarning(\"Consider yourself warned!\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Consider yourself warned!\n```\n:::\n:::\n\n:::\n\nJust like errors, a warning generated inside of a function will include the name of the function in which it was generated:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmake_NA <- function(x) {\n    warning(\"Generating an NA.\")\n    NA\n}\n\nmake_NA(\"Sodium\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in make_NA(\"Sodium\"): Generating an NA.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n:::\n\n\nMessages are simpler than errors or warnings; they just print strings to the R console.\n\nYou can issue a message with the `message()` function:\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmessage(\"In a bottle.\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nIn a bottle.\n```\n:::\n:::\n\n:::\n\n## When to generate errors or warnings\n\nStopping the execution of your program with `stop()` should only happen in the event of a catastrophe - meaning only if it is impossible for your program to continue.\n\n-   If there are **conditions that you can anticipate** that would cause your program to create an error, then you **should document those conditions** so whoever uses your software is aware.\n\nAn example includes:\n\n-   Providing invalid arguments to a function. You could check this at the beginning of your program using `stopifnot()` so that the user can quickly realize something has gone wrong.\n\nYou can think of a function as kind of contract between you and the user:\n\n-   if the user provides specified arguments, your program will provide predictable results.\n\nOf course it's **impossible for you to anticipate** all of the potential uses of your program.\n\nIt's **appropriate to create a warning** when this contract between you and the user is violated.\n\nA perfect example of this situation is the result of\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"5\", \"6\", \"seven\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  5  6 NA\n```\n:::\n:::\n\n\nThe user expects a vector of numbers to be returned as the result of `as.numeric()` but `\"seven\"` is coerced into being NA, which is not completely intuitive.\n\nR has largely been developed according to the [Unix Philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which generally **discourages printing text to the console unless something unexpected has occurred**.\n\nLanguages that commonly run on Unix systems like C and C++ are rarely used interactively, meaning that they usually underpin computer infrastructure (computers \"talking\" to other computers).\n\n**Messages printed to the console** are therefore not very useful since nobody will ever read them and it's not straightforward for other programs to capture and interpret them.\n\nIn contrast, R code is frequently executed by human beings in the R console, which serves as an interactive environment between the computer and person at the keyboard.\n\nIf you **think your program should produce a message**, make sure that the **output of the message is primarily meant for a human to read**.\n\nYou should avoid signaling a condition or the result of your program to another program by creating a message.\n\n## How should errors be handled?\n\nImagine writing a program that will take a long time to complete because of a complex calculation or because you're handling a large amount of data. If an error occurs during this computation then you're liable to lose all of the results that were calculated before the error, or your program may not finish a critical task that a program further down your pipeline is depending on. If you anticipate the possibility of errors occurring during the execution of your program, then you can design your program to handle them appropriately.\n\nThe `tryCatch()` function is the workhorse of handling errors and warnings in R. The first argument of this function is any R expression, followed by conditions which specify how to handle an error or a warning. The last argument, `finally`, specifies a function or expression that will be executed after the expression no matter what, even in the event of an error or a warning.\n\nLet's construct a simple function I'm going to call [`beera`](https://en.wikipedia.org/wiki/Yogi_Berra) that catches errors and warnings gracefully.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeera <- function(expr) {\n    tryCatch(expr,\n        error = function(e) {\n            message(\"An error occurred:\\n\", e)\n        },\n        warning = function(w) {\n            message(\"A warning occured:\\n\", w)\n        },\n        finally = {\n            message(\"Finally done!\")\n        }\n    )\n}\n```\n:::\n\n\nThis function takes an expression as an argument and tries to evaluate it. If the expression can be evaluated without any errors or warnings then the result of the expression is returned and the message `Finally done!` is printed to the R console. If an error or warning is generated, then the functions that are provided to the `error` or `warning` arguments are printed. Let's try this function out with a few examples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeera({\n    2 + 2\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nFinally done!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n\n```{.r .cell-code}\nbeera({\n    \"two\" + 2\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nAn error occurred:\nError in \"two\" + 2: non-numeric argument to binary operator\n\nFinally done!\n```\n:::\n\n```{.r .cell-code}\nbeera({\n    as.numeric(c(1, \"two\", 3))\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nA warning occured:\nsimpleWarning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced by coercion\n\nFinally done!\n```\n:::\n:::\n\n\nNotice that we've effectively transformed errors and warnings into messages.\n\nNow that you know the basics of generating and catching errors you'll need to decide when your program should generate an error. My advice to you is to limit the number of errors your program generates as much as possible. Even if you design your program so that it's able to catch and handle errors, the error handling process slows down your program by orders of magnitude. Imagine you wanted to write a simple function that checks if an argument is an even number. You might write the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even <- function(n) {\n    n %% 2 == 0\n}\n\nis_even(768)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even(\"two\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in n%%2: non-numeric argument to binary operator\n```\n:::\n:::\n\n\nYou can see that providing a string causes this function to raise an error. You could imagine though that you want to use this function across a list of different data types, and you only want to know which elements of that list are even numbers. You might think to write the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even_error <- function(n) {\n    tryCatch(n %% 2 == 0,\n        error = function(e) {\n            FALSE\n        }\n    )\n}\n\nis_even_error(714)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even_error(\"eight\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nThis appears to be working the way you intended, however when applied to more data this function will be seriously slow compared to alternatives. For example I could check that `n` is numeric before treating `n` like a number:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even_check <- function(n) {\n    is.numeric(n) && n %% 2 == 0\n}\n\nis_even_check(1876)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even_check(\"twelve\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\n::: keyideas\nNotice that by using `is.numeric()` before the \"AND\" operator (`&&`), the expression `n %% 2 == 0` is never evaluated. This is a programming language design feature called \"short circuiting.\" The expression can never evaluate to `TRUE` if the left hand side of `&&` evaluates to `FALSE`, so the right hand side is ignored.\n:::\n\nTo demonstrate the difference in the speed of the code, we will use the `microbenchmark` package to measure how long it takes for each function to be applied to the same data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(microbenchmark)\nmicrobenchmark(sapply(letters, is_even_check))\n```\n:::\n\n\n```         \nUnit: microseconds\n                           expr    min      lq     mean  median      uq     max neval\n sapply(letters, is_even_check) 46.224 47.7975 61.43616 48.6445 58.4755 167.091   100\n```\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmicrobenchmark(sapply(letters, is_even_error))\n```\n:::\n\n\n```         \nUnit: microseconds\n                           expr     min       lq     mean   median       uq      max neval\n sapply(letters, is_even_error) 640.067 678.0285 906.3037 784.4315 1044.501 2308.931   100\n```\n\nThe error catching approach is nearly 15 times slower!\n\nProper error handling is an essential tool for any software developer so that you can design programs that are error tolerant. Creating clear and informative error messages is essential for building quality software.\n\n::: callout-tip\n### Pro-tip\n\nOne closing tip I recommend is to put documentation for your software online, including the meaning of the errors that your software can potentially throw. Often a user's first instinct when encountering an error is to search online for that error message, which should lead them to your documentation!\n:::\n\n# Summary\n\n-   Errors, warnings, and messages can be generated within R code using the functions `stop`, `stopifnot`, `warning`, and `message`.\n\n-   Catching errors, and providing useful error messaging, can improve user experience with functions but can also slow down code substantially.\n\n# Post-lecture materials\n\n### Additional Resources\n\n::: callout-tip\n-   <https://rdpeng.github.io/Biostat776/lecture-error-handling-and-generation>\n-   <https://adv-r.hadley.nz/debugging>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/20-working-with-dates-and-times/index/execute-results/html.json b/_freeze/posts/20-working-with-dates-and-times/index/execute-results/html.json
index 95b68c9..b82baf4 100644
--- a/_freeze/posts/20-working-with-dates-and-times/index/execute-results/html.json
+++ b/_freeze/posts/20-working-with-dates-and-times/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "f3b1703648e41b35e71a956912954ed5",
+  "hash": "9fcf3d6b99d63f4cc32e29c25e524a43",
   "result": {
-    "markdown": "---\ntitle: \"20 - Working with dates and times\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to lubridate for dates and times in R\"\ncategories: [module 5, week 6, tidyverse, R, programming, dates and times]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/20-working-with-dates-and-times/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/dates-and-times>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://jhu-advdatasci.github.io/2018/lectures/09-dates-times>\n-   <https://r4ds.had.co.nz/dates-and-times>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Recognize the `Date`, `POSIXct` and `POSIXlt` class types in R to represent dates and times\n-   Learn how to create date and time objects in R using functions from the `lubridate` package\n-   Learn how dealing with time zones can be frustrating 🙀 but hopefully less so after today's lecture 😺\n-   Learn how to perform arithmetic operations on dates and times\n-   Learn how plotting systems in R \"know\" about dates and times to appropriately handle axis labels\n:::\n\n# Introduction\n\nIn this lesson, we will **learn how to work with dates and times** in R. These may seem simple as you use them all of the time in your day-to-day life, but the more you work with them, the more complicated they seem to get.\n\n**Dates and times are hard to work with** because they have to reconcile **two physical phenomena**\n\n1.  The rotation of the Earth and its orbit around the sun AND\n2.  A whole raft of geopolitical phenomena including months, time zones, and daylight savings time (DST)\n\nThis lesson will not teach you every last detail about dates and times, but it will give you a solid grounding of **practical skills** that will help you with common data analysis challenges.\n\n::: callout-tip\n### Classes for dates and times in R\n\nR has developed a special representation of dates and times\n\n-   Dates are represented by the `Date` class\n-   Times are represented by the `POSIXct` or the `POSIXlt` class\n:::\n\n::: callout-tip\n### Important point in time\n\n-   Dates are stored internally as the number of days since 1970-01-01\n-   Times are stored internally as the number of seconds since 1970-01-01\n\nIn computing, **Unix time** (also known as Epoch time, Posix time, seconds since the Epoch, Unix timestamp, or UNIX Epoch time) is a system for **describing a point in time**.\n\nIt is the number of seconds that have elapsed since the Unix epoch, excluding leap seconds. The Unix epoch is 00:00:00 UTC on 1 January 1970.\n\nUnix time originally appeared as the system time of Unix, but is now used widely in computing, for example by filesystems; some Python language library functions handle Unix time.\\[4\\]\n\n<https://en.wikipedia.org/wiki/Unix_time>\n:::\n\n## The `lubridate` package\n\nHere, we will focus on the `lubridate` R package, which makes it easier to work with dates and times in R.\n\n::: callout-tip\n### Pro-tip\n\n**Check out the `lubridate` cheat sheet** at <https://lubridate.tidyverse.org>\n:::\n\nA few things to note about it:\n\n-   It largely **replaces the default date/time functions in base R**\n-   It contains **methods for date/time arithmetic**\n-   It **handles time zones**, leap year, leap seconds, etc.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/lubridate_ymd.png){preview=\"TRUE\"} \\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n`lubridate` is installed when you install `tidyverse`, but it is not loaded when you load `tidyverse`. Alternatively, you can install it separately.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"lubridate\") \n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(lubridate) \n```\n:::\n\n\n# Creating date/times\n\nThere are three types of date/time data that refer to an instant in time:\n\n-   A **date**. Tibbles print this as `<date>`.\n-   A **time** within a day. Tibbles print this as `<time>`.\n-   A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as `<dttm>`. Elsewhere in R these are called `POSIXct`.\n\n::: callout-tip\n### Note\n\nWe will **focus on dates and date-times** as R does not have a native class for storing times.\n\nIf you to work with **only times**, you can use the [`hms` package](https://cran.r-project.org/web/packages/hms/index.html).\n:::\n\nYou should always **use the simplest possible data type** that works for your needs.\n\nThat means if you can use a `date` instead of a `date-time`, you should.\n\n**Date-times** are **substantially more complicated** because of the need to handle time zones, which we will come back to at the end of the lesson.\n\nTo get the current date or date-time you can use `today()` or `now()` from `lubridate`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntoday()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17\"\n```\n:::\n\n```{.r .cell-code}\nnow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17 21:24:54 EDT\"\n```\n:::\n:::\n\n\nOtherwise, there are three ways you are likely to create a date/time:\n\n::: callout-tip\n### Typical ways to create a date/time in R\n\n1.  From a string\n2.  From individual date-time components\n3.  From an existing date/time object\n:::\n\nThey work as follows.\n\n## 1. From a string\n\nDates are of the `Date` class.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- today()\nclass(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Date\"\n```\n:::\n:::\n\n\nDates can be **coerced from a character strings** using some helper functions from `lubridate`. They **automatically work out the format** once you specify the order of the component.\n\nTo use the helper functions, **identify the order in which year, month, and day appear** in your dates, then arrange \"y\", \"m\", and \"d\" in the same order.\n\nThat gives you the name of the `lubridate` function that will parse your date. For example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd(\"1970-01-01\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1970-01-01\"\n```\n:::\n\n```{.r .cell-code}\nymd(\"2017-01-31\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31\"\n```\n:::\n\n```{.r .cell-code}\nmdy(\"January 31st, 2017\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31\"\n```\n:::\n\n```{.r .cell-code}\ndmy(\"31-Jan-2017\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\n-   When reading in data with `read_csv()`, you **may need to read in as character first** and then **convert to date/time**\n-   `Date` objects have their own special `print()` methods that will **always format** as \"YYYY-MM-DD\"\n-   These functions also take unquoted numbers.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd(20170131)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31\"\n```\n:::\n:::\n\n:::\n\n### Alternate Formulations\n\nDifferent locales have different ways of formatting dates\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd(\"2016-09-13\")  ## International standard\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13\"\n```\n:::\n\n```{.r .cell-code}\nymd(\"2016/09/13\")  ## Just figure it out\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13\"\n```\n:::\n\n```{.r .cell-code}\nmdy(\"09-13-2016\")  ## Mostly U.S.\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13\"\n```\n:::\n\n```{.r .cell-code}\ndmy(\"13-09-2016\")  ## Europe\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13\"\n```\n:::\n:::\n\n\nAll of the **above are valid and lead to the exact same object**.\n\nEven if the individual dates are formatted differently, `ymd()` can **usually figure it out**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"2016-04-05\", \n       \"2016/05/06\",\n       \"2016,10,4\")\nymd(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-04-05\" \"2016-05-06\" \"2016-10-04\"\n```\n:::\n:::\n\n\nCool right?\n\n## 2. From individual date-time components\n\nSometimes the **date components** will come across **multiple columns in a dataset**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(nycflights13)\n\nflights %>% \n  select(year, month, day)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 336,776 × 3\n    year month   day\n   <int> <int> <int>\n 1  2013     1     1\n 2  2013     1     1\n 3  2013     1     1\n 4  2013     1     1\n 5  2013     1     1\n 6  2013     1     1\n 7  2013     1     1\n 8  2013     1     1\n 9  2013     1     1\n10  2013     1     1\n# ℹ 336,766 more rows\n```\n:::\n:::\n\n\nTo create a date/time from this sort of input, use\n\n-   `make_date(year,month,day)` for dates, or\n-   `make_datetime(year,month,day,hour,min,sec,tz)` for date-times\n\nWe combine these functions inside of `mutate` to add a new column to our dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nflights %>% \n  select(year, month, day) %>% \n  mutate(departure = make_date(year, month, day))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 336,776 × 4\n    year month   day departure \n   <int> <int> <int> <date>    \n 1  2013     1     1 2013-01-01\n 2  2013     1     1 2013-01-01\n 3  2013     1     1 2013-01-01\n 4  2013     1     1 2013-01-01\n 5  2013     1     1 2013-01-01\n 6  2013     1     1 2013-01-01\n 7  2013     1     1 2013-01-01\n 8  2013     1     1 2013-01-01\n 9  2013     1     1 2013-01-01\n10  2013     1     1 2013-01-01\n# ℹ 336,766 more rows\n```\n:::\n:::\n\n\n::: callout-note\n### Questions\n\nThe `flights` also contains a `hour` and `minute` column.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nflights %>% \n  select(year, month, day, hour, minute)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 336,776 × 5\n    year month   day  hour minute\n   <int> <int> <int> <dbl>  <dbl>\n 1  2013     1     1     5     15\n 2  2013     1     1     5     29\n 3  2013     1     1     5     40\n 4  2013     1     1     5     45\n 5  2013     1     1     6      0\n 6  2013     1     1     5     58\n 7  2013     1     1     6      0\n 8  2013     1     1     6      0\n 9  2013     1     1     6      0\n10  2013     1     1     6      0\n# ℹ 336,766 more rows\n```\n:::\n:::\n\n\nLet's use `make_datetime()` to create a date-time column called `departure`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## 3. From other types\n\nYou may want to **switch** between a `date-time` and a `date`.\n\nThat is the job of `as_datetime()` and `as_date()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntoday()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17\"\n```\n:::\n\n```{.r .cell-code}\nas_datetime(today())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17 UTC\"\n```\n:::\n\n```{.r .cell-code}\nnow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17 21:24:55 EDT\"\n```\n:::\n\n```{.r .cell-code}\nas_date(now())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17\"\n```\n:::\n:::\n\n\n# Date-Times in R\n\n## From a string\n\n`ymd()` and friends create dates.\n\nTo create a `date-time` **from a character string**, add an underscore and one or more of \"h\", \"m\", and \"s\" to the name of the parsing function:\n\nTimes can be coerced from a character string with `ymd_hms()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd_hms(\"2017-01-31 20:11:59\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31 20:11:59 UTC\"\n```\n:::\n\n```{.r .cell-code}\nmdy_hm(\"01/31/2017 08:01\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31 08:01:00 UTC\"\n```\n:::\n:::\n\n\nYou can also force the creation of a date-time from a date by supplying a timezone:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd_hms(\"2016-09-13 14:00:00\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13 14:00:00 UTC\"\n```\n:::\n\n```{.r .cell-code}\nymd_hms(\"2016-09-13 14:00:00\", tz = \"America/New_York\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13 14:00:00 EDT\"\n```\n:::\n\n```{.r .cell-code}\nymd_hms(\"2016-09-13 14:00:00\", tz = \"\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13 14:00:00 EDT\"\n```\n:::\n:::\n\n\n## `POSIXct` or the `POSIXlt` class\n\nLet's get into some hairy details about date-times. Date-times are represented using the `POSIXct` or the `POSIXlt` class in R. What are these things?\n\n### `POSIXct`\n\n`POSIXct` is just a very large integer under the hood. It is a useful class when you want to store times in something like a data frame.\n\nTechnically, the `POSIXct` class represents the number of **seconds** since 1 January 1970. (In case you were wondering, \"POSIXct\" stands for \"Portable Operating System Interface\", calendar time.)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hm(\"1970-01-01 01:00\")\nclass(x) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"POSIXct\" \"POSIXt\" \n```\n:::\n\n```{.r .cell-code}\nunclass(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3600\nattr(,\"tzone\")\n[1] \"UTC\"\n```\n:::\n\n```{.r .cell-code}\ntypeof(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n```\n:::\n\n```{.r .cell-code}\nattributes(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$class\n[1] \"POSIXct\" \"POSIXt\" \n\n$tzone\n[1] \"UTC\"\n```\n:::\n:::\n\n\n### `POSIXlt`\n\n`POSIXlt` is a `list` underneath and it stores a bunch of other useful information like the day of the week, day of the year, month, day of the month\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- as.POSIXlt(x)\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1970-01-01 01:00:00 UTC\"\n```\n:::\n\n```{.r .cell-code}\ntypeof(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"list\"\n```\n:::\n\n```{.r .cell-code}\nattributes(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$names\n [1] \"sec\"    \"min\"    \"hour\"   \"mday\"   \"mon\"    \"year\"   \"wday\"   \"yday\"  \n [9] \"isdst\"  \"zone\"   \"gmtoff\"\n\n$class\n[1] \"POSIXlt\" \"POSIXt\" \n\n$tzone\n[1] \"UTC\"\n\n$balanced\n[1] TRUE\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\n`POSIXlt`s are **rare** inside the tidyverse. They do crop up in base R, because they are needed to extract specific components of a date, like the year or month.\n\nSince `lubridate` provides helpers for you to do this instead, you do not really need them imho.\n\n`POSIXct`'s are always easier to work with, so if you find you have a `POSIXlt`, you should always convert it to a regular data time `lubridate::as_datetime()`.\n:::\n\n# Time Zones!\n\nTime zones were created to **make your data analyses more difficult as a data analyst**.\n\nHere are a few fun things to think about:\n\n-   `ymd_hms()` function will by **default use Coordinated Universal Time (UTC) as the time zone**. UTC is the primary time standard by which the world regulates clocks and time.\n\nYou can go to Wikipedia to find the [list of time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)\n\n-   Specifying `tz = \"\"` in one of the `ymd()` and friends functions will use the local time zone\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hm(\"1970-01-01 01:00\", tz = \"\")\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1970-01-01 01:00:00 EST\"\n```\n:::\n\n```{.r .cell-code}\nattributes(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$class\n[1] \"POSIXct\" \"POSIXt\" \n\n$tzone\n[1] \"\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nThe `tzone` attribute is optional. It **controls how the time is printed**, not what absolute time it refers to.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nattr(x, \"tzone\") <- \"US/Pacific\"\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1969-12-31 22:00:00 PST\"\n```\n:::\n\n```{.r .cell-code}\nattr(x, \"tzone\") <- \"US/Eastern\"\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1970-01-01 01:00:00 EST\"\n```\n:::\n:::\n\n:::\n\nA few other fun things to think about related to time zones:\n\n-   Almost always better to specify time zone when possible to avoid ambiguity\n\n-   Daylight savings time (DST)\n\n-   Some states are in two time zones\n\n-   Southern hemisphere is opposite\n\n# Operations on Dates and Times\n\n## Arithmetic\n\nYou **can add and subtract** dates and times.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd(\"2012-01-01\", tz = \"\")  ## Midnight\ny <- dmy_hms(\"9 Jan 2011 11:34:21\", tz = \"\")\nx - y ## this works\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTime difference of 356.5178 days\n```\n:::\n:::\n\n\nYou can do comparisons too (i.e. `>`, `<`, and `==`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx < y ## this works\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nx > y ## this works\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nx == y ## this works\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nx + y ## what??? why does this not work? \n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `+.POSIXt`(x, y): binary '+' is not defined for \"POSIXt\" objects\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe class of `x` is `POSIXct`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"POSIXct\" \"POSIXt\" \n```\n:::\n:::\n\n\n`POSIXct` objects are a measure of seconds from an origin, usually the UNIX epoch (1st Jan 1970).\n\nJust add the requisite number of seconds to the object:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx + 3*60*60 # add 3 hours\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2012-01-01 03:00:00 EST\"\n```\n:::\n:::\n\n:::\n\nSame goes for days. For example, you can just keep the date portion using `date()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- date(y)\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2011-01-09\"\n```\n:::\n:::\n\n\nAnd then add a number to the date (in this case 1 day)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny + 1  \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2011-01-10\"\n```\n:::\n:::\n\n\nCool eh?\n\n## Leaps and Bounds\n\nEven keeps track of leap years, leap seconds, daylight savings, and time zones.\n\nLeap years\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd(\"2012-03-01\")\ny <- ymd(\"2012-02-28\")\nx - y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTime difference of 2 days\n```\n:::\n:::\n\n\nNot a leap year\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd(\"2013-03-01\")\ny <- ymd(\"2013-02-28\")\nx - y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTime difference of 1 days\n```\n:::\n:::\n\n\nBUT beware of time zones!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hms(\"2012-10-25 01:00:00\", tz = \"\")\ny <- ymd_hms(\"2012-10-25 05:00:00\", tz = \"GMT\")\ny - x\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTime difference of 0 secs\n```\n:::\n:::\n\n\nThere are also things called [**leap seconds**](https://en.wikipedia.org/wiki/Leap_second).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n.leap.seconds\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"1972-07-01 GMT\" \"1973-01-01 GMT\" \"1974-01-01 GMT\" \"1975-01-01 GMT\"\n [5] \"1976-01-01 GMT\" \"1977-01-01 GMT\" \"1978-01-01 GMT\" \"1979-01-01 GMT\"\n [9] \"1980-01-01 GMT\" \"1981-07-01 GMT\" \"1982-07-01 GMT\" \"1983-07-01 GMT\"\n[13] \"1985-07-01 GMT\" \"1988-01-01 GMT\" \"1990-01-01 GMT\" \"1991-01-01 GMT\"\n[17] \"1992-07-01 GMT\" \"1993-07-01 GMT\" \"1994-07-01 GMT\" \"1996-01-01 GMT\"\n[21] \"1997-07-01 GMT\" \"1999-01-01 GMT\" \"2006-01-01 GMT\" \"2009-01-01 GMT\"\n[25] \"2012-07-01 GMT\" \"2015-07-01 GMT\" \"2017-01-01 GMT\"\n```\n:::\n:::\n\n\n# Extracting Elements of Dates/Times\n\nThere are a set of helper functions in `lubridate` that can extract sub-elements of dates/times\n\n## Date Elements\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hms(c(\"2012-10-25 01:13:46\",\n               \"2015-04-23 15:11:23\"), tz = \"\")\nyear(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2012 2015\n```\n:::\n\n```{.r .cell-code}\nmonth(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10  4\n```\n:::\n\n```{.r .cell-code}\nday(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 25 23\n```\n:::\n\n```{.r .cell-code}\nweekdays(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Thursday\" \"Thursday\"\n```\n:::\n:::\n\n\n## Time Elements\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hms(c(\"2012-10-25 01:13:46\",\n               \"2015-04-23 15:11:23\"), tz = \"\")\nminute(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 13 11\n```\n:::\n\n```{.r .cell-code}\nsecond(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 46 23\n```\n:::\n\n```{.r .cell-code}\nhour(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  1 15\n```\n:::\n\n```{.r .cell-code}\nweek(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 43 17\n```\n:::\n:::\n\n\n# Visualizing Dates\n\n## Reading in the Data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(readr)\nstorm <- read_csv(here(\"data\", \"storms_2004.csv.gz\"), progress = FALSE)\nstorm\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 52,409 × 51\n   BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME\n             <dbl>     <dbl>      <dbl>         <dbl>   <dbl>    <dbl>\n 1          200412        29       1800        200412      30     1200\n 2          200412        29       1800        200412      30     1200\n 3          200412         8       1800        200412       8     1800\n 4          200412        19       1500        200412      19     1700\n 5          200412        14        600        200412      14      800\n 6          200412        21        400        200412      21      800\n 7          200412        21        400        200412      21      800\n 8          200412        26       1500        200412      27      800\n 9          200412        26       1500        200412      27      800\n10          200412        11        800        200412      11     1300\n# ℹ 52,399 more rows\n# ℹ 45 more variables: EPISODE_ID <dbl>, EVENT_ID <dbl>, STATE <chr>,\n#   STATE_FIPS <dbl>, YEAR <dbl>, MONTH_NAME <chr>, EVENT_TYPE <chr>,\n#   CZ_TYPE <chr>, CZ_FIPS <dbl>, CZ_NAME <chr>, WFO <chr>,\n#   BEGIN_DATE_TIME <chr>, CZ_TIMEZONE <chr>, END_DATE_TIME <chr>,\n#   INJURIES_DIRECT <dbl>, INJURIES_INDIRECT <dbl>, DEATHS_DIRECT <dbl>,\n#   DEATHS_INDIRECT <dbl>, DAMAGE_PROPERTY <chr>, DAMAGE_CROPS <chr>, …\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(storm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"BEGIN_YEARMONTH\"    \"BEGIN_DAY\"          \"BEGIN_TIME\"        \n [4] \"END_YEARMONTH\"      \"END_DAY\"            \"END_TIME\"          \n [7] \"EPISODE_ID\"         \"EVENT_ID\"           \"STATE\"             \n[10] \"STATE_FIPS\"         \"YEAR\"               \"MONTH_NAME\"        \n[13] \"EVENT_TYPE\"         \"CZ_TYPE\"            \"CZ_FIPS\"           \n[16] \"CZ_NAME\"            \"WFO\"                \"BEGIN_DATE_TIME\"   \n[19] \"CZ_TIMEZONE\"        \"END_DATE_TIME\"      \"INJURIES_DIRECT\"   \n[22] \"INJURIES_INDIRECT\"  \"DEATHS_DIRECT\"      \"DEATHS_INDIRECT\"   \n[25] \"DAMAGE_PROPERTY\"    \"DAMAGE_CROPS\"       \"SOURCE\"            \n[28] \"MAGNITUDE\"          \"MAGNITUDE_TYPE\"     \"FLOOD_CAUSE\"       \n[31] \"CATEGORY\"           \"TOR_F_SCALE\"        \"TOR_LENGTH\"        \n[34] \"TOR_WIDTH\"          \"TOR_OTHER_WFO\"      \"TOR_OTHER_CZ_STATE\"\n[37] \"TOR_OTHER_CZ_FIPS\"  \"TOR_OTHER_CZ_NAME\"  \"BEGIN_RANGE\"       \n[40] \"BEGIN_AZIMUTH\"      \"BEGIN_LOCATION\"     \"END_RANGE\"         \n[43] \"END_AZIMUTH\"        \"END_LOCATION\"       \"BEGIN_LAT\"         \n[46] \"BEGIN_LON\"          \"END_LAT\"            \"END_LON\"           \n[49] \"EPISODE_NARRATIVE\"  \"EVENT_NARRATIVE\"    \"DATA_SOURCE\"       \n```\n:::\n:::\n\n\n::: callout-note\n### Questions\n\nLet's take a look at the `BEGIN_DATE_TIME`, `EVENT_TYPE`, and `DEATHS_DIRECT` variables from the `storm` dataset.\n\nTasks:\n\n1.  Create a subset of the `storm` dataset with only the four columns above.\n2.  Create a new column called `begin` that contains the `BEGIN_DATE_TIME` that has been converted to a date/time R object.\n3.  Rename the `EVENT_TYPE` column as `type`.\n4.  Rename the `DEATHS_DIRECT` column as `deaths`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\n# try it yourself\n```\n:::\n\n:::\n\nNext, we do some wrangling to create a `storm_sub` data frame (code chunk set to `echo=FALSE` for the purposes of the lecture, but code is in the R Markdown).\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nstorm_sub\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 52,409 × 3\n   begin               type             deaths\n   <dttm>              <chr>             <dbl>\n 1 2004-12-29 18:00:00 Heavy Snow            0\n 2 2004-12-29 18:00:00 Heavy Snow            0\n 3 2004-12-08 18:00:00 Winter Storm          0\n 4 2004-12-19 15:00:00 High Wind             0\n 5 2004-12-14 06:00:00 Winter Weather        0\n 6 2004-12-21 04:00:00 Winter Storm          0\n 7 2004-12-21 04:00:00 Winter Storm          0\n 8 2004-12-26 15:00:00 Winter Storm          0\n 9 2004-12-26 15:00:00 Winter Storm          0\n10 2004-12-11 08:00:00 Storm Surge/Tide      0\n# ℹ 52,399 more rows\n```\n:::\n:::\n\n\n## Histograms of Dates/Times\n\nWe can make a histogram of the dates/times to get a sense of when storm events occur.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggplot2)\nstorm_sub %>%\n  ggplot(aes(x = begin)) + \n  geom_histogram(bins = 20) + \n  theme_bw()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-37-1.png){width=672}\n:::\n:::\n\n\nWe can group by event type too.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggplot2)\nstorm_sub %>%\n  ggplot(aes(x = begin)) + \n  facet_wrap(~ type) + \n  geom_histogram(bins = 20) + \n  theme_bw() + \n  theme(axis.text.x.bottom = element_text(angle = 90))\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-38-1.png){width=1152}\n:::\n:::\n\n\n## Scatterplots of Dates/Times\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstorm_sub %>%\n  ggplot(aes(x = begin, y = deaths)) + \n  geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-39-1.png){width=672}\n:::\n:::\n\n\nIf we focus on a single month, the x-axis adapts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstorm_sub %>%\n  filter(month(begin) == 6) %>%\n  ggplot(aes(begin, deaths)) + \n  geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-40-1.png){width=672}\n:::\n:::\n\n\nSimilarly, we can focus on a single day.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstorm_sub %>%\n  filter(month(begin) == 6, day(begin) == 16) %>%\n  ggplot(aes(begin, deaths)) + \n  geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-41-1.png){width=672}\n:::\n:::\n\n\n# Summary\n\n-   Dates and times have special classes in R that allow for numerical and statistical calculations\n\n-   Dates use the `Date` class\n\n-   Date-Times (and Times) use the `POSIXct` and `POSIXlt` class\n\n-   Character strings can be coerced to Date/Time classes using the `ymd()` and `ymd_hms()` functions and friends.\n\n-   The `lubridate` package is essential for manipulating date/time data\n\n-   Both `plot` and `ggplot` \"know\" about dates and times and will handle axis labels appropriately.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  What happens if you parse a string that contains invalid dates?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd(c(\"2010-10-10\", \"bananas\"))\n```\n:::\n\n\n2.  What does the `tzone` argument to `today()` do? Why is it important?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nunclass(today())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 19586\n```\n:::\n:::\n\n\n3.  Use the appropriate `lubridate` function to parse each of the following dates:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd1 <- \"January 1, 2010\"\nd2 <- \"2015-Mar-07\"\nd3 <- \"06-Jun-2017\"\nd4 <- c(\"August 19 (2015)\", \"July 1 (2015)\")\nd5 <- \"12/30/14\" # Dec 30, 20\n```\n:::\n\n\n4.  Using the `flights` dataset, how does the distribution of flight times within a day change over the course of the year?\n\n5.  Compare `dep_time`, `sched_dep_time` and `dep_delay`. Are they consistent? Explain your findings.\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://lubridate.tidyverse.org>\n-   `lubridate` cheat sheet: <https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf>\n-   <https://jhu-advdatasci.github.io/2018/lectures/09-dates-times>\n-   <https://r4ds.had.co.nz/dates-and-times>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package      * version date (UTC) lib source\n bit            4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64          4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli            3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout       1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace     2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon         1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest         0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr        * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n emojifont      0.5.5   2021-04-20 [1] CRAN (R 4.3.0)\n evaluate       0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi          1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver         2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap        1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics       0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2      * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue           1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable         0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here         * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms            1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools      0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets    1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite       1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr          1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling       0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate    * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell        0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n nycflights13 * 1.0.2   2021-04-12 [1] CRAN (R 4.3.0)\n pillar         1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n proto          1.0.0   2016-10-29 [1] CRAN (R 4.3.0)\n purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6             2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr        * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang          1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown      2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot      2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi     0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales         1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n showtext       0.9-6   2023-05-03 [1] CRAN (R 4.3.0)\n showtextdb     3.0     2020-06-04 [1] CRAN (R 4.3.0)\n stringi        1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr      * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n sysfonts       0.8.8   2022-03-13 [1] CRAN (R 4.3.0)\n tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr        * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect     1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange     0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb           0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8           1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs          0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom          1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr          2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun           0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml           2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"20 - Working with dates and times\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to lubridate for dates and times in R\"\ncategories: [module 5, week 6, tidyverse, R, programming, dates and times]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/20-working-with-dates-and-times/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/dates-and-times>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://jhu-advdatasci.github.io/2018/lectures/09-dates-times>\n-   <https://r4ds.had.co.nz/dates-and-times>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Recognize the `Date`, `POSIXct` and `POSIXlt` class types in R to represent dates and times\n-   Learn how to create date and time objects in R using functions from the `lubridate` package\n-   Learn how dealing with time zones can be frustrating 🙀 but hopefully less so after today's lecture 😺\n-   Learn how to perform arithmetic operations on dates and times\n-   Learn how plotting systems in R \"know\" about dates and times to appropriately handle axis labels\n:::\n\n# Introduction\n\nIn this lesson, we will **learn how to work with dates and times** in R. These may seem simple as you use them all of the time in your day-to-day life, but the more you work with them, the more complicated they seem to get.\n\n**Dates and times are hard to work with** because they have to reconcile **two physical phenomena**\n\n1.  The rotation of the Earth and its orbit around the sun AND\n2.  A whole raft of geopolitical phenomena including months, time zones, and daylight savings time (DST)\n\nThis lesson will not teach you every last detail about dates and times, but it will give you a solid grounding of **practical skills** that will help you with common data analysis challenges.\n\n::: callout-tip\n### Classes for dates and times in R\n\nR has developed a special representation of dates and times\n\n-   Dates are represented by the `Date` class\n-   Times are represented by the `POSIXct` or the `POSIXlt` class\n:::\n\n::: callout-tip\n### Important point in time\n\n-   Dates are stored internally as the number of days since 1970-01-01\n-   Times are stored internally as the number of seconds since 1970-01-01\n\nIn computing, **Unix time** (also known as Epoch time, Posix time, seconds since the Epoch, Unix timestamp, or UNIX Epoch time) is a system for **describing a point in time**.\n\nIt is the number of seconds that have elapsed since the Unix epoch, excluding leap seconds. The Unix epoch is 00:00:00 UTC on 1 January 1970.\n\nUnix time originally appeared as the system time of Unix, but is now used widely in computing, for example by filesystems; some Python language library functions handle Unix time.\\[4\\]\n\n<https://en.wikipedia.org/wiki/Unix_time>\n:::\n\n## The `lubridate` package\n\nHere, we will focus on the `lubridate` R package, which makes it easier to work with dates and times in R.\n\n::: callout-tip\n### Pro-tip\n\n**Check out the `lubridate` cheat sheet** at <https://lubridate.tidyverse.org>\n:::\n\nA few things to note about it:\n\n-   It largely **replaces the default date/time functions in base R**\n-   It contains **methods for date/time arithmetic**\n-   It **handles time zones**, leap year, leap seconds, etc.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/lubridate_ymd.png){preview=\"TRUE\"} \\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n`lubridate` is installed when you install `tidyverse`, but it is not loaded when you load `tidyverse`. Alternatively, you can install it separately.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"lubridate\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(lubridate)\n```\n:::\n\n\n# Creating date/times\n\nThere are three types of date/time data that refer to an instant in time:\n\n-   A **date**. Tibbles print this as `<date>`.\n-   A **time** within a day. Tibbles print this as `<time>`.\n-   A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as `<dttm>`. Elsewhere in R these are called `POSIXct`.\n\n::: callout-tip\n### Note\n\nWe will **focus on dates and date-times** as R does not have a native class for storing times.\n\nIf you to work with **only times**, you can use the [`hms` package](https://cran.r-project.org/web/packages/hms/index.html).\n:::\n\nYou should always **use the simplest possible data type** that works for your needs.\n\nThat means if you can use a `date` instead of a `date-time`, you should.\n\n**Date-times** are **substantially more complicated** because of the need to handle time zones, which we will come back to at the end of the lesson.\n\nTo get the current date or date-time you can use `today()` or `now()` from `lubridate`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntoday()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17\"\n```\n:::\n\n```{.r .cell-code}\nnow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17 21:47:51 EDT\"\n```\n:::\n:::\n\n\nOtherwise, there are three ways you are likely to create a date/time:\n\n::: callout-tip\n### Typical ways to create a date/time in R\n\n1.  From a string\n2.  From individual date-time components\n3.  From an existing date/time object\n:::\n\nThey work as follows.\n\n## 1. From a string\n\nDates are of the `Date` class.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- today()\nclass(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Date\"\n```\n:::\n:::\n\n\nDates can be **coerced from a character strings** using some helper functions from `lubridate`. They **automatically work out the format** once you specify the order of the component.\n\nTo use the helper functions, **identify the order in which year, month, and day appear** in your dates, then arrange \"y\", \"m\", and \"d\" in the same order.\n\nThat gives you the name of the `lubridate` function that will parse your date. For example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd(\"1970-01-01\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1970-01-01\"\n```\n:::\n\n```{.r .cell-code}\nymd(\"2017-01-31\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31\"\n```\n:::\n\n```{.r .cell-code}\nmdy(\"January 31st, 2017\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31\"\n```\n:::\n\n```{.r .cell-code}\ndmy(\"31-Jan-2017\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\n-   When reading in data with `read_csv()`, you **may need to read in as character first** and then **convert to date/time**\n-   `Date` objects have their own special `print()` methods that will **always format** as \"YYYY-MM-DD\"\n-   These functions also take unquoted numbers.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd(20170131)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31\"\n```\n:::\n:::\n\n:::\n\n### Alternate Formulations\n\nDifferent locales have different ways of formatting dates\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd(\"2016-09-13\") ## International standard\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13\"\n```\n:::\n\n```{.r .cell-code}\nymd(\"2016/09/13\") ## Just figure it out\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13\"\n```\n:::\n\n```{.r .cell-code}\nmdy(\"09-13-2016\") ## Mostly U.S.\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13\"\n```\n:::\n\n```{.r .cell-code}\ndmy(\"13-09-2016\") ## Europe\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13\"\n```\n:::\n:::\n\n\nAll of the **above are valid and lead to the exact same object**.\n\nEven if the individual dates are formatted differently, `ymd()` can **usually figure it out**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\n    \"2016-04-05\",\n    \"2016/05/06\",\n    \"2016,10,4\"\n)\nymd(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-04-05\" \"2016-05-06\" \"2016-10-04\"\n```\n:::\n:::\n\n\nCool right?\n\n## 2. From individual date-time components\n\nSometimes the **date components** will come across **multiple columns in a dataset**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(nycflights13)\n\nflights %>%\n    select(year, month, day)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 336,776 × 3\n    year month   day\n   <int> <int> <int>\n 1  2013     1     1\n 2  2013     1     1\n 3  2013     1     1\n 4  2013     1     1\n 5  2013     1     1\n 6  2013     1     1\n 7  2013     1     1\n 8  2013     1     1\n 9  2013     1     1\n10  2013     1     1\n# ℹ 336,766 more rows\n```\n:::\n:::\n\n\nTo create a date/time from this sort of input, use\n\n-   `make_date(year,month,day)` for dates, or\n-   `make_datetime(year,month,day,hour,min,sec,tz)` for date-times\n\nWe combine these functions inside of `mutate` to add a new column to our dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nflights %>%\n    select(year, month, day) %>%\n    mutate(departure = make_date(year, month, day))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 336,776 × 4\n    year month   day departure \n   <int> <int> <int> <date>    \n 1  2013     1     1 2013-01-01\n 2  2013     1     1 2013-01-01\n 3  2013     1     1 2013-01-01\n 4  2013     1     1 2013-01-01\n 5  2013     1     1 2013-01-01\n 6  2013     1     1 2013-01-01\n 7  2013     1     1 2013-01-01\n 8  2013     1     1 2013-01-01\n 9  2013     1     1 2013-01-01\n10  2013     1     1 2013-01-01\n# ℹ 336,766 more rows\n```\n:::\n:::\n\n\n::: callout-note\n### Questions\n\nThe `flights` also contains a `hour` and `minute` column.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nflights %>%\n    select(year, month, day, hour, minute)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 336,776 × 5\n    year month   day  hour minute\n   <int> <int> <int> <dbl>  <dbl>\n 1  2013     1     1     5     15\n 2  2013     1     1     5     29\n 3  2013     1     1     5     40\n 4  2013     1     1     5     45\n 5  2013     1     1     6      0\n 6  2013     1     1     5     58\n 7  2013     1     1     6      0\n 8  2013     1     1     6      0\n 9  2013     1     1     6      0\n10  2013     1     1     6      0\n# ℹ 336,766 more rows\n```\n:::\n:::\n\n\nLet's use `make_datetime()` to create a date-time column called `departure`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## 3. From other types\n\nYou may want to **switch** between a `date-time` and a `date`.\n\nThat is the job of `as_datetime()` and `as_date()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntoday()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17\"\n```\n:::\n\n```{.r .cell-code}\nas_datetime(today())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17 UTC\"\n```\n:::\n\n```{.r .cell-code}\nnow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17 21:47:52 EDT\"\n```\n:::\n\n```{.r .cell-code}\nas_date(now())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2023-08-17\"\n```\n:::\n:::\n\n\n# Date-Times in R\n\n## From a string\n\n`ymd()` and friends create dates.\n\nTo create a `date-time` **from a character string**, add an underscore and one or more of \"h\", \"m\", and \"s\" to the name of the parsing function:\n\nTimes can be coerced from a character string with `ymd_hms()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd_hms(\"2017-01-31 20:11:59\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31 20:11:59 UTC\"\n```\n:::\n\n```{.r .cell-code}\nmdy_hm(\"01/31/2017 08:01\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2017-01-31 08:01:00 UTC\"\n```\n:::\n:::\n\n\nYou can also force the creation of a date-time from a date by supplying a timezone:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd_hms(\"2016-09-13 14:00:00\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13 14:00:00 UTC\"\n```\n:::\n\n```{.r .cell-code}\nymd_hms(\"2016-09-13 14:00:00\", tz = \"America/New_York\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13 14:00:00 EDT\"\n```\n:::\n\n```{.r .cell-code}\nymd_hms(\"2016-09-13 14:00:00\", tz = \"\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2016-09-13 14:00:00 EDT\"\n```\n:::\n:::\n\n\n## `POSIXct` or the `POSIXlt` class\n\nLet's get into some hairy details about date-times. Date-times are represented using the `POSIXct` or the `POSIXlt` class in R. What are these things?\n\n### `POSIXct`\n\n`POSIXct` is just a very large integer under the hood. It is a useful class when you want to store times in something like a data frame.\n\nTechnically, the `POSIXct` class represents the number of **seconds** since 1 January 1970. (In case you were wondering, \"POSIXct\" stands for \"Portable Operating System Interface\", calendar time.)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hm(\"1970-01-01 01:00\")\nclass(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"POSIXct\" \"POSIXt\" \n```\n:::\n\n```{.r .cell-code}\nunclass(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3600\nattr(,\"tzone\")\n[1] \"UTC\"\n```\n:::\n\n```{.r .cell-code}\ntypeof(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n```\n:::\n\n```{.r .cell-code}\nattributes(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$class\n[1] \"POSIXct\" \"POSIXt\" \n\n$tzone\n[1] \"UTC\"\n```\n:::\n:::\n\n\n### `POSIXlt`\n\n`POSIXlt` is a `list` underneath and it stores a bunch of other useful information like the day of the week, day of the year, month, day of the month\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- as.POSIXlt(x)\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1970-01-01 01:00:00 UTC\"\n```\n:::\n\n```{.r .cell-code}\ntypeof(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"list\"\n```\n:::\n\n```{.r .cell-code}\nattributes(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$names\n [1] \"sec\"    \"min\"    \"hour\"   \"mday\"   \"mon\"    \"year\"   \"wday\"   \"yday\"  \n [9] \"isdst\"  \"zone\"   \"gmtoff\"\n\n$class\n[1] \"POSIXlt\" \"POSIXt\" \n\n$tzone\n[1] \"UTC\"\n\n$balanced\n[1] TRUE\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\n`POSIXlt`s are **rare** inside the tidyverse. They do crop up in base R, because they are needed to extract specific components of a date, like the year or month.\n\nSince `lubridate` provides helpers for you to do this instead, you do not really need them imho.\n\n`POSIXct`'s are always easier to work with, so if you find you have a `POSIXlt`, you should always convert it to a regular data time `lubridate::as_datetime()`.\n:::\n\n# Time Zones!\n\nTime zones were created to **make your data analyses more difficult as a data analyst**.\n\nHere are a few fun things to think about:\n\n-   `ymd_hms()` function will by **default use Coordinated Universal Time (UTC) as the time zone**. UTC is the primary time standard by which the world regulates clocks and time.\n\nYou can go to Wikipedia to find the [list of time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)\n\n-   Specifying `tz = \"\"` in one of the `ymd()` and friends functions will use the local time zone\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hm(\"1970-01-01 01:00\", tz = \"\")\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1970-01-01 01:00:00 EST\"\n```\n:::\n\n```{.r .cell-code}\nattributes(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$class\n[1] \"POSIXct\" \"POSIXt\" \n\n$tzone\n[1] \"\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nThe `tzone` attribute is optional. It **controls how the time is printed**, not what absolute time it refers to.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nattr(x, \"tzone\") <- \"US/Pacific\"\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1969-12-31 22:00:00 PST\"\n```\n:::\n\n```{.r .cell-code}\nattr(x, \"tzone\") <- \"US/Eastern\"\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1970-01-01 01:00:00 EST\"\n```\n:::\n:::\n\n:::\n\nA few other fun things to think about related to time zones:\n\n-   Almost always better to specify time zone when possible to avoid ambiguity\n\n-   Daylight savings time (DST)\n\n-   Some states are in two time zones\n\n-   Southern hemisphere is opposite\n\n# Operations on Dates and Times\n\n## Arithmetic\n\nYou **can add and subtract** dates and times.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd(\"2012-01-01\", tz = \"\") ## Midnight\ny <- dmy_hms(\"9 Jan 2011 11:34:21\", tz = \"\")\nx - y ## this works\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTime difference of 356.5178 days\n```\n:::\n:::\n\n\nYou can do comparisons too (i.e. `>`, `<`, and `==`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx < y ## this works\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nx > y ## this works\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nx == y ## this works\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nx + y ## what??? why does this not work?\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `+.POSIXt`(x, y): binary '+' is not defined for \"POSIXt\" objects\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe class of `x` is `POSIXct`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"POSIXct\" \"POSIXt\" \n```\n:::\n:::\n\n\n`POSIXct` objects are a measure of seconds from an origin, usually the UNIX epoch (1st Jan 1970).\n\nJust add the requisite number of seconds to the object:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx + 3 * 60 * 60 # add 3 hours\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2012-01-01 03:00:00 EST\"\n```\n:::\n:::\n\n:::\n\nSame goes for days. For example, you can just keep the date portion using `date()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- date(y)\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2011-01-09\"\n```\n:::\n:::\n\n\nAnd then add a number to the date (in this case 1 day)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny + 1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"2011-01-10\"\n```\n:::\n:::\n\n\nCool eh?\n\n## Leaps and Bounds\n\nEven keeps track of leap years, leap seconds, daylight savings, and time zones.\n\nLeap years\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd(\"2012-03-01\")\ny <- ymd(\"2012-02-28\")\nx - y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTime difference of 2 days\n```\n:::\n:::\n\n\nNot a leap year\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd(\"2013-03-01\")\ny <- ymd(\"2013-02-28\")\nx - y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTime difference of 1 days\n```\n:::\n:::\n\n\nBUT beware of time zones!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hms(\"2012-10-25 01:00:00\", tz = \"\")\ny <- ymd_hms(\"2012-10-25 05:00:00\", tz = \"GMT\")\ny - x\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTime difference of 0 secs\n```\n:::\n:::\n\n\nThere are also things called [**leap seconds**](https://en.wikipedia.org/wiki/Leap_second).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n.leap.seconds\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"1972-07-01 GMT\" \"1973-01-01 GMT\" \"1974-01-01 GMT\" \"1975-01-01 GMT\"\n [5] \"1976-01-01 GMT\" \"1977-01-01 GMT\" \"1978-01-01 GMT\" \"1979-01-01 GMT\"\n [9] \"1980-01-01 GMT\" \"1981-07-01 GMT\" \"1982-07-01 GMT\" \"1983-07-01 GMT\"\n[13] \"1985-07-01 GMT\" \"1988-01-01 GMT\" \"1990-01-01 GMT\" \"1991-01-01 GMT\"\n[17] \"1992-07-01 GMT\" \"1993-07-01 GMT\" \"1994-07-01 GMT\" \"1996-01-01 GMT\"\n[21] \"1997-07-01 GMT\" \"1999-01-01 GMT\" \"2006-01-01 GMT\" \"2009-01-01 GMT\"\n[25] \"2012-07-01 GMT\" \"2015-07-01 GMT\" \"2017-01-01 GMT\"\n```\n:::\n:::\n\n\n# Extracting Elements of Dates/Times\n\nThere are a set of helper functions in `lubridate` that can extract sub-elements of dates/times\n\n## Date Elements\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hms(c(\n    \"2012-10-25 01:13:46\",\n    \"2015-04-23 15:11:23\"\n), tz = \"\")\nyear(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2012 2015\n```\n:::\n\n```{.r .cell-code}\nmonth(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10  4\n```\n:::\n\n```{.r .cell-code}\nday(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 25 23\n```\n:::\n\n```{.r .cell-code}\nweekdays(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Thursday\" \"Thursday\"\n```\n:::\n:::\n\n\n## Time Elements\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- ymd_hms(c(\n    \"2012-10-25 01:13:46\",\n    \"2015-04-23 15:11:23\"\n), tz = \"\")\nminute(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 13 11\n```\n:::\n\n```{.r .cell-code}\nsecond(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 46 23\n```\n:::\n\n```{.r .cell-code}\nhour(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  1 15\n```\n:::\n\n```{.r .cell-code}\nweek(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 43 17\n```\n:::\n:::\n\n\n# Visualizing Dates\n\n## Reading in the Data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(readr)\nstorm <- read_csv(here(\"data\", \"storms_2004.csv.gz\"), progress = FALSE)\nstorm\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 52,409 × 51\n   BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME\n             <dbl>     <dbl>      <dbl>         <dbl>   <dbl>    <dbl>\n 1          200412        29       1800        200412      30     1200\n 2          200412        29       1800        200412      30     1200\n 3          200412         8       1800        200412       8     1800\n 4          200412        19       1500        200412      19     1700\n 5          200412        14        600        200412      14      800\n 6          200412        21        400        200412      21      800\n 7          200412        21        400        200412      21      800\n 8          200412        26       1500        200412      27      800\n 9          200412        26       1500        200412      27      800\n10          200412        11        800        200412      11     1300\n# ℹ 52,399 more rows\n# ℹ 45 more variables: EPISODE_ID <dbl>, EVENT_ID <dbl>, STATE <chr>,\n#   STATE_FIPS <dbl>, YEAR <dbl>, MONTH_NAME <chr>, EVENT_TYPE <chr>,\n#   CZ_TYPE <chr>, CZ_FIPS <dbl>, CZ_NAME <chr>, WFO <chr>,\n#   BEGIN_DATE_TIME <chr>, CZ_TIMEZONE <chr>, END_DATE_TIME <chr>,\n#   INJURIES_DIRECT <dbl>, INJURIES_INDIRECT <dbl>, DEATHS_DIRECT <dbl>,\n#   DEATHS_INDIRECT <dbl>, DAMAGE_PROPERTY <chr>, DAMAGE_CROPS <chr>, …\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(storm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"BEGIN_YEARMONTH\"    \"BEGIN_DAY\"          \"BEGIN_TIME\"        \n [4] \"END_YEARMONTH\"      \"END_DAY\"            \"END_TIME\"          \n [7] \"EPISODE_ID\"         \"EVENT_ID\"           \"STATE\"             \n[10] \"STATE_FIPS\"         \"YEAR\"               \"MONTH_NAME\"        \n[13] \"EVENT_TYPE\"         \"CZ_TYPE\"            \"CZ_FIPS\"           \n[16] \"CZ_NAME\"            \"WFO\"                \"BEGIN_DATE_TIME\"   \n[19] \"CZ_TIMEZONE\"        \"END_DATE_TIME\"      \"INJURIES_DIRECT\"   \n[22] \"INJURIES_INDIRECT\"  \"DEATHS_DIRECT\"      \"DEATHS_INDIRECT\"   \n[25] \"DAMAGE_PROPERTY\"    \"DAMAGE_CROPS\"       \"SOURCE\"            \n[28] \"MAGNITUDE\"          \"MAGNITUDE_TYPE\"     \"FLOOD_CAUSE\"       \n[31] \"CATEGORY\"           \"TOR_F_SCALE\"        \"TOR_LENGTH\"        \n[34] \"TOR_WIDTH\"          \"TOR_OTHER_WFO\"      \"TOR_OTHER_CZ_STATE\"\n[37] \"TOR_OTHER_CZ_FIPS\"  \"TOR_OTHER_CZ_NAME\"  \"BEGIN_RANGE\"       \n[40] \"BEGIN_AZIMUTH\"      \"BEGIN_LOCATION\"     \"END_RANGE\"         \n[43] \"END_AZIMUTH\"        \"END_LOCATION\"       \"BEGIN_LAT\"         \n[46] \"BEGIN_LON\"          \"END_LAT\"            \"END_LON\"           \n[49] \"EPISODE_NARRATIVE\"  \"EVENT_NARRATIVE\"    \"DATA_SOURCE\"       \n```\n:::\n:::\n\n\n::: callout-note\n### Questions\n\nLet's take a look at the `BEGIN_DATE_TIME`, `EVENT_TYPE`, and `DEATHS_DIRECT` variables from the `storm` dataset.\n\nTasks:\n\n1.  Create a subset of the `storm` dataset with only the four columns above.\n2.  Create a new column called `begin` that contains the `BEGIN_DATE_TIME` that has been converted to a date/time R object.\n3.  Rename the `EVENT_TYPE` column as `type`.\n4.  Rename the `DEATHS_DIRECT` column as `deaths`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\n# try it yourself\n```\n:::\n\n:::\n\nNext, we do some wrangling to create a `storm_sub` data frame (code chunk set to `echo=FALSE` for the purposes of the lecture, but code is in the R Markdown).\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nstorm_sub\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 52,409 × 3\n   begin               type             deaths\n   <dttm>              <chr>             <dbl>\n 1 2004-12-29 18:00:00 Heavy Snow            0\n 2 2004-12-29 18:00:00 Heavy Snow            0\n 3 2004-12-08 18:00:00 Winter Storm          0\n 4 2004-12-19 15:00:00 High Wind             0\n 5 2004-12-14 06:00:00 Winter Weather        0\n 6 2004-12-21 04:00:00 Winter Storm          0\n 7 2004-12-21 04:00:00 Winter Storm          0\n 8 2004-12-26 15:00:00 Winter Storm          0\n 9 2004-12-26 15:00:00 Winter Storm          0\n10 2004-12-11 08:00:00 Storm Surge/Tide      0\n# ℹ 52,399 more rows\n```\n:::\n:::\n\n\n## Histograms of Dates/Times\n\nWe can make a histogram of the dates/times to get a sense of when storm events occur.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggplot2)\nstorm_sub %>%\n    ggplot(aes(x = begin)) +\n    geom_histogram(bins = 20) +\n    theme_bw()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-37-1.png){width=672}\n:::\n:::\n\n\nWe can group by event type too.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggplot2)\nstorm_sub %>%\n    ggplot(aes(x = begin)) +\n    facet_wrap(~type) +\n    geom_histogram(bins = 20) +\n    theme_bw() +\n    theme(axis.text.x.bottom = element_text(angle = 90))\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-38-1.png){width=1152}\n:::\n:::\n\n\n## Scatterplots of Dates/Times\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstorm_sub %>%\n    ggplot(aes(x = begin, y = deaths)) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-39-1.png){width=672}\n:::\n:::\n\n\nIf we focus on a single month, the x-axis adapts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstorm_sub %>%\n    filter(month(begin) == 6) %>%\n    ggplot(aes(begin, deaths)) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-40-1.png){width=672}\n:::\n:::\n\n\nSimilarly, we can focus on a single day.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstorm_sub %>%\n    filter(month(begin) == 6, day(begin) == 16) %>%\n    ggplot(aes(begin, deaths)) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-41-1.png){width=672}\n:::\n:::\n\n\n# Summary\n\n-   Dates and times have special classes in R that allow for numerical and statistical calculations\n\n-   Dates use the `Date` class\n\n-   Date-Times (and Times) use the `POSIXct` and `POSIXlt` class\n\n-   Character strings can be coerced to Date/Time classes using the `ymd()` and `ymd_hms()` functions and friends.\n\n-   The `lubridate` package is essential for manipulating date/time data\n\n-   Both `plot` and `ggplot` \"know\" about dates and times and will handle axis labels appropriately.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  What happens if you parse a string that contains invalid dates?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nymd(c(\"2010-10-10\", \"bananas\"))\n```\n:::\n\n\n2.  What does the `tzone` argument to `today()` do? Why is it important?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nunclass(today())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 19586\n```\n:::\n:::\n\n\n3.  Use the appropriate `lubridate` function to parse each of the following dates:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd1 <- \"January 1, 2010\"\nd2 <- \"2015-Mar-07\"\nd3 <- \"06-Jun-2017\"\nd4 <- c(\"August 19 (2015)\", \"July 1 (2015)\")\nd5 <- \"12/30/14\" # Dec 30, 20\n```\n:::\n\n\n4.  Using the `flights` dataset, how does the distribution of flight times within a day change over the course of the year?\n\n5.  Compare `dep_time`, `sched_dep_time` and `dep_delay`. Are they consistent? Explain your findings.\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://lubridate.tidyverse.org>\n-   `lubridate` cheat sheet: <https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf>\n-   <https://jhu-advdatasci.github.io/2018/lectures/09-dates-times>\n-   <https://r4ds.had.co.nz/dates-and-times>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package      * version date (UTC) lib source\n bit            4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64          4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli            3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout       1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace     2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon         1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest         0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr        * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n emojifont      0.5.5   2021-04-20 [1] CRAN (R 4.3.0)\n evaluate       0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi          1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver         2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap        1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics       0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2      * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue           1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable         0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here         * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms            1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools      0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets    1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite       1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr          1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling       0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate    * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell        0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n nycflights13 * 1.0.2   2021-04-12 [1] CRAN (R 4.3.0)\n pillar         1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n proto          1.0.0   2016-10-29 [1] CRAN (R 4.3.0)\n purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6             2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr        * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang          1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown      2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot      2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi     0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales         1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n showtext       0.9-6   2023-05-03 [1] CRAN (R 4.3.0)\n showtextdb     3.0     2020-06-04 [1] CRAN (R 4.3.0)\n stringi        1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr      * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n sysfonts       0.8.8   2022-03-13 [1] CRAN (R 4.3.0)\n tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr        * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect     1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange     0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb           0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8           1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs          0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom          1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr          2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun           0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml           2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/21-regular-expressions/index/execute-results/html.json b/_freeze/posts/21-regular-expressions/index/execute-results/html.json
index c3cb423..761b25e 100644
--- a/_freeze/posts/21-regular-expressions/index/execute-results/html.json
+++ b/_freeze/posts/21-regular-expressions/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "fac02dc082646e477381dc43987e009a",
+  "hash": "e2882097b166b6adf6874f45d779ab8c",
   "result": {
-    "markdown": "---\ntitle: \"21 - Regular expressions\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to working with character strings and regular expressions in R\"\ncategories: [module 5, week 6, tidyverse, R, programming, strings and regex]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/21-regular-expressions/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/strings>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-regular-expressions>\n-   <https://r4ds.had.co.nz/strings>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Understand what is a 'regular expression' and how to create one\n-   Learn the basics of searching for patterns in character strings in base R and the `stringr` R package in the `tidyverse`\n-   Use the built in character sets to search for patterns in strings including `\"\\n\"`, `\"\\t\"`, `\"\\w\"`, `\"\\d\"`, and `\"\\s\"`\n:::\n\n# Introduction\n\n## regex basics\n\nA **regular expression** (also known as a \"regex\" or \"regexp\") is a concise language for describing patterns in character strings.\n\nRegex could be **patterns that could be contained within another string**.\n\n::: callout-tip\n### Example\n\nFor example, if we wanted to search for the pattern \"ai\" in the character string \"The rain in Spain\", we see it appears twice!\n\n\"The r**ai**n in Sp**ai**n\"\n:::\n\nGenerally, a regular expression can be used for e.g.\n\n-   **searching for a pattern or string** within another string (e.g searching for the string \"a\" in the string \"Maryland\")\n-   **replacing one part of a string** with another string (e.g replacing the string \"t\" with \"p\" in the string \"hot\" where you are changing the string \"hot\" to \"hop\")\n\nIf you have never worked with regular expressions, it can seem like maybe a baby hit the keys on your keyboard (complete gibberish), but it will slowly make sense once you learn the syntax.\n\nSoon you will **be able create incredibly powerful regular expressions** in your day-to-day work.\n\n## string basics\n\nIn R, you can **create (character) strings** with either single quotes (`'hello!'`) or double quotes (`\"hello!\"`) -- no difference (not true for other languages!).\n\nI **recommend using the double quotes**, unless you want to create a string with multiple `\"`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\n```\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nStrings can be tricky when executing them. If you forget to close a quote, you'll see `+`\n\n``` r\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\n```\n\nIf this happen to you, take a deep breath, press `Escape` and try again.\n:::\n\nMultiple strings are often stored in a character vector, which you can create with `c()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(\"one\", \"two\", \"three\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one\"   \"two\"   \"three\"\n```\n:::\n:::\n\n\n# `grepl()`\n\nOne of the **most basic functions in R that uses regular expressions** is the `grepl(pattern, x)` function, which takes **two arguments** and **returns a logical**:\n\n1.  A regular expression (`pattern`)\n2.  A string to be searched (`x`)\n\nIn case you are curious, \"grepl\" literally translates to \"grep logical\".\n\nIf the string (`x`) contains the specified regular expression (`pattern`), then `grepl()` will return `TRUE`, otherwise it will return `FALSE`.\n\nLet's take a look at one example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregular_expression <- \"a\"\nstring_to_search <- \"Maryland\"\n\ngrepl(pattern = regular_expression, x = string_to_search)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\nIn the example above, we specify the regular expression `\"a\"` and store it in a variable called `regular_expression`.\n\n::: callout-tip\n### Note\n\n**Remember** that regular expressions are just strings!\n:::\n\nWe also store the string `\"Maryland\"` in a variable called `string_to_search`.\n\nThe regular expression `\"a\"` represents a single occurrence of the character `\"a\"`. Since `\"a\"` is contained within `\"Maryland\"`, `grepl()` returns the value `TRUE`.\n\n::: callout-tip\n### Example\n\nLet's try another simple example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregular_expression <- \"u\"\nstring_to_search <- \"Maryland\"\n\ngrepl(pattern = regular_expression, x = string_to_search)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nThe regular expression `\"u\"` represents a single occurrence of the character `\"u\"`, which is not a sub-string of `\"Maryland\"`, therefore `grepl()` returns the value `FALSE`.\n:::\n\nRegular expressions can be much longer than single characters. You could for example search for smaller strings inside of a larger string:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"land\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"ryla\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"Marly\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"dany\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nSince `\"land\"` and `\"ryla\"` are sub-strings of `\"Maryland\"`, `grepl()` returns `TRUE`, however when a regular expression like `\"Marly\"` or `\"dany\"` is searched `grepl()` returns `FALSE` because neither are sub-strings of `\"Maryland\"`.\n\n::: callout-tip\n### Introduce the US states dataset `state.name`\n\nThere is a dataset that comes with R called `state.name` which is a vector of strings, one for each state in the United States of America.\n\nWe are going to use this vector in several of the following examples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Alabama\"    \"Alaska\"     \"Arizona\"    \"Arkansas\"   \"California\"\n[6] \"Colorado\"  \n```\n:::\n\n```{.r .cell-code}\nlength(state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 50\n```\n:::\n:::\n\n:::\n\nNext, we will build a **regular expression for identifying several strings in this vector** of character strings, specifically a regular expression that will match names of states that both start and end with a vowel.\n\nThe state name could start and end with any vowel, so we will not be able to match exact sub-strings like in the previous examples. Thankfully we can use **metacharacters** to look for vowels and other parts of strings.\n\n## metacharacters\n\nThe first metacharacter that we will discuss is `\".\"`.\n\nThe metacharacter that only consists of a period **represents any character other than a new line** (we will discuss new lines soon).\n\n::: callout-tip\n### Example\n\nLet's take a look at some examples using the period regex:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\".\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\".\", \"*&2[0+,%<@#~|}\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\".\", \"\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n:::\n\nAs you can see the **period metacharacter is very liberal**.\n\nThis metacharacter is **most useful when you do not care about a set of characters** in a regular expression.\n\n::: callout-tip\n### Example\n\nHere is another example\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"a.b\", c(\"aaa\", \"aab\", \"abb\", \"acadb\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE  TRUE  TRUE  TRUE\n```\n:::\n:::\n\n\nIn the case above, `grepl()` returns `TRUE` for all strings that contain an `a` followed by any other character followed by a `b`.\n:::\n\n## repetition\n\nYou can specify a regular expression that **contains a certain number of characters or metacharacters** using the **enumeration metacharacters** (or sometimes called **quantifiers**).\n\n-   `+`: indicates that **one or more of the preceding expression** should be present (or matches at least 1 time)\n-   `*`: indicates that **zero or more of the preceding expression** is present (or matches at least 0 times)\n-   `?`: indicates that **zero or 1 of the preceding expression is not present or present at most 1 time** (or matches between 0 and 1 times)\n\n::: callout-tip\n### Example\n\nLet's take a look at some examples using these metacharacters:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Does \"Maryland\" contain one or more of \"a\" ?\ngrepl(\"a+\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Maryland\" contain one or more of \"x\" ?\ngrepl(\"x+\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Maryland\" contain zero or more of \"x\" ?\ngrepl(\"x*\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n:::\n\nIf you want to do more than one character, you need to wrap it in `()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Does \"Maryland\" contain zero or more of \"x\" ?\ngrepl(\"(xx)*\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's practice a few out together. Make the following regular expressions for the character string \"spookyhalloween\":\n\n1.  Does \"zz\" appear 1 or more times?\n2.  Does \"ee\" appear 1 or more times?\n3.  Does \"oo\" appear 0 or more times?\n4.  Does \"ii\" appear 0 or more times?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## try it out\n```\n:::\n\n:::\n\nYou can also **specify exact numbers of expressions** using curly brackets `{}`.\n\n-   `{n}`: exactly n\n-   `{n,}`: n or more\n-   `{,m}`: at most m\n-   `{n,m}`: between n and m\n\nFor example `\"a{5}\"` specifies \"a exactly five times\", `\"a{2,5}\"` specifies \"a between 2 and 5 times,\" and `\"a{2,}\"` specifies \"a at least 2 times.\" Let's take a look at some examples:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain exactly 2 adjacent \"s\" ?\ngrepl(\"s{2}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# This is equivalent to the expression above:\ngrepl(\"ss\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain between 1 and 3 adjacent \"s\" ?\ngrepl(\"s{1,3}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain between 2 and 3 adjacent \"i\" ?\ngrepl(\"i{2,3}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain between 2 adjacent \"iss\" ?\ngrepl(\"(iss){2}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain between 2 adjacent \"ss\" ?\ngrepl(\"(ss){2}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain the pattern of an \"i\" followed by \n# 2 of any character, with that pattern repeated three times adjacently?\ngrepl(\"(i.{2}){3}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's practice a few out together. Make the following regular expressions for the character string \"spookyspookyhalloweenspookyspookyhalloween\":\n\n1.  Search for \"spooky\" exactly 2 times. What about 3 times?\n2.  Search for \"spooky\" exactly 2 times followed by any character of length 9 (i.e. \"halloween\").\n3.  Same search as above, but search for that twice in a row.\n4.  Same search as above, but search for that three times in a row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## try it out\n```\n:::\n\n:::\n\n## capture group\n\nIn the examples above, I used parentheses `()` to create a **capturing group**. A capturing group allows you to use quantifiers on other regular expressions.\n\nIn the \"Mississippi\" example, I first created the regex `\"i.{2}\"` which matches `i` followed by any two characters (\"iss\" or \"ipp\"). Then, I used a capture group to wrap that regex, and to specify exactly three adjacent occurrences of that regex.\n\nYou can specify **sets of characters** (or character sets or character classes) with regular expressions, some of which come built in, but you can build your own **character sets** too.\n\nMore on character sets next.\n\n## character sets\n\nFirst, we will discuss the built in **character sets**:\n\n-   words (`\"\\\\w\"`) = **Words** specify **any letter, digit, or a underscore**\n-   digits (`\"\\\\d\"`) = **Digits** specify the **digits 0 through 9**\n-   whitespace characters (`\"\\\\s\"`) = **Whitespace** specifies **line breaks, tabs, or spaces**\n\nEach of these character sets have their own **compliments**:\n\n-   not words (`\"\\\\W\"`)\n-   not digits (`\"\\\\D\"`)\n-   not whitespace characters (`\"\\\\S\"`)\n\nEach specifies all of the characters not included in their corresponding character sets.\n\n::: callout-tip\n### Interesting fact\n\nTechnically, you are using the a character set `\"\\d\"` or `\"\\s\"` (with only one black slash), but because you are using this character set in a string, you need the second `\\` to escape the string. So you will type `\"\\\\d\"` or `\"\\\\s\"`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\"\\\\d\"\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"\\\\d\"\n```\n:::\n:::\n\n\nSo for example, to include a literal single or double quote in a string you can use `\\` to \"escape\" the string and being able to include a single or double quote:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndouble_quote <- \"\\\"\" \ndouble_quote\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"\\\"\"\n```\n:::\n\n```{.r .cell-code}\nsingle_quote <- '\\''\nsingle_quote\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"'\"\n```\n:::\n:::\n\n\nThat means if you want to include a literal backslash, you will need to double it up: `\"\\\\\"`.\n:::\n\nIn fact, putting **two backslashes before any punctuation mark that is also a metacharacter** indicates that you are **looking for the symbol and not the metacharacter meaning**.\n\nFor example `\"\\\\.\"` indicates you are trying to match a period in a string. Let's take a look at a few examples:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"\\\\+\", \"tragedy + time = humor\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\.\", \"https://publichealth.jhu.edu\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n::: callout-tip\n### Beware\n\nThe printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"\\'\", \"\\\"\", \"\\\\\")\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"'\"  \"\\\"\" \"\\\\\"\n```\n:::\n\n```{.r .cell-code}\nwriteLines(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'\n\"\n\\\n```\n:::\n:::\n\n:::\n\nThere are a handful of **other special characters**. The most common are\n\n-   `\"\\n\"`: newline\n-   `\"\\t\"`: tab,\n\nbut you can see the complete list by requesting help (run the following in the console and a help file will appear:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?\"'\"\n```\n:::\n\n\nYou will also sometimes see strings like \"\\u00b5\", this is a way of writing non-English characters that works on all platforms:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"\\\\t\", \"\\\\n\", \"\\u00b5\")\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"\\\\t\" \"\\\\n\" \"µ\"  \n```\n:::\n\n```{.r .cell-code}\nwriteLines(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\\t\n\\n\nµ\n```\n:::\n:::\n\n\n::: callout-tip\n### Example\n\nLet's take a look at a few examples of built in character sets: `\"\\w\"`, `\"\\d\"`, `\"\\s\"`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"\\\\w\", \"abcdefghijklmnopqrstuvwxyz0123456789\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\d\", \"0123456789\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# \"\\n\" is the metacharacter for a new line\n# \"\\t\" is the metacharacter for a tab\ngrepl(\"\\\\s\", \"\\n\\t   \")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\d\", \"abcdefghijklmnopqrstuvwxyz\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\D\", \"abcdefghijklmnopqrstuvwxyz\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\w\", \"\\n\\t   \")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n:::\n\n## brackets\n\nYou can also **specify specific character sets** using **straight brackets** `[]`.\n\nFor example a **character set of just the vowels** would look like: `\"[aeiou]\"`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"[aeiou]\", \"rhythms\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nYou can find the complement to a specific character by putting a carrot `^` after the first bracket. For example `\"[^aeiou]\"` matches all characters except the lowercase vowels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"[^aeiou]\", \"rhythms\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n## ranges\n\nYou can also **specify ranges of characters** using a **hyphen** `-` inside of the brackets.\n\nFor example:\n\n-   `\"[a-m]\"` matches all of the lowercase characters between `a` and `m`\n-   `\"[5-8]\"` matches any digit between 5 and 8 inclusive\n\n::: callout-tip\n### Example\n\nLet's take a look at some examples using custom character sets:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"[a-m]\", \"xyz\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"[a-m]\", \"ABC\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"[a-mA-M]\", \"ABC\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n:::\n\n## beginning and end\n\nThere are also metacharacters for **matching the beginning** and **the end of a string** which are `\"^\"` and `\"$\"` respectively.\n\nLet's take a look at a few examples:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"^a\", c(\"bab\", \"aab\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE  TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"b$\", c(\"bab\", \"aab\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"^[ab]*$\", c(\"bab\", \"aab\", \"abc\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE FALSE\n```\n:::\n:::\n\n\n## OR metacharacter\n\nThe last metacharacter we will discuss is the **OR metacharacter** (`\"|\"`).\n\nThe OR metacharacter **matches either the regex on the left or the regex on the right** side of this character. A few examples:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"a|b\", c(\"abc\", \"bcd\", \"cde\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"North|South\", c(\"South Dakota\", \"North Carolina\", \"West Virginia\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE FALSE\n```\n:::\n:::\n\n\n## `state.name` example\n\n::: callout-tip\n### Example\n\nFinally, we have learned enough to create a regular expression that matches all state names that both begin and end with a vowel:\n\n1.  We match the beginning of a string.\n2.  We create a character set of just capitalized vowels.\n3.  We specify one instance of that set.\n4.  Then any number of characters until:\n5.  A character set of just lowercase vowels.\n6.  We specify one instance of that set.\n7.  We match the end of a string.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_end_vowel <- \"^[AEIOU]{1}.+[aeiou]{1}$\"\nvowel_state_lgl <- grepl(start_end_vowel, state.name)\nhead(vowel_state_lgl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE\n```\n:::\n\n```{.r .cell-code}\nstate.name[vowel_state_lgl]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Alabama\"  \"Alaska\"   \"Arizona\"  \"Idaho\"    \"Indiana\"  \"Iowa\"     \"Ohio\"    \n[8] \"Oklahoma\"\n```\n:::\n:::\n\n:::\n\nBelow is a table of several important metacharacters:\n\n\n::: {.cell}\n::: {.cell-output-display}\n| Metacharacter |               Meaning                |\n|:-------------:|:------------------------------------:|\n|       .       |            Any Character             |\n|      \\\\w      |                A Word                |\n|      \\\\W      |              Not a Word              |\n|      \\\\d      |               A Digit                |\n|      \\\\D      |             Not a Digit              |\n|      \\\\s      |              Whitespace              |\n|      \\\\S      |            Not Whitespace            |\n|     [xyz]     |         A Set of Characters          |\n|    [^xyz]     |           Negation of Set            |\n|     [a-z]     |        A Range of Characters         |\n|       ^       |         Beginning of String          |\n|       $       |            End of String             |\n|      \\\\n      |               Newline                |\n|       +       |       One or More of Previous        |\n|       *       |       Zero or More of Previous       |\n|       ?       |       Zero or One of Previous        |\n|    &#124;     | Either the Previous or the Following |\n|      {5}      |        Exactly 5 of Previous         |\n|    {2, 5}     |     Between 2 and 5 or Previous      |\n|     {2, }     |       More than 2 of Previous        |\n:::\n:::\n\n\n# Other regex in base R\n\nSo far we've been using `grepl()` to see if a regex matches a string. There are a few other built in regex functions you should be aware of.\n\nFirst, we will review our workhorse of this lesson, `grepl()`, which stands for \"grep logical.\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"[Ii]\", c(\"Hawaii\", \"Illinois\", \"Kentucky\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE FALSE\n```\n:::\n:::\n\n\n## `grep()`\n\nThen, there is old fashioned `grep(pattern, x)`, which **returns the indices of the vector** that match the regex:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrep(pattern = \"[Ii]\", x = c(\"Hawaii\", \"Illinois\", \"Kentucky\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2\n```\n:::\n:::\n\n\n## `sub()`\n\nThe `sub(pattern, replacement, x)` function takes as arguments a regex, a \"replacement,\" and a vector of strings. This function will **replace the first instance of that regex found in each string**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsub(pattern = \"[Ii]\", replacement = \"1\", x= c(\"Hawaii\", \"Illinois\", \"Kentucky\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Hawa1i\"   \"1llinois\" \"Kentucky\"\n```\n:::\n:::\n\n\n## `gsub()`\n\nThe `gsub(pattern, replacement, x)` function is nearly the same as `sub()` except it will **replace every instance of the regex** that is matched in each string.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngsub(\"[Ii]\", \"1\", c(\"Hawaii\", \"Illinois\", \"Kentucky\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Hawa11\"   \"1ll1no1s\" \"Kentucky\"\n```\n:::\n:::\n\n\n## `strsplit()`\n\nThe `strsplit(x, split)` function will **split up strings** (`split`) according to the provided regex (`x`) .\n\nIf `strsplit()` is provided with a vector of strings it will return a list of string vectors.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntwo_s <- state.name[grep(\"ss\", state.name)]\ntwo_s\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Massachusetts\" \"Mississippi\"   \"Missouri\"      \"Tennessee\"    \n```\n:::\n\n```{.r .cell-code}\nstrsplit(x = two_s, split = \"ss\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] \"Ma\"        \"achusetts\"\n\n[[2]]\n[1] \"Mi\"   \"i\"    \"ippi\"\n\n[[3]]\n[1] \"Mi\"   \"ouri\"\n\n[[4]]\n[1] \"Tenne\" \"ee\"   \n```\n:::\n:::\n\n\n# The stringr package\n\nThe [`stringr`](https://github.com/hadley/stringr) package, written by [Hadley Wickham](http://hadley.nz/), is part of the [Tidyverse](https://twitter.com/hadleywickham/status/751805589425000450) group of R packages.\n\nThis package **takes a \"data first\" approach to functions involving regex**, so usually the string is the first argument and the regex is the second argument.\n\nThe majority of the function names in `stringr` **begin** with `str_*()`.\n\n![](https://raw.githubusercontent.com/rstudio/cheatsheets/master/pngs/thumbnails/strings-cheatsheet-thumbs.png){preview=\"TRUE\"}\n\n\\[**Source**: <https://stringr.tidyverse.org> \\]\n\n## `str_extract`\n\nThe `str_extract(string, pattern)` function returns the sub-string of a string (`string`) that matches the provided regular expression (`pattern`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(stringr)\nstate_tbl <- paste(state.name, state.area, state.abb)\nhead(state_tbl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Alabama 51609 AL\"     \"Alaska 589757 AK\"     \"Arizona 113909 AZ\"   \n[4] \"Arkansas 53104 AR\"    \"California 158693 CA\" \"Colorado 104247 CO\"  \n```\n:::\n\n```{.r .cell-code}\nstr_extract(state_tbl, \"[0-9]+\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"51609\"  \"589757\" \"113909\" \"53104\"  \"158693\" \"104247\" \"5009\"   \"2057\"  \n [9] \"58560\"  \"58876\"  \"6450\"   \"83557\"  \"56400\"  \"36291\"  \"56290\"  \"82264\" \n[17] \"40395\"  \"48523\"  \"33215\"  \"10577\"  \"8257\"   \"58216\"  \"84068\"  \"47716\" \n[25] \"69686\"  \"147138\" \"77227\"  \"110540\" \"9304\"   \"7836\"   \"121666\" \"49576\" \n[33] \"52586\"  \"70665\"  \"41222\"  \"69919\"  \"96981\"  \"45333\"  \"1214\"   \"31055\" \n[41] \"77047\"  \"42244\"  \"267339\" \"84916\"  \"9609\"   \"40815\"  \"68192\"  \"24181\" \n[49] \"56154\"  \"97914\" \n```\n:::\n:::\n\n\n## `str_detect`\n\nThe `str_detect(string, pattern)` is equivalent to `grepl(pattern,x)`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr_detect(state_tbl, \"[0-9]+\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[46] TRUE TRUE TRUE TRUE TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"[0-9]+\", state_tbl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[46] TRUE TRUE TRUE TRUE TRUE\n```\n:::\n:::\n\n\nIt detects the presence or absence of a pattern in a string.\n\n## `str_order`\n\nThe `str_order(x)` function returns a numeric vector that corresponds to the alphabetical order of the strings in the provided vector (`x`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Alabama\"    \"Alaska\"     \"Arizona\"    \"Arkansas\"   \"California\"\n[6] \"Colorado\"  \n```\n:::\n\n```{.r .cell-code}\nstr_order(state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25\n[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50\n```\n:::\n\n```{.r .cell-code}\nhead(state.abb)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"AL\" \"AK\" \"AZ\" \"AR\" \"CA\" \"CO\"\n```\n:::\n\n```{.r .cell-code}\nstr_order(state.abb)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  2  1  4  3  5  6  7  8  9 10 11 15 12 13 14 16 17 18 21 20 19 22 23 25 24\n[26] 26 33 34 27 29 30 31 28 32 35 36 37 38 39 40 41 42 43 44 46 45 47 49 48 50\n```\n:::\n:::\n\n\n## `str_replace`\n\nThe `str_replace(string, pattern, replace)` is equivalent to `sub(pattern, replacement, x)`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr_replace(string = state.name, pattern = \"[Aa]\", replace = \"B\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Blabama\"        \"Blaska\"         \"Brizona\"        \"Brkansas\"      \n [5] \"CBlifornia\"     \"ColorBdo\"       \"Connecticut\"    \"DelBware\"      \n [9] \"FloridB\"        \"GeorgiB\"        \"HBwaii\"         \"IdBho\"         \n[13] \"Illinois\"       \"IndiBna\"        \"IowB\"           \"KBnsas\"        \n[17] \"Kentucky\"       \"LouisiBna\"      \"MBine\"          \"MBryland\"      \n[21] \"MBssachusetts\"  \"MichigBn\"       \"MinnesotB\"      \"Mississippi\"   \n[25] \"Missouri\"       \"MontBna\"        \"NebrBska\"       \"NevBda\"        \n[29] \"New HBmpshire\"  \"New Jersey\"     \"New Mexico\"     \"New York\"      \n[33] \"North CBrolina\" \"North DBkota\"   \"Ohio\"           \"OklBhoma\"      \n[37] \"Oregon\"         \"PennsylvBnia\"   \"Rhode IslBnd\"   \"South CBrolina\"\n[41] \"South DBkota\"   \"Tennessee\"      \"TexBs\"          \"UtBh\"          \n[45] \"Vermont\"        \"VirginiB\"       \"WBshington\"     \"West VirginiB\" \n[49] \"Wisconsin\"      \"Wyoming\"       \n```\n:::\n\n```{.r .cell-code}\nsub(pattern = \"[Aa]\", replacement = \"B\", x= state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Blabama\"        \"Blaska\"         \"Brizona\"        \"Brkansas\"      \n [5] \"CBlifornia\"     \"ColorBdo\"       \"Connecticut\"    \"DelBware\"      \n [9] \"FloridB\"        \"GeorgiB\"        \"HBwaii\"         \"IdBho\"         \n[13] \"Illinois\"       \"IndiBna\"        \"IowB\"           \"KBnsas\"        \n[17] \"Kentucky\"       \"LouisiBna\"      \"MBine\"          \"MBryland\"      \n[21] \"MBssachusetts\"  \"MichigBn\"       \"MinnesotB\"      \"Mississippi\"   \n[25] \"Missouri\"       \"MontBna\"        \"NebrBska\"       \"NevBda\"        \n[29] \"New HBmpshire\"  \"New Jersey\"     \"New Mexico\"     \"New York\"      \n[33] \"North CBrolina\" \"North DBkota\"   \"Ohio\"           \"OklBhoma\"      \n[37] \"Oregon\"         \"PennsylvBnia\"   \"Rhode IslBnd\"   \"South CBrolina\"\n[41] \"South DBkota\"   \"Tennessee\"      \"TexBs\"          \"UtBh\"          \n[45] \"Vermont\"        \"VirginiB\"       \"WBshington\"     \"West VirginiB\" \n[49] \"Wisconsin\"      \"Wyoming\"       \n```\n:::\n:::\n\n\n## `str_pad`\n\nThe `str_pad(string, width, side, pad)` function pads strings (`string`) with other characters, which is often useful when the string is going to be eventually printed for a person to read.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr_pad(\"Thai\", width = 8, side = \"left\", pad = \"-\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"----Thai\"\n```\n:::\n\n```{.r .cell-code}\nstr_pad(\"Thai\", width = 8, side = \"right\", pad = \"-\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Thai----\"\n```\n:::\n\n```{.r .cell-code}\nstr_pad(\"Thai\", width = 8, side = \"both\", pad = \"-\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"--Thai--\"\n```\n:::\n:::\n\n\nThe `str_to_title(string)` function acts just like `tolower()` and `toupper()` except it puts strings into Title Case.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncases <- c(\"CAPS\", \"low\", \"Title\")\nstr_to_title(cases)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Caps\"  \"Low\"   \"Title\"\n```\n:::\n:::\n\n\n## `str_trim`\n\nThe `str_trim(string)` function deletes white space from both sides of a string.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nto_trim <- c(\"   space\", \"the    \", \"    final frontier  \")\nstr_trim(to_trim)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"space\"          \"the\"            \"final frontier\"\n```\n:::\n:::\n\n\n## `str_wrap`\n\nThe `str_wrap(string)` function inserts newlines in strings so that when the string is printed each line's length is limited.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npasted_states <- paste(state.name[1:20], collapse = \" \")\n\ncat(str_wrap(pasted_states, width = 80))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAlabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida\nGeorgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine\nMaryland\n```\n:::\n\n```{.r .cell-code}\ncat(str_wrap(pasted_states, width = 30))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAlabama Alaska Arizona\nArkansas California Colorado\nConnecticut Delaware Florida\nGeorgia Hawaii Idaho Illinois\nIndiana Iowa Kansas Kentucky\nLouisiana Maine Maryland\n```\n:::\n:::\n\n\n## `word`\n\nThe `word()` function allows you to index each word in a string as if it were a vector.\n\n\n::: {.cell}\n\n```{.r .cell-code}\na_tale <- \"It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness\"\n\nword(a_tale, 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"was\"\n```\n:::\n\n```{.r .cell-code}\nword(a_tale, end = 3) # end = last word to extract\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"It was the\"\n```\n:::\n\n```{.r .cell-code}\nword(a_tale, start = 11, end = 15) # start = first word to extract\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"of times it was the\"\n```\n:::\n:::\n\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\nThere is a corpus of common words here:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(stringr::words)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"        \"able\"     \"about\"    \"absolute\" \"accept\"   \"account\" \n```\n:::\n\n```{.r .cell-code}\nlength(stringr::words)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 980\n```\n:::\n:::\n\n\n1.  Using `stringr::words`, create regular expressions that find all words that:\n\n-   Start with \"y\".\n-   End with \"x\"\n-   Are exactly three letters long. (Don't cheat by using str_length()!)\n-   Have seven letters or more.\n\n2.  Using the same `stringr::words`, create regular expressions to find all words that:\n\n-   Start with a vowel.\n-   That only contain consonants. (Hint: thinking about matching \"not\"-vowels.)\n-   End with `ed`, but not with `eed`.\n-   End with `ing` or `ise`.\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://stringr.tidyverse.org>\n-   <https://rdpeng.github.io/Biostat776/lecture-regular-expressions>\n-   <https://r4ds.had.co.nz/strings>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr       * 1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"21 - Regular expressions\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to working with character strings and regular expressions in R\"\ncategories: [module 5, week 6, tidyverse, R, programming, strings and regex]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/21-regular-expressions/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://r4ds.had.co.nz/strings>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rdpeng.github.io/Biostat776/lecture-regular-expressions>\n-   <https://r4ds.had.co.nz/strings>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Understand what is a 'regular expression' and how to create one\n-   Learn the basics of searching for patterns in character strings in base R and the `stringr` R package in the `tidyverse`\n-   Use the built in character sets to search for patterns in strings including `\"\\n\"`, `\"\\t\"`, `\"\\w\"`, `\"\\d\"`, and `\"\\s\"`\n:::\n\n# Introduction\n\n## regex basics\n\nA **regular expression** (also known as a \"regex\" or \"regexp\") is a concise language for describing patterns in character strings.\n\nRegex could be **patterns that could be contained within another string**.\n\n::: callout-tip\n### Example\n\nFor example, if we wanted to search for the pattern \"ai\" in the character string \"The rain in Spain\", we see it appears twice!\n\n\"The r**ai**n in Sp**ai**n\"\n:::\n\nGenerally, a regular expression can be used for e.g.\n\n-   **searching for a pattern or string** within another string (e.g searching for the string \"a\" in the string \"Maryland\")\n-   **replacing one part of a string** with another string (e.g replacing the string \"t\" with \"p\" in the string \"hot\" where you are changing the string \"hot\" to \"hop\")\n\nIf you have never worked with regular expressions, it can seem like maybe a baby hit the keys on your keyboard (complete gibberish), but it will slowly make sense once you learn the syntax.\n\nSoon you will **be able create incredibly powerful regular expressions** in your day-to-day work.\n\n## string basics\n\nIn R, you can **create (character) strings** with either single quotes (`'hello!'`) or double quotes (`\"hello!\"`) -- no difference (not true for other languages!).\n\nI **recommend using the double quotes**, unless you want to create a string with multiple `\"`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\n```\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nStrings can be tricky when executing them. If you forget to close a quote, you'll see `+`\n\n``` r\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\n```\n\nIf this happen to you, take a deep breath, press `Escape` and try again.\n:::\n\nMultiple strings are often stored in a character vector, which you can create with `c()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nc(\"one\", \"two\", \"three\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one\"   \"two\"   \"three\"\n```\n:::\n:::\n\n\n# `grepl()`\n\nOne of the **most basic functions in R that uses regular expressions** is the `grepl(pattern, x)` function, which takes **two arguments** and **returns a logical**:\n\n1.  A regular expression (`pattern`)\n2.  A string to be searched (`x`)\n\nIn case you are curious, \"grepl\" literally translates to \"grep logical\".\n\nIf the string (`x`) contains the specified regular expression (`pattern`), then `grepl()` will return `TRUE`, otherwise it will return `FALSE`.\n\nLet's take a look at one example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregular_expression <- \"a\"\nstring_to_search <- \"Maryland\"\n\ngrepl(pattern = regular_expression, x = string_to_search)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\nIn the example above, we specify the regular expression `\"a\"` and store it in a variable called `regular_expression`.\n\n::: callout-tip\n### Note\n\n**Remember** that regular expressions are just strings!\n:::\n\nWe also store the string `\"Maryland\"` in a variable called `string_to_search`.\n\nThe regular expression `\"a\"` represents a single occurrence of the character `\"a\"`. Since `\"a\"` is contained within `\"Maryland\"`, `grepl()` returns the value `TRUE`.\n\n::: callout-tip\n### Example\n\nLet's try another simple example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregular_expression <- \"u\"\nstring_to_search <- \"Maryland\"\n\ngrepl(pattern = regular_expression, x = string_to_search)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nThe regular expression `\"u\"` represents a single occurrence of the character `\"u\"`, which is not a sub-string of `\"Maryland\"`, therefore `grepl()` returns the value `FALSE`.\n:::\n\nRegular expressions can be much longer than single characters. You could for example search for smaller strings inside of a larger string:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"land\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"ryla\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"Marly\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"dany\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nSince `\"land\"` and `\"ryla\"` are sub-strings of `\"Maryland\"`, `grepl()` returns `TRUE`, however when a regular expression like `\"Marly\"` or `\"dany\"` is searched `grepl()` returns `FALSE` because neither are sub-strings of `\"Maryland\"`.\n\n::: callout-tip\n### Introduce the US states dataset `state.name`\n\nThere is a dataset that comes with R called `state.name` which is a vector of strings, one for each state in the United States of America.\n\nWe are going to use this vector in several of the following examples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Alabama\"    \"Alaska\"     \"Arizona\"    \"Arkansas\"   \"California\"\n[6] \"Colorado\"  \n```\n:::\n\n```{.r .cell-code}\nlength(state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 50\n```\n:::\n:::\n\n:::\n\nNext, we will build a **regular expression for identifying several strings in this vector** of character strings, specifically a regular expression that will match names of states that both start and end with a vowel.\n\nThe state name could start and end with any vowel, so we will not be able to match exact sub-strings like in the previous examples. Thankfully we can use **metacharacters** to look for vowels and other parts of strings.\n\n## metacharacters\n\nThe first metacharacter that we will discuss is `\".\"`.\n\nThe metacharacter that only consists of a period **represents any character other than a new line** (we will discuss new lines soon).\n\n::: callout-tip\n### Example\n\nLet's take a look at some examples using the period regex:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\".\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\".\", \"*&2[0+,%<@#~|}\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\".\", \"\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n:::\n\nAs you can see the **period metacharacter is very liberal**.\n\nThis metacharacter is **most useful when you do not care about a set of characters** in a regular expression.\n\n::: callout-tip\n### Example\n\nHere is another example\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"a.b\", c(\"aaa\", \"aab\", \"abb\", \"acadb\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE  TRUE  TRUE  TRUE\n```\n:::\n:::\n\n\nIn the case above, `grepl()` returns `TRUE` for all strings that contain an `a` followed by any other character followed by a `b`.\n:::\n\n## repetition\n\nYou can specify a regular expression that **contains a certain number of characters or metacharacters** using the **enumeration metacharacters** (or sometimes called **quantifiers**).\n\n-   `+`: indicates that **one or more of the preceding expression** should be present (or matches at least 1 time)\n-   `*`: indicates that **zero or more of the preceding expression** is present (or matches at least 0 times)\n-   `?`: indicates that **zero or 1 of the preceding expression is not present or present at most 1 time** (or matches between 0 and 1 times)\n\n::: callout-tip\n### Example\n\nLet's take a look at some examples using these metacharacters:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Does \"Maryland\" contain one or more of \"a\" ?\ngrepl(\"a+\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Maryland\" contain one or more of \"x\" ?\ngrepl(\"x+\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Maryland\" contain zero or more of \"x\" ?\ngrepl(\"x*\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n:::\n\nIf you want to do more than one character, you need to wrap it in `()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Does \"Maryland\" contain zero or more of \"x\" ?\ngrepl(\"(xx)*\", \"Maryland\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's practice a few out together. Make the following regular expressions for the character string \"spookyhalloween\":\n\n1.  Does \"zz\" appear 1 or more times?\n2.  Does \"ee\" appear 1 or more times?\n3.  Does \"oo\" appear 0 or more times?\n4.  Does \"ii\" appear 0 or more times?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## try it out\n```\n:::\n\n:::\n\nYou can also **specify exact numbers of expressions** using curly brackets `{}`.\n\n-   `{n}`: exactly n\n-   `{n,}`: n or more\n-   `{,m}`: at most m\n-   `{n,m}`: between n and m\n\nFor example `\"a{5}\"` specifies \"a exactly five times\", `\"a{2,5}\"` specifies \"a between 2 and 5 times,\" and `\"a{2,}\"` specifies \"a at least 2 times.\" Let's take a look at some examples:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain exactly 2 adjacent \"s\" ?\ngrepl(\"s{2}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# This is equivalent to the expression above:\ngrepl(\"ss\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain between 1 and 3 adjacent \"s\" ?\ngrepl(\"s{1,3}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain between 2 and 3 adjacent \"i\" ?\ngrepl(\"i{2,3}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain between 2 adjacent \"iss\" ?\ngrepl(\"(iss){2}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain between 2 adjacent \"ss\" ?\ngrepl(\"(ss){2}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\n# Does \"Mississippi\" contain the pattern of an \"i\" followed by\n# 2 of any character, with that pattern repeated three times adjacently?\ngrepl(\"(i.{2}){3}\", \"Mississippi\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's practice a few out together. Make the following regular expressions for the character string \"spookyspookyhalloweenspookyspookyhalloween\":\n\n1.  Search for \"spooky\" exactly 2 times. What about 3 times?\n2.  Search for \"spooky\" exactly 2 times followed by any character of length 9 (i.e. \"halloween\").\n3.  Same search as above, but search for that twice in a row.\n4.  Same search as above, but search for that three times in a row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## try it out\n```\n:::\n\n:::\n\n## capture group\n\nIn the examples above, I used parentheses `()` to create a **capturing group**. A capturing group allows you to use quantifiers on other regular expressions.\n\nIn the \"Mississippi\" example, I first created the regex `\"i.{2}\"` which matches `i` followed by any two characters (\"iss\" or \"ipp\"). Then, I used a capture group to wrap that regex, and to specify exactly three adjacent occurrences of that regex.\n\nYou can specify **sets of characters** (or character sets or character classes) with regular expressions, some of which come built in, but you can build your own **character sets** too.\n\nMore on character sets next.\n\n## character sets\n\nFirst, we will discuss the built in **character sets**:\n\n-   words (`\"\\\\w\"`) = **Words** specify **any letter, digit, or a underscore**\n-   digits (`\"\\\\d\"`) = **Digits** specify the **digits 0 through 9**\n-   whitespace characters (`\"\\\\s\"`) = **Whitespace** specifies **line breaks, tabs, or spaces**\n\nEach of these character sets have their own **compliments**:\n\n-   not words (`\"\\\\W\"`)\n-   not digits (`\"\\\\D\"`)\n-   not whitespace characters (`\"\\\\S\"`)\n\nEach specifies all of the characters not included in their corresponding character sets.\n\n::: callout-tip\n### Interesting fact\n\nTechnically, you are using the a character set `\"\\d\"` or `\"\\s\"` (with only one black slash), but because you are using this character set in a string, you need the second `\\` to escape the string. So you will type `\"\\\\d\"` or `\"\\\\s\"`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\"\\\\d\"\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"\\\\d\"\n```\n:::\n:::\n\n\nSo for example, to include a literal single or double quote in a string you can use `\\` to \"escape\" the string and being able to include a single or double quote:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndouble_quote <- \"\\\"\"\ndouble_quote\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"\\\"\"\n```\n:::\n\n```{.r .cell-code}\nsingle_quote <- \"'\"\nsingle_quote\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"'\"\n```\n:::\n:::\n\n\nThat means if you want to include a literal backslash, you will need to double it up: `\"\\\\\"`.\n:::\n\nIn fact, putting **two backslashes before any punctuation mark that is also a metacharacter** indicates that you are **looking for the symbol and not the metacharacter meaning**.\n\nFor example `\"\\\\.\"` indicates you are trying to match a period in a string. Let's take a look at a few examples:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"\\\\+\", \"tragedy + time = humor\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\.\", \"https://publichealth.jhu.edu\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n::: callout-tip\n### Beware\n\nThe printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"\\'\", \"\\\"\", \"\\\\\")\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"'\"  \"\\\"\" \"\\\\\"\n```\n:::\n\n```{.r .cell-code}\nwriteLines(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'\n\"\n\\\n```\n:::\n:::\n\n:::\n\nThere are a handful of **other special characters**. The most common are\n\n-   `\"\\n\"`: newline\n-   `\"\\t\"`: tab,\n\nbut you can see the complete list by requesting help (run the following in the console and a help file will appear:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?\"'\"\n```\n:::\n\n\nYou will also sometimes see strings like \"\\u00b5\", this is a way of writing non-English characters that works on all platforms:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"\\\\t\", \"\\\\n\", \"\\u00b5\")\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"\\\\t\" \"\\\\n\" \"µ\"  \n```\n:::\n\n```{.r .cell-code}\nwriteLines(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\\t\n\\n\nµ\n```\n:::\n:::\n\n\n::: callout-tip\n### Example\n\nLet's take a look at a few examples of built in character sets: `\"\\w\"`, `\"\\d\"`, `\"\\s\"`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"\\\\w\", \"abcdefghijklmnopqrstuvwxyz0123456789\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\d\", \"0123456789\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n# \"\\n\" is the metacharacter for a new line\n# \"\\t\" is the metacharacter for a tab\ngrepl(\"\\\\s\", \"\\n\\t   \")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\d\", \"abcdefghijklmnopqrstuvwxyz\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\D\", \"abcdefghijklmnopqrstuvwxyz\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"\\\\w\", \"\\n\\t   \")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n:::\n\n## brackets\n\nYou can also **specify specific character sets** using **straight brackets** `[]`.\n\nFor example a **character set of just the vowels** would look like: `\"[aeiou]\"`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"[aeiou]\", \"rhythms\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nYou can find the complement to a specific character by putting a carrot `^` after the first bracket. For example `\"[^aeiou]\"` matches all characters except the lowercase vowels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"[^aeiou]\", \"rhythms\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n## ranges\n\nYou can also **specify ranges of characters** using a **hyphen** `-` inside of the brackets.\n\nFor example:\n\n-   `\"[a-m]\"` matches all of the lowercase characters between `a` and `m`\n-   `\"[5-8]\"` matches any digit between 5 and 8 inclusive\n\n::: callout-tip\n### Example\n\nLet's take a look at some examples using custom character sets:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"[a-m]\", \"xyz\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"[a-m]\", \"ABC\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"[a-mA-M]\", \"ABC\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n:::\n\n## beginning and end\n\nThere are also metacharacters for **matching the beginning** and **the end of a string** which are `\"^\"` and `\"$\"` respectively.\n\nLet's take a look at a few examples:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"^a\", c(\"bab\", \"aab\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE  TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"b$\", c(\"bab\", \"aab\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"^[ab]*$\", c(\"bab\", \"aab\", \"abc\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE FALSE\n```\n:::\n:::\n\n\n## OR metacharacter\n\nThe last metacharacter we will discuss is the **OR metacharacter** (`\"|\"`).\n\nThe OR metacharacter **matches either the regex on the left or the regex on the right** side of this character. A few examples:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"a|b\", c(\"abc\", \"bcd\", \"cde\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE FALSE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"North|South\", c(\"South Dakota\", \"North Carolina\", \"West Virginia\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE FALSE\n```\n:::\n:::\n\n\n## `state.name` example\n\n::: callout-tip\n### Example\n\nFinally, we have learned enough to create a regular expression that matches all state names that both begin and end with a vowel:\n\n1.  We match the beginning of a string.\n2.  We create a character set of just capitalized vowels.\n3.  We specify one instance of that set.\n4.  Then any number of characters until:\n5.  A character set of just lowercase vowels.\n6.  We specify one instance of that set.\n7.  We match the end of a string.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_end_vowel <- \"^[AEIOU]{1}.+[aeiou]{1}$\"\nvowel_state_lgl <- grepl(start_end_vowel, state.name)\nhead(vowel_state_lgl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE\n```\n:::\n\n```{.r .cell-code}\nstate.name[vowel_state_lgl]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Alabama\"  \"Alaska\"   \"Arizona\"  \"Idaho\"    \"Indiana\"  \"Iowa\"     \"Ohio\"    \n[8] \"Oklahoma\"\n```\n:::\n:::\n\n:::\n\nBelow is a table of several important metacharacters:\n\n\n::: {.cell}\n::: {.cell-output-display}\n| Metacharacter |               Meaning                |\n|:-------------:|:------------------------------------:|\n|       .       |            Any Character             |\n|      \\\\w      |                A Word                |\n|      \\\\W      |              Not a Word              |\n|      \\\\d      |               A Digit                |\n|      \\\\D      |             Not a Digit              |\n|      \\\\s      |              Whitespace              |\n|      \\\\S      |            Not Whitespace            |\n|     [xyz]     |         A Set of Characters          |\n|    [^xyz]     |           Negation of Set            |\n|     [a-z]     |        A Range of Characters         |\n|       ^       |         Beginning of String          |\n|       $       |            End of String             |\n|      \\\\n      |               Newline                |\n|       +       |       One or More of Previous        |\n|       *       |       Zero or More of Previous       |\n|       ?       |       Zero or One of Previous        |\n|    &#124;     | Either the Previous or the Following |\n|      {5}      |        Exactly 5 of Previous         |\n|    {2, 5}     |     Between 2 and 5 or Previous      |\n|     {2, }     |       More than 2 of Previous        |\n:::\n:::\n\n\n# Other regex in base R\n\nSo far we've been using `grepl()` to see if a regex matches a string. There are a few other built in regex functions you should be aware of.\n\nFirst, we will review our workhorse of this lesson, `grepl()`, which stands for \"grep logical.\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrepl(\"[Ii]\", c(\"Hawaii\", \"Illinois\", \"Kentucky\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1]  TRUE  TRUE FALSE\n```\n:::\n:::\n\n\n## `grep()`\n\nThen, there is old fashioned `grep(pattern, x)`, which **returns the indices of the vector** that match the regex:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrep(pattern = \"[Ii]\", x = c(\"Hawaii\", \"Illinois\", \"Kentucky\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2\n```\n:::\n:::\n\n\n## `sub()`\n\nThe `sub(pattern, replacement, x)` function takes as arguments a regex, a \"replacement,\" and a vector of strings. This function will **replace the first instance of that regex found in each string**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsub(pattern = \"[Ii]\", replacement = \"1\", x = c(\"Hawaii\", \"Illinois\", \"Kentucky\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Hawa1i\"   \"1llinois\" \"Kentucky\"\n```\n:::\n:::\n\n\n## `gsub()`\n\nThe `gsub(pattern, replacement, x)` function is nearly the same as `sub()` except it will **replace every instance of the regex** that is matched in each string.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngsub(\"[Ii]\", \"1\", c(\"Hawaii\", \"Illinois\", \"Kentucky\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Hawa11\"   \"1ll1no1s\" \"Kentucky\"\n```\n:::\n:::\n\n\n## `strsplit()`\n\nThe `strsplit(x, split)` function will **split up strings** (`split`) according to the provided regex (`x`) .\n\nIf `strsplit()` is provided with a vector of strings it will return a list of string vectors.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntwo_s <- state.name[grep(\"ss\", state.name)]\ntwo_s\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Massachusetts\" \"Mississippi\"   \"Missouri\"      \"Tennessee\"    \n```\n:::\n\n```{.r .cell-code}\nstrsplit(x = two_s, split = \"ss\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] \"Ma\"        \"achusetts\"\n\n[[2]]\n[1] \"Mi\"   \"i\"    \"ippi\"\n\n[[3]]\n[1] \"Mi\"   \"ouri\"\n\n[[4]]\n[1] \"Tenne\" \"ee\"   \n```\n:::\n:::\n\n\n# The stringr package\n\nThe [`stringr`](https://github.com/hadley/stringr) package, written by [Hadley Wickham](http://hadley.nz/), is part of the [Tidyverse](https://twitter.com/hadleywickham/status/751805589425000450) group of R packages.\n\nThis package **takes a \"data first\" approach to functions involving regex**, so usually the string is the first argument and the regex is the second argument.\n\nThe majority of the function names in `stringr` **begin** with `str_*()`.\n\n![](https://raw.githubusercontent.com/rstudio/cheatsheets/master/pngs/thumbnails/strings-cheatsheet-thumbs.png){preview=\"TRUE\"}\n\n\\[**Source**: <https://stringr.tidyverse.org> \\]\n\n## `str_extract`\n\nThe `str_extract(string, pattern)` function returns the sub-string of a string (`string`) that matches the provided regular expression (`pattern`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(stringr)\nstate_tbl <- paste(state.name, state.area, state.abb)\nhead(state_tbl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Alabama 51609 AL\"     \"Alaska 589757 AK\"     \"Arizona 113909 AZ\"   \n[4] \"Arkansas 53104 AR\"    \"California 158693 CA\" \"Colorado 104247 CO\"  \n```\n:::\n\n```{.r .cell-code}\nstr_extract(state_tbl, \"[0-9]+\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"51609\"  \"589757\" \"113909\" \"53104\"  \"158693\" \"104247\" \"5009\"   \"2057\"  \n [9] \"58560\"  \"58876\"  \"6450\"   \"83557\"  \"56400\"  \"36291\"  \"56290\"  \"82264\" \n[17] \"40395\"  \"48523\"  \"33215\"  \"10577\"  \"8257\"   \"58216\"  \"84068\"  \"47716\" \n[25] \"69686\"  \"147138\" \"77227\"  \"110540\" \"9304\"   \"7836\"   \"121666\" \"49576\" \n[33] \"52586\"  \"70665\"  \"41222\"  \"69919\"  \"96981\"  \"45333\"  \"1214\"   \"31055\" \n[41] \"77047\"  \"42244\"  \"267339\" \"84916\"  \"9609\"   \"40815\"  \"68192\"  \"24181\" \n[49] \"56154\"  \"97914\" \n```\n:::\n:::\n\n\n## `str_detect`\n\nThe `str_detect(string, pattern)` is equivalent to `grepl(pattern,x)`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr_detect(state_tbl, \"[0-9]+\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[46] TRUE TRUE TRUE TRUE TRUE\n```\n:::\n\n```{.r .cell-code}\ngrepl(\"[0-9]+\", state_tbl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[46] TRUE TRUE TRUE TRUE TRUE\n```\n:::\n:::\n\n\nIt detects the presence or absence of a pattern in a string.\n\n## `str_order`\n\nThe `str_order(x)` function returns a numeric vector that corresponds to the alphabetical order of the strings in the provided vector (`x`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Alabama\"    \"Alaska\"     \"Arizona\"    \"Arkansas\"   \"California\"\n[6] \"Colorado\"  \n```\n:::\n\n```{.r .cell-code}\nstr_order(state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25\n[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50\n```\n:::\n\n```{.r .cell-code}\nhead(state.abb)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"AL\" \"AK\" \"AZ\" \"AR\" \"CA\" \"CO\"\n```\n:::\n\n```{.r .cell-code}\nstr_order(state.abb)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1]  2  1  4  3  5  6  7  8  9 10 11 15 12 13 14 16 17 18 21 20 19 22 23 25 24\n[26] 26 33 34 27 29 30 31 28 32 35 36 37 38 39 40 41 42 43 44 46 45 47 49 48 50\n```\n:::\n:::\n\n\n## `str_replace`\n\nThe `str_replace(string, pattern, replace)` is equivalent to `sub(pattern, replacement, x)`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr_replace(string = state.name, pattern = \"[Aa]\", replace = \"B\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Blabama\"        \"Blaska\"         \"Brizona\"        \"Brkansas\"      \n [5] \"CBlifornia\"     \"ColorBdo\"       \"Connecticut\"    \"DelBware\"      \n [9] \"FloridB\"        \"GeorgiB\"        \"HBwaii\"         \"IdBho\"         \n[13] \"Illinois\"       \"IndiBna\"        \"IowB\"           \"KBnsas\"        \n[17] \"Kentucky\"       \"LouisiBna\"      \"MBine\"          \"MBryland\"      \n[21] \"MBssachusetts\"  \"MichigBn\"       \"MinnesotB\"      \"Mississippi\"   \n[25] \"Missouri\"       \"MontBna\"        \"NebrBska\"       \"NevBda\"        \n[29] \"New HBmpshire\"  \"New Jersey\"     \"New Mexico\"     \"New York\"      \n[33] \"North CBrolina\" \"North DBkota\"   \"Ohio\"           \"OklBhoma\"      \n[37] \"Oregon\"         \"PennsylvBnia\"   \"Rhode IslBnd\"   \"South CBrolina\"\n[41] \"South DBkota\"   \"Tennessee\"      \"TexBs\"          \"UtBh\"          \n[45] \"Vermont\"        \"VirginiB\"       \"WBshington\"     \"West VirginiB\" \n[49] \"Wisconsin\"      \"Wyoming\"       \n```\n:::\n\n```{.r .cell-code}\nsub(pattern = \"[Aa]\", replacement = \"B\", x = state.name)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Blabama\"        \"Blaska\"         \"Brizona\"        \"Brkansas\"      \n [5] \"CBlifornia\"     \"ColorBdo\"       \"Connecticut\"    \"DelBware\"      \n [9] \"FloridB\"        \"GeorgiB\"        \"HBwaii\"         \"IdBho\"         \n[13] \"Illinois\"       \"IndiBna\"        \"IowB\"           \"KBnsas\"        \n[17] \"Kentucky\"       \"LouisiBna\"      \"MBine\"          \"MBryland\"      \n[21] \"MBssachusetts\"  \"MichigBn\"       \"MinnesotB\"      \"Mississippi\"   \n[25] \"Missouri\"       \"MontBna\"        \"NebrBska\"       \"NevBda\"        \n[29] \"New HBmpshire\"  \"New Jersey\"     \"New Mexico\"     \"New York\"      \n[33] \"North CBrolina\" \"North DBkota\"   \"Ohio\"           \"OklBhoma\"      \n[37] \"Oregon\"         \"PennsylvBnia\"   \"Rhode IslBnd\"   \"South CBrolina\"\n[41] \"South DBkota\"   \"Tennessee\"      \"TexBs\"          \"UtBh\"          \n[45] \"Vermont\"        \"VirginiB\"       \"WBshington\"     \"West VirginiB\" \n[49] \"Wisconsin\"      \"Wyoming\"       \n```\n:::\n:::\n\n\n## `str_pad`\n\nThe `str_pad(string, width, side, pad)` function pads strings (`string`) with other characters, which is often useful when the string is going to be eventually printed for a person to read.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr_pad(\"Thai\", width = 8, side = \"left\", pad = \"-\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"----Thai\"\n```\n:::\n\n```{.r .cell-code}\nstr_pad(\"Thai\", width = 8, side = \"right\", pad = \"-\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Thai----\"\n```\n:::\n\n```{.r .cell-code}\nstr_pad(\"Thai\", width = 8, side = \"both\", pad = \"-\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"--Thai--\"\n```\n:::\n:::\n\n\nThe `str_to_title(string)` function acts just like `tolower()` and `toupper()` except it puts strings into Title Case.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncases <- c(\"CAPS\", \"low\", \"Title\")\nstr_to_title(cases)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Caps\"  \"Low\"   \"Title\"\n```\n:::\n:::\n\n\n## `str_trim`\n\nThe `str_trim(string)` function deletes white space from both sides of a string.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nto_trim <- c(\"   space\", \"the    \", \"    final frontier  \")\nstr_trim(to_trim)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"space\"          \"the\"            \"final frontier\"\n```\n:::\n:::\n\n\n## `str_wrap`\n\nThe `str_wrap(string)` function inserts newlines in strings so that when the string is printed each line's length is limited.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npasted_states <- paste(state.name[1:20], collapse = \" \")\n\ncat(str_wrap(pasted_states, width = 80))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAlabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida\nGeorgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine\nMaryland\n```\n:::\n\n```{.r .cell-code}\ncat(str_wrap(pasted_states, width = 30))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAlabama Alaska Arizona\nArkansas California Colorado\nConnecticut Delaware Florida\nGeorgia Hawaii Idaho Illinois\nIndiana Iowa Kansas Kentucky\nLouisiana Maine Maryland\n```\n:::\n:::\n\n\n## `word`\n\nThe `word()` function allows you to index each word in a string as if it were a vector.\n\n\n::: {.cell}\n\n```{.r .cell-code}\na_tale <- \"It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness\"\n\nword(a_tale, 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"was\"\n```\n:::\n\n```{.r .cell-code}\nword(a_tale, end = 3) # end = last word to extract\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"It was the\"\n```\n:::\n\n```{.r .cell-code}\nword(a_tale, start = 11, end = 15) # start = first word to extract\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"of times it was the\"\n```\n:::\n:::\n\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\nThere is a corpus of common words here:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(stringr::words)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"        \"able\"     \"about\"    \"absolute\" \"accept\"   \"account\" \n```\n:::\n\n```{.r .cell-code}\nlength(stringr::words)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 980\n```\n:::\n:::\n\n\n1.  Using `stringr::words`, create regular expressions that find all words that:\n\n-   Start with \"y\".\n-   End with \"x\"\n-   Are exactly three letters long. (Don't cheat by using str_length()!)\n-   Have seven letters or more.\n\n2.  Using the same `stringr::words`, create regular expressions to find all words that:\n\n-   Start with a vowel.\n-   That only contain consonants. (Hint: thinking about matching \"not\"-vowels.)\n-   End with `ed`, but not with `eed`.\n-   End with `ing` or `ise`.\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://stringr.tidyverse.org>\n-   <https://rdpeng.github.io/Biostat776/lecture-regular-expressions>\n-   <https://r4ds.had.co.nz/strings>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr       * 1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/posts/22-working-with-factors/index/execute-results/html.json b/_freeze/posts/22-working-with-factors/index/execute-results/html.json
index 4d92a56..b15c3a2 100644
--- a/_freeze/posts/22-working-with-factors/index/execute-results/html.json
+++ b/_freeze/posts/22-working-with-factors/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "278daa67ee09b08825d1e1dde78694f7",
+  "hash": "86479f4338f981d95a4b5ea8d924a949",
   "result": {
-    "markdown": "---\ntitle: \"22 - Factors\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An introduction to working categorial variables using factors in R\"\ncategories: [module 5, week 7, tidyverse, factors, categorial variables]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/22-working-with-factors/index.qmd).*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton\n2.  <https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors>\n3.  <https://forcats.tidyverse.org>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton\n-   <https://r4ds.had.co.nz/factors>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   How to create factors and some challenges working with them in base R\n-   An introduction to the `forcats` package in the `tidyverse` to work with **cat**egorical variables in R\n:::\n\n# Introduction\n\n**Factors** are used for working with **categorical variables**, or variables that have a fixed and known set of possible values (income bracket, U.S. state, political affiliation).\n\nFactors are **useful when**:\n\n-   You want to **include categorical variables in regression models**\n-   You want to **plot categorical data** (e.g. want to map categorical variables to aesthetic attributes)\n-   You want to **display character vectors in a non-alphabetical order**\n\n::: callout-tip\n### Example\n\nImagine that you have a variable that records month:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"Dec\", \"Apr\", \"Jan\", \"Mar\")\n```\n:::\n\n\nUsing a string to record this variable has two problems:\n\n1.  There are only twelve possible months, and there's nothing saving you from typos:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx_typo <- c(\"Dec\", \"Apr\", \"Jam\", \"Mar\")\n```\n:::\n\n\n2.  It doesn't sort in a useful way:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsort(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Apr\" \"Dec\" \"Jan\" \"Mar\"\n```\n:::\n:::\n\n:::\n\n## Factor basics\n\nYou can fix both of these problems with a **factor**.\n\nTo create a factor you must start by creating a list of the valid **levels**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmonth_levels <- c(\n  \"Jan\", \"Feb\", \"Mar\", \"Apr\", \"May\", \"Jun\", \n  \"Jul\", \"Aug\", \"Sep\", \"Oct\", \"Nov\", \"Dec\"\n)\n```\n:::\n\n\nNow we can create a factor with the `factor()` function defining the `levels` argument:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- factor(x, levels = month_levels)\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] Dec Apr Jan Mar\nLevels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n```\n:::\n:::\n\n\nWe can see what happens if we try to **sort the factor**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsort(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] Jan Mar Apr Dec\nLevels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n```\n:::\n:::\n\n\nWe can also check the **attributes of the factor**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nattributes(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$levels\n [1] \"Jan\" \"Feb\" \"Mar\" \"Apr\" \"May\" \"Jun\" \"Jul\" \"Aug\" \"Sep\" \"Oct\" \"Nov\" \"Dec\"\n\n$class\n[1] \"factor\"\n```\n:::\n:::\n\n\nIf you want to access the set of levels directly, you can do so with `levels()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlevels(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Jan\" \"Feb\" \"Mar\" \"Apr\" \"May\" \"Jun\" \"Jul\" \"Aug\" \"Sep\" \"Oct\" \"Nov\" \"Dec\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nAny values not in the level will be silently converted to NA:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny_typo <- factor(x_typo, levels = month_levels)\ny_typo\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] Dec  Apr  <NA> Mar \nLevels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n```\n:::\n:::\n\n:::\n\n## Challenges working with categorical data\n\nWorking with categorical data can really helpful in many situations, but it also be challenging.\n\nFor example,\n\n1.  What if the **original data source** for where the categorical data is getting ingested **changes**?\n    -   If a domain expert is providing spreadsheet data at regular intervals, code that worked on the initial data may not generate an error message, but could silently produce incorrect results.\n2.  What if a **new level** of a categorical data is added in an updated dataset?\n3.  When categorical data are coded with numerical values, it can be easy to **break the relationship between category numbers and category labels** without realizing it, thus losing the information encoded in a variable.\n    -   Let's consider an example of this below.\n\n::: callout-tip\n### Example\n\nConsider a set of decades,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nx1_original <- c(10, 10, 10, 50, 60, 20, 20, 40)\nx1_factor <- factor(x1_original)\nattributes(x1_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$levels\n[1] \"10\" \"20\" \"40\" \"50\" \"60\"\n\n$class\n[1] \"factor\"\n```\n:::\n\n```{.r .cell-code}\ntibble(x1_original, x1_factor) %>% \n  mutate(x1_numeric = as.numeric(x1_factor))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 8 × 3\n  x1_original x1_factor x1_numeric\n        <dbl> <fct>          <dbl>\n1          10 10                 1\n2          10 10                 1\n3          10 10                 1\n4          50 50                 4\n5          60 60                 5\n6          20 20                 2\n7          20 20                 2\n8          40 40                 3\n```\n:::\n:::\n\n\nInstead of creating a new variable with a numeric version of the value of the factor variable `x1_factor`, the **variable loses the original numerical categories** and **creates a factor number** (i.e., 10 is mapped to 1, 20 is mapped to 2, and 40 is mapped to 3, etc).\n:::\n\nThis **result is unexpected** because `base::as.numeric()` is intended to recover numeric information by coercing a character variable.\n\n::: callout-tip\n### Example\n\nCompare the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"hello\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n\n```{.r .cell-code}\nas.numeric(factor(c(\"hello\")))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n```\n:::\n:::\n\n\nIn the first example, R does not how to convert the character string to a numeric, so it returns a `NA`.\n\nIn the second example, it creates factor numbers and orders them according to an alphabetical order. Here is another example of this behavior:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(factor(c(\"hello\", \"goodbye\")))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 1\n```\n:::\n:::\n\n:::\n\nThis behavior of the `factor()` function feels unexpected at best.\n\nAnother example of **unexpected behavior** is how the function will **silently make a missing value** because the values in the data and the levels do not match.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactor(\"a\", levels=\"c\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] <NA>\nLevels: c\n```\n:::\n:::\n\n\nThe **unfortunate behavior of factors in R** has led to an online movement against the default behavior of many data import functions to make factors out of any variable composed as strings.\n\nThe tidyverse is part of this movement, with functions from the `readr` package defaulting to leaving strings as-is. (Others have chosen to add `options(stringAsFactors=FALSE)` into their start up commands.)\n\n## Factors when modeling data\n\nSo if factors are so troublesome, what's the point of them in the first place?\n\nFactors are **still necessary for some data analytic tasks**. The most salient case is in **statistical modeling**.\n\nWhen you pass a factor variable into `lm()` or `glm()`, R automatically creates indicator (or more colloquially 'dummy') variables for each of the levels and picks one as a reference group.\n\nFor simple cases, this behavior can also be **achieved with a character vector**.\n\nHowever, to choose **which level to use as a reference level** or to order classes, factors must be used.\n\n::: callout-tip\n### Example\n\nConsider a vector of character strings with three income levels:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nincome_level <- c(rep(\"low\",10), \n                  rep(\"medium\",10), \n                  rep(\"high\",10))\nincome_level\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"low\"    \"low\"    \"low\"    \"low\"    \"low\"    \"low\"    \"low\"    \"low\"   \n [9] \"low\"    \"low\"    \"medium\" \"medium\" \"medium\" \"medium\" \"medium\" \"medium\"\n[17] \"medium\" \"medium\" \"medium\" \"medium\" \"high\"   \"high\"   \"high\"   \"high\"  \n[25] \"high\"   \"high\"   \"high\"   \"high\"   \"high\"   \"high\"  \n```\n:::\n:::\n\n\nHere, it **might make sense to use the lowest income level (low) as the reference** class so that all the other coefficients can be interpreted in comparison to it.\n\nHowever, R would use **high as the reference** by default because 'h' comes before 'l' in the alphabet.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- factor(income_level)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] low    low    low    low    low    low    low    low    low    low   \n[11] medium medium medium medium medium medium medium medium medium medium\n[21] high   high   high   high   high   high   high   high   high   high  \nLevels: high low medium\n```\n:::\n\n```{.r .cell-code}\ny <- rnorm(30) # generate some random obs from a normal dist\nlm(y ~ x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ x)\n\nCoefficients:\n(Intercept)         xlow      xmedium  \n   -0.10063      0.29152     -0.01547  \n```\n:::\n:::\n\n:::\n\n## Memory req for factors and character strings\n\nConsider a large character string such as `income_level` corresponding to a categorical variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nincome_level <- c(rep(\"low\",10000), \n                  rep(\"medium\",10000), \n                  rep(\"high\",10000))\n```\n:::\n\n\nIn early versions of R, storing categorical data as a factor variable was considerably more efficient than storing the same data as strings, because factor variables only store the factor labels once.\n\nHowever, R now uses a global string pool, so each unique string is only stored once, which means storage is now less of an issue.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nformat(object.size(income_level), units=\"Kb\") # size of the character string\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"234.6 Kb\"\n```\n:::\n\n```{.r .cell-code}\nformat(object.size(factor(income_level)), units=\"Kb\") # size of the factor\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"117.8 Kb\"\n```\n:::\n:::\n\n\n## Summary\n\nFactors can be really useful in many data analytic tasks, but the base R functions to work with factors can lead to some unexpected behavior that can catch new R users.\n\nLet's introduce a package to make wrangling factors easier.\n\n# `forcats`\n\nNext, we will introduce the `forcats` package, which is part of the core `tidyverse`, but can also be loaded directly\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(forcats)\n```\n:::\n\n\nIt provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.\n\n## General Social Survey\n\nFor the rest of this lecture, we are going to use the `gss_cat` dataset that is installed when you load `forcats`.\n\nIt's a sample of data from the [General Social Survey](https://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago.\n\nThe **survey has thousands of questions**, so in `gss_cat`.\n\nI have selected a handful that will illustrate some common challenges you will encounter when working with factors.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 21,483 × 9\n    year marital         age race  rincome        partyid    relig denom tvhours\n   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>\n 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12\n 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA\n 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2\n 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4\n 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1\n 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA\n 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3\n 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA\n 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0\n10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3\n# ℹ 21,473 more rows\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nSince this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.\n:::\n\nWhen factors are stored in a `tibble`, you cannot see their levels so easily. One way to view them is with `count()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  count(race)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  race      n\n  <fct> <int>\n1 Other  1959\n2 Black  3129\n3 White 16395\n```\n:::\n:::\n\n\nOr with a bar chart using the `geom_bar()` geom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  ggplot(aes(x=race)) +\n    geom_bar()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-21-1.png){fig-alt='A bar chart showing the distribution of race. There are ~2000 records with race \"Other\", 3000 with race \"Black\", and other 15,000 with race \"White\".' width=672}\n:::\n:::\n\n\n::: callout-tip\n### Important\n\nWhen **working with factors**, the **two most common operations** are\n\n1.  Changing the **order** of the levels\n2.  Changing the **values** of the levels\n:::\n\nThose operations are described in the sections below.\n\n## Modifying factor order\n\nIt's often useful to **change the order of the factor levels** in a visualization.\n\nLet's explore the `relig` (religion) factor:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  count(relig)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 15 × 2\n   relig                       n\n   <fct>                   <int>\n 1 No answer                  93\n 2 Don't know                 15\n 3 Inter-nondenominational   109\n 4 Native american            23\n 5 Christian                 689\n 6 Orthodox-christian         95\n 7 Moslem/islam              104\n 8 Other eastern              32\n 9 Hinduism                   71\n10 Buddhism                  147\n11 Other                     224\n12 None                     3523\n13 Jewish                    388\n14 Catholic                 5124\n15 Protestant              10846\n```\n:::\n:::\n\n\nWe see there are 15 categories in the `gss_cat` dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nattributes(gss_cat$relig)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$levels\n [1] \"No answer\"               \"Don't know\"             \n [3] \"Inter-nondenominational\" \"Native american\"        \n [5] \"Christian\"               \"Orthodox-christian\"     \n [7] \"Moslem/islam\"            \"Other eastern\"          \n [9] \"Hinduism\"                \"Buddhism\"               \n[11] \"Other\"                   \"None\"                   \n[13] \"Jewish\"                  \"Catholic\"               \n[15] \"Protestant\"              \"Not applicable\"         \n\n$class\n[1] \"factor\"\n```\n:::\n:::\n\n\nThe first level is \"No answer\" followed by \"Don't know\", and so on.\n\nImagine you want to explore the average number of hours spent watching TV (`tvhours`) per day across religions (`relig`):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_summary <- gss_cat %>% \n  group_by(relig) %>% \n  summarise(tvhours = mean(tvhours, na.rm = TRUE),\n            n = n())\n\nrelig_summary %>% \n  ggplot(aes(x = tvhours, y = relig)) + \n  geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-24-1.png){fig-alt='A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern.' width=672}\n:::\n:::\n\n\nThe y-axis lists the levels of the `relig` factor in the order of the levels.\n\nHowever, it is **hard to read this plot** because **there's no overall pattern**.\n\n### `fct_reorder`\n\nWe can improve it by **reordering the levels** of `relig` using `fct_reorder()`. `fct_reorder(.f, .x, .fun)` takes three arguments:\n\n-   `.f`, the factor whose levels you want to modify.\n-   `.x`, a numeric vector that you want to use to reorder the levels.\n-   Optionally, `.fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_summary %>% \n  ggplot(aes(x = tvhours, \n             y = fct_reorder(.f = relig, .x = tvhours))) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-25-1.png){fig-alt='The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. \"Other eastern\" has the fewest tvhours under 2, and \"Don\\'t know\" has the highest (over 5).' width=672}\n:::\n:::\n\n\n**Reordering** religion makes it **much easier to see** that people in the \"Don't know\" category watch much more TV, and Hinduism & Other Eastern religions watch much less.\n\nAs you start making more complicated transformations, I recommend moving them out of `aes()` and into a separate `mutate()` step.\n\n::: callout-tip\n### Example\n\nYou could rewrite the plot above as:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_summary %>% \n  mutate(relig = fct_reorder(relig, tvhours)) %>% \n  ggplot(aes(x = tvhours, y = relig)) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n:::\n\n::: callout-tip\n### Another example\n\nWhat if we create a similar plot looking at how average age varies across reported income level?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrincome_summary <- \n  gss_cat %>% \n  group_by(rincome) %>% \n  summarise(age = mean(age, na.rm = TRUE),\n            n = n())\n\nrincome_summary %>% \n  ggplot(aes(x = age, y = fct_reorder(.f = rincome, .x = age))) + \n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-27-1.png){fig-alt='A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn\\'t make much sense. One section of the y-axis goes from $6000-6999, then <$1000, then $8000-9999.' width=672}\n:::\n:::\n\n\nHere, arbitrarily reordering the levels isn't a good idea! That's because `rincome` already has a principled order that we shouldn't mess with.\n:::\n\n::: callout-tip\n### Pro-tip\n\nReserve `fct_reorder()` for factors whose levels are arbitrarily ordered.\n:::\n\n::: callout-note\n### Question\n\nLet's practice `fct_reorder()`. Using the `palmerpenguins` dataset,\n\n1.  Calculate the average `bill_length_mm` for each species\n2.  Create a scatter plot showing the average for each species.\\\n3.  Go back and reorder the factor `species` based on the average bill length from largest to smallest.\n4.  Now order it from smallest to largest\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(palmerpenguins)\npenguins \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>\n 1 Adelie  Torgersen           39.1          18.7               181        3750\n 2 Adelie  Torgersen           39.5          17.4               186        3800\n 3 Adelie  Torgersen           40.3          18                 195        3250\n 4 Adelie  Torgersen           NA            NA                  NA          NA\n 5 Adelie  Torgersen           36.7          19.3               193        3450\n 6 Adelie  Torgersen           39.3          20.6               190        3650\n 7 Adelie  Torgersen           38.9          17.8               181        3625\n 8 Adelie  Torgersen           39.2          19.6               195        4675\n 9 Adelie  Torgersen           34.1          18.1               193        3475\n10 Adelie  Torgersen           42            20.2               190        4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex <fct>, year <int>\n```\n:::\n\n```{.r .cell-code}\n## Try it out\n```\n:::\n\n:::\n\n### `fct_relevel`\n\nHowever, it does make sense to pull \"Not applicable\" to the front with the other special levels.\n\nYou can use `fct_relevel()`.\n\nIt takes a factor, `f`, and then any number of levels that you want to move to the front of the line.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrincome_summary %>% \n  ggplot(aes(age, fct_relevel(rincome, \"Not applicable\"))) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-29-1.png){fig-alt='The same scatterplot but now \"Not Applicable\" is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is \"Not applicable\".' width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nAny levels not mentioned in `fct_relevel` will be left in their existing order.\n:::\n\nAnother type of reordering is useful when you are coloring the lines on a plot. `fct_reorder2(f, x, y)` reorders the factor `f` by the `y` values associated with the largest `x` values.\n\nThis makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.\n\n\n::: {.cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nby_age <- \n  gss_cat %>% \n  filter(!is.na(age)) %>% \n  count(age, marital) %>% \n  group_by(age) %>% \n  mutate(prop = n / sum(n))\n\nby_age %>% \n  ggplot(aes(age, prop, colour = marital)) +\n    geom_line(na.rm = TRUE)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-30-1.png){fig-alt='- A line plot with age on the x-axis and proportion on the y-axis.\n  There is one line for each category of marital status: no answer,\n  never married, separated, divorced, widowed, and married. It is\n  a little hard to read the plot because the order of the legend is\n  unrelated to the lines on the plot.\n- Rearranging the legend makes the plot easier to read because the\n  legend colours now match the order of the lines on the far right\n  of the plot. You can see some unsuprising patterns: the proportion\n  never marred decreases with age, married forms an upside down U\n  shape, and widowed starts off low but increases steeply after age\n  60.' width=384}\n:::\n\n```{.r .cell-code}\nby_age %>% \n  ggplot(aes(age, prop, colour = fct_reorder2(marital, age, prop))) +\n    geom_line() +\n  labs(colour = \"marital\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-30-2.png){fig-alt='- A line plot with age on the x-axis and proportion on the y-axis.\n  There is one line for each category of marital status: no answer,\n  never married, separated, divorced, widowed, and married. It is\n  a little hard to read the plot because the order of the legend is\n  unrelated to the lines on the plot.\n- Rearranging the legend makes the plot easier to read because the\n  legend colours now match the order of the lines on the far right\n  of the plot. You can see some unsuprising patterns: the proportion\n  never marred decreases with age, married forms an upside down U\n  shape, and widowed starts off low but increases steeply after age\n  60.' width=384}\n:::\n:::\n\n\n### `fct_infreq`\n\nFinally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. Combine it with `fct_rev()` if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%  \n  ggplot(aes(marital)) +\n    geom_bar()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-31-1.png){fig-alt='A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000).' width=672}\n:::\n:::\n\n\n## Modifying factor levels\n\nMore powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays.\n\n### `fct_recode`\n\nThe most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  count(partyid)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 2\n   partyid                n\n   <fct>              <int>\n 1 No answer            154\n 2 Don't know             1\n 3 Other party          393\n 4 Strong republican   2314\n 5 Not str republican  3032\n 6 Ind,near rep        1791\n 7 Independent         4119\n 8 Ind,near dem        2499\n 9 Not str democrat    3690\n10 Strong democrat     3490\n```\n:::\n:::\n\n\nThe **levels are terse and inconsistent**.\n\nLet's tweak them to be longer and use a parallel construction.\n\nLike most rename and recoding functions in the tidyverse:\n\n-   the **new values go on the left**\n-   the **old values go on the right**\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  mutate(partyid = fct_recode(partyid,\n      \"Republican, strong\"    = \"Strong republican\",\n      \"Republican, weak\"      = \"Not str republican\",\n      \"Independent, near rep\" = \"Ind,near rep\",\n      \"Independent, near dem\" = \"Ind,near dem\",\n      \"Democrat, weak\"        = \"Not str democrat\",\n      \"Democrat, strong\"      = \"Strong democrat\")) %>% \n  count(partyid)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 2\n   partyid                   n\n   <fct>                 <int>\n 1 No answer               154\n 2 Don't know                1\n 3 Other party             393\n 4 Republican, strong     2314\n 5 Republican, weak       3032\n 6 Independent, near rep  1791\n 7 Independent            4119\n 8 Independent, near dem  2499\n 9 Democrat, weak         3690\n10 Democrat, strong       3490\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\n`fct_recode()` will leave the levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.\n:::\n\nTo combine groups, you can assign multiple old levels to the same new level:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  mutate(partyid = fct_recode(partyid,\n      \"Republican, strong\"    = \"Strong republican\",\n      \"Republican, weak\"      = \"Not str republican\",\n      \"Independent, near rep\" = \"Ind,near rep\",\n      \"Independent, near dem\" = \"Ind,near dem\",\n      \"Democrat, weak\"        = \"Not str democrat\",\n      \"Democrat, strong\"      = \"Strong democrat\",\n      \"Other\"                 = \"No answer\",\n      \"Other\"                 = \"Don't know\",\n      \"Other\"                 = \"Other party\")) %>% \n  count(partyid)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 8 × 2\n  partyid                   n\n  <fct>                 <int>\n1 Other                   548\n2 Republican, strong     2314\n3 Republican, weak       3032\n4 Independent, near rep  1791\n5 Independent            4119\n6 Independent, near dem  2499\n7 Democrat, weak         3690\n8 Democrat, strong       3490\n```\n:::\n:::\n\n\nUse this technique with care: if you group together categories that are truly different you will end up with misleading results.\n\n### `fct_collapse`\n\nIf you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.\n\nFor **each new variable**, you can **provide a vector of old levels**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  mutate(partyid = fct_collapse(partyid,\n      \"other\" = c(\"No answer\", \"Don't know\", \"Other party\"),\n      \"rep\" = c(\"Strong republican\", \"Not str republican\"),\n      \"ind\" = c(\"Ind,near rep\", \"Independent\", \"Ind,near dem\"),\n      \"dem\" = c(\"Not str democrat\", \"Strong democrat\"))) %>% \n  count(partyid)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 2\n  partyid     n\n  <fct>   <int>\n1 other     548\n2 rep      5346\n3 ind      8409\n4 dem      7180\n```\n:::\n:::\n\n\n### `fct_lump_*`\n\nSometimes you **just want to lump together the small groups** to make a plot or table simpler.\n\nThat's the **job of the `fct_lump_*()` family of functions**.\n\n`fct_lump_lowfreq()` is a simple starting point that progressively lumps the smallest groups categories into \"Other\", always keeping \"Other\" as the smallest category.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  mutate(relig = fct_lump_lowfreq(relig)) %>% \n  count(relig)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n  relig          n\n  <fct>      <int>\n1 Protestant 10846\n2 Other      10637\n```\n:::\n:::\n\n\nIn this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we'd probably like to see some more details!\n\nInstead, we can use the `fct_lump_n()` to **specify that we want exactly 10 groups**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>% \n  mutate(relig = fct_lump_n(relig, n = 10)) %>% \n  count(relig, sort = TRUE) %>% \n  print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 2\n   relig                       n\n   <fct>                   <int>\n 1 Protestant              10846\n 2 Catholic                 5124\n 3 None                     3523\n 4 Christian                 689\n 5 Other                     458\n 6 Jewish                    388\n 7 Buddhism                  147\n 8 Inter-nondenominational   109\n 9 Moslem/islam              104\n10 Orthodox-christian         95\n```\n:::\n:::\n\n\nRead the documentation to learn about `fct_lump_min()` and `fct_lump_prop()` which are useful in other cases.\n\n## Ordered factors\n\nThere's a **special type of factor** that needs to be mentioned briefly: ordered factors.\n\n**Ordered factors**, created with `ordered()`, imply a strict ordering and equal distance between levels:\n\nThe **first level** is \"less than\" the **second level** by the same amount that the second level is \"less than\" the **third level**, and so on...\n\nYou can recognize them when printing because they use `<` between the factor levels:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nordered(c(\"a\", \"b\", \"c\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] a b c\nLevels: a < b < c\n```\n:::\n:::\n\n\nHowever, in practice, `ordered()` factors **behave very similarly to regular factors**.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Explore the distribution of `rincome` (reported income). What makes the default bar chart hard to understand? How could you improve the plot?\n\n2.  What is the most common `relig` in this survey? What's the most common `partyid`?\n\n3.  Which `relig` does `denom` (denomination) apply to? How can you find out with a table? How can you find out with a visualization?\n\n4.  There are some suspiciously high numbers in `tvhours`. Is the mean a good summary?\n\n5.  For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled.\n\n6.  Why did moving \"Not applicable\" to the front of the levels move it to the bottom of the plot?\n\n7.  How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?\n\n8.  How could you collapse `rincome` into a small set of categories?\n\n9.  Notice there are 9 groups (excluding other) in the `fct_lump` example above. Why not 10? (Hint: type `?fct_lump`, and find the default for the argument `other_level` is \"Other\".)\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://r4ds.had.co.nz/factors>\n-   [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton\n-   <https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors>\n-   <https://forcats.tidyverse.org>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package        * version date (UTC) lib source\n cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout         1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest           0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver           2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2        * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools        0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite         1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr            1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling         0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)\n pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown        2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun             0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml             2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"22 - Factors\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An introduction to working categorial variables using factors in R\"\ncategories: [module 5, week 7, tidyverse, factors, categorial variables]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/22-working-with-factors/index.qmd).*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton\n2.  <https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors>\n3.  <https://forcats.tidyverse.org>\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton\n-   <https://r4ds.had.co.nz/factors>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   How to create factors and some challenges working with them in base R\n-   An introduction to the `forcats` package in the `tidyverse` to work with **cat**egorical variables in R\n:::\n\n# Introduction\n\n**Factors** are used for working with **categorical variables**, or variables that have a fixed and known set of possible values (income bracket, U.S. state, political affiliation).\n\nFactors are **useful when**:\n\n-   You want to **include categorical variables in regression models**\n-   You want to **plot categorical data** (e.g. want to map categorical variables to aesthetic attributes)\n-   You want to **display character vectors in a non-alphabetical order**\n\n::: callout-tip\n### Example\n\nImagine that you have a variable that records month:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"Dec\", \"Apr\", \"Jan\", \"Mar\")\n```\n:::\n\n\nUsing a string to record this variable has two problems:\n\n1.  There are only twelve possible months, and there's nothing saving you from typos:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx_typo <- c(\"Dec\", \"Apr\", \"Jam\", \"Mar\")\n```\n:::\n\n\n2.  It doesn't sort in a useful way:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsort(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Apr\" \"Dec\" \"Jan\" \"Mar\"\n```\n:::\n:::\n\n:::\n\n## Factor basics\n\nYou can fix both of these problems with a **factor**.\n\nTo create a factor you must start by creating a list of the valid **levels**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmonth_levels <- c(\n    \"Jan\", \"Feb\", \"Mar\", \"Apr\", \"May\", \"Jun\",\n    \"Jul\", \"Aug\", \"Sep\", \"Oct\", \"Nov\", \"Dec\"\n)\n```\n:::\n\n\nNow we can create a factor with the `factor()` function defining the `levels` argument:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- factor(x, levels = month_levels)\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] Dec Apr Jan Mar\nLevels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n```\n:::\n:::\n\n\nWe can see what happens if we try to **sort the factor**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsort(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] Jan Mar Apr Dec\nLevels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n```\n:::\n:::\n\n\nWe can also check the **attributes of the factor**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nattributes(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$levels\n [1] \"Jan\" \"Feb\" \"Mar\" \"Apr\" \"May\" \"Jun\" \"Jul\" \"Aug\" \"Sep\" \"Oct\" \"Nov\" \"Dec\"\n\n$class\n[1] \"factor\"\n```\n:::\n:::\n\n\nIf you want to access the set of levels directly, you can do so with `levels()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlevels(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Jan\" \"Feb\" \"Mar\" \"Apr\" \"May\" \"Jun\" \"Jul\" \"Aug\" \"Sep\" \"Oct\" \"Nov\" \"Dec\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nAny values not in the level will be silently converted to NA:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny_typo <- factor(x_typo, levels = month_levels)\ny_typo\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] Dec  Apr  <NA> Mar \nLevels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n```\n:::\n:::\n\n:::\n\n## Challenges working with categorical data\n\nWorking with categorical data can really helpful in many situations, but it also be challenging.\n\nFor example,\n\n1.  What if the **original data source** for where the categorical data is getting ingested **changes**?\n    -   If a domain expert is providing spreadsheet data at regular intervals, code that worked on the initial data may not generate an error message, but could silently produce incorrect results.\n2.  What if a **new level** of a categorical data is added in an updated dataset?\n3.  When categorical data are coded with numerical values, it can be easy to **break the relationship between category numbers and category labels** without realizing it, thus losing the information encoded in a variable.\n    -   Let's consider an example of this below.\n\n::: callout-tip\n### Example\n\nConsider a set of decades,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nx1_original <- c(10, 10, 10, 50, 60, 20, 20, 40)\nx1_factor <- factor(x1_original)\nattributes(x1_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$levels\n[1] \"10\" \"20\" \"40\" \"50\" \"60\"\n\n$class\n[1] \"factor\"\n```\n:::\n\n```{.r .cell-code}\ntibble(x1_original, x1_factor) %>%\n    mutate(x1_numeric = as.numeric(x1_factor))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 8 × 3\n  x1_original x1_factor x1_numeric\n        <dbl> <fct>          <dbl>\n1          10 10                 1\n2          10 10                 1\n3          10 10                 1\n4          50 50                 4\n5          60 60                 5\n6          20 20                 2\n7          20 20                 2\n8          40 40                 3\n```\n:::\n:::\n\n\nInstead of creating a new variable with a numeric version of the value of the factor variable `x1_factor`, the **variable loses the original numerical categories** and **creates a factor number** (i.e., 10 is mapped to 1, 20 is mapped to 2, and 40 is mapped to 3, etc).\n:::\n\nThis **result is unexpected** because `base::as.numeric()` is intended to recover numeric information by coercing a character variable.\n\n::: callout-tip\n### Example\n\nCompare the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"hello\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n\n```{.r .cell-code}\nas.numeric(factor(c(\"hello\")))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n```\n:::\n:::\n\n\nIn the first example, R does not how to convert the character string to a numeric, so it returns a `NA`.\n\nIn the second example, it creates factor numbers and orders them according to an alphabetical order. Here is another example of this behavior:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(factor(c(\"hello\", \"goodbye\")))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 1\n```\n:::\n:::\n\n:::\n\nThis behavior of the `factor()` function feels unexpected at best.\n\nAnother example of **unexpected behavior** is how the function will **silently make a missing value** because the values in the data and the levels do not match.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactor(\"a\", levels = \"c\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] <NA>\nLevels: c\n```\n:::\n:::\n\n\nThe **unfortunate behavior of factors in R** has led to an online movement against the default behavior of many data import functions to make factors out of any variable composed as strings.\n\nThe tidyverse is part of this movement, with functions from the `readr` package defaulting to leaving strings as-is. (Others have chosen to add `options(stringAsFactors=FALSE)` into their start up commands.)\n\n## Factors when modeling data\n\nSo if factors are so troublesome, what's the point of them in the first place?\n\nFactors are **still necessary for some data analytic tasks**. The most salient case is in **statistical modeling**.\n\nWhen you pass a factor variable into `lm()` or `glm()`, R automatically creates indicator (or more colloquially 'dummy') variables for each of the levels and picks one as a reference group.\n\nFor simple cases, this behavior can also be **achieved with a character vector**.\n\nHowever, to choose **which level to use as a reference level** or to order classes, factors must be used.\n\n::: callout-tip\n### Example\n\nConsider a vector of character strings with three income levels:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nincome_level <- c(\n    rep(\"low\", 10),\n    rep(\"medium\", 10),\n    rep(\"high\", 10)\n)\nincome_level\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"low\"    \"low\"    \"low\"    \"low\"    \"low\"    \"low\"    \"low\"    \"low\"   \n [9] \"low\"    \"low\"    \"medium\" \"medium\" \"medium\" \"medium\" \"medium\" \"medium\"\n[17] \"medium\" \"medium\" \"medium\" \"medium\" \"high\"   \"high\"   \"high\"   \"high\"  \n[25] \"high\"   \"high\"   \"high\"   \"high\"   \"high\"   \"high\"  \n```\n:::\n:::\n\n\nHere, it **might make sense to use the lowest income level (low) as the reference** class so that all the other coefficients can be interpreted in comparison to it.\n\nHowever, R would use **high as the reference** by default because 'h' comes before 'l' in the alphabet.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- factor(income_level)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] low    low    low    low    low    low    low    low    low    low   \n[11] medium medium medium medium medium medium medium medium medium medium\n[21] high   high   high   high   high   high   high   high   high   high  \nLevels: high low medium\n```\n:::\n\n```{.r .cell-code}\ny <- rnorm(30) # generate some random obs from a normal dist\nlm(y ~ x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ x)\n\nCoefficients:\n(Intercept)         xlow      xmedium  \n    -0.5621       0.5728       0.4219  \n```\n:::\n:::\n\n:::\n\n## Memory req for factors and character strings\n\nConsider a large character string such as `income_level` corresponding to a categorical variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nincome_level <- c(\n    rep(\"low\", 10000),\n    rep(\"medium\", 10000),\n    rep(\"high\", 10000)\n)\n```\n:::\n\n\nIn early versions of R, storing categorical data as a factor variable was considerably more efficient than storing the same data as strings, because factor variables only store the factor labels once.\n\nHowever, R now uses a global string pool, so each unique string is only stored once, which means storage is now less of an issue.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nformat(object.size(income_level), units = \"Kb\") # size of the character string\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"234.6 Kb\"\n```\n:::\n\n```{.r .cell-code}\nformat(object.size(factor(income_level)), units = \"Kb\") # size of the factor\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"117.8 Kb\"\n```\n:::\n:::\n\n\n## Summary\n\nFactors can be really useful in many data analytic tasks, but the base R functions to work with factors can lead to some unexpected behavior that can catch new R users.\n\nLet's introduce a package to make wrangling factors easier.\n\n# `forcats`\n\nNext, we will introduce the `forcats` package, which is part of the core `tidyverse`, but can also be loaded directly\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(forcats)\n```\n:::\n\n\nIt provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.\n\n## General Social Survey\n\nFor the rest of this lecture, we are going to use the `gss_cat` dataset that is installed when you load `forcats`.\n\nIt's a sample of data from the [General Social Survey](https://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago.\n\nThe **survey has thousands of questions**, so in `gss_cat`.\n\nI have selected a handful that will illustrate some common challenges you will encounter when working with factors.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 21,483 × 9\n    year marital         age race  rincome        partyid    relig denom tvhours\n   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>\n 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12\n 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA\n 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2\n 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4\n 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1\n 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA\n 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3\n 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA\n 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0\n10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3\n# ℹ 21,473 more rows\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nSince this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.\n:::\n\nWhen factors are stored in a `tibble`, you cannot see their levels so easily. One way to view them is with `count()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    count(race)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n  race      n\n  <fct> <int>\n1 Other  1959\n2 Black  3129\n3 White 16395\n```\n:::\n:::\n\n\nOr with a bar chart using the `geom_bar()` geom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    ggplot(aes(x = race)) +\n    geom_bar()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-21-1.png){fig-alt='A bar chart showing the distribution of race. There are ~2000 records with race \"Other\", 3000 with race \"Black\", and other 15,000 with race \"White\".' width=672}\n:::\n:::\n\n\n::: callout-tip\n### Important\n\nWhen **working with factors**, the **two most common operations** are\n\n1.  Changing the **order** of the levels\n2.  Changing the **values** of the levels\n:::\n\nThose operations are described in the sections below.\n\n## Modifying factor order\n\nIt's often useful to **change the order of the factor levels** in a visualization.\n\nLet's explore the `relig` (religion) factor:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    count(relig)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 15 × 2\n   relig                       n\n   <fct>                   <int>\n 1 No answer                  93\n 2 Don't know                 15\n 3 Inter-nondenominational   109\n 4 Native american            23\n 5 Christian                 689\n 6 Orthodox-christian         95\n 7 Moslem/islam              104\n 8 Other eastern              32\n 9 Hinduism                   71\n10 Buddhism                  147\n11 Other                     224\n12 None                     3523\n13 Jewish                    388\n14 Catholic                 5124\n15 Protestant              10846\n```\n:::\n:::\n\n\nWe see there are 15 categories in the `gss_cat` dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nattributes(gss_cat$relig)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$levels\n [1] \"No answer\"               \"Don't know\"             \n [3] \"Inter-nondenominational\" \"Native american\"        \n [5] \"Christian\"               \"Orthodox-christian\"     \n [7] \"Moslem/islam\"            \"Other eastern\"          \n [9] \"Hinduism\"                \"Buddhism\"               \n[11] \"Other\"                   \"None\"                   \n[13] \"Jewish\"                  \"Catholic\"               \n[15] \"Protestant\"              \"Not applicable\"         \n\n$class\n[1] \"factor\"\n```\n:::\n:::\n\n\nThe first level is \"No answer\" followed by \"Don't know\", and so on.\n\nImagine you want to explore the average number of hours spent watching TV (`tvhours`) per day across religions (`relig`):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_summary <- gss_cat %>%\n    group_by(relig) %>%\n    summarise(\n        tvhours = mean(tvhours, na.rm = TRUE),\n        n = n()\n    )\n\nrelig_summary %>%\n    ggplot(aes(x = tvhours, y = relig)) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-24-1.png){fig-alt='A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern.' width=672}\n:::\n:::\n\n\nThe y-axis lists the levels of the `relig` factor in the order of the levels.\n\nHowever, it is **hard to read this plot** because **there's no overall pattern**.\n\n### `fct_reorder`\n\nWe can improve it by **reordering the levels** of `relig` using `fct_reorder()`. `fct_reorder(.f, .x, .fun)` takes three arguments:\n\n-   `.f`, the factor whose levels you want to modify.\n-   `.x`, a numeric vector that you want to use to reorder the levels.\n-   Optionally, `.fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_summary %>%\n    ggplot(aes(\n        x = tvhours,\n        y = fct_reorder(.f = relig, .x = tvhours)\n    )) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-25-1.png){fig-alt='The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. \"Other eastern\" has the fewest tvhours under 2, and \"Don\\'t know\" has the highest (over 5).' width=672}\n:::\n:::\n\n\n**Reordering** religion makes it **much easier to see** that people in the \"Don't know\" category watch much more TV, and Hinduism & Other Eastern religions watch much less.\n\nAs you start making more complicated transformations, I recommend moving them out of `aes()` and into a separate `mutate()` step.\n\n::: callout-tip\n### Example\n\nYou could rewrite the plot above as:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_summary %>%\n    mutate(relig = fct_reorder(relig, tvhours)) %>%\n    ggplot(aes(x = tvhours, y = relig)) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n:::\n\n::: callout-tip\n### Another example\n\nWhat if we create a similar plot looking at how average age varies across reported income level?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrincome_summary <-\n    gss_cat %>%\n    group_by(rincome) %>%\n    summarise(\n        age = mean(age, na.rm = TRUE),\n        n = n()\n    )\n\nrincome_summary %>%\n    ggplot(aes(x = age, y = fct_reorder(.f = rincome, .x = age))) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-27-1.png){fig-alt='A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn\\'t make much sense. One section of the y-axis goes from $6000-6999, then <$1000, then $8000-9999.' width=672}\n:::\n:::\n\n\nHere, arbitrarily reordering the levels isn't a good idea! That's because `rincome` already has a principled order that we shouldn't mess with.\n:::\n\n::: callout-tip\n### Pro-tip\n\nReserve `fct_reorder()` for factors whose levels are arbitrarily ordered.\n:::\n\n::: callout-note\n### Question\n\nLet's practice `fct_reorder()`. Using the `palmerpenguins` dataset,\n\n1.  Calculate the average `bill_length_mm` for each species\n2.  Create a scatter plot showing the average for each species.\\\n3.  Go back and reorder the factor `species` based on the average bill length from largest to smallest.\n4.  Now order it from smallest to largest\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(palmerpenguins)\npenguins\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>\n 1 Adelie  Torgersen           39.1          18.7               181        3750\n 2 Adelie  Torgersen           39.5          17.4               186        3800\n 3 Adelie  Torgersen           40.3          18                 195        3250\n 4 Adelie  Torgersen           NA            NA                  NA          NA\n 5 Adelie  Torgersen           36.7          19.3               193        3450\n 6 Adelie  Torgersen           39.3          20.6               190        3650\n 7 Adelie  Torgersen           38.9          17.8               181        3625\n 8 Adelie  Torgersen           39.2          19.6               195        4675\n 9 Adelie  Torgersen           34.1          18.1               193        3475\n10 Adelie  Torgersen           42            20.2               190        4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex <fct>, year <int>\n```\n:::\n\n```{.r .cell-code}\n## Try it out\n```\n:::\n\n:::\n\n### `fct_relevel`\n\nHowever, it does make sense to pull \"Not applicable\" to the front with the other special levels.\n\nYou can use `fct_relevel()`.\n\nIt takes a factor, `f`, and then any number of levels that you want to move to the front of the line.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrincome_summary %>%\n    ggplot(aes(age, fct_relevel(rincome, \"Not applicable\"))) +\n    geom_point()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-29-1.png){fig-alt='The same scatterplot but now \"Not Applicable\" is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is \"Not applicable\".' width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nAny levels not mentioned in `fct_relevel` will be left in their existing order.\n:::\n\nAnother type of reordering is useful when you are coloring the lines on a plot. `fct_reorder2(f, x, y)` reorders the factor `f` by the `y` values associated with the largest `x` values.\n\nThis makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.\n\n\n::: {.cell layout-ncol=\"2\"}\n\n```{.r .cell-code}\nby_age <-\n    gss_cat %>%\n    filter(!is.na(age)) %>%\n    count(age, marital) %>%\n    group_by(age) %>%\n    mutate(prop = n / sum(n))\n\nby_age %>%\n    ggplot(aes(age, prop, colour = marital)) +\n    geom_line(na.rm = TRUE)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-30-1.png){fig-alt='- A line plot with age on the x-axis and proportion on the y-axis.\n  There is one line for each category of marital status: no answer,\n  never married, separated, divorced, widowed, and married. It is\n  a little hard to read the plot because the order of the legend is\n  unrelated to the lines on the plot.\n- Rearranging the legend makes the plot easier to read because the\n  legend colours now match the order of the lines on the far right\n  of the plot. You can see some unsuprising patterns: the proportion\n  never marred decreases with age, married forms an upside down U\n  shape, and widowed starts off low but increases steeply after age\n  60.' width=384}\n:::\n\n```{.r .cell-code}\nby_age %>%\n    ggplot(aes(age, prop, colour = fct_reorder2(marital, age, prop))) +\n    geom_line() +\n    labs(colour = \"marital\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-30-2.png){fig-alt='- A line plot with age on the x-axis and proportion on the y-axis.\n  There is one line for each category of marital status: no answer,\n  never married, separated, divorced, widowed, and married. It is\n  a little hard to read the plot because the order of the legend is\n  unrelated to the lines on the plot.\n- Rearranging the legend makes the plot easier to read because the\n  legend colours now match the order of the lines on the far right\n  of the plot. You can see some unsuprising patterns: the proportion\n  never marred decreases with age, married forms an upside down U\n  shape, and widowed starts off low but increases steeply after age\n  60.' width=384}\n:::\n:::\n\n\n### `fct_infreq`\n\nFinally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. Combine it with `fct_rev()` if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%\n    ggplot(aes(marital)) +\n    geom_bar()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-31-1.png){fig-alt='A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000).' width=672}\n:::\n:::\n\n\n## Modifying factor levels\n\nMore powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays.\n\n### `fct_recode`\n\nThe most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    count(partyid)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 2\n   partyid                n\n   <fct>              <int>\n 1 No answer            154\n 2 Don't know             1\n 3 Other party          393\n 4 Strong republican   2314\n 5 Not str republican  3032\n 6 Ind,near rep        1791\n 7 Independent         4119\n 8 Ind,near dem        2499\n 9 Not str democrat    3690\n10 Strong democrat     3490\n```\n:::\n:::\n\n\nThe **levels are terse and inconsistent**.\n\nLet's tweak them to be longer and use a parallel construction.\n\nLike most rename and recoding functions in the tidyverse:\n\n-   the **new values go on the left**\n-   the **old values go on the right**\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    mutate(partyid = fct_recode(partyid,\n        \"Republican, strong\"    = \"Strong republican\",\n        \"Republican, weak\"      = \"Not str republican\",\n        \"Independent, near rep\" = \"Ind,near rep\",\n        \"Independent, near dem\" = \"Ind,near dem\",\n        \"Democrat, weak\"        = \"Not str democrat\",\n        \"Democrat, strong\"      = \"Strong democrat\"\n    )) %>%\n    count(partyid)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 2\n   partyid                   n\n   <fct>                 <int>\n 1 No answer               154\n 2 Don't know                1\n 3 Other party             393\n 4 Republican, strong     2314\n 5 Republican, weak       3032\n 6 Independent, near rep  1791\n 7 Independent            4119\n 8 Independent, near dem  2499\n 9 Democrat, weak         3690\n10 Democrat, strong       3490\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\n`fct_recode()` will leave the levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.\n:::\n\nTo combine groups, you can assign multiple old levels to the same new level:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    mutate(partyid = fct_recode(partyid,\n        \"Republican, strong\"    = \"Strong republican\",\n        \"Republican, weak\"      = \"Not str republican\",\n        \"Independent, near rep\" = \"Ind,near rep\",\n        \"Independent, near dem\" = \"Ind,near dem\",\n        \"Democrat, weak\"        = \"Not str democrat\",\n        \"Democrat, strong\"      = \"Strong democrat\",\n        \"Other\"                 = \"No answer\",\n        \"Other\"                 = \"Don't know\",\n        \"Other\"                 = \"Other party\"\n    )) %>%\n    count(partyid)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 8 × 2\n  partyid                   n\n  <fct>                 <int>\n1 Other                   548\n2 Republican, strong     2314\n3 Republican, weak       3032\n4 Independent, near rep  1791\n5 Independent            4119\n6 Independent, near dem  2499\n7 Democrat, weak         3690\n8 Democrat, strong       3490\n```\n:::\n:::\n\n\nUse this technique with care: if you group together categories that are truly different you will end up with misleading results.\n\n### `fct_collapse`\n\nIf you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.\n\nFor **each new variable**, you can **provide a vector of old levels**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    mutate(partyid = fct_collapse(partyid,\n        \"other\" = c(\"No answer\", \"Don't know\", \"Other party\"),\n        \"rep\" = c(\"Strong republican\", \"Not str republican\"),\n        \"ind\" = c(\"Ind,near rep\", \"Independent\", \"Ind,near dem\"),\n        \"dem\" = c(\"Not str democrat\", \"Strong democrat\")\n    )) %>%\n    count(partyid)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 2\n  partyid     n\n  <fct>   <int>\n1 other     548\n2 rep      5346\n3 ind      8409\n4 dem      7180\n```\n:::\n:::\n\n\n### `fct_lump_*`\n\nSometimes you **just want to lump together the small groups** to make a plot or table simpler.\n\nThat's the **job of the `fct_lump_*()` family of functions**.\n\n`fct_lump_lowfreq()` is a simple starting point that progressively lumps the smallest groups categories into \"Other\", always keeping \"Other\" as the smallest category.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    mutate(relig = fct_lump_lowfreq(relig)) %>%\n    count(relig)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n  relig          n\n  <fct>      <int>\n1 Protestant 10846\n2 Other      10637\n```\n:::\n:::\n\n\nIn this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we'd probably like to see some more details!\n\nInstead, we can use the `fct_lump_n()` to **specify that we want exactly 10 groups**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngss_cat %>%\n    mutate(relig = fct_lump_n(relig, n = 10)) %>%\n    count(relig, sort = TRUE) %>%\n    print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 2\n   relig                       n\n   <fct>                   <int>\n 1 Protestant              10846\n 2 Catholic                 5124\n 3 None                     3523\n 4 Christian                 689\n 5 Other                     458\n 6 Jewish                    388\n 7 Buddhism                  147\n 8 Inter-nondenominational   109\n 9 Moslem/islam              104\n10 Orthodox-christian         95\n```\n:::\n:::\n\n\nRead the documentation to learn about `fct_lump_min()` and `fct_lump_prop()` which are useful in other cases.\n\n## Ordered factors\n\nThere's a **special type of factor** that needs to be mentioned briefly: ordered factors.\n\n**Ordered factors**, created with `ordered()`, imply a strict ordering and equal distance between levels:\n\nThe **first level** is \"less than\" the **second level** by the same amount that the second level is \"less than\" the **third level**, and so on...\n\nYou can recognize them when printing because they use `<` between the factor levels:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nordered(c(\"a\", \"b\", \"c\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] a b c\nLevels: a < b < c\n```\n:::\n:::\n\n\nHowever, in practice, `ordered()` factors **behave very similarly to regular factors**.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Explore the distribution of `rincome` (reported income). What makes the default bar chart hard to understand? How could you improve the plot?\n\n2.  What is the most common `relig` in this survey? What's the most common `partyid`?\n\n3.  Which `relig` does `denom` (denomination) apply to? How can you find out with a table? How can you find out with a visualization?\n\n4.  There are some suspiciously high numbers in `tvhours`. Is the mean a good summary?\n\n5.  For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled.\n\n6.  Why did moving \"Not applicable\" to the front of the levels move it to the bottom of the plot?\n\n7.  How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?\n\n8.  How could you collapse `rincome` into a small set of categories?\n\n9.  Notice there are 9 groups (excluding other) in the `fct_lump` example above. Why not 10? (Hint: type `?fct_lump`, and find the default for the argument `other_level` is \"Other\".)\n:::\n\n### Additional Resources\n\n::: callout-tip\n-   <https://r4ds.had.co.nz/factors>\n-   [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton\n-   <https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors>\n-   <https://forcats.tidyverse.org>\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package        * version date (UTC) lib source\n cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout         1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest           0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate         0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver           2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2        * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable           0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools        0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets      1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite         1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr            1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling         0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate      * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell          0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)\n pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr          * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown        2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales           1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi          1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr        * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr          * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange       0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun             0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml             2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/23-working-with-text-sentiment-analysis/index/execute-results/html.json b/_freeze/posts/23-working-with-text-sentiment-analysis/index/execute-results/html.json
index 11e8ab9..203758b 100644
--- a/_freeze/posts/23-working-with-text-sentiment-analysis/index/execute-results/html.json
+++ b/_freeze/posts/23-working-with-text-sentiment-analysis/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "554e2156587a878a3c61d9daa6e2bd91",
+  "hash": "e14a35d66ea739db78e47be211081167",
   "result": {
-    "markdown": "---\ntitle: \"23 - Tidytext and sentiment analysis\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to tidytext and sentiment analysis\"\ncategories: [module 5, week 7, tidyverse, tidytext, sentiment analysis]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/23-working-with-text-sentiment-analysis/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   [Text mining with R: A Tidy Approach](https://www.tidytextmining.com/) from Julia Silge and David Robinson which uses the [`tidytext`](https://github.com/juliasilge/tidytext) R package\n-   [Supervised Machine Learning for Text Analsyis in R](https://smltar.com/preface.html) from Emil Hvitfeldt, Julia Sigle\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Learn about what is is meant by \"tidy text\" data\n-   Know the fundamentals of the `tidytext` R package to create tidy text data\n-   Know the fundamentals of sentiment analysis\n:::\n\n# Motivation\n\nAnalyzing text data such as Twitter content, books or news articles is commonly performed in data science.\n\nIn this lecture, we will be asking the following questions:\n\n1.  Which are the **most commonly used words** from Jane Austen's novels?\n2.  Which are the **most positive** or **negative words**?\n3.  How does the **sentiment** (e.g. positive vs negative) of the text change across each novel?\n\n![](https://images-na.ssl-images-amazon.com/images/I/A1YUH7-W5AL.jpg){preview=\"TRUE\"}\n\n\\[[image source](https://images-na.ssl-images-amazon.com/images/I/A1YUH7-W5AL.jpg)\\]\n\nTo answer these questions, we will need to learn about a few things. Specifically,\n\n1.  How to **convert words in documents** to a **tidy text** format using the `tidytext` R package.\n2.  A little bit about [sentiment analysis](https://www.tidytextmining.com/sentiment.html).\n\n# Tidy text\n\nIn previous lectures, you have learned about the **tidy data principles** and the `tidyverse` R packages as a way to make handling data easier and more effective.\n\nThese packages depend on **data being formatted in a particular way**.\n\nThe idea with tidy text is to **treat text as data frames of individual words** and **apply the same tidy data principles** to make text mining tasks easier and consistent with already developed tools.\n\nFirst let's recall what a **tidy** data format means.\n\n### What is a **tidy** format?\n\nFirst, the [tidyverse](https://www.tidyverse.org) is\n\n> \"an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.\"\n\nAnother way of putting it is that it is a **set of packages** that are useful specifically for data manipulation, exploration and visualization **with a common philosophy**.\n\n### What is this common philosphy?\n\nThe common philosophy is called **\"tidy\" data**.\n\nIt is a standard way of mapping the meaning of a dataset to its structure.\n\nIn **tidy** data:\n\n-   Each variable forms a column.\n-   Each observation forms a row.\n-   Each type of observational unit forms a table.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](http://r4ds.had.co.nz/images/tidy-1.png){width=95%}\n:::\n:::\n\n\n\\[[img source](http://r4ds.had.co.nz)\\]\n\nWorking with **tidy data is useful** because it creates a structured way of organizing data values within a data set.\n\nThis makes the data analysis process more efficient and simplifies the development of data analysis tools that work together.\n\nIn this way, you can focus on the problem you are investigating, rather than the uninteresting logistics of data.\n\n### What is a **tidy text** format?\n\nWhen dealing with **text** data, the **tidy text** format is defined as a table **with one-token-per-row**, where a **token** is a meaningful unit of text (e.g. a word, pair of words, sentence, paragraph, etc).\n\nUsing a **given set of token**, we can **tokenize** text, or **split the text into the defined tokens of interest along the rows**.\n\nWe will learn more about how to do this using functions in the [`tidytext`](https://github.com/juliasilge/tidytext) R package.\n\nIn contrast, other data structures that are commonly used to store text data in text mining applications:\n\n-   **string**: text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form.\n-   **corpus**: these types of objects typically contain raw strings annotated with additional metadata and details.\n-   **document-term matrix**: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count.\n\nI won't describing these other formats in greater detail, but encourage you to read about them if interested in this topic.\n\n### Why is this format useful?\n\nOne of the biggest advantages of transforming text data to the tidy text format is that it allows data to transition smoothly between other packages that adhere to the `tidyverse` framework (e.g. `ggplot2`, `dplyr`, etc).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A flowchart of a typical text analysis using tidy data principles.](https://www.tidytextmining.com/images/tmwr_0101.png){width=90%}\n:::\n:::\n\n\n\\[[image source](https://www.tidytextmining.com/images/tmwr_0101.png)\\]\n\nIn addition, a user can transition between the tidy text format for e.g data visualization with `ggplot2`, but then also convert data to other data structures (e.g. document-term matrix) that is commonly used in machine learning applications.\n\n### How does it work?\n\nThe main workhorse function in the `tidytext` R package to tokenize text a data frame is the `unnest_tokens(tbl, output, input)` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?unnest_tokens\n```\n:::\n\n\nIn addition to the tibble or data frame (`tbl`), the function needs two basic arguments:\n\n1.  `output` or the output column name that will be created (e.g. string) as the text is unnested into it\n2.  `input` or input column name that the text comes from and gets split\n\nLet's try out the `unnest_tokens()` function using the first paragraph in the preface of Roger Peng's [R Programming for Data Science](https://leanpub.com/rprogramming) book.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Preface from R Programming for Data Science](../../images/peng_preface.png){width=90%}\n:::\n:::\n\n\nTo make this easier, I typed this text into a vector of character strings: one string per sentence.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface <- \n  c(\"I started using R in 1998 when I was a college undergraduate working on my senior thesis.\", \n    \"The version was 0.63.\",  \n    \"I was an applied mathematics major with a statistics concentration and I was working with Dr. Nicolas Hengartner on an analysis of word frequencies in classic texts (Shakespeare, Milton, etc.).\", \n    \"The idea was to see if we could identify the authorship of each of the texts based on how frequently they used certain words.\", \n    \"We downloaded the data from Project Gutenberg and used some basic linear discriminant analysis for the modeling.\",\n    \"The work was eventually published and was my first ever peer-reviewed publication.\", \n    \"I guess you could argue it was my first real 'data science' experience.\")\n\npeng_preface\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"I started using R in 1998 when I was a college undergraduate working on my senior thesis.\"                                                                                                        \n[2] \"The version was 0.63.\"                                                                                                                                                                            \n[3] \"I was an applied mathematics major with a statistics concentration and I was working with Dr. Nicolas Hengartner on an analysis of word frequencies in classic texts (Shakespeare, Milton, etc.).\"\n[4] \"The idea was to see if we could identify the authorship of each of the texts based on how frequently they used certain words.\"                                                                    \n[5] \"We downloaded the data from Project Gutenberg and used some basic linear discriminant analysis for the modeling.\"                                                                                 \n[6] \"The work was eventually published and was my first ever peer-reviewed publication.\"                                                                                                               \n[7] \"I guess you could argue it was my first real 'data science' experience.\"                                                                                                                          \n```\n:::\n:::\n\n\nTurns out Roger performed a similar analysis as an undergraduate student!\n\nHe goes to say that back then no one was using R (but a little bit of something called S-PLUS), so I can only imagine how different it was to accomplish a task like the one we are going to do today compared to when he was an undergraduate.\n\nNext, we load a few R packages\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(stringr)\nlibrary(tidytext) ## needs to be installed\nlibrary(janeaustenr) ## needs to be installed\n```\n:::\n\n\nThen, we use the `tibble()` function to construct a data frame with two columns: one counting the line number and one from the character strings in `peng_preface`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df <- tibble(line=1:7, \n                          text=peng_preface)\npeng_preface_df\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 × 2\n   line text                                                                    \n  <int> <chr>                                                                   \n1     1 I started using R in 1998 when I was a college undergraduate working on…\n2     2 The version was 0.63.                                                   \n3     3 I was an applied mathematics major with a statistics concentration and …\n4     4 The idea was to see if we could identify the authorship of each of the …\n5     5 We downloaded the data from Project Gutenberg and used some basic linea…\n6     6 The work was eventually published and was my first ever peer-reviewed p…\n7     7 I guess you could argue it was my first real 'data science' experience. \n```\n:::\n:::\n\n\n### Text Mining and Tokens\n\nNext, we will use the `unnest_tokens()` function where we will call the output column to be created `word` and the input column `text` from the `peng_preface_df`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_token <- \n  peng_preface_df %>% \n  unnest_tokens(output = word, \n                input = text, \n                token = \"words\")\n\npeng_token %>% \n  head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word   \n  <int> <chr>  \n1     1 i      \n2     1 started\n3     1 using  \n4     1 r      \n5     1 in     \n6     1 1998   \n```\n:::\n\n```{.r .cell-code}\npeng_token %>% \n  tail()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word      \n  <int> <chr>     \n1     7 my        \n2     7 first     \n3     7 real      \n4     7 data      \n5     7 science   \n6     7 experience\n```\n:::\n:::\n\n\nThe argument `token=\"words\"` **defines the unit for tokenization**.\n\nThe default is `\"words\"`, but there are lots of other options.\n\n::: callout-tip\n### Example\n\nWe could tokenize by `\"characters\"`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df %>% \n  unnest_tokens(word, \n                text, \n                token = \"characters\") %>% \n  head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word \n  <int> <chr>\n1     1 i    \n2     1 s    \n3     1 t    \n4     1 a    \n5     1 r    \n6     1 t    \n```\n:::\n:::\n\n:::\n\nor something called [ngrams](https://en.wikipedia.org/wiki/N-gram), which is defined by Wikipedia as a *\"contiguous sequence of n items from a given sample of text or speech\"*\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df %>% \n  unnest_tokens(word,\n                text, \n                token = \"ngrams\", \n                n=3) %>% \n  head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word           \n  <int> <chr>          \n1     1 i started using\n2     1 started using r\n3     1 using r in     \n4     1 r in 1998      \n5     1 in 1998 when   \n6     1 1998 when i    \n```\n:::\n:::\n\n\nAnother option is to use the `character_shingles` option, which is similar to tokenizing like `ngrams`, except the units are characters instead of words.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df %>% \n  unnest_tokens(word, \n                text, \n                token = \"character_shingles\",\n                n = 4) %>% \n  head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word \n  <int> <chr>\n1     1 ista \n2     1 star \n3     1 tart \n4     1 arte \n5     1 rted \n6     1 tedu \n```\n:::\n:::\n\n\nYou can also **create custom functions** for tokenization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df %>% \n  unnest_tokens(word, \n                text, \n                token = stringr::str_split,\n                pattern = \" \") %>% \n  head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word   \n  <int> <chr>  \n1     1 i      \n2     1 started\n3     1 using  \n4     1 r      \n5     1 in     \n6     1 1998   \n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's tokenize the first four sentences of [Amanda Gorman's *The Hill We Climb*](https://www.nytimes.com/2021/01/19/books/amanda-gorman-inauguration-hill-we-climb.html) by words.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngorman_hill_we_climb <- \n  c(\"When day comes we ask ourselves, where can we find light in this neverending shade?\",\n    \"The loss we carry, a sea we must wade.\", \n    \"We’ve braved the belly of the beast, we’ve learned that quiet isn’t always peace and the norms and notions of what just is, isn’t always justice.\",\n    \"And yet the dawn is ours before we knew it, somehow we do it, somehow we’ve weathered and witnessed a nation that isn’t broken but simply unfinished.\")\n\nhill_df <- tibble(line=seq_along(gorman_hill_we_climb), \n                  text=gorman_hill_we_climb)\nhill_df \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 2\n   line text                                                                    \n  <int> <chr>                                                                   \n1     1 When day comes we ask ourselves, where can we find light in this nevere…\n2     2 The loss we carry, a sea we must wade.                                  \n3     3 We’ve braved the belly of the beast, we’ve learned that quiet isn’t alw…\n4     4 And yet the dawn is ours before we knew it, somehow we do it, somehow w…\n```\n:::\n\n```{.r .cell-code}\n### try it out\n\nhill_df %>% \n  unnest_tokens(output = wordsforfun, \n                input = text, \n                token = \"words\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 77 × 2\n    line wordsforfun\n   <int> <chr>      \n 1     1 when       \n 2     1 day        \n 3     1 comes      \n 4     1 we         \n 5     1 ask        \n 6     1 ourselves  \n 7     1 where      \n 8     1 can        \n 9     1 we         \n10     1 find       \n# ℹ 67 more rows\n```\n:::\n:::\n\n:::\n\n# Example: text from works of Jane Austen\n\nWe will use the text from six published novels from Jane Austen, which are available in the [`janeaustenr`](https://cran.r-project.org/web/packages/janeaustenr/index.html) R package. The [authors](https://www.tidytextmining.com/tidytext.html#tidyausten) describe the format:\n\n> \"The package provides the text in a one-row-per-line format, where a line is this context is analogous to a literal printed line in a physical book.\n>\n> The package contains:\n>\n> -   `sensesensibility`: Sense and Sensibility, published in 1811\n> -   `prideprejudice`: Pride and Prejudice, published in 1813\n> -   `mansfieldpark`: Mansfield Park, published in 1814\n> -   `emma`: Emma, published in 1815\n> -   `northangerabbey`: Northanger Abbey, published posthumously in 1818\n> -   `persuasion`: Persuasion, also published posthumously in 1818\n>\n> There is also a function `austen_books()` that returns a tidy data frame of all 6 novels.\"\n\nLet's load in the text from `prideprejudice` and look at how the data are stored.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(janeaustenr)\nhead(prideprejudice, 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"PRIDE AND PREJUDICE\"                                                        \n [2] \"\"                                                                           \n [3] \"By Jane Austen\"                                                             \n [4] \"\"                                                                           \n [5] \"\"                                                                           \n [6] \"\"                                                                           \n [7] \"Chapter 1\"                                                                  \n [8] \"\"                                                                           \n [9] \"\"                                                                           \n[10] \"It is a truth universally acknowledged, that a single man in possession\"    \n[11] \"of a good fortune, must be in want of a wife.\"                              \n[12] \"\"                                                                           \n[13] \"However little known the feelings or views of such a man may be on his\"     \n[14] \"first entering a neighbourhood, this truth is so well fixed in the minds\"   \n[15] \"of the surrounding families, that he is considered the rightful property\"   \n[16] \"of some one or other of their daughters.\"                                   \n[17] \"\"                                                                           \n[18] \"\\\"My dear Mr. Bennet,\\\" said his lady to him one day, \\\"have you heard that\"\n[19] \"Netherfield Park is let at last?\\\"\"                                         \n[20] \"\"                                                                           \n```\n:::\n:::\n\n\nWe see each line is in a character vector with elements of about 70 characters.\n\nSimilar to what we did above with Roger's preface, we can\n\n-   Turn the text of character strings into a data frame and then\n-   Convert it into a one-row-per-line dataframe using the `unnest_tokens()` function\n\n\n::: {.cell}\n\n```{.r .cell-code}\npp_book_df <- tibble(text = prideprejudice) \n  \npp_book_df %>% \n  unnest_tokens(output = word, \n                input = text, \n                token=\"words\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 122,204 × 1\n   word     \n   <chr>    \n 1 pride    \n 2 and      \n 3 prejudice\n 4 by       \n 5 jane     \n 6 austen   \n 7 chapter  \n 8 1        \n 9 it       \n10 is       \n# ℹ 122,194 more rows\n```\n:::\n:::\n\n\nWe can also divide it by paragraphs:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntmp <- pp_book_df %>% \n  unnest_tokens(output = paragraph, \n                input = text, \n                token =\"paragraphs\") \ntmp\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10,721 × 1\n   paragraph                                                                    \n   <chr>                                                                        \n 1 \"pride and prejudice\"                                                        \n 2 \"by jane austen\"                                                             \n 3 \"chapter 1\"                                                                  \n 4 \"it is a truth universally acknowledged, that a single man in possession\"    \n 5 \"of a good fortune, must be in want of a wife.\"                              \n 6 \"however little known the feelings or views of such a man may be on his\"     \n 7 \"first entering a neighbourhood, this truth is so well fixed in the minds\"   \n 8 \"of the surrounding families, that he is considered the rightful property\"   \n 9 \"of some one or other of their daughters.\"                                   \n10 \"\\\"my dear mr. bennet,\\\" said his lady to him one day, \\\"have you heard that\"\n# ℹ 10,711 more rows\n```\n:::\n:::\n\n\nWe can extract a particular element from the tibble\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntmp[3,1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 1\n  paragraph\n  <chr>    \n1 chapter 1\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nWhat you name the output column, e.g. `paragraph` in this case, doesn't affect it, it's just good to give it a consistent name.\n:::\n\nWe could also divide it by sentence:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npp_book_df %>%\n    unnest_tokens(output = sentence,\n                  input = text, \n                  token = \"sentences\") \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 15,545 × 1\n   sentence                                                                  \n   <chr>                                                                     \n 1 \"pride and prejudice\"                                                     \n 2 \"by jane austen\"                                                          \n 3 \"chapter 1\"                                                               \n 4 \"it is a truth universally acknowledged, that a single man in possession\" \n 5 \"of a good fortune, must be in want of a wife.\"                           \n 6 \"however little known the feelings or views of such a man may be on his\"  \n 7 \"first entering a neighbourhood, this truth is so well fixed in the minds\"\n 8 \"of the surrounding families, that he is considered the rightful property\"\n 9 \"of some one or other of their daughters.\"                                \n10 \"\\\"my dear mr.\"                                                           \n# ℹ 15,535 more rows\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThis is tricked by terms like \"Mr.\" and \"Mrs.\"\n:::\n\nOne neat trick is that we can unnest by two layers:\n\n1.  paragraph and then\n2.  word\n\nThis lets us keep track of **which paragraph is which**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparagraphs <- \n  pp_book_df %>%\n    unnest_tokens(output = paragraph, \n                  input = text, \n                  token = \"paragraphs\") %>%\n    mutate(paragraph_number = row_number()) \n\nparagraphs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10,721 × 2\n   paragraph                                                    paragraph_number\n   <chr>                                                                   <int>\n 1 \"pride and prejudice\"                                                       1\n 2 \"by jane austen\"                                                            2\n 3 \"chapter 1\"                                                                 3\n 4 \"it is a truth universally acknowledged, that a single man …                4\n 5 \"of a good fortune, must be in want of a wife.\"                             5\n 6 \"however little known the feelings or views of such a man m…                6\n 7 \"first entering a neighbourhood, this truth is so well fixe…                7\n 8 \"of the surrounding families, that he is considered the rig…                8\n 9 \"of some one or other of their daughters.\"                                  9\n10 \"\\\"my dear mr. bennet,\\\" said his lady to him one day, \\\"ha…               10\n# ℹ 10,711 more rows\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nWe use `mutate()` to annotate a paragraph number quantity to keep track of paragraphs in the original format.\n:::\n\nAfter tokenizing by paragraph, we can then tokenzie by word:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparagraphs %>%\n    unnest_tokens(output = word, \n                  input = paragraph)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 122,204 × 2\n   paragraph_number word     \n              <int> <chr>    \n 1                1 pride    \n 2                1 and      \n 3                1 prejudice\n 4                2 by       \n 5                2 jane     \n 6                2 austen   \n 7                3 chapter  \n 8                3 1        \n 9                4 it       \n10                4 is       \n# ℹ 122,194 more rows\n```\n:::\n:::\n\n\nWe notice there are many what are called **stop words** (\"the\", \"of\", \"to\", and so forth in English).\n\nOften in text analysis, we will want to **remove stop words** because stop words are words that are not useful for an analysis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(stop_words)\n\ntable(stop_words$lexicon)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n    onix    SMART snowball \n     404      571      174 \n```\n:::\n\n```{.r .cell-code}\nstop_words %>% \n  head(n=10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 2\n   word        lexicon\n   <chr>       <chr>  \n 1 a           SMART  \n 2 a's         SMART  \n 3 able        SMART  \n 4 about       SMART  \n 5 above       SMART  \n 6 according   SMART  \n 7 accordingly SMART  \n 8 across      SMART  \n 9 actually    SMART  \n10 after       SMART  \n```\n:::\n:::\n\n\nWe can remove stop words (kept in the `tidytext` dataset `stop_words`) with an `anti_join(x,y)` (return all rows from `x` without a match in `y`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwords_by_paragraph <- \n  paragraphs %>%\n    unnest_tokens(output = word, \n                  input = paragraph) %>%\n    anti_join(stop_words)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n```{.r .cell-code}\nwords_by_paragraph \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 37,246 × 2\n   paragraph_number word        \n              <int> <chr>       \n 1                1 pride       \n 2                1 prejudice   \n 3                2 jane        \n 4                2 austen      \n 5                3 chapter     \n 6                3 1           \n 7                4 truth       \n 8                4 universally \n 9                4 acknowledged\n10                4 single      \n# ℹ 37,236 more rows\n```\n:::\n:::\n\n\nBecause we have stored our data in a tidy dataset, we can use `tidyverse` packages for exploratory data analysis.\n\nFor example, here we use `dplyr`'s `count()` function to find the most common words in the book\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwords_by_paragraph %>%\n  count(word, sort = TRUE) %>% \n  head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n  word          n\n  <chr>     <int>\n1 elizabeth   597\n2 darcy       373\n3 bennet      294\n4 miss        283\n5 jane        264\n6 bingley     257\n```\n:::\n:::\n\n\nThen use `ggplot2` to plot the most commonly used words from the book.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwords_by_paragraph %>%\n  count(word, sort = TRUE) %>%\n  filter(n > 150) %>%\n  mutate(word = fct_reorder(word, n)) %>%\n  ggplot(aes(word, n)) +\n    geom_col() +\n    xlab(NULL) +\n    coord_flip()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-24-1.png){width=672}\n:::\n:::\n\n\nWe can also do this for all of her books using the `austen_books()` object\n\n\n::: {.cell}\n\n```{.r .cell-code}\nausten_books() %>% \n  head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n  text                    book               \n  <chr>                   <fct>              \n1 \"SENSE AND SENSIBILITY\" Sense & Sensibility\n2 \"\"                      Sense & Sensibility\n3 \"by Jane Austen\"        Sense & Sensibility\n4 \"\"                      Sense & Sensibility\n5 \"(1811)\"                Sense & Sensibility\n6 \"\"                      Sense & Sensibility\n```\n:::\n:::\n\n\nWe can do some data wrangling that keep tracks of the line number and chapter (using a regex) to find where all the chapters are.\n\n\n::: {.cell}\n\n```{.r .cell-code}\noriginal_books <- \n  austen_books() %>%\n  group_by(book) %>%\n  mutate(linenumber = row_number(),\n         chapter = cumsum(\n                    str_detect(text, \n                               pattern = regex(pattern = \"^chapter [\\\\divxlc]\",\n                                               ignore_case = TRUE))\n                              )\n                          ) %>%\n  ungroup()\n\noriginal_books\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 73,422 × 4\n   text                    book                linenumber chapter\n   <chr>                   <fct>                    <int>   <int>\n 1 \"SENSE AND SENSIBILITY\" Sense & Sensibility          1       0\n 2 \"\"                      Sense & Sensibility          2       0\n 3 \"by Jane Austen\"        Sense & Sensibility          3       0\n 4 \"\"                      Sense & Sensibility          4       0\n 5 \"(1811)\"                Sense & Sensibility          5       0\n 6 \"\"                      Sense & Sensibility          6       0\n 7 \"\"                      Sense & Sensibility          7       0\n 8 \"\"                      Sense & Sensibility          8       0\n 9 \"\"                      Sense & Sensibility          9       0\n10 \"CHAPTER 1\"             Sense & Sensibility         10       1\n# ℹ 73,412 more rows\n```\n:::\n:::\n\n\nFinally, we can restructure it to a one-token-per-row format using the `unnest_tokens()` function and remove stop words using the `anti_join()` function in `dplyr`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_books <- original_books %>%\n  unnest_tokens(word, text) %>% \n  anti_join(stop_words)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n```{.r .cell-code}\ntidy_books\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 217,609 × 4\n   book                linenumber chapter word       \n   <fct>                    <int>   <int> <chr>      \n 1 Sense & Sensibility          1       0 sense      \n 2 Sense & Sensibility          1       0 sensibility\n 3 Sense & Sensibility          3       0 jane       \n 4 Sense & Sensibility          3       0 austen     \n 5 Sense & Sensibility          5       0 1811       \n 6 Sense & Sensibility         10       1 chapter    \n 7 Sense & Sensibility         10       1 1          \n 8 Sense & Sensibility         13       1 family     \n 9 Sense & Sensibility         13       1 dashwood   \n10 Sense & Sensibility         13       1 settled    \n# ℹ 217,599 more rows\n```\n:::\n:::\n\n\nHere are the most commonly used words across all of Jane Austen's books.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_books %>%\n  count(word, sort = TRUE) %>%\n  filter(n > 600) %>%\n  mutate(word = fct_reorder(word, n)) %>%\n  ggplot(aes(word, n)) +\n    geom_col() +\n    xlab(NULL) +\n    coord_flip()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\n# Sentiment Analysis\n\nIn the previous section, we explored the **tidy text** format and showed how we can calculate things such as word frequency.\n\nNext, we are going to look at something called **opinion mining** or **sentiment analysis**. The [tidytext authors](https://www.tidytextmining.com/sentiment.html) write:\n\n> *\"When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tools of text mining to approach the emotional content of text programmatically, as shown in the figure below\"*\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A flowchart of a typical text analysis that uses tidytext for sentiment analysis.](https://www.tidytextmining.com/images/tmwr_0201.png){width=90%}\n:::\n:::\n\n\n\\[[image source](https://www.tidytextmining.com/images/tmwr_0201.png)\\]\n\n> *\"One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn't the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.\"*\n\nLet's try using sentiment analysis on the Jane Austen books.\n\n## The `sentiments` dataset\n\nInside the `tidytext` package are several **sentiment lexicons**. A few things to note:\n\n-   The lexicons are based on unigrams (single words)\n-   The lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth\n\nYou can use the `get_sentiments()` function to extract a specific lexicon.\n\nThe `nrc` lexicon **categorizes words into multiple categories** of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_sentiments(\"nrc\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 13,872 × 2\n   word        sentiment\n   <chr>       <chr>    \n 1 abacus      trust    \n 2 abandon     fear     \n 3 abandon     negative \n 4 abandon     sadness  \n 5 abandoned   anger    \n 6 abandoned   fear     \n 7 abandoned   negative \n 8 abandoned   sadness  \n 9 abandonment anger    \n10 abandonment fear     \n# ℹ 13,862 more rows\n```\n:::\n:::\n\n\nThe `bing` lexicon **categorizes words in a binary fashion** into positive and negative categories\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_sentiments(\"bing\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6,786 × 2\n   word        sentiment\n   <chr>       <chr>    \n 1 2-faces     negative \n 2 abnormal    negative \n 3 abolish     negative \n 4 abominable  negative \n 5 abominably  negative \n 6 abominate   negative \n 7 abomination negative \n 8 abort       negative \n 9 aborted     negative \n10 aborts      negative \n# ℹ 6,776 more rows\n```\n:::\n:::\n\n\nThe `AFINN` lexicon **assigns words with a score that runs between -5 and 5**, with negative scores indicating negative sentiment and positive scores indicating positive sentiment\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_sentiments(\"afinn\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2,477 × 2\n   word       value\n   <chr>      <dbl>\n 1 abandon       -2\n 2 abandoned     -2\n 3 abandons      -2\n 4 abducted      -2\n 5 abduction     -2\n 6 abductions    -2\n 7 abhor         -3\n 8 abhorred      -3\n 9 abhorrent     -3\n10 abhors        -3\n# ℹ 2,467 more rows\n```\n:::\n:::\n\n\nThe authors of the `tidytext` package note:\n\n> *\"How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen's novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text.\"*\n\nTwo other caveats:\n\n> *\"Not every English word is in the lexicons because many English words are pretty neutral. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in\"no good\" or \"not true\"; a lexicon-based method like this is based on unigrams only. For many kinds of text (like the narrative examples below), there are not sustained sections of sarcasm or negated text, so this is not an important effect.\"*\n\nand\n\n> *\"One last caveat is that the size of the chunk of text that we use to add up unigram sentiment scores can have an effect on an analysis. A text the size of many paragraphs can often have positive and negative sentiment averaged out to about zero, while sentence-sized or paragraph-sized text often works better.\"*\n\n### Joining together tidy text data with lexicons\n\nNow that we have our data in a tidy text format AND we have learned about different types of lexicons in application for sentiment analysis, we can **join the words together** using a join function.\n\n::: callout-tip\n### Example\n\nWhat are the most common joy words in the book *Emma*?\n\nHere, we use the `nrc` lexicon and join the `tidy_books` dataset with the `nrc_joy` lexicon using the `inner_join()` function in `dplyr`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnrc_joy <- get_sentiments(\"nrc\") %>% \n  filter(sentiment == \"joy\")\n\ntidy_books %>%\n  filter(book == \"Emma\") %>%\n  inner_join(nrc_joy) %>%\n  count(word, sort = TRUE)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 297 × 2\n   word          n\n   <chr>     <int>\n 1 friend      166\n 2 hope        143\n 3 happy       125\n 4 love        117\n 5 deal         92\n 6 found        92\n 7 happiness    76\n 8 pretty       68\n 9 true         66\n10 comfort      65\n# ℹ 287 more rows\n```\n:::\n:::\n\n:::\n\nWe can do things like investigate how the sentiment of the text changes throughout each of Jane's novels.\n\nHere, we use the `bing` lexicon, find a sentiment score for each word, and then use `inner_join()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_books %>%\n  inner_join(get_sentiments(\"bing\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in inner_join(., get_sentiments(\"bing\")): Detected an unexpected many-to-many relationship between `x` and `y`.\nℹ Row 131015 of `x` matches multiple rows in `y`.\nℹ Row 5051 of `y` matches multiple rows in `x`.\nℹ If a many-to-many relationship is expected, set `relationship =\n  \"many-to-many\"` to silence this warning.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 44,171 × 5\n   book                linenumber chapter word        sentiment\n   <fct>                    <int>   <int> <chr>       <chr>    \n 1 Sense & Sensibility         16       1 respectable positive \n 2 Sense & Sensibility         18       1 advanced    positive \n 3 Sense & Sensibility         20       1 death       negative \n 4 Sense & Sensibility         21       1 loss        negative \n 5 Sense & Sensibility         25       1 comfortably positive \n 6 Sense & Sensibility         28       1 goodness    positive \n 7 Sense & Sensibility         28       1 solid       positive \n 8 Sense & Sensibility         29       1 comfort     positive \n 9 Sense & Sensibility         30       1 relish      positive \n10 Sense & Sensibility         33       1 steady      positive \n# ℹ 44,161 more rows\n```\n:::\n:::\n\n\nThen, we can **count how many positive and negative words** there are in each section of the books.\n\nWe create an index to help us keep track of where we are in the narrative, which uses integer division, and counts up sections of 80 lines of text.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_books %>%\n  inner_join(get_sentiments(\"bing\")) %>%\n  count(book, \n        index = linenumber %/% 80, \n        sentiment) \n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in inner_join(., get_sentiments(\"bing\")): Detected an unexpected many-to-many relationship between `x` and `y`.\nℹ Row 131015 of `x` matches multiple rows in `y`.\nℹ Row 5051 of `y` matches multiple rows in `x`.\nℹ If a many-to-many relationship is expected, set `relationship =\n  \"many-to-many\"` to silence this warning.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,840 × 4\n   book                index sentiment     n\n   <fct>               <dbl> <chr>     <int>\n 1 Sense & Sensibility     0 negative     16\n 2 Sense & Sensibility     0 positive     26\n 3 Sense & Sensibility     1 negative     19\n 4 Sense & Sensibility     1 positive     44\n 5 Sense & Sensibility     2 negative     12\n 6 Sense & Sensibility     2 positive     23\n 7 Sense & Sensibility     3 negative     15\n 8 Sense & Sensibility     3 positive     22\n 9 Sense & Sensibility     4 negative     16\n10 Sense & Sensibility     4 positive     29\n# ℹ 1,830 more rows\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `%/%` operator does **integer division** (`x %/% y` is equivalent to `floor(x/y)`) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiment in.\n:::\n\nFinally, we use `pivot_wider()` to have positive and negative counts in different columns, and then use `mutate()` to calculate a net sentiment (positive - negative).\n\n\n::: {.cell}\n\n```{.r .cell-code}\njane_austen_sentiment <- \n  tidy_books %>%\n  inner_join(get_sentiments(\"bing\")) %>%\n  count(book, \n        index = linenumber %/% 80, \n        sentiment) %>%\n  pivot_wider(names_from = sentiment, \n              values_from = n, \n              values_fill = 0) %>%\n  mutate(sentiment = positive - negative)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in inner_join(., get_sentiments(\"bing\")): Detected an unexpected many-to-many relationship between `x` and `y`.\nℹ Row 131015 of `x` matches multiple rows in `y`.\nℹ Row 5051 of `y` matches multiple rows in `x`.\nℹ If a many-to-many relationship is expected, set `relationship =\n  \"many-to-many\"` to silence this warning.\n```\n:::\n\n```{.r .cell-code}\njane_austen_sentiment\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 920 × 5\n   book                index negative positive sentiment\n   <fct>               <dbl>    <int>    <int>     <int>\n 1 Sense & Sensibility     0       16       26        10\n 2 Sense & Sensibility     1       19       44        25\n 3 Sense & Sensibility     2       12       23        11\n 4 Sense & Sensibility     3       15       22         7\n 5 Sense & Sensibility     4       16       29        13\n 6 Sense & Sensibility     5       16       39        23\n 7 Sense & Sensibility     6       24       37        13\n 8 Sense & Sensibility     7       22       39        17\n 9 Sense & Sensibility     8       30       35         5\n10 Sense & Sensibility     9       14       18         4\n# ℹ 910 more rows\n```\n:::\n:::\n\n\nThen we can plot the sentiment scores across the sections of each novel:\n\n\n::: {.cell}\n\n```{.r .cell-code}\njane_austen_sentiment %>% \n  ggplot(aes(x = index, y = sentiment, fill = book)) +\n    geom_col(show.legend = FALSE) +\n    facet_wrap(. ~ book, ncol = 2, scales = \"free_x\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-37-1.png){width=672}\n:::\n:::\n\n\nWe can see how the sentiment trajectory of the novel changes over time.\n\n### Word clouds\n\nYou can also do things like create word clouds using the `wordcloud` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(wordcloud)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nLoading required package: RColorBrewer\n```\n:::\n\n```{.r .cell-code}\ntidy_books %>%\n  anti_join(stop_words) %>%\n  count(word) %>%\n  with(wordcloud(word, n, max.words = 100))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-38-1.png){width=672}\n:::\n:::\n\n\n# Converting to and from tidy and non-tidy formats\n\nIn this section, we want to **convert our tidy text data** constructed with the `unnest_tokens()` function (useable by packages in the tidyverse) into a different format that can be **used by packages for natural language processing** or other types of machine learning algorithms in non-tidy formats.\n\nIn the figure below, we see how an analysis might switch between tidy and non-tidy data structures and tools.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![ A flowchart of a typical text analysis that combines tidytext with other tools and data formats, particularly the `tm` or `quanteda` packages. Here, we show how to convert back and forth between document-term matrices and tidy data frames, as well as converting from a Corpus object to a text data frame.](https://www.tidytextmining.com/images/tmwr_0501.png){width=90%}\n:::\n:::\n\n\n\\[[image source](https://www.tidytextmining.com/images/tmwr_0501.png)\\]\n\n<details>\n\n<summary>Click here for how to convert to and from tidy and non-tidy formats to build machine learning algorithms.</summary>\n\nTo introduce some of these tools, we first need to introduce **document-term matrices**, as well as **casting** a tidy data frame into a sparse matrix.\n\n### Document-term matrix\n\nOne of the most common structures that text mining packages work with is the **document-term matrix** (or DTM). This is a matrix where:\n\n-   each row represents one document (such as a book or article),\n-   each column represents one term, and\n-   each value (typically) contains the number of appearances of that term in that document.\n\nSince most pairings of document and term do not occur (they have the value zero), DTMs are usually implemented as sparse matrices.\n\nThese objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format.\n\nDTM objects **cannot be used directly with tidy tools**, just as tidy data frames cannot be used as input for most text mining packages. Thus, the `tidytext` package provides two verbs that convert between the two formats.\n\n-   `tidy()` turns a document-term matrix into a tidy data frame. This verb comes from the `broom` package, which provides similar tidying functions for many statistical models and objects.\n-   `cast()` turns a tidy one-term-per-row data frame into a matrix. `tidytext` provides three variations of this verb, each converting to a different type of matrix: `cast_sparse()` (converting to a sparse matrix from the `Matrix` package), `cast_dtm()` (converting to a `DocumentTermMatrix` object from `tm`), and `cast_dfm()` (converting to a `dfm` object from `quanteda`).\n\nA DTM is typically comparable to a tidy data frame after a count or a group_by/summarize that contains counts or another statistic for each combination of a term and document.\n\n### Creating DocumentTermMatrix objects\n\nPerhaps the most widely used implementation of DTMs in R is the `DocumentTermMatrix` class in the `tm` package. Many available text mining datasets are provided in this format.\n\nLet's create a sparse with `cast_sparse()` function and then a `dtm` with the `cast_dtm()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_austen <- \n  austen_books() %>%\n    mutate(line = row_number()) %>%\n    unnest_tokens(word, text) %>%\n    anti_join(stop_words)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n```{.r .cell-code}\ntidy_austen\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 217,609 × 3\n   book                 line word       \n   <fct>               <int> <chr>      \n 1 Sense & Sensibility     1 sense      \n 2 Sense & Sensibility     1 sensibility\n 3 Sense & Sensibility     3 jane       \n 4 Sense & Sensibility     3 austen     \n 5 Sense & Sensibility     5 1811       \n 6 Sense & Sensibility    10 chapter    \n 7 Sense & Sensibility    10 1          \n 8 Sense & Sensibility    13 family     \n 9 Sense & Sensibility    13 dashwood   \n10 Sense & Sensibility    13 settled    \n# ℹ 217,599 more rows\n```\n:::\n:::\n\n\nFirst, we'll make a sparse matrix with `cast_sparse(data, row, column, value)`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nausten_sparse <- tidy_austen %>%\n    count(line, word) %>%\n    cast_sparse(row = line, column = word, value = n)\n\nausten_sparse[1:10, 1:10]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n10 x 10 sparse Matrix of class \"dgCMatrix\"\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n  [[ suppressing 10 column names 'sense', 'sensibility', 'austen' ... ]]\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n                      \n1  1 1 . . . . . . . .\n3  . . 1 1 . . . . . .\n5  . . . . 1 . . . . .\n10 . . . . . 1 1 . . .\n13 . . . . . . . 1 1 1\n14 . . . . . . . . . .\n15 . . . . . . . . . .\n16 . . . . . . . . . .\n17 . . . . . . . . 1 .\n18 . . . . . . . . . .\n```\n:::\n:::\n\n\nNext, we'll make a `dtm` object with `cast_dtm(data, document, matrix)`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nausten_dtm <- tidy_austen %>%\n    count(line, word) %>%\n    cast_dtm(document = line, term = word, value = n)\n\nausten_dtm\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n<<DocumentTermMatrix (documents: 61010, terms: 13914)>>\nNon-/sparse entries: 216128/848677012\nSparsity           : 100%\nMaximal term length: 19\nWeighting          : term frequency (tf)\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(austen_dtm)  \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"DocumentTermMatrix\"    \"simple_triplet_matrix\"\n```\n:::\n\n```{.r .cell-code}\ndim(austen_dtm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 61010 13914\n```\n:::\n\n```{.r .cell-code}\nas.matrix(austen_dtm[1:20, 1:10])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n    Terms\nDocs sense sensibility austen jane 1811 1 chapter dashwood estate family\n  1      1           1      0    0    0 0       0        0      0      0\n  3      0           0      1    1    0 0       0        0      0      0\n  5      0           0      0    0    1 0       0        0      0      0\n  10     0           0      0    0    0 1       1        0      0      0\n  13     0           0      0    0    0 0       0        1      1      1\n  14     0           0      0    0    0 0       0        0      0      0\n  15     0           0      0    0    0 0       0        0      0      0\n  16     0           0      0    0    0 0       0        0      0      0\n  17     0           0      0    0    0 0       0        0      1      0\n  18     0           0      0    0    0 0       0        0      0      0\n  19     0           0      0    0    0 0       0        0      0      0\n  20     0           0      0    0    0 0       0        0      0      0\n  21     0           0      0    0    0 0       0        0      0      0\n  22     0           0      0    0    0 0       0        1      0      1\n  23     0           0      0    0    0 0       0        0      1      0\n  24     0           0      0    0    0 0       0        0      0      0\n  25     0           0      0    0    0 0       0        0      0      0\n  26     0           0      0    0    0 0       0        0      0      0\n  27     0           0      0    0    0 0       0        1      0      0\n  28     0           0      0    0    0 0       0        0      0      0\n```\n:::\n:::\n\n\nNow we have 61010 observations and 13914 features.\n\nWith these matricies, you can start to leverage the NLP methods and software. For example, in text mining, we often have collections of documents, such as blog posts or news articles, that we'd like to divide into natural groups so that we can understand them separately.\n\n**Topic modeling** is a method for **unsupervised classification** of such documents, similar to clustering on numeric data, which finds natural groups of items even when we're not sure what we are looking for.\n\nLatent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to \"overlap\" each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.\n\nWe can also perform **supervised analyses** to build a classifier to classify lines of text from our `austen_sparse` or `austen_dtm` objects.\n\n</details>\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package      * version date (UTC) lib source\n cli            3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout       1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace     2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest         0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr        * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate       0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi          1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver         2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap        1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n fs             1.6.3   2023-07-20 [1] CRAN (R 4.3.0)\n generics       0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2      * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue           1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable         0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms            1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools      0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets    1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n janeaustenr  * 1.0.0   2022-08-26 [1] CRAN (R 4.3.0)\n jsonlite       1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr          1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling       0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice        0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate    * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n Matrix         1.6-1   2023-08-14 [1] CRAN (R 4.3.0)\n munsell        0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n NLP            0.2-1   2020-10-14 [1] CRAN (R 4.3.0)\n pillar         1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6             2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n rappdirs       0.3.3   2021-01-31 [1] CRAN (R 4.3.0)\n RColorBrewer * 1.1-3   2022-04-03 [1] CRAN (R 4.3.0)\n Rcpp           1.0.11  2023-07-06 [1] CRAN (R 4.3.0)\n readr        * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang          1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown      2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi     0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales         1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n slam           0.1-50  2022-01-08 [1] CRAN (R 4.3.0)\n SnowballC      0.7.1   2023-04-25 [1] CRAN (R 4.3.0)\n stringi        1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr      * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n textdata       0.4.4   2022-09-02 [1] CRAN (R 4.3.0)\n tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr        * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect     1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidytext     * 0.4.1   2023-01-07 [1] CRAN (R 4.3.0)\n tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange     0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tm             0.7-11  2023-02-05 [1] CRAN (R 4.3.0)\n tokenizers     0.3.0   2022-12-22 [1] CRAN (R 4.3.0)\n tzdb           0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8           1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs          0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr          2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n wordcloud    * 2.6     2018-08-24 [1] CRAN (R 4.3.0)\n xfun           0.40    2023-08-09 [1] CRAN (R 4.3.0)\n xml2           1.3.5   2023-07-06 [1] CRAN (R 4.3.0)\n yaml           2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"23 - Tidytext and sentiment analysis\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to tidytext and sentiment analysis\"\ncategories: [module 5, week 7, tidyverse, tidytext, sentiment analysis]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/23-working-with-text-sentiment-analysis/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   [Text mining with R: A Tidy Approach](https://www.tidytextmining.com/) from Julia Silge and David Robinson which uses the [`tidytext`](https://github.com/juliasilge/tidytext) R package\n-   [Supervised Machine Learning for Text Analsyis in R](https://smltar.com/preface.html) from Emil Hvitfeldt, Julia Sigle\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n-   Learn about what is is meant by \"tidy text\" data\n-   Know the fundamentals of the `tidytext` R package to create tidy text data\n-   Know the fundamentals of sentiment analysis\n:::\n\n# Motivation\n\nAnalyzing text data such as Twitter content, books or news articles is commonly performed in data science.\n\nIn this lecture, we will be asking the following questions:\n\n1.  Which are the **most commonly used words** from Jane Austen's novels?\n2.  Which are the **most positive** or **negative words**?\n3.  How does the **sentiment** (e.g. positive vs negative) of the text change across each novel?\n\n![](https://images-na.ssl-images-amazon.com/images/I/A1YUH7-W5AL.jpg){preview=\"TRUE\"}\n\n\\[[image source](https://images-na.ssl-images-amazon.com/images/I/A1YUH7-W5AL.jpg)\\]\n\nTo answer these questions, we will need to learn about a few things. Specifically,\n\n1.  How to **convert words in documents** to a **tidy text** format using the `tidytext` R package.\n2.  A little bit about [sentiment analysis](https://www.tidytextmining.com/sentiment.html).\n\n# Tidy text\n\nIn previous lectures, you have learned about the **tidy data principles** and the `tidyverse` R packages as a way to make handling data easier and more effective.\n\nThese packages depend on **data being formatted in a particular way**.\n\nThe idea with tidy text is to **treat text as data frames of individual words** and **apply the same tidy data principles** to make text mining tasks easier and consistent with already developed tools.\n\nFirst let's recall what a **tidy** data format means.\n\n### What is a **tidy** format?\n\nFirst, the [tidyverse](https://www.tidyverse.org) is\n\n> \"an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.\"\n\nAnother way of putting it is that it is a **set of packages** that are useful specifically for data manipulation, exploration and visualization **with a common philosophy**.\n\n### What is this common philosphy?\n\nThe common philosophy is called **\"tidy\" data**.\n\nIt is a standard way of mapping the meaning of a dataset to its structure.\n\nIn **tidy** data:\n\n-   Each variable forms a column.\n-   Each observation forms a row.\n-   Each type of observational unit forms a table.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](http://r4ds.had.co.nz/images/tidy-1.png){width=95%}\n:::\n:::\n\n\n\\[[img source](http://r4ds.had.co.nz)\\]\n\nWorking with **tidy data is useful** because it creates a structured way of organizing data values within a data set.\n\nThis makes the data analysis process more efficient and simplifies the development of data analysis tools that work together.\n\nIn this way, you can focus on the problem you are investigating, rather than the uninteresting logistics of data.\n\n### What is a **tidy text** format?\n\nWhen dealing with **text** data, the **tidy text** format is defined as a table **with one-token-per-row**, where a **token** is a meaningful unit of text (e.g. a word, pair of words, sentence, paragraph, etc).\n\nUsing a **given set of token**, we can **tokenize** text, or **split the text into the defined tokens of interest along the rows**.\n\nWe will learn more about how to do this using functions in the [`tidytext`](https://github.com/juliasilge/tidytext) R package.\n\nIn contrast, other data structures that are commonly used to store text data in text mining applications:\n\n-   **string**: text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form.\n-   **corpus**: these types of objects typically contain raw strings annotated with additional metadata and details.\n-   **document-term matrix**: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count.\n\nI won't describing these other formats in greater detail, but encourage you to read about them if interested in this topic.\n\n### Why is this format useful?\n\nOne of the biggest advantages of transforming text data to the tidy text format is that it allows data to transition smoothly between other packages that adhere to the `tidyverse` framework (e.g. `ggplot2`, `dplyr`, etc).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A flowchart of a typical text analysis using tidy data principles.](https://www.tidytextmining.com/images/tmwr_0101.png){width=90%}\n:::\n:::\n\n\n\\[[image source](https://www.tidytextmining.com/images/tmwr_0101.png)\\]\n\nIn addition, a user can transition between the tidy text format for e.g data visualization with `ggplot2`, but then also convert data to other data structures (e.g. document-term matrix) that is commonly used in machine learning applications.\n\n### How does it work?\n\nThe main workhorse function in the `tidytext` R package to tokenize text a data frame is the `unnest_tokens(tbl, output, input)` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?unnest_tokens\n```\n:::\n\n\nIn addition to the tibble or data frame (`tbl`), the function needs two basic arguments:\n\n1.  `output` or the output column name that will be created (e.g. string) as the text is unnested into it\n2.  `input` or input column name that the text comes from and gets split\n\nLet's try out the `unnest_tokens()` function using the first paragraph in the preface of Roger Peng's [R Programming for Data Science](https://leanpub.com/rprogramming) book.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Preface from R Programming for Data Science](../../images/peng_preface.png){width=90%}\n:::\n:::\n\n\nTo make this easier, I typed this text into a vector of character strings: one string per sentence.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface <-\n    c(\n        \"I started using R in 1998 when I was a college undergraduate working on my senior thesis.\",\n        \"The version was 0.63.\",\n        \"I was an applied mathematics major with a statistics concentration and I was working with Dr. Nicolas Hengartner on an analysis of word frequencies in classic texts (Shakespeare, Milton, etc.).\",\n        \"The idea was to see if we could identify the authorship of each of the texts based on how frequently they used certain words.\",\n        \"We downloaded the data from Project Gutenberg and used some basic linear discriminant analysis for the modeling.\",\n        \"The work was eventually published and was my first ever peer-reviewed publication.\",\n        \"I guess you could argue it was my first real 'data science' experience.\"\n    )\n\npeng_preface\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"I started using R in 1998 when I was a college undergraduate working on my senior thesis.\"                                                                                                        \n[2] \"The version was 0.63.\"                                                                                                                                                                            \n[3] \"I was an applied mathematics major with a statistics concentration and I was working with Dr. Nicolas Hengartner on an analysis of word frequencies in classic texts (Shakespeare, Milton, etc.).\"\n[4] \"The idea was to see if we could identify the authorship of each of the texts based on how frequently they used certain words.\"                                                                    \n[5] \"We downloaded the data from Project Gutenberg and used some basic linear discriminant analysis for the modeling.\"                                                                                 \n[6] \"The work was eventually published and was my first ever peer-reviewed publication.\"                                                                                                               \n[7] \"I guess you could argue it was my first real 'data science' experience.\"                                                                                                                          \n```\n:::\n:::\n\n\nTurns out Roger performed a similar analysis as an undergraduate student!\n\nHe goes to say that back then no one was using R (but a little bit of something called S-PLUS), so I can only imagine how different it was to accomplish a task like the one we are going to do today compared to when he was an undergraduate.\n\nNext, we load a few R packages\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(stringr)\nlibrary(tidytext) ## needs to be installed\nlibrary(janeaustenr) ## needs to be installed\n```\n:::\n\n\nThen, we use the `tibble()` function to construct a data frame with two columns: one counting the line number and one from the character strings in `peng_preface`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df <- tibble(\n    line = 1:7,\n    text = peng_preface\n)\npeng_preface_df\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 × 2\n   line text                                                                    \n  <int> <chr>                                                                   \n1     1 I started using R in 1998 when I was a college undergraduate working on…\n2     2 The version was 0.63.                                                   \n3     3 I was an applied mathematics major with a statistics concentration and …\n4     4 The idea was to see if we could identify the authorship of each of the …\n5     5 We downloaded the data from Project Gutenberg and used some basic linea…\n6     6 The work was eventually published and was my first ever peer-reviewed p…\n7     7 I guess you could argue it was my first real 'data science' experience. \n```\n:::\n:::\n\n\n### Text Mining and Tokens\n\nNext, we will use the `unnest_tokens()` function where we will call the output column to be created `word` and the input column `text` from the `peng_preface_df`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_token <-\n    peng_preface_df %>%\n    unnest_tokens(\n        output = word,\n        input = text,\n        token = \"words\"\n    )\n\npeng_token %>%\n    head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word   \n  <int> <chr>  \n1     1 i      \n2     1 started\n3     1 using  \n4     1 r      \n5     1 in     \n6     1 1998   \n```\n:::\n\n```{.r .cell-code}\npeng_token %>%\n    tail()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word      \n  <int> <chr>     \n1     7 my        \n2     7 first     \n3     7 real      \n4     7 data      \n5     7 science   \n6     7 experience\n```\n:::\n:::\n\n\nThe argument `token=\"words\"` **defines the unit for tokenization**.\n\nThe default is `\"words\"`, but there are lots of other options.\n\n::: callout-tip\n### Example\n\nWe could tokenize by `\"characters\"`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df %>%\n    unnest_tokens(word,\n        text,\n        token = \"characters\"\n    ) %>%\n    head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word \n  <int> <chr>\n1     1 i    \n2     1 s    \n3     1 t    \n4     1 a    \n5     1 r    \n6     1 t    \n```\n:::\n:::\n\n:::\n\nor something called [ngrams](https://en.wikipedia.org/wiki/N-gram), which is defined by Wikipedia as a *\"contiguous sequence of n items from a given sample of text or speech\"*\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df %>%\n    unnest_tokens(word,\n        text,\n        token = \"ngrams\",\n        n = 3\n    ) %>%\n    head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word           \n  <int> <chr>          \n1     1 i started using\n2     1 started using r\n3     1 using r in     \n4     1 r in 1998      \n5     1 in 1998 when   \n6     1 1998 when i    \n```\n:::\n:::\n\n\nAnother option is to use the `character_shingles` option, which is similar to tokenizing like `ngrams`, except the units are characters instead of words.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df %>%\n    unnest_tokens(word,\n        text,\n        token = \"character_shingles\",\n        n = 4\n    ) %>%\n    head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word \n  <int> <chr>\n1     1 ista \n2     1 star \n3     1 tart \n4     1 arte \n5     1 rted \n6     1 tedu \n```\n:::\n:::\n\n\nYou can also **create custom functions** for tokenization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npeng_preface_df %>%\n    unnest_tokens(word,\n        text,\n        token = stringr::str_split,\n        pattern = \" \"\n    ) %>%\n    head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n   line word   \n  <int> <chr>  \n1     1 i      \n2     1 started\n3     1 using  \n4     1 r      \n5     1 in     \n6     1 1998   \n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's tokenize the first four sentences of [Amanda Gorman's *The Hill We Climb*](https://www.nytimes.com/2021/01/19/books/amanda-gorman-inauguration-hill-we-climb.html) by words.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngorman_hill_we_climb <-\n    c(\n        \"When day comes we ask ourselves, where can we find light in this neverending shade?\",\n        \"The loss we carry, a sea we must wade.\",\n        \"We’ve braved the belly of the beast, we’ve learned that quiet isn’t always peace and the norms and notions of what just is, isn’t always justice.\",\n        \"And yet the dawn is ours before we knew it, somehow we do it, somehow we’ve weathered and witnessed a nation that isn’t broken but simply unfinished.\"\n    )\n\nhill_df <- tibble(\n    line = seq_along(gorman_hill_we_climb),\n    text = gorman_hill_we_climb\n)\nhill_df\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 2\n   line text                                                                    \n  <int> <chr>                                                                   \n1     1 When day comes we ask ourselves, where can we find light in this nevere…\n2     2 The loss we carry, a sea we must wade.                                  \n3     3 We’ve braved the belly of the beast, we’ve learned that quiet isn’t alw…\n4     4 And yet the dawn is ours before we knew it, somehow we do it, somehow w…\n```\n:::\n\n```{.r .cell-code}\n### try it out\n\nhill_df %>%\n    unnest_tokens(\n        output = wordsforfun,\n        input = text,\n        token = \"words\"\n    )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 77 × 2\n    line wordsforfun\n   <int> <chr>      \n 1     1 when       \n 2     1 day        \n 3     1 comes      \n 4     1 we         \n 5     1 ask        \n 6     1 ourselves  \n 7     1 where      \n 8     1 can        \n 9     1 we         \n10     1 find       \n# ℹ 67 more rows\n```\n:::\n:::\n\n:::\n\n# Example: text from works of Jane Austen\n\nWe will use the text from six published novels from Jane Austen, which are available in the [`janeaustenr`](https://cran.r-project.org/web/packages/janeaustenr/index.html) R package. The [authors](https://www.tidytextmining.com/tidytext.html#tidyausten) describe the format:\n\n> \"The package provides the text in a one-row-per-line format, where a line is this context is analogous to a literal printed line in a physical book.\n>\n> The package contains:\n>\n> -   `sensesensibility`: Sense and Sensibility, published in 1811\n> -   `prideprejudice`: Pride and Prejudice, published in 1813\n> -   `mansfieldpark`: Mansfield Park, published in 1814\n> -   `emma`: Emma, published in 1815\n> -   `northangerabbey`: Northanger Abbey, published posthumously in 1818\n> -   `persuasion`: Persuasion, also published posthumously in 1818\n>\n> There is also a function `austen_books()` that returns a tidy data frame of all 6 novels.\"\n\nLet's load in the text from `prideprejudice` and look at how the data are stored.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(janeaustenr)\nhead(prideprejudice, 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"PRIDE AND PREJUDICE\"                                                        \n [2] \"\"                                                                           \n [3] \"By Jane Austen\"                                                             \n [4] \"\"                                                                           \n [5] \"\"                                                                           \n [6] \"\"                                                                           \n [7] \"Chapter 1\"                                                                  \n [8] \"\"                                                                           \n [9] \"\"                                                                           \n[10] \"It is a truth universally acknowledged, that a single man in possession\"    \n[11] \"of a good fortune, must be in want of a wife.\"                              \n[12] \"\"                                                                           \n[13] \"However little known the feelings or views of such a man may be on his\"     \n[14] \"first entering a neighbourhood, this truth is so well fixed in the minds\"   \n[15] \"of the surrounding families, that he is considered the rightful property\"   \n[16] \"of some one or other of their daughters.\"                                   \n[17] \"\"                                                                           \n[18] \"\\\"My dear Mr. Bennet,\\\" said his lady to him one day, \\\"have you heard that\"\n[19] \"Netherfield Park is let at last?\\\"\"                                         \n[20] \"\"                                                                           \n```\n:::\n:::\n\n\nWe see each line is in a character vector with elements of about 70 characters.\n\nSimilar to what we did above with Roger's preface, we can\n\n-   Turn the text of character strings into a data frame and then\n-   Convert it into a one-row-per-line dataframe using the `unnest_tokens()` function\n\n\n::: {.cell}\n\n```{.r .cell-code}\npp_book_df <- tibble(text = prideprejudice)\n\npp_book_df %>%\n    unnest_tokens(\n        output = word,\n        input = text,\n        token = \"words\"\n    )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 122,204 × 1\n   word     \n   <chr>    \n 1 pride    \n 2 and      \n 3 prejudice\n 4 by       \n 5 jane     \n 6 austen   \n 7 chapter  \n 8 1        \n 9 it       \n10 is       \n# ℹ 122,194 more rows\n```\n:::\n:::\n\n\nWe can also divide it by paragraphs:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntmp <- pp_book_df %>%\n    unnest_tokens(\n        output = paragraph,\n        input = text,\n        token = \"paragraphs\"\n    )\ntmp\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10,721 × 1\n   paragraph                                                                    \n   <chr>                                                                        \n 1 \"pride and prejudice\"                                                        \n 2 \"by jane austen\"                                                             \n 3 \"chapter 1\"                                                                  \n 4 \"it is a truth universally acknowledged, that a single man in possession\"    \n 5 \"of a good fortune, must be in want of a wife.\"                              \n 6 \"however little known the feelings or views of such a man may be on his\"     \n 7 \"first entering a neighbourhood, this truth is so well fixed in the minds\"   \n 8 \"of the surrounding families, that he is considered the rightful property\"   \n 9 \"of some one or other of their daughters.\"                                   \n10 \"\\\"my dear mr. bennet,\\\" said his lady to him one day, \\\"have you heard that\"\n# ℹ 10,711 more rows\n```\n:::\n:::\n\n\nWe can extract a particular element from the tibble\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntmp[3, 1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 1\n  paragraph\n  <chr>    \n1 chapter 1\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nWhat you name the output column, e.g. `paragraph` in this case, doesn't affect it, it's just good to give it a consistent name.\n:::\n\nWe could also divide it by sentence:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npp_book_df %>%\n    unnest_tokens(\n        output = sentence,\n        input = text,\n        token = \"sentences\"\n    )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 15,545 × 1\n   sentence                                                                  \n   <chr>                                                                     \n 1 \"pride and prejudice\"                                                     \n 2 \"by jane austen\"                                                          \n 3 \"chapter 1\"                                                               \n 4 \"it is a truth universally acknowledged, that a single man in possession\" \n 5 \"of a good fortune, must be in want of a wife.\"                           \n 6 \"however little known the feelings or views of such a man may be on his\"  \n 7 \"first entering a neighbourhood, this truth is so well fixed in the minds\"\n 8 \"of the surrounding families, that he is considered the rightful property\"\n 9 \"of some one or other of their daughters.\"                                \n10 \"\\\"my dear mr.\"                                                           \n# ℹ 15,535 more rows\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThis is tricked by terms like \"Mr.\" and \"Mrs.\"\n:::\n\nOne neat trick is that we can unnest by two layers:\n\n1.  paragraph and then\n2.  word\n\nThis lets us keep track of **which paragraph is which**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparagraphs <-\n    pp_book_df %>%\n    unnest_tokens(\n        output = paragraph,\n        input = text,\n        token = \"paragraphs\"\n    ) %>%\n    mutate(paragraph_number = row_number())\n\nparagraphs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10,721 × 2\n   paragraph                                                    paragraph_number\n   <chr>                                                                   <int>\n 1 \"pride and prejudice\"                                                       1\n 2 \"by jane austen\"                                                            2\n 3 \"chapter 1\"                                                                 3\n 4 \"it is a truth universally acknowledged, that a single man …                4\n 5 \"of a good fortune, must be in want of a wife.\"                             5\n 6 \"however little known the feelings or views of such a man m…                6\n 7 \"first entering a neighbourhood, this truth is so well fixe…                7\n 8 \"of the surrounding families, that he is considered the rig…                8\n 9 \"of some one or other of their daughters.\"                                  9\n10 \"\\\"my dear mr. bennet,\\\" said his lady to him one day, \\\"ha…               10\n# ℹ 10,711 more rows\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nWe use `mutate()` to annotate a paragraph number quantity to keep track of paragraphs in the original format.\n:::\n\nAfter tokenizing by paragraph, we can then tokenzie by word:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparagraphs %>%\n    unnest_tokens(\n        output = word,\n        input = paragraph\n    )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 122,204 × 2\n   paragraph_number word     \n              <int> <chr>    \n 1                1 pride    \n 2                1 and      \n 3                1 prejudice\n 4                2 by       \n 5                2 jane     \n 6                2 austen   \n 7                3 chapter  \n 8                3 1        \n 9                4 it       \n10                4 is       \n# ℹ 122,194 more rows\n```\n:::\n:::\n\n\nWe notice there are many what are called **stop words** (\"the\", \"of\", \"to\", and so forth in English).\n\nOften in text analysis, we will want to **remove stop words** because stop words are words that are not useful for an analysis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(stop_words)\n\ntable(stop_words$lexicon)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n    onix    SMART snowball \n     404      571      174 \n```\n:::\n\n```{.r .cell-code}\nstop_words %>%\n    head(n = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 2\n   word        lexicon\n   <chr>       <chr>  \n 1 a           SMART  \n 2 a's         SMART  \n 3 able        SMART  \n 4 about       SMART  \n 5 above       SMART  \n 6 according   SMART  \n 7 accordingly SMART  \n 8 across      SMART  \n 9 actually    SMART  \n10 after       SMART  \n```\n:::\n:::\n\n\nWe can remove stop words (kept in the `tidytext` dataset `stop_words`) with an `anti_join(x,y)` (return all rows from `x` without a match in `y`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwords_by_paragraph <-\n    paragraphs %>%\n    unnest_tokens(\n        output = word,\n        input = paragraph\n    ) %>%\n    anti_join(stop_words)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n```{.r .cell-code}\nwords_by_paragraph\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 37,246 × 2\n   paragraph_number word        \n              <int> <chr>       \n 1                1 pride       \n 2                1 prejudice   \n 3                2 jane        \n 4                2 austen      \n 5                3 chapter     \n 6                3 1           \n 7                4 truth       \n 8                4 universally \n 9                4 acknowledged\n10                4 single      \n# ℹ 37,236 more rows\n```\n:::\n:::\n\n\nBecause we have stored our data in a tidy dataset, we can use `tidyverse` packages for exploratory data analysis.\n\nFor example, here we use `dplyr`'s `count()` function to find the most common words in the book\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwords_by_paragraph %>%\n    count(word, sort = TRUE) %>%\n    head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n  word          n\n  <chr>     <int>\n1 elizabeth   597\n2 darcy       373\n3 bennet      294\n4 miss        283\n5 jane        264\n6 bingley     257\n```\n:::\n:::\n\n\nThen use `ggplot2` to plot the most commonly used words from the book.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwords_by_paragraph %>%\n    count(word, sort = TRUE) %>%\n    filter(n > 150) %>%\n    mutate(word = fct_reorder(word, n)) %>%\n    ggplot(aes(word, n)) +\n    geom_col() +\n    xlab(NULL) +\n    coord_flip()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-24-1.png){width=672}\n:::\n:::\n\n\nWe can also do this for all of her books using the `austen_books()` object\n\n\n::: {.cell}\n\n```{.r .cell-code}\nausten_books() %>%\n    head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n  text                    book               \n  <chr>                   <fct>              \n1 \"SENSE AND SENSIBILITY\" Sense & Sensibility\n2 \"\"                      Sense & Sensibility\n3 \"by Jane Austen\"        Sense & Sensibility\n4 \"\"                      Sense & Sensibility\n5 \"(1811)\"                Sense & Sensibility\n6 \"\"                      Sense & Sensibility\n```\n:::\n:::\n\n\nWe can do some data wrangling that keep tracks of the line number and chapter (using a regex) to find where all the chapters are.\n\n\n::: {.cell}\n\n```{.r .cell-code}\noriginal_books <-\n    austen_books() %>%\n    group_by(book) %>%\n    mutate(\n        linenumber = row_number(),\n        chapter = cumsum(\n            str_detect(text,\n                pattern = regex(\n                    pattern = \"^chapter [\\\\divxlc]\",\n                    ignore_case = TRUE\n                )\n            )\n        )\n    ) %>%\n    ungroup()\n\noriginal_books\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 73,422 × 4\n   text                    book                linenumber chapter\n   <chr>                   <fct>                    <int>   <int>\n 1 \"SENSE AND SENSIBILITY\" Sense & Sensibility          1       0\n 2 \"\"                      Sense & Sensibility          2       0\n 3 \"by Jane Austen\"        Sense & Sensibility          3       0\n 4 \"\"                      Sense & Sensibility          4       0\n 5 \"(1811)\"                Sense & Sensibility          5       0\n 6 \"\"                      Sense & Sensibility          6       0\n 7 \"\"                      Sense & Sensibility          7       0\n 8 \"\"                      Sense & Sensibility          8       0\n 9 \"\"                      Sense & Sensibility          9       0\n10 \"CHAPTER 1\"             Sense & Sensibility         10       1\n# ℹ 73,412 more rows\n```\n:::\n:::\n\n\nFinally, we can restructure it to a one-token-per-row format using the `unnest_tokens()` function and remove stop words using the `anti_join()` function in `dplyr`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_books <- original_books %>%\n    unnest_tokens(word, text) %>%\n    anti_join(stop_words)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n```{.r .cell-code}\ntidy_books\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 217,609 × 4\n   book                linenumber chapter word       \n   <fct>                    <int>   <int> <chr>      \n 1 Sense & Sensibility          1       0 sense      \n 2 Sense & Sensibility          1       0 sensibility\n 3 Sense & Sensibility          3       0 jane       \n 4 Sense & Sensibility          3       0 austen     \n 5 Sense & Sensibility          5       0 1811       \n 6 Sense & Sensibility         10       1 chapter    \n 7 Sense & Sensibility         10       1 1          \n 8 Sense & Sensibility         13       1 family     \n 9 Sense & Sensibility         13       1 dashwood   \n10 Sense & Sensibility         13       1 settled    \n# ℹ 217,599 more rows\n```\n:::\n:::\n\n\nHere are the most commonly used words across all of Jane Austen's books.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_books %>%\n    count(word, sort = TRUE) %>%\n    filter(n > 600) %>%\n    mutate(word = fct_reorder(word, n)) %>%\n    ggplot(aes(word, n)) +\n    geom_col() +\n    xlab(NULL) +\n    coord_flip()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\n# Sentiment Analysis\n\nIn the previous section, we explored the **tidy text** format and showed how we can calculate things such as word frequency.\n\nNext, we are going to look at something called **opinion mining** or **sentiment analysis**. The [tidytext authors](https://www.tidytextmining.com/sentiment.html) write:\n\n> *\"When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tools of text mining to approach the emotional content of text programmatically, as shown in the figure below\"*\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A flowchart of a typical text analysis that uses tidytext for sentiment analysis.](https://www.tidytextmining.com/images/tmwr_0201.png){width=90%}\n:::\n:::\n\n\n\\[[image source](https://www.tidytextmining.com/images/tmwr_0201.png)\\]\n\n> *\"One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn't the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.\"*\n\nLet's try using sentiment analysis on the Jane Austen books.\n\n## The `sentiments` dataset\n\nInside the `tidytext` package are several **sentiment lexicons**. A few things to note:\n\n-   The lexicons are based on unigrams (single words)\n-   The lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth\n\nYou can use the `get_sentiments()` function to extract a specific lexicon.\n\nThe `nrc` lexicon **categorizes words into multiple categories** of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_sentiments(\"nrc\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 13,872 × 2\n   word        sentiment\n   <chr>       <chr>    \n 1 abacus      trust    \n 2 abandon     fear     \n 3 abandon     negative \n 4 abandon     sadness  \n 5 abandoned   anger    \n 6 abandoned   fear     \n 7 abandoned   negative \n 8 abandoned   sadness  \n 9 abandonment anger    \n10 abandonment fear     \n# ℹ 13,862 more rows\n```\n:::\n:::\n\n\nThe `bing` lexicon **categorizes words in a binary fashion** into positive and negative categories\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_sentiments(\"bing\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6,786 × 2\n   word        sentiment\n   <chr>       <chr>    \n 1 2-faces     negative \n 2 abnormal    negative \n 3 abolish     negative \n 4 abominable  negative \n 5 abominably  negative \n 6 abominate   negative \n 7 abomination negative \n 8 abort       negative \n 9 aborted     negative \n10 aborts      negative \n# ℹ 6,776 more rows\n```\n:::\n:::\n\n\nThe `AFINN` lexicon **assigns words with a score that runs between -5 and 5**, with negative scores indicating negative sentiment and positive scores indicating positive sentiment\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_sentiments(\"afinn\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2,477 × 2\n   word       value\n   <chr>      <dbl>\n 1 abandon       -2\n 2 abandoned     -2\n 3 abandons      -2\n 4 abducted      -2\n 5 abduction     -2\n 6 abductions    -2\n 7 abhor         -3\n 8 abhorred      -3\n 9 abhorrent     -3\n10 abhors        -3\n# ℹ 2,467 more rows\n```\n:::\n:::\n\n\nThe authors of the `tidytext` package note:\n\n> *\"How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen's novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text.\"*\n\nTwo other caveats:\n\n> *\"Not every English word is in the lexicons because many English words are pretty neutral. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in\"no good\" or \"not true\"; a lexicon-based method like this is based on unigrams only. For many kinds of text (like the narrative examples below), there are not sustained sections of sarcasm or negated text, so this is not an important effect.\"*\n\nand\n\n> *\"One last caveat is that the size of the chunk of text that we use to add up unigram sentiment scores can have an effect on an analysis. A text the size of many paragraphs can often have positive and negative sentiment averaged out to about zero, while sentence-sized or paragraph-sized text often works better.\"*\n\n### Joining together tidy text data with lexicons\n\nNow that we have our data in a tidy text format AND we have learned about different types of lexicons in application for sentiment analysis, we can **join the words together** using a join function.\n\n::: callout-tip\n### Example\n\nWhat are the most common joy words in the book *Emma*?\n\nHere, we use the `nrc` lexicon and join the `tidy_books` dataset with the `nrc_joy` lexicon using the `inner_join()` function in `dplyr`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnrc_joy <- get_sentiments(\"nrc\") %>%\n    filter(sentiment == \"joy\")\n\ntidy_books %>%\n    filter(book == \"Emma\") %>%\n    inner_join(nrc_joy) %>%\n    count(word, sort = TRUE)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 297 × 2\n   word          n\n   <chr>     <int>\n 1 friend      166\n 2 hope        143\n 3 happy       125\n 4 love        117\n 5 deal         92\n 6 found        92\n 7 happiness    76\n 8 pretty       68\n 9 true         66\n10 comfort      65\n# ℹ 287 more rows\n```\n:::\n:::\n\n:::\n\nWe can do things like investigate how the sentiment of the text changes throughout each of Jane's novels.\n\nHere, we use the `bing` lexicon, find a sentiment score for each word, and then use `inner_join()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_books %>%\n    inner_join(get_sentiments(\"bing\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in inner_join(., get_sentiments(\"bing\")): Detected an unexpected many-to-many relationship between `x` and `y`.\nℹ Row 131015 of `x` matches multiple rows in `y`.\nℹ Row 5051 of `y` matches multiple rows in `x`.\nℹ If a many-to-many relationship is expected, set `relationship =\n  \"many-to-many\"` to silence this warning.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 44,171 × 5\n   book                linenumber chapter word        sentiment\n   <fct>                    <int>   <int> <chr>       <chr>    \n 1 Sense & Sensibility         16       1 respectable positive \n 2 Sense & Sensibility         18       1 advanced    positive \n 3 Sense & Sensibility         20       1 death       negative \n 4 Sense & Sensibility         21       1 loss        negative \n 5 Sense & Sensibility         25       1 comfortably positive \n 6 Sense & Sensibility         28       1 goodness    positive \n 7 Sense & Sensibility         28       1 solid       positive \n 8 Sense & Sensibility         29       1 comfort     positive \n 9 Sense & Sensibility         30       1 relish      positive \n10 Sense & Sensibility         33       1 steady      positive \n# ℹ 44,161 more rows\n```\n:::\n:::\n\n\nThen, we can **count how many positive and negative words** there are in each section of the books.\n\nWe create an index to help us keep track of where we are in the narrative, which uses integer division, and counts up sections of 80 lines of text.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_books %>%\n    inner_join(get_sentiments(\"bing\")) %>%\n    count(book,\n        index = linenumber %/% 80,\n        sentiment\n    )\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in inner_join(., get_sentiments(\"bing\")): Detected an unexpected many-to-many relationship between `x` and `y`.\nℹ Row 131015 of `x` matches multiple rows in `y`.\nℹ Row 5051 of `y` matches multiple rows in `x`.\nℹ If a many-to-many relationship is expected, set `relationship =\n  \"many-to-many\"` to silence this warning.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,840 × 4\n   book                index sentiment     n\n   <fct>               <dbl> <chr>     <int>\n 1 Sense & Sensibility     0 negative     16\n 2 Sense & Sensibility     0 positive     26\n 3 Sense & Sensibility     1 negative     19\n 4 Sense & Sensibility     1 positive     44\n 5 Sense & Sensibility     2 negative     12\n 6 Sense & Sensibility     2 positive     23\n 7 Sense & Sensibility     3 negative     15\n 8 Sense & Sensibility     3 positive     22\n 9 Sense & Sensibility     4 negative     16\n10 Sense & Sensibility     4 positive     29\n# ℹ 1,830 more rows\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `%/%` operator does **integer division** (`x %/% y` is equivalent to `floor(x/y)`) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiment in.\n:::\n\nFinally, we use `pivot_wider()` to have positive and negative counts in different columns, and then use `mutate()` to calculate a net sentiment (positive - negative).\n\n\n::: {.cell}\n\n```{.r .cell-code}\njane_austen_sentiment <-\n    tidy_books %>%\n    inner_join(get_sentiments(\"bing\")) %>%\n    count(book,\n        index = linenumber %/% 80,\n        sentiment\n    ) %>%\n    pivot_wider(\n        names_from = sentiment,\n        values_from = n,\n        values_fill = 0\n    ) %>%\n    mutate(sentiment = positive - negative)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in inner_join(., get_sentiments(\"bing\")): Detected an unexpected many-to-many relationship between `x` and `y`.\nℹ Row 131015 of `x` matches multiple rows in `y`.\nℹ Row 5051 of `y` matches multiple rows in `x`.\nℹ If a many-to-many relationship is expected, set `relationship =\n  \"many-to-many\"` to silence this warning.\n```\n:::\n\n```{.r .cell-code}\njane_austen_sentiment\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 920 × 5\n   book                index negative positive sentiment\n   <fct>               <dbl>    <int>    <int>     <int>\n 1 Sense & Sensibility     0       16       26        10\n 2 Sense & Sensibility     1       19       44        25\n 3 Sense & Sensibility     2       12       23        11\n 4 Sense & Sensibility     3       15       22         7\n 5 Sense & Sensibility     4       16       29        13\n 6 Sense & Sensibility     5       16       39        23\n 7 Sense & Sensibility     6       24       37        13\n 8 Sense & Sensibility     7       22       39        17\n 9 Sense & Sensibility     8       30       35         5\n10 Sense & Sensibility     9       14       18         4\n# ℹ 910 more rows\n```\n:::\n:::\n\n\nThen we can plot the sentiment scores across the sections of each novel:\n\n\n::: {.cell}\n\n```{.r .cell-code}\njane_austen_sentiment %>%\n    ggplot(aes(x = index, y = sentiment, fill = book)) +\n    geom_col(show.legend = FALSE) +\n    facet_wrap(. ~ book, ncol = 2, scales = \"free_x\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-37-1.png){width=672}\n:::\n:::\n\n\nWe can see how the sentiment trajectory of the novel changes over time.\n\n### Word clouds\n\nYou can also do things like create word clouds using the `wordcloud` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(wordcloud)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nLoading required package: RColorBrewer\n```\n:::\n\n```{.r .cell-code}\ntidy_books %>%\n    anti_join(stop_words) %>%\n    count(word) %>%\n    with(wordcloud(word, n, max.words = 100))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in wordcloud(word, n, max.words = 100): miss could not be fit on page.\nIt will not be plotted.\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-38-1.png){width=672}\n:::\n:::\n\n\n# Converting to and from tidy and non-tidy formats\n\nIn this section, we want to **convert our tidy text data** constructed with the `unnest_tokens()` function (useable by packages in the tidyverse) into a different format that can be **used by packages for natural language processing** or other types of machine learning algorithms in non-tidy formats.\n\nIn the figure below, we see how an analysis might switch between tidy and non-tidy data structures and tools.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![ A flowchart of a typical text analysis that combines tidytext with other tools and data formats, particularly the `tm` or `quanteda` packages. Here, we show how to convert back and forth between document-term matrices and tidy data frames, as well as converting from a Corpus object to a text data frame.](https://www.tidytextmining.com/images/tmwr_0501.png){width=90%}\n:::\n:::\n\n\n\\[[image source](https://www.tidytextmining.com/images/tmwr_0501.png)\\]\n\n<details>\n\n<summary>Click here for how to convert to and from tidy and non-tidy formats to build machine learning algorithms.</summary>\n\nTo introduce some of these tools, we first need to introduce **document-term matrices**, as well as **casting** a tidy data frame into a sparse matrix.\n\n### Document-term matrix\n\nOne of the most common structures that text mining packages work with is the **document-term matrix** (or DTM). This is a matrix where:\n\n-   each row represents one document (such as a book or article),\n-   each column represents one term, and\n-   each value (typically) contains the number of appearances of that term in that document.\n\nSince most pairings of document and term do not occur (they have the value zero), DTMs are usually implemented as sparse matrices.\n\nThese objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format.\n\nDTM objects **cannot be used directly with tidy tools**, just as tidy data frames cannot be used as input for most text mining packages. Thus, the `tidytext` package provides two verbs that convert between the two formats.\n\n-   `tidy()` turns a document-term matrix into a tidy data frame. This verb comes from the `broom` package, which provides similar tidying functions for many statistical models and objects.\n-   `cast()` turns a tidy one-term-per-row data frame into a matrix. `tidytext` provides three variations of this verb, each converting to a different type of matrix: `cast_sparse()` (converting to a sparse matrix from the `Matrix` package), `cast_dtm()` (converting to a `DocumentTermMatrix` object from `tm`), and `cast_dfm()` (converting to a `dfm` object from `quanteda`).\n\nA DTM is typically comparable to a tidy data frame after a count or a group_by/summarize that contains counts or another statistic for each combination of a term and document.\n\n### Creating DocumentTermMatrix objects\n\nPerhaps the most widely used implementation of DTMs in R is the `DocumentTermMatrix` class in the `tm` package. Many available text mining datasets are provided in this format.\n\nLet's create a sparse with `cast_sparse()` function and then a `dtm` with the `cast_dtm()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_austen <-\n    austen_books() %>%\n    mutate(line = row_number()) %>%\n    unnest_tokens(word, text) %>%\n    anti_join(stop_words)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nJoining with `by = join_by(word)`\n```\n:::\n\n```{.r .cell-code}\ntidy_austen\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 217,609 × 3\n   book                 line word       \n   <fct>               <int> <chr>      \n 1 Sense & Sensibility     1 sense      \n 2 Sense & Sensibility     1 sensibility\n 3 Sense & Sensibility     3 jane       \n 4 Sense & Sensibility     3 austen     \n 5 Sense & Sensibility     5 1811       \n 6 Sense & Sensibility    10 chapter    \n 7 Sense & Sensibility    10 1          \n 8 Sense & Sensibility    13 family     \n 9 Sense & Sensibility    13 dashwood   \n10 Sense & Sensibility    13 settled    \n# ℹ 217,599 more rows\n```\n:::\n:::\n\n\nFirst, we'll make a sparse matrix with `cast_sparse(data, row, column, value)`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nausten_sparse <- tidy_austen %>%\n    count(line, word) %>%\n    cast_sparse(row = line, column = word, value = n)\n\nausten_sparse[1:10, 1:10]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n10 x 10 sparse Matrix of class \"dgCMatrix\"\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n  [[ suppressing 10 column names 'sense', 'sensibility', 'austen' ... ]]\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n                      \n1  1 1 . . . . . . . .\n3  . . 1 1 . . . . . .\n5  . . . . 1 . . . . .\n10 . . . . . 1 1 . . .\n13 . . . . . . . 1 1 1\n14 . . . . . . . . . .\n15 . . . . . . . . . .\n16 . . . . . . . . . .\n17 . . . . . . . . 1 .\n18 . . . . . . . . . .\n```\n:::\n:::\n\n\nNext, we'll make a `dtm` object with `cast_dtm(data, document, matrix)`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nausten_dtm <- tidy_austen %>%\n    count(line, word) %>%\n    cast_dtm(document = line, term = word, value = n)\n\nausten_dtm\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n<<DocumentTermMatrix (documents: 61010, terms: 13914)>>\nNon-/sparse entries: 216128/848677012\nSparsity           : 100%\nMaximal term length: 19\nWeighting          : term frequency (tf)\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(austen_dtm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"DocumentTermMatrix\"    \"simple_triplet_matrix\"\n```\n:::\n\n```{.r .cell-code}\ndim(austen_dtm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 61010 13914\n```\n:::\n\n```{.r .cell-code}\nas.matrix(austen_dtm[1:20, 1:10])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n    Terms\nDocs sense sensibility austen jane 1811 1 chapter dashwood estate family\n  1      1           1      0    0    0 0       0        0      0      0\n  3      0           0      1    1    0 0       0        0      0      0\n  5      0           0      0    0    1 0       0        0      0      0\n  10     0           0      0    0    0 1       1        0      0      0\n  13     0           0      0    0    0 0       0        1      1      1\n  14     0           0      0    0    0 0       0        0      0      0\n  15     0           0      0    0    0 0       0        0      0      0\n  16     0           0      0    0    0 0       0        0      0      0\n  17     0           0      0    0    0 0       0        0      1      0\n  18     0           0      0    0    0 0       0        0      0      0\n  19     0           0      0    0    0 0       0        0      0      0\n  20     0           0      0    0    0 0       0        0      0      0\n  21     0           0      0    0    0 0       0        0      0      0\n  22     0           0      0    0    0 0       0        1      0      1\n  23     0           0      0    0    0 0       0        0      1      0\n  24     0           0      0    0    0 0       0        0      0      0\n  25     0           0      0    0    0 0       0        0      0      0\n  26     0           0      0    0    0 0       0        0      0      0\n  27     0           0      0    0    0 0       0        1      0      0\n  28     0           0      0    0    0 0       0        0      0      0\n```\n:::\n:::\n\n\nNow we have 61010 observations and 13914 features.\n\nWith these matricies, you can start to leverage the NLP methods and software. For example, in text mining, we often have collections of documents, such as blog posts or news articles, that we'd like to divide into natural groups so that we can understand them separately.\n\n**Topic modeling** is a method for **unsupervised classification** of such documents, similar to clustering on numeric data, which finds natural groups of items even when we're not sure what we are looking for.\n\nLatent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to \"overlap\" each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.\n\nWe can also perform **supervised analyses** to build a classifier to classify lines of text from our `austen_sparse` or `austen_dtm` objects.\n\n</details>\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package      * version date (UTC) lib source\n cli            3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout       1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace     2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest         0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr        * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate       0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi          1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver         2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap        1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n fs             1.6.3   2023-07-20 [1] CRAN (R 4.3.0)\n generics       0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2      * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue           1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable         0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms            1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools      0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets    1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n janeaustenr  * 1.0.0   2022-08-26 [1] CRAN (R 4.3.0)\n jsonlite       1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr          1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling       0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice        0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate    * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n Matrix         1.6-1   2023-08-14 [1] CRAN (R 4.3.0)\n munsell        0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n NLP            0.2-1   2020-10-14 [1] CRAN (R 4.3.0)\n pillar         1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6             2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n rappdirs       0.3.3   2021-01-31 [1] CRAN (R 4.3.0)\n RColorBrewer * 1.1-3   2022-04-03 [1] CRAN (R 4.3.0)\n Rcpp           1.0.11  2023-07-06 [1] CRAN (R 4.3.0)\n readr        * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang          1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown      2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi     0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales         1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n slam           0.1-50  2022-01-08 [1] CRAN (R 4.3.0)\n SnowballC      0.7.1   2023-04-25 [1] CRAN (R 4.3.0)\n stringi        1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr      * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n textdata       0.4.4   2022-09-02 [1] CRAN (R 4.3.0)\n tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr        * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect     1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidytext     * 0.4.1   2023-01-07 [1] CRAN (R 4.3.0)\n tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange     0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tm             0.7-11  2023-02-05 [1] CRAN (R 4.3.0)\n tokenizers     0.3.0   2022-12-22 [1] CRAN (R 4.3.0)\n tzdb           0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8           1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs          0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr          2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n wordcloud    * 2.6     2018-08-24 [1] CRAN (R 4.3.0)\n xfun           0.40    2023-08-09 [1] CRAN (R 4.3.0)\n xml2           1.3.5   2023-07-06 [1] CRAN (R 4.3.0)\n yaml           2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/23-working-with-text-sentiment-analysis/index/figure-html/unnamed-chunk-38-1.png b/_freeze/posts/23-working-with-text-sentiment-analysis/index/figure-html/unnamed-chunk-38-1.png
index 126e03a..d7847ab 100644
Binary files a/_freeze/posts/23-working-with-text-sentiment-analysis/index/figure-html/unnamed-chunk-38-1.png and b/_freeze/posts/23-working-with-text-sentiment-analysis/index/figure-html/unnamed-chunk-38-1.png differ
diff --git a/_freeze/posts/24-best-practices-data-analyses/index/execute-results/html.json b/_freeze/posts/24-best-practices-data-analyses/index/execute-results/html.json
index 51664b1..e4919d3 100644
--- a/_freeze/posts/24-best-practices-data-analyses/index/execute-results/html.json
+++ b/_freeze/posts/24-best-practices-data-analyses/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "13568ba6a8ace5a47f1d782b6b91e13e",
+  "hash": "244ddea718fbc71aa57935a2db791a0b",
   "result": {
-    "markdown": "---\ntitle: \"24 - Best practices for data analyses\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"A noncomprehensive set of best practices for building data analyses\"\ncategories: [module 6, week 8, best practices]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/24-best-practices-data-analyses/index.qmd).*\n\n# Pre-lecture materials\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://teachdatascience.com/philosophy>\n-   <https://teachdatascience.com/closing2020>\n-   [Sharing biological data: why, when, and how](https://febs.onlinelibrary.wiley.com/doi/10.1002/1873-3468.14067)\n-   <https://github.com/genomicsclass/labs/blob/master/eda/plots_to_avoid.Rmd>\n-   <http://jtleek.com/advdatasci/09-expository-graphs.html>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\nBe able to state best practices for:\n\n-   Considerations around building ethical data analyses\n-   Sharing data\n-   Creating data visualizations\n:::\n\n\n::: {.cell}\n\n:::\n\n\n# Best practices for data ethics\n\nIn philosophy departments, classes and modules centered around **data ethics** are widely discussed.\n\nThe ethical challenges around working with data are not fundamentally different from the ethical challenges philosophers have always faced.\n\nHowever, putting an ethical framework around building data analyses in practice is indeed new for most data scientists, and for many of us, we are woefully under-prepared to teach so far outside our comfort zone.\n\nThat being said, we can provide some thoughts on how to approach a data science problem using a philosophical lens.\n\n## Defining ethics\n\nWe start with a grounding in the definition of Ethics:\n\n**Ethics**, also called moral philosophy, has three main branches:\n\n1.  [Applied ethics](https://www.oxfordbibliographies.com/view/document/obo-9780195396577/obo-9780195396577-0006.xml) \"is a branch of ethics devoted to the treatment of moral problems, practices, and policies in personal life, professions, technology, and government.\"\n2.  [Ethical theory](https://academic.oup.com/edited-volume/35492/chapter-abstract/304418208?redirectedFrom=fulltext&login=false) \"is concerned with the articulation and the justification of the fundamental principles that govern the issues of how we should live and what we morally ought to do. Its most general concerns are providing an account of moral evaluation and, possibly, articulating a decision procedure to guide moral action.\"\n3.  [Metaethics](https://plato.stanford.edu/entries/metaethics/) \"is the attempt to understand the metaphysical, epistemological, semantic, and psychological, presuppositions and commitments of moral thought, talk, and practice.\"\n\nWhile, unfortunately, there are myriad examples of **ethical data science problems** (see, for example, blog posts [bookclub](https://teachdatascience.com/bookclub/) and [data feminism](https://teachdatascience.com/datafem/)), here I aim to connect some of the broader data science ethics issues with the existing philosophical literature.\n\nNote, I am only scratching the surface and a deeper dive might involve education in related philosophical fields (epistemology, metaphysics, or philosophy of science), philosophical methodologies, and ethical schools of thought, but you can peruse all of these through, for example, a course or readings introducing the discipline of philosophy.\n\nBelow we provide some thoughts on how to approach a data science problem using a philosophical lens.\n\n## Case Study\n\nWe begin by considering a case study around ethical data analyses.\n\nMany **ethics case studies** provided in a classroom setting **describe algorithms built on data which are meant to predict outcomes**.\n\n::: callout-tip\n### Note\n\nLarge scale algorithmic decision making presents particular ethical predicaments because of both the scale of impact and the \"black-box\" sense of how the algorithm is generating predictions.\n:::\n\nConsider the well-known issue of using [facial recognition software](https://en.wikipedia.org/wiki/Facial_recognition_system) in policing.\n\nThere are many questions surrounding the policing issue:\n\n-   What are the action options with respect to the outcome of the algorithm?\n-   What are the good and bad aspects of each action and how are these to be weighed against each other?\n\n![](https://teachdatascience.com/philosophy/LAPD.png)\n\n\\[Source: [CNN](https://www.cnn.com/2019/09/12/tech/california-body-cam-facial-recognition-ban/index.html)\\]\n\n::: callout-tip\n### Important questions\n\nThe two main ethical concerns surrounding facial recognition software break down into\n\n-   How the algorithms were developed?\n-   How the algorithm is used?\n:::\n\nWhen thinking about the questions below, reflect on the good aspects and the bad aspects and how one might weight the good versus the bad.\n\n### Creating the algorithm\n\n-   What data should be used to train the algorithm?\n    -   If the accuracy rates of the algorithm differ based on the demographics of the subgroups within the data, is more data and testing required?\n-   Who and what criteria should be used to tune the algorithm?\n    -   Who should be involved in decisions on the tuning parameters of the algorithm?\n    -   Which optimization criteria should be used (e.g., accuracy? false positive rate? false negative rate?)\n-   Issues of access:\n    -   Who should own or have control of the facial image data?\n        -   Do individuals have a right to keep their facial image private from being in databases?\n        -   Do individuals have a right to be notified that their facial image is in the data base? For example, if I ring someone's doorbell and my face is captured in a database, do I need to be told? \\[While traditional human subjects and IRB requirements necessitate consent to be included in any research project, in most cases it is legal to photograph a person without their consent.\\]\n    -   Should the data be accessible to researchers working to make the field more equitable? What if allowing accessibility thereby makes the data accessible to bad actors?\n\n### Using the algorithm\n\n-   Issues of personal impact:\n    -   The software might make it easier to accurately associate an individual with a crime, but it might also make it easier to mistakenly associate an individual with a crime. How should the pro vs con be weighed against each other?\n    -   Do individuals have a right to know, correct, or delete personal information included in a database?\n-   Issues of societal impact:\n    -   Is it permissible to use a facial recognition software which has been trained primarily on Caucasian faces, given that this results in false positive and false negative rates that are not equally dispersed across racial lines?\n    -   While the software might make it easier to protect against criminal activity, it also makes it easier to undermine specific communities when their members are mistakenly identified with criminal activity. How should the pro vs con of different communities be weighed against each other?\n-   Issues of money:\n    -   Is it permissible for a software company to profit from an algorithm while having no financial responsibility for its misuse or negative impacts?\n    -   Who should pay the court fees and missed work hours of those who were mistakenly accused of crimes?\n\nTo settle the questions above, we need to study various ethical theories, and it turns out that the different theories may lead us to different conclusions. As non-philosophers, we recognize that the suggested readings and ideas may come across as overwhelming. If you are overwhelmed, we suggest that you choose one ethical theory, think carefully about how it informs decision making, and help your students to connect the ethical framework to a data science case study.\n\n## Final thoughts\n\nThis is a challenging topic, but as you analyze data, ask yourself the following broad questions to help you with ethical considerations around the data analysis.\n\n::: callout-tip\n### Questions to ask yourself when analyzing data?\n\n1.  Why are we producing this knowledge?\n2.  For whom are we producing this knowledge?\n3.  What communities do they serve?\n4.  Which stakeholders need to be involved in making decisions in and around the data analysis?\n:::\n\n# Best practices for sharing data\n\nData sharing is an essential element of the scientific method, imperative to ensure transparency and reproducibility.\n\nDifferent areas of research collect fundamentally different types of data, such as tabular data, time series data, image data, or genomic data. These types of data differ greatly in size and require different approaches for sharing.\n\nIn this section, I outline broad best practices to make your data publicly accessible and usable, generally and for several specific kinds of data.\n\n## FAIR principles\n\nSharing data proves more useful when others can easily find and access, interpret, and reuse the data. To maximize the benefit of sharing your data, follow the [findable, accessible, interoperable, and reusable (FAIR)](https://www.go-fair.org/fair-principles/) guiding principles of data sharing, which optimize reuse of generated data.\n\n::: callout-tip\n### FAIR data sharing principles\n\n1.  **Findable**. The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.\n2.  **Accessible**. Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorization.\n3.  **Interoperable**. The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.\n4.  **Reusable**. The ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.\n:::\n\n## Why share?\n\n1.  **Benefits of sharing data to science and society**. Sharing data allows for transparency in scientific studies and allows one to fully understand what occurred in an analysis and reproduce the results. Without complete data, metadata, and information about resources used to generate the data, reproducing a study proves impossible.\n2.  **Benefits of sharing data to individual researchers**. Sharing data increases the impact of a researcher's work and reputation for sound science. Awards for those with an excellent record of [data sharing](https://researchsymbionts.org/) or [data reuse](https://researchparasite.com/) can exemplify this reputation.\n\n### Addressing common concerns about data sharing\n\nDespite the clear benefits of sharing data, some researchers still have concerns about doing so.\n\n-   **Novelty**. Some worry that sharing data may decrease the novelty of their work and their chance to publish in prominent journals. You can address this concern by sharing your data only after publication. You can also choose to preprint your manuscript when you decide to share your data. Furthermore, you only need to share the data and metadata required to reproduce your published study.\n-   **Time spent on sharing data**. Some have concerns about the time it takes to organize and share data publicly. Many add 'data available upon request' to manuscripts instead of depositing the data in a public repository in hopes of getting the work out sooner. It does take time to organize data in preparation for sharing, but sharing data publicly may save you time. Sharing data in a public repository that guarantees archival persistence means that you will not have to worry about storing and backing up the data yourself.\n-   **Human subject data**. Sharing of data on human subjects requires special ethical, legal, and privacy considerations. Existing recommendations largely aim to balance the privacy of human participants with the benefits of data sharing by de-identifying human participants and obtaining consent for sharing. Sharing human data poses a variety of challenges for analysis, transparency, reproducibility, interoperability, and access.\n\n::: callout-tip\n### Human data\n\nSometimes you cannot publicly post all human data, even after de-identification. We suggest three strategies for making these data maximally accessible.\n\n1.  Deposit raw data files in a controlled-access repository. Controlled-access repositories allow only qualified researchers who apply to access the data.\n2.  Even if you cannot make individual-level raw data available, you can make as much processed data available as possible. This may take the form of summary statistics such as means and standard deviations, rather than individual-level data.\n3.  You may want to generate simulated data distinct from the original data but statistically similar to it. Simulated data would allow others to reproduce your analysis without disclosing the original data or requiring the security controls needed for controlled access.\n:::\n\n## What data to share?\n\nDepending on the data type, you might be able to share the data itself, or a summarized version of it. Boradly thought, you want to share the following:\n\n1.  The **data** itself, or a summarized version, or a simulated data similar to the original.\n2.  Any **metadata** to describe the primary data and the resources used to generate it. Most disciplines have specific metadata standards to follow (e.g. [microarrays](http://fged.org/projects/minseqe/)).\n3.  **Data dictionary**. These have crucial role in organizing your data, especially explaining the variables and their representation. Data dictionaries should provide short names for each variable, a longer text label for the variable, a definition for each variable, data type (such as floating-point number, integer, or string), measurement units, and expected minimum and maximum values. Data dictionaries can make explicit what future users would otherwise have to guess about the representation of data.\n4.  **Source code**. Ideally, readers should have all materials needed to completely reproduce the study described in a publication, not just data. These materials include source code, preprocessing, and analysis scripts. Guidelines for organization of computational project can help you arrange your data and scripts in a way that will make it easier for you and other to access and reuse them.\n5.  **Licensing**. Clear licensing information attached to your data avoids any questions of whether others may reuse it. Many data resources turn out not to be as reusable as the providers intended, due to lack of clarity in licensing or restrictive licensing choices.\n\n::: callout-tip\n### How should you document your data?\n\nDocument your data in three ways:\n\n1.  **With your manuscript**.\n2.  **With description fields** in the metadata collected by repositories\n3.  **With README files**. README files provide abbreviated information about a collection of files (e.g. explain organization, file locations, observations and variables present in each file, details on the experimental design, etc).\n:::\n\n# Best practices for data visualizations\n\n## Motiviation\n\n::: callout-tip\n### Quote from a hero of mine\n\n*\"The greatest value of a picture is when it forces us to notice what we never expected to see.\"* -John W. Tukey\n\n\n::: {.cell .fig-cap-location-top layout-align=\"center\"}\n::: {.cell-output-display}\n![](http://upload.wikimedia.org/wikipedia/en/e/e9/John_Tukey.jpg){fig-align='center'}\n:::\n:::\n\n:::\n\nMistakes, biases, systematic errors and unexpected variability are commonly found in data regardless of applications. Failure to discover these problems often leads to **flawed analyses and false discoveries**.\n\nAs an example, consider that measurement devices sometimes fail and not all summarization procedures, such as the `mean()` function in R, are designed to detect these. Yet, these functions will still give you an answer.\n\nFurthermore, it may be hard or impossible to notice an error was made just from the reported summaries.\n\n**Data visualization is a powerful approach to detecting these problems**. We refer to this particular task as exploratory data analysis (EDA), coined by John Tukey.\n\nOn a more positive note, data visualization can also lead to discoveries which would otherwise be missed if we simply subject the data to a battery of statistical summaries or procedures.\n\nWhen analyzing data, we often **make use of exploratory plots to motivate the analyses** we choose.\n\nIn this section, we will discuss some types of plots to avoid, better ways to visualize data, some principles to create good plots, and ways to use `ggplot2` to create **expository** (intended to explain or describe something) graphs.\n\n::: callout-tip\n### Example\n\nThe following figure is from [Lippmann et al. 2006](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1665439/):\n\n![Nickel concentration and PM10 health effects (Blue points represent average county-level concentrations from 2000--2005 for 72 U.S. counties representing 69 communities).](../../images/lippman.png){width=\"70%\"}\n\nThe following figure is from [Dominici et al. 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2137127/), in response to the work by Lippmann et al. above.\n\n![Nickel concentration and PM10 health effects (with and without New York).](../../images/dominici_ehp.png){width=\"70%\"}\n\nElevated levels of Ni and V PM2.5 chemical components in New York are likely attributed to oil-fired power plants and emissions from ships burning oil, as noted by Lippmann et al. (2006).\n:::\n\n### Generating data visualizations\n\nIn order to determine the effectiveness or quality of a visualization, we need to first understand three things:\n\n::: callout-tip\n### Questions to ask yourself when building data visualizations\n\n1.  What is the question we are trying to answer?\n2.  Why are we building this visualization?\n3.  For whom are we producing this data visualization for? Who is the intended audience to consume this visualization?\n:::\n\nNo plot (or any statistical tool, really) can be judged without knowing the answers to those questions. No plot or graphic exists in a vacuum. There is always context and other surrounding factors that play a role in determining a plot's effectiveness.\n\nConversely, **high-quality, well-made visualizations** usually allow one to properly deduce what question is being asked and who the audience is meant to be. A good visualization **tells a complete story in a single frame**.\n\n::: callout-tip\n## Broad steps for creating data visualizations\n\nThe act of visualizing data typically proceeds in two broad steps:\n\n1.  Given the question and the audience, **what type of plot should I make?**\n2.  Given the plot I intend to make, **how can I optimize it for clarity and effectiveness?**\n:::\n\n## Data viz principles\n\n### Developing plots\n\nInitially, one must decide what information should be presented. The following principles for developing analytic graphics come from Edward Tufte's book [*Beautiful Evidence*](https://www.edwardtufte.com/tufte/books_be).\n\n1.  Show comparisons\n2.  Show causality, mechanism, explanation\n3.  Show multivariate data\n4.  Integrate multiple modes of evidence\n5.  Describe and document the evidence\n6.  Content is king - good plots start with good questions\n\n### Optimizing plots\n\n1.  Maximize the data/ink ratio -- if \"ink\" can be removed without reducing the information being communicated, then it should be removed.\n2.  Maximize the range of perceptual conditions -- your audience's perceptual abilities may not be fully known, so it's best to allow for a wide range, to the extent possible (or knowable).\n3.  Show variation in the **data**, not variation in the **design**.\n\nWhat's sub-optimal about this plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd <- airquality %>%\n  mutate(Summer = ifelse(Month %in% c(7, 8, 9), 2, 3))\nwith(d, {\n  plot(Temp, Ozone, col = unclass(Summer), pch = 19, frame.plot = FALSE)\n  legend(\"topleft\", col = 2:3, pch = 19, bty = \"n\",\n         legend = c(\"Summer\", \"Non-Summer\"))\n})\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nWhat's sub-optimal about this plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nairquality %>%\n  mutate(Summer = ifelse(Month %in% c(7, 8, 9), \n                         \"Summer\", \"Non-Summer\")) %>%\n  ggplot(aes(Temp, Ozone)) + \n  geom_point(aes(color = Summer), size = 2) + \n  theme_minimal()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nSome of these principles are taken from Edward Tufte's *Visual Display of Quantitative Information*:\n\n## Plots to Avoid\n\nThis section is based on a talk by [Karl W. Broman](http://kbroman.org/) titled \"How to Display Data Badly,\" in which he described how the default plots offered by Microsoft Excel \"obscure your data and annoy your readers\" ([here](https://kbroman.org/talks.html) is a link to a collection of Karl Broman's talks).\n\n::: callout-tip\n### FYI\n\nKarl's lecture was inspired by the 1984 paper by H. Wainer: How to display data badly. American Statistician 38(2): 137--147.\n\nDr. Wainer was the first to elucidate the principles of the bad display of data.\n\nHowever, according to Karl Broman, \"The now widespread use of Microsoft Excel has resulted in remarkable advances in the field.\"\n\nHere we show examples of \"bad plots\" and how to improve them in R.\n:::\n\n::: callout-tip\n### Some general principles of *bad* plots\n\n-   Display as little information as possible.\n-   Obscure what you do show (with chart junk).\n-   Use pseudo-3D and color gratuitously.\n-   Make a pie chart (preferably in color and 3D).\n-   Use a poorly chosen scale.\n-   Ignore significant figures.\n:::\n\n## Examples\n\nHere are some examples of bad plots and suggestions on how to improve\n\n### Pie charts\n\nLet's say we are interested in the most commonly used browsers. Wikipedia has a [table](https://en.wikipedia.org/wiki/Usage_share_of_web_browsers) with the \"usage share of web browsers\" or the proportion of visitors to a group of web sites that use a particular web browser from July 2017.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbrowsers <- c(Chrome=60, Safari=14, UCBrowser=7,\n              Firefox=5, Opera=3, IE=3, Noinfo=8)\nbrowsers.df <- gather(data.frame(t(browsers)), \n                      \"browser\", \"proportion\") \n```\n:::\n\n\nLet's say we want to report the results of the usage. The standard way of displaying these is with a pie chart:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npie(browsers,main=\"Browser Usage (July 2022)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nIf we look at the help file for `pie()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?pie\n```\n:::\n\n\nIt states:\n\n> \"Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.\"\n\nTo see this, look at the figure above and try to determine the percentages just from looking at the plot. Unless the percentages are close to 25%, 50% or 75%, this is not so easy. Simply showing the numbers is not only clear, but also saves on printing costs.\n\n#### Instead of pie charts, try bar plots\n\nIf you do want to plot them, then a barplot is appropriate. Here we use the `geom_bar()` function in `ggplot2`. Note, there are also horizontal lines at every multiple of 10, which helps the eye quickly make comparisons across:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- browsers.df %>% \n        ggplot(aes(x=reorder(browser, -proportion), \n                   y=proportion)) + \n        geom_bar(stat=\"identity\")\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nNotice that we can now pretty easily determine the percentages by following a horizontal line to the x-axis.\n\n#### Polish your plots\n\nWhile this figure is already a big improvement over a pie chart, we can do even better. When you create figures, you want your figures to be self-sufficient, meaning someone looking at the plot can understand everything about it.\n\nSome possible critiques are:\n\n1.  make the axes bigger\n2.  make the labels bigger\n3.  make the labels be full names (e.g. \"Browser\" and \"Proportion of users\", ideally with units when appropriate)\n4.  add a title\n\nLet's explore how to do these things to make an even better figure.\n\nTo start, go to the help file for `theme()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?ggplot2::theme\n```\n:::\n\n\nWe see there are arguments with text that control all the text sizes in the plot. If you scroll down, you see the text argument in the theme command requires class `element_text`. Let's try it out.\n\nTo change the x-axis and y-axis labels to be full names, use `xlab()` and `ylab()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- p + xlab(\"Browser\") + \n        ylab(\"Proportion of Users\")\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nMaybe a title\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Browser Usage (July 2022)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\nNext, we can also use the `theme()` function in `ggplot2` to control the justifications and sizes of the axes, labels and titles.\n\nTo center the title\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Browser Usage (July 2022)\") + \n  theme(plot.title = element_text(hjust = 0.5))\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nTo create bigger text/labels/titles:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- p + ggtitle(\"Browser Usage (July 2022)\") + \n        theme(plot.title = element_text(hjust = 0.5), \n        text = element_text(size = 15))\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\n#### \"I don't like that theme\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_bw()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_dark()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_classic() # axis lines!\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggthemes::theme_base()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n### 3D barplots\n\nPlease, avoid a 3D version because it obfuscates the plot, making it more difficult to find the percentages by eye.\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig2b.png)\n\n### Donut plots\n\nEven worse than pie charts are donut plots.\n\n![](http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Donut-Chart.svg/360px-Donut-Chart.svg.png)\n\nThe reason is that by removing the center, we remove one of the visual cues for determining the different areas: the angles. **There is no reason to ever use a donut plot to display data**.\n\n::: callout-note\n### Question\n\nWhy are pie/donut charts [so common](https://blog.usejournal.com/why-humans-love-pie-charts-9cd346000bdc)?\n\n<https://blog.usejournal.com/why-humans-love-pie-charts-9cd346000bdc>\n:::\n\n### Barplots as data summaries\n\nWhile barplots are useful for showing percentages, they are incorrectly used to display data from two groups being compared. Specifically, barplots are created with height equal to the group means; an antenna is added at the top to represent standard errors. This plot is simply showing two numbers per group and the plot adds nothing:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig1c.png)\n\n#### Instead of bar plots for summaries, try box plots\n\nIf the number of points is small enough, we might as well add them to the plot. When the number of points is too large for us to see them, just showing a boxplot is preferable.\n\nLet's recreate these barplots as boxplots and overlay the points. We will simulate similar data to demonstrate one way to improve the graphic above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\ndat <- data.frame(\"Treatment\" = rnorm(10, 30, sd=4), \n                  \"Control\" = rnorm(10, 36, sd=4))\ngather(dat, \"type\", \"response\") %>% \n  ggplot(aes(type, response)) + \n  geom_boxplot() + \n  geom_point(position='jitter') + \n  ggtitle(\"Response to drug treatment\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\nNotice how much more we see here: the center, spread, range, and the points themselves. In the barplot, we only see the mean and the standard error (SE), and the SE has more to do with sample size than with the spread of the data.\n\nThis problem is magnified when our data has outliers or very large tails. For example, in the plot below, there appears to be very large and consistent differences between the two groups:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig3c.png)\n\nHowever, a quick look at the data demonstrates that this difference is mostly driven by just two points.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\ndat <- data.frame(\"Treatment\" = rgamma(10, 10, 1), \n                  \"Control\" = rgamma(10, 1, .01))\ngather(dat, \"type\", \"response\") %>% \n  ggplot(aes(type, response)) + \n  geom_boxplot() + \n  geom_point(position='jitter')\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\n#### Use log scale if data includes outliers\n\nA version showing the data in the log-scale is much more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n gather(dat, \"type\", \"response\") %>% \n  ggplot(aes(type, response)) + \n  geom_boxplot() + \n  geom_point(position='jitter') + \n  scale_y_log10()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n### Barplots for paired data\n\nA common task in data analysis is the comparison of two groups. When the dataset is small and data are paired, such as the outcomes before and after a treatment, two-color barplots are unfortunately often used to display the results.\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig6r_e.png)\n\n#### Instead of paired bar plots, try scatter plots\n\nThere are better ways of showing these data to illustrate that there is an increase after treatment. One is to simply make a scatter plot, which shows that most points are above the identity line. Another alternative is to plot the differences against the before values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\nbefore <- runif(6, 5, 8)\nafter <- rnorm(6, before*1.15, 2)\nli <- range(c(before, after))\nymx <- max(abs(after-before))\n\npar(mfrow=c(1,2))\nplot(before, after, xlab=\"Before\", ylab=\"After\",\n     ylim=li, xlim=li)\nabline(0,1, lty=2, col=1)\n\nplot(before, after-before, xlab=\"Before\", ylim=c(-ymx, ymx),\n     ylab=\"Change (After - Before)\", lwd=2)\nabline(h=0, lty=2, col=1)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\n#### or line plots\n\nLine plots are not a bad choice, although they can be harder to follow than the previous two. Boxplots show you the increase, but lose the paired information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- rep(c(0,1), rep(6,2))\npar(mfrow=c(1,2))\nplot(z, c(before, after),\n     xaxt=\"n\", ylab=\"Response\",\n     xlab=\"\", xlim=c(-0.5, 1.5))\naxis(side=1, at=c(0,1), c(\"Before\",\"After\"))\nsegments(rep(0,6), before, rep(1,6), after, col=1)     \n\nboxplot(before,after,names=c(\"Before\",\"After\"),ylab=\"Response\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-22-1.png){width=672}\n:::\n:::\n\n\n### Gratuitous 3D\n\nThe figure below shows three curves. Pseudo 3D is used, but it is not clear why. Maybe to separate the three curves? Notice how difficult it is to determine the values of the curves at any given point:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig8b.png)\n\nThis plot can be made better by simply using color to distinguish the three lines:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- read_csv(\"https://github.com/kbroman/Talk_Graphs/raw/master/R/fig8dat.csv\") %>%\n  as_tibble(.name_repair = make.names)\n\np <- x %>% \n  gather(\"drug\", \"proportion\", -log.dose) %>% \n  ggplot(aes(x=log.dose, y=proportion,\n             color=drug)) + \n  geom_line()\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n:::\n\n\nThis plot demonstrates that using color is more than enough to distinguish the three lines.\n\nWe can make this plot better using the functions we learned above\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Survival proportion\") + \n        theme(plot.title = element_text(hjust = 0.5), \n        text = element_text(size = 15))\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-24-1.png){width=672}\n:::\n:::\n\n\n#### Legends\n\nWe can also move the legend inside the plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Survival proportion\") + \n        theme(plot.title = element_text(hjust = 0.5), \n        text = element_text(size = 15), \n        legend.position = c(0.2, 0.3))\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-25-1.png){width=672}\n:::\n:::\n\n\nWe can also make the legend transparent\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransparent_legend =  theme(\n  legend.background = element_rect(fill = \"transparent\"),\n  legend.key = element_rect(fill = \"transparent\", \n                            color = \"transparent\"))\n\np + ggtitle(\"Survival proportion\") + \n        theme(plot.title = element_text(hjust = 0.5), \n        text = element_text(size = 15), \n        legend.position = c(0.2, 0.3)) + \n  transparent_legend\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n\n### Too many significant digits\n\nBy default, statistical software like R returns many significant digits. This does not mean we should report them. Cutting and pasting directly from R is a bad idea since you might end up showing a table, such as the one below, comparing the heights of basketball players:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nheights <- cbind(rnorm(8,73,3),rnorm(8,73,3),rnorm(8,80,3),\n                 rnorm(8,78,3),rnorm(8,78,3))\ncolnames(heights)<-c(\"SG\",\"PG\",\"C\",\"PF\",\"SF\")\nrownames(heights)<- paste(\"team\",1:8)\nheights\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n             SG       PG        C       PF       SF\nteam 1 68.88065 73.07480 81.80948 76.60455 82.23521\nteam 2 70.05272 66.86024 74.64847 72.70140 78.55640\nteam 3 71.33653 73.63946 81.00483 78.56787 77.86893\nteam 4 73.36414 81.01021 81.68293 76.90146 77.35226\nteam 5 72.63738 69.31895 83.66281 81.17280 82.39133\nteam 6 68.99188 75.50274 79.36564 75.77514 78.68900\nteam 7 73.51017 74.59772 82.09829 73.95492 78.32287\nteam 8 73.46524 71.05953 77.88069 76.44808 73.86569\n```\n:::\n:::\n\n\nWe are reporting precision up to 0.00001 inches. Do you know of a tape measure with that much precision? This can be easily remedied:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(heights,1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n         SG   PG    C   PF   SF\nteam 1 68.9 73.1 81.8 76.6 82.2\nteam 2 70.1 66.9 74.6 72.7 78.6\nteam 3 71.3 73.6 81.0 78.6 77.9\nteam 4 73.4 81.0 81.7 76.9 77.4\nteam 5 72.6 69.3 83.7 81.2 82.4\nteam 6 69.0 75.5 79.4 75.8 78.7\nteam 7 73.5 74.6 82.1 74.0 78.3\nteam 8 73.5 71.1 77.9 76.4 73.9\n```\n:::\n:::\n\n\n### Minimal figure captions\n\nRecall the plot we had before:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransparent_legend =  theme(\n  legend.background = element_rect(fill = \"transparent\"),\n  legend.key = element_rect(fill = \"transparent\", \n                            color = \"transparent\"))\n\np + ggtitle(\"Survival proportion\") + \n        theme(plot.title = element_text(hjust = 0.5), \n        text = element_text(size = 15), \n        legend.position = c(0.2, 0.3)) + \n  xlab(\"dose (mg)\") + \n  transparent_legend\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\nWhat type of caption would be good here?\n\nWhen creating figure captions, think about the following:\n\n1.  Be specific\n\n> A plot of the proportion of patients who survived after three drug treatments.\n\n2.  Label the caption\n\n> Figure 1. A plot of the proportion of patients who survived after three drug treatments.\n\n3.  Tell a story\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments.\n\n4.  Include units\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments (milligram).\n\n5.  Explain aesthetics\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments (milligram). Three colors represent three drug treatments. Drug A results in largest survival proportion for the larger drug doses.\n\n## Final thoughts data viz\n\nIn general, you should follow these principles:\n\n-   Create expository graphs to tell a story (figure and caption should be self-sufficient; it's the first thing people look at)\n\n    -   Be accurate and clear\n    -   Let the data speak\n    -   Make axes, labels and titles big\n    -   Make labels full names (ideally with units when appropriate)\n    -   Add informative legends; use space effectively\n\n-   Show as much information as possible, taking care not to obscure the message\n\n-   Science not sales: avoid unnecessary frills (especially gratuitous 3D)\n\n-   In tables, every digit should be meaningful\n\n### Some further reading\n\n-   N Cross (2011). Design Thinking: Understanding How Designers Think and Work. Bloomsbury Publishing.\n-   J Tukey (1977). Exploratory Data Analysis.\n-   ER Tufte (1983) The visual display of quantitative information. Graphics Press.\n-   ER Tufte (1990) Envisioning information. Graphics Press.\n-   ER Tufte (1997) Visual explanations. Graphics Press.\n-   ER Tufte (2006) Beautiful Evidence. Graphics Press.\n-   WS Cleveland (1993) Visualizing data. Hobart Press.\n-   WS Cleveland (1994) The elements of graphing data. CRC Press.\n-   A Gelman, C Pasarica, R Dodhia (2002) Let's practice what we preach: Turning tables into graphs. The American Statistician 56:121-130.\n-   NB Robbins (2004) Creating more effective graphs. Wiley.\n-   [Nature Methods columns](http://bang.clearscience.info/?p=546)\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n curl          5.0.2   2023-08-14 [1] CRAN (R 4.3.1)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n ggthemes      4.2.4   2021-01-20 [1] CRAN (R 4.3.0)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"24 - Best practices for data analyses\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"A noncomprehensive set of best practices for building data analyses\"\ncategories: [module 6, week 8, best practices]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/24-best-practices-data-analyses/index.qmd).*\n\n# Pre-lecture materials\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://teachdatascience.com/philosophy>\n-   <https://teachdatascience.com/closing2020>\n-   [Sharing biological data: why, when, and how](https://febs.onlinelibrary.wiley.com/doi/10.1002/1873-3468.14067)\n-   <https://github.com/genomicsclass/labs/blob/master/eda/plots_to_avoid.Rmd>\n-   <http://jtleek.com/advdatasci/09-expository-graphs.html>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\nBe able to state best practices for:\n\n-   Considerations around building ethical data analyses\n-   Sharing data\n-   Creating data visualizations\n:::\n\n\n::: {.cell}\n\n:::\n\n\n# Best practices for data ethics\n\nIn philosophy departments, classes and modules centered around **data ethics** are widely discussed.\n\nThe ethical challenges around working with data are not fundamentally different from the ethical challenges philosophers have always faced.\n\nHowever, putting an ethical framework around building data analyses in practice is indeed new for most data scientists, and for many of us, we are woefully under-prepared to teach so far outside our comfort zone.\n\nThat being said, we can provide some thoughts on how to approach a data science problem using a philosophical lens.\n\n## Defining ethics\n\nWe start with a grounding in the definition of Ethics:\n\n**Ethics**, also called moral philosophy, has three main branches:\n\n1.  [Applied ethics](https://www.oxfordbibliographies.com/view/document/obo-9780195396577/obo-9780195396577-0006.xml) \"is a branch of ethics devoted to the treatment of moral problems, practices, and policies in personal life, professions, technology, and government.\"\n2.  [Ethical theory](https://academic.oup.com/edited-volume/35492/chapter-abstract/304418208?redirectedFrom=fulltext&login=false) \"is concerned with the articulation and the justification of the fundamental principles that govern the issues of how we should live and what we morally ought to do. Its most general concerns are providing an account of moral evaluation and, possibly, articulating a decision procedure to guide moral action.\"\n3.  [Metaethics](https://plato.stanford.edu/entries/metaethics/) \"is the attempt to understand the metaphysical, epistemological, semantic, and psychological, presuppositions and commitments of moral thought, talk, and practice.\"\n\nWhile, unfortunately, there are myriad examples of **ethical data science problems** (see, for example, blog posts [bookclub](https://teachdatascience.com/bookclub/) and [data feminism](https://teachdatascience.com/datafem/)), here I aim to connect some of the broader data science ethics issues with the existing philosophical literature.\n\nNote, I am only scratching the surface and a deeper dive might involve education in related philosophical fields (epistemology, metaphysics, or philosophy of science), philosophical methodologies, and ethical schools of thought, but you can peruse all of these through, for example, a course or readings introducing the discipline of philosophy.\n\nBelow we provide some thoughts on how to approach a data science problem using a philosophical lens.\n\n## Case Study\n\nWe begin by considering a case study around ethical data analyses.\n\nMany **ethics case studies** provided in a classroom setting **describe algorithms built on data which are meant to predict outcomes**.\n\n::: callout-tip\n### Note\n\nLarge scale algorithmic decision making presents particular ethical predicaments because of both the scale of impact and the \"black-box\" sense of how the algorithm is generating predictions.\n:::\n\nConsider the well-known issue of using [facial recognition software](https://en.wikipedia.org/wiki/Facial_recognition_system) in policing.\n\nThere are many questions surrounding the policing issue:\n\n-   What are the action options with respect to the outcome of the algorithm?\n-   What are the good and bad aspects of each action and how are these to be weighed against each other?\n\n![](https://teachdatascience.com/philosophy/LAPD.png)\n\n\\[Source: [CNN](https://www.cnn.com/2019/09/12/tech/california-body-cam-facial-recognition-ban/index.html)\\]\n\n::: callout-tip\n### Important questions\n\nThe two main ethical concerns surrounding facial recognition software break down into\n\n-   How the algorithms were developed?\n-   How the algorithm is used?\n:::\n\nWhen thinking about the questions below, reflect on the good aspects and the bad aspects and how one might weight the good versus the bad.\n\n### Creating the algorithm\n\n-   What data should be used to train the algorithm?\n    -   If the accuracy rates of the algorithm differ based on the demographics of the subgroups within the data, is more data and testing required?\n-   Who and what criteria should be used to tune the algorithm?\n    -   Who should be involved in decisions on the tuning parameters of the algorithm?\n    -   Which optimization criteria should be used (e.g., accuracy? false positive rate? false negative rate?)\n-   Issues of access:\n    -   Who should own or have control of the facial image data?\n        -   Do individuals have a right to keep their facial image private from being in databases?\n        -   Do individuals have a right to be notified that their facial image is in the data base? For example, if I ring someone's doorbell and my face is captured in a database, do I need to be told? \\[While traditional human subjects and IRB requirements necessitate consent to be included in any research project, in most cases it is legal to photograph a person without their consent.\\]\n    -   Should the data be accessible to researchers working to make the field more equitable? What if allowing accessibility thereby makes the data accessible to bad actors?\n\n### Using the algorithm\n\n-   Issues of personal impact:\n    -   The software might make it easier to accurately associate an individual with a crime, but it might also make it easier to mistakenly associate an individual with a crime. How should the pro vs con be weighed against each other?\n    -   Do individuals have a right to know, correct, or delete personal information included in a database?\n-   Issues of societal impact:\n    -   Is it permissible to use a facial recognition software which has been trained primarily on Caucasian faces, given that this results in false positive and false negative rates that are not equally dispersed across racial lines?\n    -   While the software might make it easier to protect against criminal activity, it also makes it easier to undermine specific communities when their members are mistakenly identified with criminal activity. How should the pro vs con of different communities be weighed against each other?\n-   Issues of money:\n    -   Is it permissible for a software company to profit from an algorithm while having no financial responsibility for its misuse or negative impacts?\n    -   Who should pay the court fees and missed work hours of those who were mistakenly accused of crimes?\n\nTo settle the questions above, we need to study various ethical theories, and it turns out that the different theories may lead us to different conclusions. As non-philosophers, we recognize that the suggested readings and ideas may come across as overwhelming. If you are overwhelmed, we suggest that you choose one ethical theory, think carefully about how it informs decision making, and help your students to connect the ethical framework to a data science case study.\n\n## Final thoughts\n\nThis is a challenging topic, but as you analyze data, ask yourself the following broad questions to help you with ethical considerations around the data analysis.\n\n::: callout-tip\n### Questions to ask yourself when analyzing data?\n\n1.  Why are we producing this knowledge?\n2.  For whom are we producing this knowledge?\n3.  What communities do they serve?\n4.  Which stakeholders need to be involved in making decisions in and around the data analysis?\n:::\n\n# Best practices for sharing data\n\nData sharing is an essential element of the scientific method, imperative to ensure transparency and reproducibility.\n\nDifferent areas of research collect fundamentally different types of data, such as tabular data, time series data, image data, or genomic data. These types of data differ greatly in size and require different approaches for sharing.\n\nIn this section, I outline broad best practices to make your data publicly accessible and usable, generally and for several specific kinds of data.\n\n## FAIR principles\n\nSharing data proves more useful when others can easily find and access, interpret, and reuse the data. To maximize the benefit of sharing your data, follow the [findable, accessible, interoperable, and reusable (FAIR)](https://www.go-fair.org/fair-principles/) guiding principles of data sharing, which optimize reuse of generated data.\n\n::: callout-tip\n### FAIR data sharing principles\n\n1.  **Findable**. The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.\n2.  **Accessible**. Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorization.\n3.  **Interoperable**. The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.\n4.  **Reusable**. The ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.\n:::\n\n## Why share?\n\n1.  **Benefits of sharing data to science and society**. Sharing data allows for transparency in scientific studies and allows one to fully understand what occurred in an analysis and reproduce the results. Without complete data, metadata, and information about resources used to generate the data, reproducing a study proves impossible.\n2.  **Benefits of sharing data to individual researchers**. Sharing data increases the impact of a researcher's work and reputation for sound science. Awards for those with an excellent record of [data sharing](https://researchsymbionts.org/) or [data reuse](https://researchparasite.com/) can exemplify this reputation.\n\n### Addressing common concerns about data sharing\n\nDespite the clear benefits of sharing data, some researchers still have concerns about doing so.\n\n-   **Novelty**. Some worry that sharing data may decrease the novelty of their work and their chance to publish in prominent journals. You can address this concern by sharing your data only after publication. You can also choose to preprint your manuscript when you decide to share your data. Furthermore, you only need to share the data and metadata required to reproduce your published study.\n-   **Time spent on sharing data**. Some have concerns about the time it takes to organize and share data publicly. Many add 'data available upon request' to manuscripts instead of depositing the data in a public repository in hopes of getting the work out sooner. It does take time to organize data in preparation for sharing, but sharing data publicly may save you time. Sharing data in a public repository that guarantees archival persistence means that you will not have to worry about storing and backing up the data yourself.\n-   **Human subject data**. Sharing of data on human subjects requires special ethical, legal, and privacy considerations. Existing recommendations largely aim to balance the privacy of human participants with the benefits of data sharing by de-identifying human participants and obtaining consent for sharing. Sharing human data poses a variety of challenges for analysis, transparency, reproducibility, interoperability, and access.\n\n::: callout-tip\n### Human data\n\nSometimes you cannot publicly post all human data, even after de-identification. We suggest three strategies for making these data maximally accessible.\n\n1.  Deposit raw data files in a controlled-access repository. Controlled-access repositories allow only qualified researchers who apply to access the data.\n2.  Even if you cannot make individual-level raw data available, you can make as much processed data available as possible. This may take the form of summary statistics such as means and standard deviations, rather than individual-level data.\n3.  You may want to generate simulated data distinct from the original data but statistically similar to it. Simulated data would allow others to reproduce your analysis without disclosing the original data or requiring the security controls needed for controlled access.\n:::\n\n## What data to share?\n\nDepending on the data type, you might be able to share the data itself, or a summarized version of it. Boradly thought, you want to share the following:\n\n1.  The **data** itself, or a summarized version, or a simulated data similar to the original.\n2.  Any **metadata** to describe the primary data and the resources used to generate it. Most disciplines have specific metadata standards to follow (e.g. [microarrays](http://fged.org/projects/minseqe/)).\n3.  **Data dictionary**. These have crucial role in organizing your data, especially explaining the variables and their representation. Data dictionaries should provide short names for each variable, a longer text label for the variable, a definition for each variable, data type (such as floating-point number, integer, or string), measurement units, and expected minimum and maximum values. Data dictionaries can make explicit what future users would otherwise have to guess about the representation of data.\n4.  **Source code**. Ideally, readers should have all materials needed to completely reproduce the study described in a publication, not just data. These materials include source code, preprocessing, and analysis scripts. Guidelines for organization of computational project can help you arrange your data and scripts in a way that will make it easier for you and other to access and reuse them.\n5.  **Licensing**. Clear licensing information attached to your data avoids any questions of whether others may reuse it. Many data resources turn out not to be as reusable as the providers intended, due to lack of clarity in licensing or restrictive licensing choices.\n\n::: callout-tip\n### How should you document your data?\n\nDocument your data in three ways:\n\n1.  **With your manuscript**.\n2.  **With description fields** in the metadata collected by repositories\n3.  **With README files**. README files provide abbreviated information about a collection of files (e.g. explain organization, file locations, observations and variables present in each file, details on the experimental design, etc).\n:::\n\n# Best practices for data visualizations\n\n## Motiviation\n\n::: callout-tip\n### Quote from a hero of mine\n\n*\"The greatest value of a picture is when it forces us to notice what we never expected to see.\"* -John W. Tukey\n\n\n::: {.cell .fig-cap-location-top layout-align=\"center\"}\n::: {.cell-output-display}\n![](http://upload.wikimedia.org/wikipedia/en/e/e9/John_Tukey.jpg){fig-align='center'}\n:::\n:::\n\n:::\n\nMistakes, biases, systematic errors and unexpected variability are commonly found in data regardless of applications. Failure to discover these problems often leads to **flawed analyses and false discoveries**.\n\nAs an example, consider that measurement devices sometimes fail and not all summarization procedures, such as the `mean()` function in R, are designed to detect these. Yet, these functions will still give you an answer.\n\nFurthermore, it may be hard or impossible to notice an error was made just from the reported summaries.\n\n**Data visualization is a powerful approach to detecting these problems**. We refer to this particular task as exploratory data analysis (EDA), coined by John Tukey.\n\nOn a more positive note, data visualization can also lead to discoveries which would otherwise be missed if we simply subject the data to a battery of statistical summaries or procedures.\n\nWhen analyzing data, we often **make use of exploratory plots to motivate the analyses** we choose.\n\nIn this section, we will discuss some types of plots to avoid, better ways to visualize data, some principles to create good plots, and ways to use `ggplot2` to create **expository** (intended to explain or describe something) graphs.\n\n::: callout-tip\n### Example\n\nThe following figure is from [Lippmann et al. 2006](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1665439/):\n\n![Nickel concentration and PM10 health effects (Blue points represent average county-level concentrations from 2000--2005 for 72 U.S. counties representing 69 communities).](../../images/lippman.png){width=\"70%\"}\n\nThe following figure is from [Dominici et al. 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2137127/), in response to the work by Lippmann et al. above.\n\n![Nickel concentration and PM10 health effects (with and without New York).](../../images/dominici_ehp.png){width=\"70%\"}\n\nElevated levels of Ni and V PM2.5 chemical components in New York are likely attributed to oil-fired power plants and emissions from ships burning oil, as noted by Lippmann et al. (2006).\n:::\n\n### Generating data visualizations\n\nIn order to determine the effectiveness or quality of a visualization, we need to first understand three things:\n\n::: callout-tip\n### Questions to ask yourself when building data visualizations\n\n1.  What is the question we are trying to answer?\n2.  Why are we building this visualization?\n3.  For whom are we producing this data visualization for? Who is the intended audience to consume this visualization?\n:::\n\nNo plot (or any statistical tool, really) can be judged without knowing the answers to those questions. No plot or graphic exists in a vacuum. There is always context and other surrounding factors that play a role in determining a plot's effectiveness.\n\nConversely, **high-quality, well-made visualizations** usually allow one to properly deduce what question is being asked and who the audience is meant to be. A good visualization **tells a complete story in a single frame**.\n\n::: callout-tip\n## Broad steps for creating data visualizations\n\nThe act of visualizing data typically proceeds in two broad steps:\n\n1.  Given the question and the audience, **what type of plot should I make?**\n2.  Given the plot I intend to make, **how can I optimize it for clarity and effectiveness?**\n:::\n\n## Data viz principles\n\n### Developing plots\n\nInitially, one must decide what information should be presented. The following principles for developing analytic graphics come from Edward Tufte's book [*Beautiful Evidence*](https://www.edwardtufte.com/tufte/books_be).\n\n1.  Show comparisons\n2.  Show causality, mechanism, explanation\n3.  Show multivariate data\n4.  Integrate multiple modes of evidence\n5.  Describe and document the evidence\n6.  Content is king - good plots start with good questions\n\n### Optimizing plots\n\n1.  Maximize the data/ink ratio -- if \"ink\" can be removed without reducing the information being communicated, then it should be removed.\n2.  Maximize the range of perceptual conditions -- your audience's perceptual abilities may not be fully known, so it's best to allow for a wide range, to the extent possible (or knowable).\n3.  Show variation in the **data**, not variation in the **design**.\n\nWhat's sub-optimal about this plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd <- airquality %>%\n    mutate(Summer = ifelse(Month %in% c(7, 8, 9), 2, 3))\nwith(d, {\n    plot(Temp, Ozone, col = unclass(Summer), pch = 19, frame.plot = FALSE)\n    legend(\"topleft\",\n        col = 2:3, pch = 19, bty = \"n\",\n        legend = c(\"Summer\", \"Non-Summer\")\n    )\n})\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nWhat's sub-optimal about this plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nairquality %>%\n    mutate(Summer = ifelse(Month %in% c(7, 8, 9),\n        \"Summer\", \"Non-Summer\"\n    )) %>%\n    ggplot(aes(Temp, Ozone)) +\n    geom_point(aes(color = Summer), size = 2) +\n    theme_minimal()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nSome of these principles are taken from Edward Tufte's *Visual Display of Quantitative Information*:\n\n## Plots to Avoid\n\nThis section is based on a talk by [Karl W. Broman](http://kbroman.org/) titled \"How to Display Data Badly,\" in which he described how the default plots offered by Microsoft Excel \"obscure your data and annoy your readers\" ([here](https://kbroman.org/talks.html) is a link to a collection of Karl Broman's talks).\n\n::: callout-tip\n### FYI\n\nKarl's lecture was inspired by the 1984 paper by H. Wainer: How to display data badly. American Statistician 38(2): 137--147.\n\nDr. Wainer was the first to elucidate the principles of the bad display of data.\n\nHowever, according to Karl Broman, \"The now widespread use of Microsoft Excel has resulted in remarkable advances in the field.\"\n\nHere we show examples of \"bad plots\" and how to improve them in R.\n:::\n\n::: callout-tip\n### Some general principles of *bad* plots\n\n-   Display as little information as possible.\n-   Obscure what you do show (with chart junk).\n-   Use pseudo-3D and color gratuitously.\n-   Make a pie chart (preferably in color and 3D).\n-   Use a poorly chosen scale.\n-   Ignore significant figures.\n:::\n\n## Examples\n\nHere are some examples of bad plots and suggestions on how to improve\n\n### Pie charts\n\nLet's say we are interested in the most commonly used browsers. Wikipedia has a [table](https://en.wikipedia.org/wiki/Usage_share_of_web_browsers) with the \"usage share of web browsers\" or the proportion of visitors to a group of web sites that use a particular web browser from July 2017.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbrowsers <- c(\n    Chrome = 60, Safari = 14, UCBrowser = 7,\n    Firefox = 5, Opera = 3, IE = 3, Noinfo = 8\n)\nbrowsers.df <- gather(\n    data.frame(t(browsers)),\n    \"browser\", \"proportion\"\n)\n```\n:::\n\n\nLet's say we want to report the results of the usage. The standard way of displaying these is with a pie chart:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npie(browsers, main = \"Browser Usage (July 2022)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nIf we look at the help file for `pie()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?pie\n```\n:::\n\n\nIt states:\n\n> \"Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.\"\n\nTo see this, look at the figure above and try to determine the percentages just from looking at the plot. Unless the percentages are close to 25%, 50% or 75%, this is not so easy. Simply showing the numbers is not only clear, but also saves on printing costs.\n\n#### Instead of pie charts, try bar plots\n\nIf you do want to plot them, then a barplot is appropriate. Here we use the `geom_bar()` function in `ggplot2`. Note, there are also horizontal lines at every multiple of 10, which helps the eye quickly make comparisons across:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- browsers.df %>%\n    ggplot(aes(\n        x = reorder(browser, -proportion),\n        y = proportion\n    )) +\n    geom_bar(stat = \"identity\")\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nNotice that we can now pretty easily determine the percentages by following a horizontal line to the x-axis.\n\n#### Polish your plots\n\nWhile this figure is already a big improvement over a pie chart, we can do even better. When you create figures, you want your figures to be self-sufficient, meaning someone looking at the plot can understand everything about it.\n\nSome possible critiques are:\n\n1.  make the axes bigger\n2.  make the labels bigger\n3.  make the labels be full names (e.g. \"Browser\" and \"Proportion of users\", ideally with units when appropriate)\n4.  add a title\n\nLet's explore how to do these things to make an even better figure.\n\nTo start, go to the help file for `theme()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?ggplot2::theme\n```\n:::\n\n\nWe see there are arguments with text that control all the text sizes in the plot. If you scroll down, you see the text argument in the theme command requires class `element_text`. Let's try it out.\n\nTo change the x-axis and y-axis labels to be full names, use `xlab()` and `ylab()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- p + xlab(\"Browser\") +\n    ylab(\"Proportion of Users\")\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nMaybe a title\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Browser Usage (July 2022)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\nNext, we can also use the `theme()` function in `ggplot2` to control the justifications and sizes of the axes, labels and titles.\n\nTo center the title\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Browser Usage (July 2022)\") +\n    theme(plot.title = element_text(hjust = 0.5))\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nTo create bigger text/labels/titles:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- p + ggtitle(\"Browser Usage (July 2022)\") +\n    theme(\n        plot.title = element_text(hjust = 0.5),\n        text = element_text(size = 15)\n    )\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\n#### \"I don't like that theme\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_bw()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_dark()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_classic() # axis lines!\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggthemes::theme_base()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n### 3D barplots\n\nPlease, avoid a 3D version because it obfuscates the plot, making it more difficult to find the percentages by eye.\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig2b.png)\n\n### Donut plots\n\nEven worse than pie charts are donut plots.\n\n![](http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Donut-Chart.svg/360px-Donut-Chart.svg.png)\n\nThe reason is that by removing the center, we remove one of the visual cues for determining the different areas: the angles. **There is no reason to ever use a donut plot to display data**.\n\n::: callout-note\n### Question\n\nWhy are pie/donut charts [so common](https://blog.usejournal.com/why-humans-love-pie-charts-9cd346000bdc)?\n\n<https://blog.usejournal.com/why-humans-love-pie-charts-9cd346000bdc>\n:::\n\n### Barplots as data summaries\n\nWhile barplots are useful for showing percentages, they are incorrectly used to display data from two groups being compared. Specifically, barplots are created with height equal to the group means; an antenna is added at the top to represent standard errors. This plot is simply showing two numbers per group and the plot adds nothing:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig1c.png)\n\n#### Instead of bar plots for summaries, try box plots\n\nIf the number of points is small enough, we might as well add them to the plot. When the number of points is too large for us to see them, just showing a boxplot is preferable.\n\nLet's recreate these barplots as boxplots and overlay the points. We will simulate similar data to demonstrate one way to improve the graphic above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\ndat <- data.frame(\n    \"Treatment\" = rnorm(10, 30, sd = 4),\n    \"Control\" = rnorm(10, 36, sd = 4)\n)\ngather(dat, \"type\", \"response\") %>%\n    ggplot(aes(type, response)) +\n    geom_boxplot() +\n    geom_point(position = \"jitter\") +\n    ggtitle(\"Response to drug treatment\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\nNotice how much more we see here: the center, spread, range, and the points themselves. In the barplot, we only see the mean and the standard error (SE), and the SE has more to do with sample size than with the spread of the data.\n\nThis problem is magnified when our data has outliers or very large tails. For example, in the plot below, there appears to be very large and consistent differences between the two groups:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig3c.png)\n\nHowever, a quick look at the data demonstrates that this difference is mostly driven by just two points.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\ndat <- data.frame(\n    \"Treatment\" = rgamma(10, 10, 1),\n    \"Control\" = rgamma(10, 1, .01)\n)\ngather(dat, \"type\", \"response\") %>%\n    ggplot(aes(type, response)) +\n    geom_boxplot() +\n    geom_point(position = \"jitter\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\n#### Use log scale if data includes outliers\n\nA version showing the data in the log-scale is much more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngather(dat, \"type\", \"response\") %>%\n    ggplot(aes(type, response)) +\n    geom_boxplot() +\n    geom_point(position = \"jitter\") +\n    scale_y_log10()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n### Barplots for paired data\n\nA common task in data analysis is the comparison of two groups. When the dataset is small and data are paired, such as the outcomes before and after a treatment, two-color barplots are unfortunately often used to display the results.\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig6r_e.png)\n\n#### Instead of paired bar plots, try scatter plots\n\nThere are better ways of showing these data to illustrate that there is an increase after treatment. One is to simply make a scatter plot, which shows that most points are above the identity line. Another alternative is to plot the differences against the before values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\nbefore <- runif(6, 5, 8)\nafter <- rnorm(6, before * 1.15, 2)\nli <- range(c(before, after))\nymx <- max(abs(after - before))\n\npar(mfrow = c(1, 2))\nplot(before, after,\n    xlab = \"Before\", ylab = \"After\",\n    ylim = li, xlim = li\n)\nabline(0, 1, lty = 2, col = 1)\n\nplot(before, after - before,\n    xlab = \"Before\", ylim = c(-ymx, ymx),\n    ylab = \"Change (After - Before)\", lwd = 2\n)\nabline(h = 0, lty = 2, col = 1)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\n#### or line plots\n\nLine plots are not a bad choice, although they can be harder to follow than the previous two. Boxplots show you the increase, but lose the paired information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- rep(c(0, 1), rep(6, 2))\npar(mfrow = c(1, 2))\nplot(z, c(before, after),\n    xaxt = \"n\", ylab = \"Response\",\n    xlab = \"\", xlim = c(-0.5, 1.5)\n)\naxis(side = 1, at = c(0, 1), c(\"Before\", \"After\"))\nsegments(rep(0, 6), before, rep(1, 6), after, col = 1)\n\nboxplot(before, after, names = c(\"Before\", \"After\"), ylab = \"Response\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-22-1.png){width=672}\n:::\n:::\n\n\n### Gratuitous 3D\n\nThe figure below shows three curves. Pseudo 3D is used, but it is not clear why. Maybe to separate the three curves? Notice how difficult it is to determine the values of the curves at any given point:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig8b.png)\n\nThis plot can be made better by simply using color to distinguish the three lines:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- read_csv(\"https://github.com/kbroman/Talk_Graphs/raw/master/R/fig8dat.csv\") %>%\n    as_tibble(.name_repair = make.names)\n\np <- x %>%\n    gather(\"drug\", \"proportion\", -log.dose) %>%\n    ggplot(aes(\n        x = log.dose, y = proportion,\n        color = drug\n    )) +\n    geom_line()\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n:::\n\n\nThis plot demonstrates that using color is more than enough to distinguish the three lines.\n\nWe can make this plot better using the functions we learned above\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Survival proportion\") +\n    theme(\n        plot.title = element_text(hjust = 0.5),\n        text = element_text(size = 15)\n    )\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-24-1.png){width=672}\n:::\n:::\n\n\n#### Legends\n\nWe can also move the legend inside the plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Survival proportion\") +\n    theme(\n        plot.title = element_text(hjust = 0.5),\n        text = element_text(size = 15),\n        legend.position = c(0.2, 0.3)\n    )\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-25-1.png){width=672}\n:::\n:::\n\n\nWe can also make the legend transparent\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransparent_legend <- theme(\n    legend.background = element_rect(fill = \"transparent\"),\n    legend.key = element_rect(\n        fill = \"transparent\",\n        color = \"transparent\"\n    )\n)\n\np + ggtitle(\"Survival proportion\") +\n    theme(\n        plot.title = element_text(hjust = 0.5),\n        text = element_text(size = 15),\n        legend.position = c(0.2, 0.3)\n    ) +\n    transparent_legend\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n\n### Too many significant digits\n\nBy default, statistical software like R returns many significant digits. This does not mean we should report them. Cutting and pasting directly from R is a bad idea since you might end up showing a table, such as the one below, comparing the heights of basketball players:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nheights <- cbind(\n    rnorm(8, 73, 3), rnorm(8, 73, 3), rnorm(8, 80, 3),\n    rnorm(8, 78, 3), rnorm(8, 78, 3)\n)\ncolnames(heights) <- c(\"SG\", \"PG\", \"C\", \"PF\", \"SF\")\nrownames(heights) <- paste(\"team\", 1:8)\nheights\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n             SG       PG        C       PF       SF\nteam 1 68.88065 73.07480 81.80948 76.60455 82.23521\nteam 2 70.05272 66.86024 74.64847 72.70140 78.55640\nteam 3 71.33653 73.63946 81.00483 78.56787 77.86893\nteam 4 73.36414 81.01021 81.68293 76.90146 77.35226\nteam 5 72.63738 69.31895 83.66281 81.17280 82.39133\nteam 6 68.99188 75.50274 79.36564 75.77514 78.68900\nteam 7 73.51017 74.59772 82.09829 73.95492 78.32287\nteam 8 73.46524 71.05953 77.88069 76.44808 73.86569\n```\n:::\n:::\n\n\nWe are reporting precision up to 0.00001 inches. Do you know of a tape measure with that much precision? This can be easily remedied:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(heights, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n         SG   PG    C   PF   SF\nteam 1 68.9 73.1 81.8 76.6 82.2\nteam 2 70.1 66.9 74.6 72.7 78.6\nteam 3 71.3 73.6 81.0 78.6 77.9\nteam 4 73.4 81.0 81.7 76.9 77.4\nteam 5 72.6 69.3 83.7 81.2 82.4\nteam 6 69.0 75.5 79.4 75.8 78.7\nteam 7 73.5 74.6 82.1 74.0 78.3\nteam 8 73.5 71.1 77.9 76.4 73.9\n```\n:::\n:::\n\n\n### Minimal figure captions\n\nRecall the plot we had before:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransparent_legend <- theme(\n    legend.background = element_rect(fill = \"transparent\"),\n    legend.key = element_rect(\n        fill = \"transparent\",\n        color = \"transparent\"\n    )\n)\n\np + ggtitle(\"Survival proportion\") +\n    theme(\n        plot.title = element_text(hjust = 0.5),\n        text = element_text(size = 15),\n        legend.position = c(0.2, 0.3)\n    ) +\n    xlab(\"dose (mg)\") +\n    transparent_legend\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\nWhat type of caption would be good here?\n\nWhen creating figure captions, think about the following:\n\n1.  Be specific\n\n> A plot of the proportion of patients who survived after three drug treatments.\n\n2.  Label the caption\n\n> Figure 1. A plot of the proportion of patients who survived after three drug treatments.\n\n3.  Tell a story\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments.\n\n4.  Include units\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments (milligram).\n\n5.  Explain aesthetics\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments (milligram). Three colors represent three drug treatments. Drug A results in largest survival proportion for the larger drug doses.\n\n## Final thoughts data viz\n\nIn general, you should follow these principles:\n\n-   Create expository graphs to tell a story (figure and caption should be self-sufficient; it's the first thing people look at)\n\n    -   Be accurate and clear\n    -   Let the data speak\n    -   Make axes, labels and titles big\n    -   Make labels full names (ideally with units when appropriate)\n    -   Add informative legends; use space effectively\n\n-   Show as much information as possible, taking care not to obscure the message\n\n-   Science not sales: avoid unnecessary frills (especially gratuitous 3D)\n\n-   In tables, every digit should be meaningful\n\n### Some further reading\n\n-   N Cross (2011). Design Thinking: Understanding How Designers Think and Work. Bloomsbury Publishing.\n-   J Tukey (1977). Exploratory Data Analysis.\n-   ER Tufte (1983) The visual display of quantitative information. Graphics Press.\n-   ER Tufte (1990) Envisioning information. Graphics Press.\n-   ER Tufte (1997) Visual explanations. Graphics Press.\n-   ER Tufte (2006) Beautiful Evidence. Graphics Press.\n-   WS Cleveland (1993) Visualizing data. Hobart Press.\n-   WS Cleveland (1994) The elements of graphing data. CRC Press.\n-   A Gelman, C Pasarica, R Dodhia (2002) Let's practice what we preach: Turning tables into graphs. The American Statistician 56:121-130.\n-   NB Robbins (2004) Creating more effective graphs. Wiley.\n-   [Nature Methods columns](http://bang.clearscience.info/?p=546)\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n curl          5.0.2   2023-08-14 [1] CRAN (R 4.3.1)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n ggthemes      4.2.4   2021-01-20 [1] CRAN (R 4.3.0)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/25-python-for-r-users/index/execute-results/html.json b/_freeze/posts/25-python-for-r-users/index/execute-results/html.json
index 3203bb2..a3599ef 100644
--- a/_freeze/posts/25-python-for-r-users/index/execute-results/html.json
+++ b/_freeze/posts/25-python-for-r-users/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "0cb9274c2190f9593ff00160dd7e4f1c",
+  "hash": "a14644892e40bacdeaab6e804788513d",
   "result": {
-    "markdown": "---\ntitle: \"25 - Python for R Users\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to using Python in R and the reticulate package\"\ncategories: [week 8, module 6, python, R, programming]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/25-python-for-r-users/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rstudio.github.io/reticulate>\n2.  <https://py-pkgs.org/02-setup>\n3.  [The Python Tutorial](https://docs.python.org/3/tutorial)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rstudio.github.io/reticulate>\n-   <https://github.com/bcaffo/ds4ph-bme/blob/master/python.md>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n1.  Install the `reticulate` R package on your machine (I'm assuming you have python installed already)\n2.  Learn about `reticulate` to work interoperability between Python and R\n3.  Be able to translate between R and Python objects\n:::\n\n# Python for R Users\n\nAs the number of computational and statistical methods for the analysis data continue to increase, you will find many will be implemented in other languages.\n\nOften **Python is the language of choice**.\n\nPython is incredibly powerful and I increasingly interact with it on very frequent basis these days. To be able to leverage software tools implemented in Python, today I am giving an overview of using Python from the perspective of an R user.\n\n## Overview\n\nFor this lecture, we will be using the [`reticulate` R package](https://rstudio.github.io/reticulate), which provides a set of tools for interoperability between Python and R. The package includes facilities for:\n\n-   **Calling Python from R** in a variety of ways including (i) R Markdown, (ii) sourcing Python scripts, (iii) importing Python modules, and (iv) using Python interactively within an R session.\n-   **Translation between R and Python objects** (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).\n\n![](https://rstudio.github.io/reticulate/images/reticulated_python.png){preview=\"TRUE\"}\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate/index.html)\\]\n\n::: callout-tip\n### Pro-tip for installing python\n\n**Installing python**: If you would like recommendations on installing python, I like these resources:\n\n-   Py Pkgs: <https://py-pkgs.org/02-setup#installing-python>\n-   my fav: Using conda environments with mini-forge: <https://github.com/conda-forge/miniforge>\n-   from `reticulate`: <https://rstudio.github.io/reticulate/articles/python_packages.html>\n\n**What's happening under the hood?**: `reticulate` embeds a Python session within your R session, enabling seamless, high-performance interoperability.\n\nIf you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, `reticulate` can make your life better!\n:::\n\n## Install `reticulate`\n\nLet's try it out. Before we get started, you will need to install the packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.package(\"reticulate\")\n```\n:::\n\n\nWe will also load the `here` and `tidyverse` packages for our lesson:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(tidyverse)\nlibrary(reticulate)\n```\n:::\n\n\n## python path\n\nIf python is not installed on your computer, you can use the `install_python()` function from `reticulate` to install it.\n\n-   <https://rstudio.github.io/reticulate/reference/install_python>\n\nIf python is already installed, by default, `reticulate` uses the version of Python found on your `PATH`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.which(\"python3\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n           python3 \n\"/usr/bin/python3\" \n```\n:::\n:::\n\n\nThe `use_python()` function enables you to specify an alternate version, for example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nuse_python(\"/usr/<new>/<path>/local/bin/python\")\n```\n:::\n\n\nFor example, I can define the path explicitly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nuse_python(\"/opt/homebrew/Caskroom/miniforge/base/bin/python\")\n```\n:::\n\n\nYou can confirm that `reticulate` is using the correct version of python that you requested using the `py_discover_config` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npy_discover_config()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\npython:         /usr/bin/python3\nlibpython:      /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib\npythonhome:     /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9:/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9\nversion:        3.9.6 (default, May  7 2023, 23:32:44)  [Clang 14.0.3 (clang-1403.0.22.14.1)]\nnumpy:          /Users/leocollado/Library/Python/3.9/lib/python/site-packages/numpy\nnumpy_version:  1.25.2\n\nNOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK\n```\n:::\n:::\n\n\n## Calling Python in R\n\nThere are a variety of ways to integrate Python code into your R projects:\n\n1.  **Python in R Markdown** --- A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa).\n\n2.  **Importing Python modules** --- The `import()` function enables you to import any Python module and call its functions directly from R.\n\n3.  **Sourcing Python scripts** --- The `source_python()` function enables you to source a Python script the same way you would `source()` an R script (Python functions and objects defined within the script become directly available to the R session).\n\n4.  **Python REPL** --- The `repl_python()` function creates an interactive Python console within R. Objects you create within Python are available to your R session (and vice-versa).\n\nBelow I will focus on introducing the first and last one. However, before we do that, let's introduce a bit about python basics.\n\n# Python basics\n\nPython is a **high-level**, **object-oriented programming** language useful to know for anyone analyzing data.\n\nThe most important thing to know before learning Python, is that in Python, **everything is an object**.\n\n-   There is no compiling and no need to define the type of variables before using them.\n-   No need to allocate memory for variables.\n-   The code is very easy to learn and easy to read (syntax).\n\nThere is a large scientific community contributing to Python. Some of the most widely used libraries in Python are `numpy`, `scipy`, `pandas`, and `matplotlib`.\n\n## start python\n\nThere are two modes you can write Python code in: **interactive mode** or **script mode**. If you open up a UNIX command window and have a command-line interface, you can simply type `python` (or `python3`) in the shell:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3\n```\n:::\n\n\nand the **interactive mode** will open up. You can write code in the interactive mode and Python will *interpret* the code using the **python interpreter**.\n\nAnother way to pass code to Python is to store code in a file ending in `.py`, and execute the file in the **script mode** using\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3 myscript.py\n```\n:::\n\n\nTo check what version of Python you are using, type the following in the shell:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3 --version\n```\n:::\n\n\n## R or python via terminal\n\n(Demo in class)\n\n## objects in python\n\nEverything in Python is an object. Think of an object as a data structure that contains both data as well as functions. These objects can be variables, functions, and modules which are all objects. We can operate on this objects with what are called **operators** (e.g. addition, subtraction, concatenation or other operations), define/apply functions, test/apply for conditionals statements, (e.g. `if`, `else` statements) or iterate over the objects.\n\nNot all objects are required to have **attributes** and **methods** to operate on the objects in Python, but **everything is an object** (i.e. all objects can be assigned to a variable or passed as an argument to a function). A user can work with built-in defined classes of objects or can create new classes of objects. Using these objects, a user can perform operations on the objects by modifying / interacting with them.\n\n## variables\n\nVariable names are case sensitive, can contain numbers and letters, can contain underscores, cannot begin with a number, cannot contain illegal characters and cannot be one of the 31 keywords in Python:\n\n> \"and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while, with, yield\"\n\n## operators\n\n-   Numeric operators are `+`, `-`, `*`, `/`, `**` (exponent), `%` (modulus if applied to integers)\n-   String and list operators: `+` and `*` .\n-   Assignment operator: `=`\n-   The augmented assignment operator `+=` (or `-=`) can be used like `n += x` which is equal to `n = n + x`\n-   Boolean relational operators: `==` (equal), `!=` (not equal), `>`, `<`, `>=` (greater than or equal to), `<=` (less than or equal to)\n-   Boolean expressions will produce True or False\n-   Logical operators: `and`, `or`, and `not`. e.g. `x > 1 and x <= 5`\n\n\n::: {.cell}\n\n```{.python .cell-code}\n2 ** 3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8\n```\n:::\n\n```{.python .cell-code}\nx = 3 \nx > 1 and x <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n:::\n\n\nAnd in R, the execution changes from Python to R seamlessly\n\n\n::: {.cell}\n\n```{.r .cell-code}\n2^3 \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 8\n```\n:::\n\n```{.r .cell-code}\nx = 3\nx > 1 & x <=5\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n## format operators\n\nIf `%` is applied to strings, this operator is the **format operator**. It tells Python how to format a list of values in a string. For example,\n\n-   `%d` says to format the value as an integer\n-   `%g` says to format the value as an float\n-   `%s` says to format the value as an string\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint('In %d days, I have eaten %g %s.' % (5, 3.5, 'cupcakes'))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nIn 5 days, I have eaten 3.5 cupcakes.\n```\n:::\n:::\n\n\n## functions\n\nPython contains a small list of very useful **built-in functions**.\n\nAll other functions need defined by the user or need to be imported from modules.\n\n::: callout-tip\n### Pro-tip\n\nFor a more detailed list on the built-in functions in Python, see [Built-in Python Functions](https://docs.python.org/2/library/functions.html).\n:::\n\nThe first function we will discuss, `type()`, reports the type of any object, which is very useful when handling multiple data types (remember, everything in Python is an object). Here are some the mains types you will encounter:\n\n-   integer (`int`)\n-   floating-point (`float`)\n-   string (`str`)\n-   list (`list`)\n-   dictionary (`dict`)\n-   tuple (`tuple`)\n-   function (`function`)\n-   module (`module`)\n-   boolean (`bool`): e.g. True, False\n-   enumerate (`enumerate`)\n\nIf we asked for the type of a string \"Let's go Ravens!\"\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(\"Let's go Ravens!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n<class 'str'>\n```\n:::\n:::\n\n\nThis would return the `str` type.\n\nYou have also seen how to use the `print()` function. The function print will accept an argument and print the argument to the screen. Print can be used in two ways:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(\"Let's go Ravens!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Let's go Ravens!\"\n```\n:::\n:::\n\n\n## new functions\n\nNew functions can be defined using one of the 31 keywords in Python def.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef new_world(): \n    return 'Hello world!'\n    \nprint(new_world())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello world!\n```\n:::\n:::\n\n\nThe first line of the function (the header) must start with `def`, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon. The arguments can be specified in any order.\n\nThe rest of the function (the body) always has an indentation of four spaces. If you define a function in the interactive mode, the interpreter will print ellipses (...) to let you know the function is not complete. To complete the function, enter an empty line (not necessary in a script).\n\nTo return a value from a function, use return. The function will immediately terminate and not run any code written past this point.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef squared(x):\n    \"\"\" Return the square of a  \n        value \"\"\"\n    return x ** 2\n\nprint(squared(4))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n16\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\npython has its version of `...` (also from docs.python.org)\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef concat(*args, sep=\"/\"):\n return sep.join(args)  \n\nconcat(\"a\", \"b\", \"c\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'a/b/c'\n```\n:::\n:::\n\n:::\n\n## iteration\n\n**Iterative loops** can be written with the `for`, `while` and `break` statements.\n\nDefining a **`for` loop** is similar to defining a new function.\n\n-   The header ends with a colon and the body is indented.\n-   The function `range(n)` takes in an integer `n` and creates a set of values from `0` to `n - 1`.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfor i in range(3):\n  print('Baby shark, doo doo doo doo doo doo!')\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nBaby shark, doo doo doo doo doo doo!\nBaby shark, doo doo doo doo doo doo!\nBaby shark, doo doo doo doo doo doo!\n```\n:::\n\n```{.python .cell-code}\nprint('Baby shark!')\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nBaby shark!\n```\n:::\n:::\n\n\n`for` loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries.\n\nThe **function `len()`** can be used to:\n\n-   Calculate the length of a string\n-   Calculate the number of elements in a list\n-   Calculate the number of items (key-value pairs) in a dictionary\n-   Calculate the number elements in the tuple\n\n\n::: {.cell}\n\n```{.python .cell-code}\nx = 'Baby shark!'\nlen(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n11\n```\n:::\n:::\n\n\n## methods for each type of object (dot notation)\n\nFor strings, lists and dictionaries, there are set of methods you can use to manipulate the objects. In general, the notation for methods is the **dot notation**.\n\nThe syntax is the **name of the object** followed by a **dot** (or period) followed by the **name of the method**.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nx = \"Hello Baltimore!\"\nx.split()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['Hello', 'Baltimore!']\n```\n:::\n:::\n\n\n## Data structures\n\nWe have already seen lists. Python has other **data structures** built in.\n\n-   Sets `{\"a\", \"a\", \"a\", \"b\"}` (unique elements)\n-   Tuples `(1, 2, 3)` (a lot like lists but not mutable, i.e. need to create a new to modify)\n-   Dictionaries\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndict = {\"a\" : 1, \"b\" : 2}\ndict['a']\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n1\n```\n:::\n\n```{.python .cell-code}\ndict['b']\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2\n```\n:::\n:::\n\n\nMore about data structures can be founds [at the python docs](https://docs.python.org/3/tutorial/datastructures.html)\n\n# `reticulate`\n\n## Python engine within R Markdown\n\nThe `reticulate` package includes a Python engine for R Markdown with the following features:\n\n1.  Run **Python chunks in a single Python session embedded within your R session** (shared variables/state between Python chunks)\n\n2.  **Printing of Python output**, including graphical output from `matplotlib`.\n\n3.  **Access to objects created within Python chunks from R** using the `py` object (e.g. `py$x` would access an `x` variable created within Python from R).\n\n4.  **Access to objects created within R chunks from Python** using the `r` object (e.g. `r.x` would access to `x` variable created within R from Python)\n\n::: callout-tip\n### Conversions\n\nBuilt in conversion for many Python object types is provided, including [NumPy](https://numpy.org) arrays and [Pandas](https://pandas.pydata.org) data frames.\n:::\n\n## From Python to R\n\nAs an example, you can use Pandas to read and manipulate data then easily plot the Pandas data frame using `ggplot2`:\n\nLet's first create a `flights.csv` dataset in R and save it using `write_csv` from `readr`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# checks to see if a folder called \"data\" exists; if not, it installs it\nif(!file.exists(here(\"data\"))){\n  dir.create(here(\"data\"))\n}\n\n# checks to see if a file called \"flights.csv\" exists; if not, it saves it to the data folder\nif(!file.exists(here(\"data\", \"flights.csv\"))){\n  readr::write_csv(nycflights13::flights, \n                   file = here(\"data\", \"flights.csv\"))\n}\n\nnycflights13::flights %>% \n  head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 19\n   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>\n1  2013     1     1      517            515         2      830            819\n2  2013     1     1      533            529         4      850            830\n3  2013     1     1      542            540         2      923            850\n4  2013     1     1      544            545        -1     1004           1022\n5  2013     1     1      554            600        -6      812            837\n6  2013     1     1      554            558        -4      740            728\n# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,\n#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,\n#   hour <dbl>, minute <dbl>, time_hour <dttm>\n```\n:::\n:::\n\n\nNext, we **use Python to read in the file** and do some data wrangling\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport pandas\nflights_path = \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/data/flights.csv\"\nflights = pandas.read_csv(flights_path)\nflights = flights[flights['dest'] == \"ORD\"]\nflights = flights[['carrier', 'dep_delay', 'arr_delay']]\nflights = flights.dropna()\nflights\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n       carrier  dep_delay  arr_delay\n5           UA       -4.0       12.0\n9           AA       -2.0        8.0\n25          MQ        8.0       32.0\n38          AA       -1.0       14.0\n57          AA       -4.0        4.0\n...        ...        ...        ...\n336645      AA      -12.0      -37.0\n336669      UA       -7.0      -13.0\n336675      MQ       -7.0      -11.0\n336696      B6       -5.0      -23.0\n336709      AA      -13.0      -38.0\n\n[16566 rows x 3 columns]\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(py$flights)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   carrier dep_delay arr_delay\n5       UA        -4        12\n9       AA        -2         8\n25      MQ         8        32\n38      AA        -1        14\n57      AA        -4         4\n70      UA         9        20\n```\n:::\n\n```{.r .cell-code}\npy$flights_path \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/data/flights.csv\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(py$flights)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\nclass(py$flights_path)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n:::\n\n\nNext, we can use R to **visualize the Pandas** `DataFrame`.\n\nThe data frame is **loaded in as an R object now** stored in the variable `py`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(py$flights, aes(x = carrier, y = arr_delay)) + \n  geom_point() + \n  geom_jitter()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `reticulate` Python engine is enabled by default within R Markdown whenever `reticulate` is installed.\n:::\n\n### From R to Python\n\nUse R to read and manipulate data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nflights <- read_csv(here(\"data\",\"flights.csv\")) %>%\n  filter(dest == \"ORD\") %>%\n  select(carrier, dep_delay, arr_delay) %>%\n  na.omit()\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 336776 Columns: 19\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr   (4): carrier, tailnum, origin, dest\ndbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...\ndttm  (1): time_hour\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n```{.r .cell-code}\nflights\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 16,566 × 3\n   carrier dep_delay arr_delay\n   <chr>       <dbl>     <dbl>\n 1 UA             -4        12\n 2 AA             -2         8\n 3 MQ              8        32\n 4 AA             -1        14\n 5 AA             -4         4\n 6 UA              9        20\n 7 UA              2        21\n 8 AA             -6       -12\n 9 MQ             39        49\n10 B6             -2        15\n# ℹ 16,556 more rows\n```\n:::\n:::\n\n\n### Use Python to print R dataframe\n\nIf you recall, we can **access objects created within R chunks from Python** using the `r` object (e.g. `r.x` would access to `x` variable created within R from Python).\n\nWe can then ask for the first ten rows using the `head()` function in python.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nr.flights.head(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n  carrier  dep_delay  arr_delay\n0      UA       -4.0       12.0\n1      AA       -2.0        8.0\n2      MQ        8.0       32.0\n3      AA       -1.0       14.0\n4      AA       -4.0        4.0\n5      UA        9.0       20.0\n6      UA        2.0       21.0\n7      AA       -6.0      -12.0\n8      MQ       39.0       49.0\n9      B6       -2.0       15.0\n```\n:::\n:::\n\n\n## import python modules\n\nYou can use the `import()` function to import any Python module and call it from R. For example, this code imports the Python `os` module in python and calls the `listdir()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nos <- import(\"os\")\nos$listdir(\".\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\"       \"index_files\"     \"index.rmarkdown\"\n```\n:::\n:::\n\n\nFunctions and other data within Python modules and classes can be accessed via the `$` operator (analogous to the way you would interact with an R list, environment, or reference class).\n\nImported Python modules support code completion and inline help:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Using reticulate tab completion](https://rstudio.github.io/reticulate/images/reticulate_completion.png){fig-align='center'}\n:::\n:::\n\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate)\\]\n\nSimilarly, we can import the pandas library:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npd <- import('pandas')\ntest <- pd$read_csv(here(\"data\",\"flights.csv\"))\nhead(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n1 2013     1   1      517            515         2      830            819\n2 2013     1   1      533            529         4      850            830\n3 2013     1   1      542            540         2      923            850\n4 2013     1   1      544            545        -1     1004           1022\n5 2013     1   1      554            600        -6      812            837\n6 2013     1   1      554            558        -4      740            728\n  arr_delay carrier flight tailnum origin dest air_time distance hour minute\n1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15\n2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29\n3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40\n4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45\n5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0\n6        12      UA   1696  N39463    EWR  ORD      150      719    5     58\n             time_hour\n1 2013-01-01T10:00:00Z\n2 2013-01-01T10:00:00Z\n3 2013-01-01T10:00:00Z\n4 2013-01-01T10:00:00Z\n5 2013-01-01T11:00:00Z\n6 2013-01-01T10:00:00Z\n```\n:::\n\n```{.r .cell-code}\nclass(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"data.frame\"\n```\n:::\n:::\n\n\nor the scikit-learn python library:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nskl_lr <- import(\"sklearn.linear_model\")\nskl_lr\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nModule(sklearn.linear_model)\n```\n:::\n:::\n\n\n## Calling python scripts\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource_python(\"secret_functions.py\")\nsubject_1 <- read_subject(\"secret_data.csv\")\n```\n:::\n\n\n## Calling the python repl\n\nIf you want to work with Python interactively you can call the `repl_python()` function, which provides a Python REPL embedded within your R session.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrepl_python()\n```\n:::\n\n\nObjects created within the Python REPL can be accessed from R using the `py` object exported from `reticulate`. For example:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Using the repl_python() function](https://rstudio.github.io/reticulate/images/python_repl.png){fig-align='center'}\n:::\n:::\n\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate)\\]\n\ni.e. objects do have permenancy in R after exiting the python repl.\n\nSo typing `x = 4` in the repl will put `py$x` as 4 in R after you exit the repl.\n\nEnter exit within the Python REPL to return to the R prompt.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Try to use tab completion for a function.\n2.  Try to install and load a different python module in R using `import()`.\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package      * version date (UTC) lib source\n bit            4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64          4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli            3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout       1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace     2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon         1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest         0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr        * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate       0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi          1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver         2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap        1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics       0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2      * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue           1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable         0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here         * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms            1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools      0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets    1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite       1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr          1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling       0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice        0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate    * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n Matrix         1.6-1   2023-08-14 [1] CRAN (R 4.3.0)\n munsell        0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n nycflights13   1.0.2   2021-04-12 [1] CRAN (R 4.3.0)\n pillar         1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n png            0.1-8   2022-11-29 [1] CRAN (R 4.3.0)\n purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6             2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n Rcpp           1.0.11  2023-07-06 [1] CRAN (R 4.3.0)\n readr        * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n reticulate   * 1.31    2023-08-10 [1] CRAN (R 4.3.0)\n rlang          1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown      2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot      2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi     0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales         1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi        1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr      * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr        * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect     1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange     0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb           0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8           1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs          0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom          1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr          2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun           0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml           2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n─ Python configuration ───────────────────────────────────────────────────────────────────────────────────────────────\n python:         /usr/bin/python3\n libpython:      /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib\n pythonhome:     /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9:/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9\n version:        3.9.6 (default, May  7 2023, 23:32:44)  [Clang 14.0.3 (clang-1403.0.22.14.1)]\n numpy:          /Users/leocollado/Library/Python/3.9/lib/python/site-packages/numpy\n numpy_version:  1.25.2\n \n NOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"25 - Python for R Users\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to using Python in R and the reticulate package\"\ncategories: [week 8, module 6, python, R, programming]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/25-python-for-r-users/index.qmd).*\n\n<!-- Add interesting quote -->\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1.  <https://rstudio.github.io/reticulate>\n2.  <https://py-pkgs.org/02-setup>\n3.  [The Python Tutorial](https://docs.python.org/3/tutorial)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n-   <https://rstudio.github.io/reticulate>\n-   <https://github.com/bcaffo/ds4ph-bme/blob/master/python.md>\n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n1.  Install the `reticulate` R package on your machine (I'm assuming you have python installed already)\n2.  Learn about `reticulate` to work interoperability between Python and R\n3.  Be able to translate between R and Python objects\n:::\n\n# Python for R Users\n\nAs the number of computational and statistical methods for the analysis data continue to increase, you will find many will be implemented in other languages.\n\nOften **Python is the language of choice**.\n\nPython is incredibly powerful and I increasingly interact with it on very frequent basis these days. To be able to leverage software tools implemented in Python, today I am giving an overview of using Python from the perspective of an R user.\n\n## Overview\n\nFor this lecture, we will be using the [`reticulate` R package](https://rstudio.github.io/reticulate), which provides a set of tools for interoperability between Python and R. The package includes facilities for:\n\n-   **Calling Python from R** in a variety of ways including (i) R Markdown, (ii) sourcing Python scripts, (iii) importing Python modules, and (iv) using Python interactively within an R session.\n-   **Translation between R and Python objects** (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).\n\n![](https://rstudio.github.io/reticulate/images/reticulated_python.png){preview=\"TRUE\"}\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate/index.html)\\]\n\n::: callout-tip\n### Pro-tip for installing python\n\n**Installing python**: If you would like recommendations on installing python, I like these resources:\n\n-   Py Pkgs: <https://py-pkgs.org/02-setup#installing-python>\n-   my fav: Using conda environments with mini-forge: <https://github.com/conda-forge/miniforge>\n-   from `reticulate`: <https://rstudio.github.io/reticulate/articles/python_packages.html>\n\n**What's happening under the hood?**: `reticulate` embeds a Python session within your R session, enabling seamless, high-performance interoperability.\n\nIf you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, `reticulate` can make your life better!\n:::\n\n## Install `reticulate`\n\nLet's try it out. Before we get started, you will need to install the packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.package(\"reticulate\")\n```\n:::\n\n\nWe will also load the `here` and `tidyverse` packages for our lesson:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(tidyverse)\nlibrary(reticulate)\n```\n:::\n\n\n## python path\n\nIf python is not installed on your computer, you can use the `install_python()` function from `reticulate` to install it.\n\n-   <https://rstudio.github.io/reticulate/reference/install_python>\n\nIf python is already installed, by default, `reticulate` uses the version of Python found on your `PATH`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.which(\"python3\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n           python3 \n\"/usr/bin/python3\" \n```\n:::\n:::\n\n\nThe `use_python()` function enables you to specify an alternate version, for example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nuse_python(\"/usr/<new>/<path>/local/bin/python\")\n```\n:::\n\n\nFor example, I can define the path explicitly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nuse_python(\"/opt/homebrew/Caskroom/miniforge/base/bin/python\")\n```\n:::\n\n\nYou can confirm that `reticulate` is using the correct version of python that you requested using the `py_discover_config` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npy_discover_config()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\npython:         /usr/bin/python3\nlibpython:      /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib\npythonhome:     /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9:/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9\nversion:        3.9.6 (default, May  7 2023, 23:32:44)  [Clang 14.0.3 (clang-1403.0.22.14.1)]\nnumpy:          /Users/leocollado/Library/Python/3.9/lib/python/site-packages/numpy\nnumpy_version:  1.25.2\n\nNOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK\n```\n:::\n:::\n\n\n## Calling Python in R\n\nThere are a variety of ways to integrate Python code into your R projects:\n\n1.  **Python in R Markdown** --- A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa).\n\n2.  **Importing Python modules** --- The `import()` function enables you to import any Python module and call its functions directly from R.\n\n3.  **Sourcing Python scripts** --- The `source_python()` function enables you to source a Python script the same way you would `source()` an R script (Python functions and objects defined within the script become directly available to the R session).\n\n4.  **Python REPL** --- The `repl_python()` function creates an interactive Python console within R. Objects you create within Python are available to your R session (and vice-versa).\n\nBelow I will focus on introducing the first and last one. However, before we do that, let's introduce a bit about python basics.\n\n# Python basics\n\nPython is a **high-level**, **object-oriented programming** language useful to know for anyone analyzing data.\n\nThe most important thing to know before learning Python, is that in Python, **everything is an object**.\n\n-   There is no compiling and no need to define the type of variables before using them.\n-   No need to allocate memory for variables.\n-   The code is very easy to learn and easy to read (syntax).\n\nThere is a large scientific community contributing to Python. Some of the most widely used libraries in Python are `numpy`, `scipy`, `pandas`, and `matplotlib`.\n\n## start python\n\nThere are two modes you can write Python code in: **interactive mode** or **script mode**. If you open up a UNIX command window and have a command-line interface, you can simply type `python` (or `python3`) in the shell:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3\n```\n:::\n\n\nand the **interactive mode** will open up. You can write code in the interactive mode and Python will *interpret* the code using the **python interpreter**.\n\nAnother way to pass code to Python is to store code in a file ending in `.py`, and execute the file in the **script mode** using\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3 myscript.py\n```\n:::\n\n\nTo check what version of Python you are using, type the following in the shell:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npython3 --version\n```\n:::\n\n\n## R or python via terminal\n\n(Demo in class)\n\n## objects in python\n\nEverything in Python is an object. Think of an object as a data structure that contains both data as well as functions. These objects can be variables, functions, and modules which are all objects. We can operate on this objects with what are called **operators** (e.g. addition, subtraction, concatenation or other operations), define/apply functions, test/apply for conditionals statements, (e.g. `if`, `else` statements) or iterate over the objects.\n\nNot all objects are required to have **attributes** and **methods** to operate on the objects in Python, but **everything is an object** (i.e. all objects can be assigned to a variable or passed as an argument to a function). A user can work with built-in defined classes of objects or can create new classes of objects. Using these objects, a user can perform operations on the objects by modifying / interacting with them.\n\n## variables\n\nVariable names are case sensitive, can contain numbers and letters, can contain underscores, cannot begin with a number, cannot contain illegal characters and cannot be one of the 31 keywords in Python:\n\n> \"and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while, with, yield\"\n\n## operators\n\n-   Numeric operators are `+`, `-`, `*`, `/`, `**` (exponent), `%` (modulus if applied to integers)\n-   String and list operators: `+` and `*` .\n-   Assignment operator: `=`\n-   The augmented assignment operator `+=` (or `-=`) can be used like `n += x` which is equal to `n = n + x`\n-   Boolean relational operators: `==` (equal), `!=` (not equal), `>`, `<`, `>=` (greater than or equal to), `<=` (less than or equal to)\n-   Boolean expressions will produce True or False\n-   Logical operators: `and`, `or`, and `not`. e.g. `x > 1 and x <= 5`\n\n\n::: {.cell}\n\n```{.python .cell-code}\n2 ** 3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8\n```\n:::\n\n```{.python .cell-code}\nx = 3 \nx > 1 and x <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTrue\n```\n:::\n:::\n\n\nAnd in R, the execution changes from Python to R seamlessly\n\n\n::: {.cell}\n\n```{.r .cell-code}\n2^3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 8\n```\n:::\n\n```{.r .cell-code}\nx <- 3\nx > 1 & x <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n## format operators\n\nIf `%` is applied to strings, this operator is the **format operator**. It tells Python how to format a list of values in a string. For example,\n\n-   `%d` says to format the value as an integer\n-   `%g` says to format the value as an float\n-   `%s` says to format the value as an string\n\n\n::: {.cell}\n\n```{.python .cell-code}\nprint('In %d days, I have eaten %g %s.' % (5, 3.5, 'cupcakes'))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nIn 5 days, I have eaten 3.5 cupcakes.\n```\n:::\n:::\n\n\n## functions\n\nPython contains a small list of very useful **built-in functions**.\n\nAll other functions need defined by the user or need to be imported from modules.\n\n::: callout-tip\n### Pro-tip\n\nFor a more detailed list on the built-in functions in Python, see [Built-in Python Functions](https://docs.python.org/2/library/functions.html).\n:::\n\nThe first function we will discuss, `type()`, reports the type of any object, which is very useful when handling multiple data types (remember, everything in Python is an object). Here are some the mains types you will encounter:\n\n-   integer (`int`)\n-   floating-point (`float`)\n-   string (`str`)\n-   list (`list`)\n-   dictionary (`dict`)\n-   tuple (`tuple`)\n-   function (`function`)\n-   module (`module`)\n-   boolean (`bool`): e.g. True, False\n-   enumerate (`enumerate`)\n\nIf we asked for the type of a string \"Let's go Ravens!\"\n\n\n::: {.cell}\n\n```{.python .cell-code}\ntype(\"Let's go Ravens!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n<class 'str'>\n```\n:::\n:::\n\n\nThis would return the `str` type.\n\nYou have also seen how to use the `print()` function. The function print will accept an argument and print the argument to the screen. Print can be used in two ways:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(\"Let's go Ravens!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Let's go Ravens!\"\n```\n:::\n:::\n\n\n## new functions\n\nNew functions can be defined using one of the 31 keywords in Python def.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef new_world(): \n    return 'Hello world!'\n    \nprint(new_world())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello world!\n```\n:::\n:::\n\n\nThe first line of the function (the header) must start with `def`, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon. The arguments can be specified in any order.\n\nThe rest of the function (the body) always has an indentation of four spaces. If you define a function in the interactive mode, the interpreter will print ellipses (...) to let you know the function is not complete. To complete the function, enter an empty line (not necessary in a script).\n\nTo return a value from a function, use return. The function will immediately terminate and not run any code written past this point.\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef squared(x):\n    \"\"\" Return the square of a  \n        value \"\"\"\n    return x ** 2\n\nprint(squared(4))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n16\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\npython has its version of `...` (also from docs.python.org)\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndef concat(*args, sep=\"/\"):\n return sep.join(args)  \n\nconcat(\"a\", \"b\", \"c\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'a/b/c'\n```\n:::\n:::\n\n:::\n\n## iteration\n\n**Iterative loops** can be written with the `for`, `while` and `break` statements.\n\nDefining a **`for` loop** is similar to defining a new function.\n\n-   The header ends with a colon and the body is indented.\n-   The function `range(n)` takes in an integer `n` and creates a set of values from `0` to `n - 1`.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nfor i in range(3):\n  print('Baby shark, doo doo doo doo doo doo!')\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nBaby shark, doo doo doo doo doo doo!\nBaby shark, doo doo doo doo doo doo!\nBaby shark, doo doo doo doo doo doo!\n```\n:::\n\n```{.python .cell-code}\nprint('Baby shark!')\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nBaby shark!\n```\n:::\n:::\n\n\n`for` loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries.\n\nThe **function `len()`** can be used to:\n\n-   Calculate the length of a string\n-   Calculate the number of elements in a list\n-   Calculate the number of items (key-value pairs) in a dictionary\n-   Calculate the number elements in the tuple\n\n\n::: {.cell}\n\n```{.python .cell-code}\nx = 'Baby shark!'\nlen(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n11\n```\n:::\n:::\n\n\n## methods for each type of object (dot notation)\n\nFor strings, lists and dictionaries, there are set of methods you can use to manipulate the objects. In general, the notation for methods is the **dot notation**.\n\nThe syntax is the **name of the object** followed by a **dot** (or period) followed by the **name of the method**.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nx = \"Hello Baltimore!\"\nx.split()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n['Hello', 'Baltimore!']\n```\n:::\n:::\n\n\n## Data structures\n\nWe have already seen lists. Python has other **data structures** built in.\n\n-   Sets `{\"a\", \"a\", \"a\", \"b\"}` (unique elements)\n-   Tuples `(1, 2, 3)` (a lot like lists but not mutable, i.e. need to create a new to modify)\n-   Dictionaries\n\n\n::: {.cell}\n\n```{.python .cell-code}\ndict = {\"a\" : 1, \"b\" : 2}\ndict['a']\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n1\n```\n:::\n\n```{.python .cell-code}\ndict['b']\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n2\n```\n:::\n:::\n\n\nMore about data structures can be founds [at the python docs](https://docs.python.org/3/tutorial/datastructures.html)\n\n# `reticulate`\n\n## Python engine within R Markdown\n\nThe `reticulate` package includes a Python engine for R Markdown with the following features:\n\n1.  Run **Python chunks in a single Python session embedded within your R session** (shared variables/state between Python chunks)\n\n2.  **Printing of Python output**, including graphical output from `matplotlib`.\n\n3.  **Access to objects created within Python chunks from R** using the `py` object (e.g. `py$x` would access an `x` variable created within Python from R).\n\n4.  **Access to objects created within R chunks from Python** using the `r` object (e.g. `r.x` would access to `x` variable created within R from Python)\n\n::: callout-tip\n### Conversions\n\nBuilt in conversion for many Python object types is provided, including [NumPy](https://numpy.org) arrays and [Pandas](https://pandas.pydata.org) data frames.\n:::\n\n## From Python to R\n\nAs an example, you can use Pandas to read and manipulate data then easily plot the Pandas data frame using `ggplot2`:\n\nLet's first create a `flights.csv` dataset in R and save it using `write_csv` from `readr`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# checks to see if a folder called \"data\" exists; if not, it installs it\nif (!file.exists(here(\"data\"))) {\n    dir.create(here(\"data\"))\n}\n\n# checks to see if a file called \"flights.csv\" exists; if not, it saves it to the data folder\nif (!file.exists(here(\"data\", \"flights.csv\"))) {\n    readr::write_csv(nycflights13::flights,\n        file = here(\"data\", \"flights.csv\")\n    )\n}\n\nnycflights13::flights %>%\n    head()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 19\n   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>\n1  2013     1     1      517            515         2      830            819\n2  2013     1     1      533            529         4      850            830\n3  2013     1     1      542            540         2      923            850\n4  2013     1     1      544            545        -1     1004           1022\n5  2013     1     1      554            600        -6      812            837\n6  2013     1     1      554            558        -4      740            728\n# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,\n#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,\n#   hour <dbl>, minute <dbl>, time_hour <dttm>\n```\n:::\n:::\n\n\nNext, we **use Python to read in the file** and do some data wrangling\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport pandas\nflights_path = \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/data/flights.csv\"\nflights = pandas.read_csv(flights_path)\nflights = flights[flights['dest'] == \"ORD\"]\nflights = flights[['carrier', 'dep_delay', 'arr_delay']]\nflights = flights.dropna()\nflights\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n       carrier  dep_delay  arr_delay\n5           UA       -4.0       12.0\n9           AA       -2.0        8.0\n25          MQ        8.0       32.0\n38          AA       -1.0       14.0\n57          AA       -4.0        4.0\n...        ...        ...        ...\n336645      AA      -12.0      -37.0\n336669      UA       -7.0      -13.0\n336675      MQ       -7.0      -11.0\n336696      B6       -5.0      -23.0\n336709      AA      -13.0      -38.0\n\n[16566 rows x 3 columns]\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(py$flights)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   carrier dep_delay arr_delay\n5       UA        -4        12\n9       AA        -2         8\n25      MQ         8        32\n38      AA        -1        14\n57      AA        -4         4\n70      UA         9        20\n```\n:::\n\n```{.r .cell-code}\npy$flights_path\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/data/flights.csv\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(py$flights)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\nclass(py$flights_path)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n:::\n\n\nNext, we can use R to **visualize the Pandas** `DataFrame`.\n\nThe data frame is **loaded in as an R object now** stored in the variable `py`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(py$flights, aes(x = carrier, y = arr_delay)) +\n    geom_point() +\n    geom_jitter()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `reticulate` Python engine is enabled by default within R Markdown whenever `reticulate` is installed.\n:::\n\n### From R to Python\n\nUse R to read and manipulate data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nflights <- read_csv(here(\"data\", \"flights.csv\")) %>%\n    filter(dest == \"ORD\") %>%\n    select(carrier, dep_delay, arr_delay) %>%\n    na.omit()\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 336776 Columns: 19\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr   (4): carrier, tailnum, origin, dest\ndbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...\ndttm  (1): time_hour\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n```{.r .cell-code}\nflights\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 16,566 × 3\n   carrier dep_delay arr_delay\n   <chr>       <dbl>     <dbl>\n 1 UA             -4        12\n 2 AA             -2         8\n 3 MQ              8        32\n 4 AA             -1        14\n 5 AA             -4         4\n 6 UA              9        20\n 7 UA              2        21\n 8 AA             -6       -12\n 9 MQ             39        49\n10 B6             -2        15\n# ℹ 16,556 more rows\n```\n:::\n:::\n\n\n### Use Python to print R dataframe\n\nIf you recall, we can **access objects created within R chunks from Python** using the `r` object (e.g. `r.x` would access to `x` variable created within R from Python).\n\nWe can then ask for the first ten rows using the `head()` function in python.\n\n\n::: {.cell}\n\n```{.python .cell-code}\nr.flights.head(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n  carrier  dep_delay  arr_delay\n0      UA       -4.0       12.0\n1      AA       -2.0        8.0\n2      MQ        8.0       32.0\n3      AA       -1.0       14.0\n4      AA       -4.0        4.0\n5      UA        9.0       20.0\n6      UA        2.0       21.0\n7      AA       -6.0      -12.0\n8      MQ       39.0       49.0\n9      B6       -2.0       15.0\n```\n:::\n:::\n\n\n## import python modules\n\nYou can use the `import()` function to import any Python module and call it from R. For example, this code imports the Python `os` module in python and calls the `listdir()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nos <- import(\"os\")\nos$listdir(\".\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\"       \"index_files\"     \"index.rmarkdown\"\n```\n:::\n:::\n\n\nFunctions and other data within Python modules and classes can be accessed via the `$` operator (analogous to the way you would interact with an R list, environment, or reference class).\n\nImported Python modules support code completion and inline help:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Using reticulate tab completion](https://rstudio.github.io/reticulate/images/reticulate_completion.png){fig-align='center'}\n:::\n:::\n\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate)\\]\n\nSimilarly, we can import the pandas library:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npd <- import(\"pandas\")\ntest <- pd$read_csv(here(\"data\", \"flights.csv\"))\nhead(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n1 2013     1   1      517            515         2      830            819\n2 2013     1   1      533            529         4      850            830\n3 2013     1   1      542            540         2      923            850\n4 2013     1   1      544            545        -1     1004           1022\n5 2013     1   1      554            600        -6      812            837\n6 2013     1   1      554            558        -4      740            728\n  arr_delay carrier flight tailnum origin dest air_time distance hour minute\n1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15\n2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29\n3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40\n4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45\n5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0\n6        12      UA   1696  N39463    EWR  ORD      150      719    5     58\n             time_hour\n1 2013-01-01T10:00:00Z\n2 2013-01-01T10:00:00Z\n3 2013-01-01T10:00:00Z\n4 2013-01-01T10:00:00Z\n5 2013-01-01T11:00:00Z\n6 2013-01-01T10:00:00Z\n```\n:::\n\n```{.r .cell-code}\nclass(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"data.frame\"\n```\n:::\n:::\n\n\nor the scikit-learn python library:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nskl_lr <- import(\"sklearn.linear_model\")\nskl_lr\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nModule(sklearn.linear_model)\n```\n:::\n:::\n\n\n## Calling python scripts\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource_python(\"secret_functions.py\")\nsubject_1 <- read_subject(\"secret_data.csv\")\n```\n:::\n\n\n## Calling the python repl\n\nIf you want to work with Python interactively you can call the `repl_python()` function, which provides a Python REPL embedded within your R session.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrepl_python()\n```\n:::\n\n\nObjects created within the Python REPL can be accessed from R using the `py` object exported from `reticulate`. For example:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Using the repl_python() function](https://rstudio.github.io/reticulate/images/python_repl.png){fig-align='center'}\n:::\n:::\n\n\n\\[**Source**: [Rstudio](https://rstudio.github.io/reticulate)\\]\n\ni.e. objects do have permenancy in R after exiting the python repl.\n\nSo typing `x = 4` in the repl will put `py$x` as 4 in R after you exit the repl.\n\nEnter exit within the Python REPL to return to the R prompt.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1.  Try to use tab completion for a function.\n2.  Try to install and load a different python module in R using `import()`.\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package      * version date (UTC) lib source\n bit            4.0.5   2022-11-15 [1] CRAN (R 4.3.0)\n bit64          4.0.5   2020-08-30 [1] CRAN (R 4.3.0)\n cli            3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout       1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace     2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n crayon         1.5.2   2022-09-29 [1] CRAN (R 4.3.0)\n digest         0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr        * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate       0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi          1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver         2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap        1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics       0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2      * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue           1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable         0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here         * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms            1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools      0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets    1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite       1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr          1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling       0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lattice        0.21-8  2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate    * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n Matrix         1.6-1   2023-08-14 [1] CRAN (R 4.3.0)\n munsell        0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n nycflights13   1.0.2   2021-04-12 [1] CRAN (R 4.3.0)\n pillar         1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n png            0.1-8   2022-11-29 [1] CRAN (R 4.3.0)\n purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6             2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n Rcpp           1.0.11  2023-07-06 [1] CRAN (R 4.3.0)\n readr        * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n reticulate   * 1.31    2023-08-10 [1] CRAN (R 4.3.0)\n rlang          1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown      2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot      2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi     0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales         1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi        1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr      * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr        * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect     1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange     0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb           0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8           1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs          0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n vroom          1.6.3   2023-04-28 [1] CRAN (R 4.3.0)\n withr          2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun           0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml           2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n─ Python configuration ───────────────────────────────────────────────────────────────────────────────────────────────\n python:         /usr/bin/python3\n libpython:      /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib\n pythonhome:     /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9:/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9\n version:        3.9.6 (default, May  7 2023, 23:32:44)  [Clang 14.0.3 (clang-1403.0.22.14.1)]\n numpy:          /Users/leocollado/Library/Python/3.9/lib/python/site-packages/numpy\n numpy_version:  1.25.2\n \n NOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/posts/25-python-for-r-users/index/figure-html/unnamed-chunk-26-1.png b/_freeze/posts/25-python-for-r-users/index/figure-html/unnamed-chunk-26-1.png
index 1a91590..7d7a930 100644
Binary files a/_freeze/posts/25-python-for-r-users/index/figure-html/unnamed-chunk-26-1.png and b/_freeze/posts/25-python-for-r-users/index/figure-html/unnamed-chunk-26-1.png differ
diff --git a/_freeze/projects/project-1/index/execute-results/html.json b/_freeze/projects/project-1/index/execute-results/html.json
index bfacc16..2bd4c1f 100644
--- a/_freeze/projects/project-1/index/execute-results/html.json
+++ b/_freeze/projects/project-1/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "d8adbb8331babae42cb7575423e6b282",
+  "hash": "3de0d1b827fa138c8656f5078c209fdd",
   "result": {
-    "markdown": "---\ntitle: \"Project 1\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Finding great chocolate bars!\"\ncategories: [project 1, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-1/index.qmd).*\n\n# Background\n\n**Due date: Sept 16 at 11:59pm**\n\n### To submit your project\n\nPlease write up your project using R Markdown and `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** for each of the answers to each part.\n\nTo get started, [watch this video on setting up your R Markdown document](https://www.stephaniehicks.com/jhustatcomputing2021/posts/2021-09-02-literate-programming/#create-and-knit-your-first-r-markdown-document).\n\n### Install `tidyverse`\n\nBefore attempting this assignment, you should first install the `tidyverse` package if you have not already. The `tidyverse` package is actually a collection of many packages that serves as a convenient way to install many packages without having to do them one by one. This can be done with the `install.packages()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nRunning this function will install a host of other packages so it make take a minute or two depending on how fast your computer is. Once you have installed it, you will want to load the package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### Data\n\nThat data for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com), which is a weekly podcast and global [community activity](https://github.com/rfordatascience/tidytuesday) brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.\n\n![](https://github.com/rfordatascience/tidytuesday/raw/master/static/tt_logo.png){.preview-image}\n\n\\[**Source**: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/static/tt_logo.png)\\]\n\nIf we look at the [TidyTuesday github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022#2022-data) from 2022, we see this dataset chocolate bar reviews.\n\nTo access the data, you need to install the `tidytuesdayR` R package and use the function `tt_load()` with the date of '2022-01-18' to load the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidytuesdayR\")\n```\n:::\n\n\nThis is how you can download the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntuesdata <- tidytuesdayR::tt_load('2022-01-18')\nchocolate <- tuesdata$chocolate\n```\n:::\n\n\nHowever, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(tidyverse)\n\n# tests if a directory named \"data\" exists locally\nif(!dir.exists(here(\"data\"))) { dir.create(here(\"data\")) }\n\n# saves data only once (not each time you knit a R Markdown)\nif(!file.exists(here(\"data\",\"chocolate.RDS\"))) {\n  url_csv <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv'\n  chocolate <- readr::read_csv(url_csv)\n  \n  # save the file to RDS objects\n  saveRDS(chocolate, file= here(\"data\",\"chocolate.RDS\"))\n}\n```\n:::\n\n\nHere we read in the `.RDS` dataset locally from our computing environment:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchocolate <- readRDS(here(\"data\",\"chocolate.RDS\"))\nas_tibble(chocolate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2,530 × 10\n     ref company_manufacturer company_location review_date\n   <dbl> <chr>                <chr>                  <dbl>\n 1  2454 5150                 U.S.A.                  2019\n 2  2458 5150                 U.S.A.                  2019\n 3  2454 5150                 U.S.A.                  2019\n 4  2542 5150                 U.S.A.                  2021\n 5  2546 5150                 U.S.A.                  2021\n 6  2546 5150                 U.S.A.                  2021\n 7  2542 5150                 U.S.A.                  2021\n 8   797 A. Morin             France                  2012\n 9   797 A. Morin             France                  2012\n10  1011 A. Morin             France                  2013\n# ℹ 2,520 more rows\n# ℹ 6 more variables: country_of_bean_origin <chr>,\n#   specific_bean_origin_or_bar_name <chr>, cocoa_percent <chr>,\n#   ingredients <chr>, most_memorable_characteristics <chr>, rating <dbl>\n```\n:::\n:::\n\n\nWe can take a glimpse at the data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(chocolate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 2,530\nColumns: 10\n$ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…\n$ company_manufacturer             <chr> \"5150\", \"5150\", \"5150\", \"5150\", \"5150…\n$ company_location                 <chr> \"U.S.A.\", \"U.S.A.\", \"U.S.A.\", \"U.S.A.…\n$ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…\n$ country_of_bean_origin           <chr> \"Tanzania\", \"Dominican Republic\", \"Ma…\n$ specific_bean_origin_or_bar_name <chr> \"Kokoa Kamili, batch 1\", \"Zorzal, bat…\n$ cocoa_percent                    <chr> \"76%\", \"76%\", \"76%\", \"68%\", \"72%\", \"8…\n$ ingredients                      <chr> \"3- B,S,C\", \"3- B,S,C\", \"3- B,S,C\", \"…\n$ most_memorable_characteristics   <chr> \"rich cocoa, fatty, bready\", \"cocoa, …\n$ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…\n```\n:::\n:::\n\n\nHere is a data dictionary for what all the column names mean:\n\n-   <https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-01-18/readme.md#data-dictionary>\n\n# Part 1: Explore data\n\nIn this part, use functions from `dplyr` and `ggplot2` to answer the following questions.\n\n1.  Make a histogram of the `rating` scores to visualize the overall distribution of scores. Change the number of bins from the default to 10, 15, 20, and 25. Pick on the one that you think looks the best. Explain what the difference is when you change the number of bins and explain why you picked the one you did.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here and describe your answer afterwards\n```\n:::\n\n\nThe ratings are discrete values making the histogram look strange. When you make the bin size smaller, it aggregates the ratings together in larger groups removing that effect. I picked 15, but there really is no wrong answer. Just looking for an answer here.\n\n2.  Consider the countries where the beans originated from. How many reviews come from each country of bean origin?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n3.  What is average `rating` scores from reviews of chocolate bars that have Ecuador as `country_of_bean_origin` in this dataset? For this same set of reviews, also calculate (1) the total number of reviews and (2) the standard deviation of the `rating` scores. Your answer should be a new data frame with these three summary statistics in three columns. Label the name of these columns `mean`, `sd`, and `total`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n4.  Which country makes the best chocolate (or has the highest ratings on average) with beans from Ecuador?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n5.  Calculate the average rating across all country of origins for beans. Which top 3 countries have the highest ratings on average?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n6.  Following up on the previous problem, now remove any countries of bean origins that have less than 10 chocolate bar reviews. Now, which top 3 countries have the highest ratings on average?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n7.  For this last part, let's explore the relationship between percent chocolate and ratings.\n\nUse the functions in `dplyr`, `tidyr`, and `lubridate` to perform the following steps to the `chocolate` dataset:\n\n1.  Identify the countries of bean origin with at least 50 reviews. Remove reviews from countries are not in this list.\n2.  Using the variable describing the chocolate percentage for each review, create a new column that groups chocolate percentages into one of four groups: (i) \\<60%, (ii) \\>=60 to \\<70%, (iii) \\>=70 to \\<90%, and (iii) \\>=90% (**Hint** check out the `substr()` function in base R and the `case_when()` function from `dplyr` -- see example below).\n3.  Using the new column described in #2, re-order the factor levels (if needed) to be starting with the smallest percentage group and increasing to the largest percentage group (**Hint** check out the `fct_relevel()` function from `forcats`).\n4.  For each country, make a set of four side-by-side boxplots plotting the groups on the x-axis and the ratings on the y-axis. These plots should be faceted by country.\n\nOn average, which category of chocolate percentage is most highly rated? Do these countries mostly agree or are there disagreements?\n\n**Hint**: You may find the `case_when()` function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a `mutate()` call).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Generate some random numbers\ndat <- tibble(x = rnorm(100))\nslice(dat, 1:3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 1\n       x\n   <dbl>\n1  0.342\n2  0.623\n3 -1.30 \n```\n:::\n\n```{.r .cell-code}\n## Create a new column that indicates whether the value of 'x' is positive or negative\ndat %>%\n        mutate(is_positive = case_when(\n                x >= 0 ~ \"Yes\",\n                x < 0 ~ \"No\"\n        ))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 100 × 2\n        x is_positive\n    <dbl> <chr>      \n 1  0.342 Yes        \n 2  0.623 Yes        \n 3 -1.30  No         \n 4 -0.293 No         \n 5 -0.219 No         \n 6  0.645 Yes        \n 7  0.211 Yes        \n 8  2.16  Yes        \n 9  0.263 Yes        \n10  0.337 Yes        \n# ℹ 90 more rows\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 2: Join two datasets together\n\nThe goal of this part of the assignment is to join two datasets together. `gapminder` is a [R package](https://cran.r-project.org/web/packages/gapminder/README.html) that contains an excerpt from the [Gapminder data](https://www.gapminder.org/data/).\n\n### Tasks\n\n1.  Use this dataset it to create a new column called `continent` in our `chocolate` dataset that contains the continent name for each review where the country of bean origin is.\n2.  Only keep reviews that have reviews from countries of bean origin with at least 10 reviews.\n3.  Also, remove the country of bean origin named `\"Blend\"`.\n4.  Make a set of violin plots with ratings on the y-axis and `continent`s on the x-axis.\n\n**Hint**:\n\n-   Check to see if there are any `NA`s in the new column. If there are any `NA`s, add the continent name for each row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 3: Convert wide data into long data\n\nThe goal of this part of the assignment is to take a dataset that is either messy or simply not tidy and to make them tidy datasets. The objective is to gain some familiarity with the functions in the `dplyr`, `tidyr` packages. You may find it helpful to review the section on spreading and gathering data.\n\n### Tasks\n\nWe are going to create a set of features for us to plot over time. Use the functions in `dplyr` and `tidyr` to perform the following steps to the `chocolate` dataset:\n\n1.  Create a new set of columns titled `beans`, `sugar`, `cocoa_butter`, `vanilla`, `letchin`, and `salt` that contain a 1 or 0 representing whether or not that review for the chocolate bar contained that ingredient (1) or not (0).\n2.  Create a new set of columns titled `char_cocoa`, `char_sweet`, `char_nutty`, `char_creamy`, `char_roasty`, `char_earthy` that contain a 1 or 0 representing whether or not that the most memorable characteristic for the chocolate bar had that word (1) or not (0). For example, if the word \"sweet\" appears in the `most_memorable_characteristics`, then record a 1, otherwise a 0 for that review in the `char_sweet` column (**Hint**: check out `str_detect()` from the `stringr` package).\n3.  For each year (i.e. `review_date`), calculate the mean value in each new column you created across all reviews for that year. (**Hint**: If all has gone well thus far, you should have a dataset with 16 rows and 13 columns).\n4.  Convert this wide dataset into a long dataset with a new `feature` and `mean_score` column.\n\nIt should look something like this:\n\n```         \nreview_date     feature   mean_score\n<dbl>           <chr>     <dbl>\n2006    beans   0.967741935     \n2006    sugar   0.967741935     \n2006    cocoa_butter    0.903225806     \n2006    vanilla 0.693548387     \n2006    letchin 0.693548387     \n2006    salt    0.000000000     \n2006    char_cocoa  0.209677419     \n2006    char_sweet  0.161290323     \n2006    char_nutty  0.032258065     \n2006    char_creamy 0.241935484 \n```\n\n### Notes\n\n-   You may need to use functions outside these packages to obtain this result.\n\n-   Do not worry about the ordering of the rows or columns. Depending on whether you use `gather()` or `pivot_longer()`, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 4: Data visualization\n\nIn this part of the project, we will continue to work with our now tidy song dataset from the previous part.\n\n### Tasks\n\nUse the functions in `ggplot2` package to make a scatter plot of the `mean_score`s (y-axis) over time (x-axis). One plot for each `mean_score`. For full credit, your plot should include:\n\n1.  An overall title for the plot and a subtitle summarizing key trends that you found. Also include a caption in the figure with your name.\n2.  Both the observed points for the `mean_score`, but also a smoothed non-linear pattern of the trend\n3.  All plots should be shown in the one figure\n4.  There should be an informative x-axis and y-axis label\n\nConsider playing around with the `theme()` function to make the figure shine, including playing with background colors, font, etc.\n\n### Notes\n\n-   You may need to use functions outside these packages to obtain this result.\n\n-   Don't worry about the ordering of the rows or columns. Depending on whether you use `gather()` or `pivot_longer()`, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 5: Make the worst plot you can!\n\nThis sounds a bit crazy I know, but I want this to try and be FUN! Instead of trying to make a \"good\" plot, I want you to explore your creative side and make a really awful data visualization in every way. :)\n\n### Tasks\n\nUsing the `chocolate` dataset (or any of the modified versions you made throughout this assignment or anything else you wish you build upon it):\n\n1.  Make the absolute worst plot that you can. You need to customize it in **at least 7 ways** to make it awful.\n2.  In your document, write 1 - 2 sentences about each different customization you added (using bullets -- i.e. there should be at least 7 bullet points each with 1-2 sentences), and how it could be useful for you when you want to make an awesome data visualization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 6: Make my plot a better plot!\n\nThe goal is to take my sad looking plot and make it better! If you'd like an [example](https://twitter.com/drmowinckels/status/1392136510468763652), here is a tweet I came across of someone who gave a talk about how to zhoosh up your ggplots.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchocolate %>%\n  ggplot(aes(x = as.factor(review_date), \n             y = rating, \n             fill = review_date)) +\n  geom_violin()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n### Tasks\n\n1.  You need to customize it in **at least 7 ways** to make it better.\n2.  In your document, write 1 - 2 sentences about each different customization you added (using bullets -- i.e. there should be at least 7 bullet points each with 1-2 sentences), describing how you improved it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"Project 1\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Finding great chocolate bars!\"\ncategories: [project 1, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-1/index.qmd).*\n\n# Background\n\n**Due date: Sept 16 at 11:59pm**\n\n### To submit your project\n\nPlease write up your project using R Markdown and `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** for each of the answers to each part.\n\nTo get started, [watch this video on setting up your R Markdown document](https://www.stephaniehicks.com/jhustatcomputing2021/posts/2021-09-02-literate-programming/#create-and-knit-your-first-r-markdown-document).\n\n### Install `tidyverse`\n\nBefore attempting this assignment, you should first install the `tidyverse` package if you have not already. The `tidyverse` package is actually a collection of many packages that serves as a convenient way to install many packages without having to do them one by one. This can be done with the `install.packages()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nRunning this function will install a host of other packages so it make take a minute or two depending on how fast your computer is. Once you have installed it, you will want to load the package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### Data\n\nThat data for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com), which is a weekly podcast and global [community activity](https://github.com/rfordatascience/tidytuesday) brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.\n\n![](https://github.com/rfordatascience/tidytuesday/raw/master/static/tt_logo.png){.preview-image}\n\n\\[**Source**: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/static/tt_logo.png)\\]\n\nIf we look at the [TidyTuesday github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022#2022-data) from 2022, we see this dataset chocolate bar reviews.\n\nTo access the data, you need to install the `tidytuesdayR` R package and use the function `tt_load()` with the date of '2022-01-18' to load the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidytuesdayR\")\n```\n:::\n\n\nThis is how you can download the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntuesdata <- tidytuesdayR::tt_load(\"2022-01-18\")\nchocolate <- tuesdata$chocolate\n```\n:::\n\n\nHowever, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(tidyverse)\n\n# tests if a directory named \"data\" exists locally\nif (!dir.exists(here(\"data\"))) {\n    dir.create(here(\"data\"))\n}\n\n# saves data only once (not each time you knit a R Markdown)\nif (!file.exists(here(\"data\", \"chocolate.RDS\"))) {\n    url_csv <- \"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv\"\n    chocolate <- readr::read_csv(url_csv)\n\n    # save the file to RDS objects\n    saveRDS(chocolate, file = here(\"data\", \"chocolate.RDS\"))\n}\n```\n:::\n\n\nHere we read in the `.RDS` dataset locally from our computing environment:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchocolate <- readRDS(here(\"data\", \"chocolate.RDS\"))\nas_tibble(chocolate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2,530 × 10\n     ref company_manufacturer company_location review_date\n   <dbl> <chr>                <chr>                  <dbl>\n 1  2454 5150                 U.S.A.                  2019\n 2  2458 5150                 U.S.A.                  2019\n 3  2454 5150                 U.S.A.                  2019\n 4  2542 5150                 U.S.A.                  2021\n 5  2546 5150                 U.S.A.                  2021\n 6  2546 5150                 U.S.A.                  2021\n 7  2542 5150                 U.S.A.                  2021\n 8   797 A. Morin             France                  2012\n 9   797 A. Morin             France                  2012\n10  1011 A. Morin             France                  2013\n# ℹ 2,520 more rows\n# ℹ 6 more variables: country_of_bean_origin <chr>,\n#   specific_bean_origin_or_bar_name <chr>, cocoa_percent <chr>,\n#   ingredients <chr>, most_memorable_characteristics <chr>, rating <dbl>\n```\n:::\n:::\n\n\nWe can take a glimpse at the data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(chocolate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 2,530\nColumns: 10\n$ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…\n$ company_manufacturer             <chr> \"5150\", \"5150\", \"5150\", \"5150\", \"5150…\n$ company_location                 <chr> \"U.S.A.\", \"U.S.A.\", \"U.S.A.\", \"U.S.A.…\n$ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…\n$ country_of_bean_origin           <chr> \"Tanzania\", \"Dominican Republic\", \"Ma…\n$ specific_bean_origin_or_bar_name <chr> \"Kokoa Kamili, batch 1\", \"Zorzal, bat…\n$ cocoa_percent                    <chr> \"76%\", \"76%\", \"76%\", \"68%\", \"72%\", \"8…\n$ ingredients                      <chr> \"3- B,S,C\", \"3- B,S,C\", \"3- B,S,C\", \"…\n$ most_memorable_characteristics   <chr> \"rich cocoa, fatty, bready\", \"cocoa, …\n$ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…\n```\n:::\n:::\n\n\nHere is a data dictionary for what all the column names mean:\n\n-   <https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-01-18/readme.md#data-dictionary>\n\n# Part 1: Explore data\n\nIn this part, use functions from `dplyr` and `ggplot2` to answer the following questions.\n\n1.  Make a histogram of the `rating` scores to visualize the overall distribution of scores. Change the number of bins from the default to 10, 15, 20, and 25. Pick on the one that you think looks the best. Explain what the difference is when you change the number of bins and explain why you picked the one you did.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here and describe your answer afterwards\n```\n:::\n\n\nThe ratings are discrete values making the histogram look strange. When you make the bin size smaller, it aggregates the ratings together in larger groups removing that effect. I picked 15, but there really is no wrong answer. Just looking for an answer here.\n\n2.  Consider the countries where the beans originated from. How many reviews come from each country of bean origin?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n3.  What is average `rating` scores from reviews of chocolate bars that have Ecuador as `country_of_bean_origin` in this dataset? For this same set of reviews, also calculate (1) the total number of reviews and (2) the standard deviation of the `rating` scores. Your answer should be a new data frame with these three summary statistics in three columns. Label the name of these columns `mean`, `sd`, and `total`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n4.  Which country makes the best chocolate (or has the highest ratings on average) with beans from Ecuador?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n5.  Calculate the average rating across all country of origins for beans. Which top 3 countries have the highest ratings on average?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n6.  Following up on the previous problem, now remove any countries of bean origins that have less than 10 chocolate bar reviews. Now, which top 3 countries have the highest ratings on average?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n7.  For this last part, let's explore the relationship between percent chocolate and ratings.\n\nUse the functions in `dplyr`, `tidyr`, and `lubridate` to perform the following steps to the `chocolate` dataset:\n\n1.  Identify the countries of bean origin with at least 50 reviews. Remove reviews from countries are not in this list.\n2.  Using the variable describing the chocolate percentage for each review, create a new column that groups chocolate percentages into one of four groups: (i) \\<60%, (ii) \\>=60 to \\<70%, (iii) \\>=70 to \\<90%, and (iii) \\>=90% (**Hint** check out the `substr()` function in base R and the `case_when()` function from `dplyr` -- see example below).\n3.  Using the new column described in #2, re-order the factor levels (if needed) to be starting with the smallest percentage group and increasing to the largest percentage group (**Hint** check out the `fct_relevel()` function from `forcats`).\n4.  For each country, make a set of four side-by-side boxplots plotting the groups on the x-axis and the ratings on the y-axis. These plots should be faceted by country.\n\nOn average, which category of chocolate percentage is most highly rated? Do these countries mostly agree or are there disagreements?\n\n**Hint**: You may find the `case_when()` function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a `mutate()` call).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Generate some random numbers\ndat <- tibble(x = rnorm(100))\nslice(dat, 1:3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 1\n        x\n    <dbl>\n1  0.0156\n2 -0.540 \n3  1.05  \n```\n:::\n\n```{.r .cell-code}\n## Create a new column that indicates whether the value of 'x' is positive or negative\ndat %>%\n    mutate(is_positive = case_when(\n        x >= 0 ~ \"Yes\",\n        x < 0 ~ \"No\"\n    ))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 100 × 2\n         x is_positive\n     <dbl> <chr>      \n 1  0.0156 Yes        \n 2 -0.540  No         \n 3  1.05   Yes        \n 4  0.632  Yes        \n 5  1.51   Yes        \n 6 -0.214  No         \n 7 -1.11   No         \n 8 -1.12   No         \n 9 -0.290  No         \n10 -2.70   No         \n# ℹ 90 more rows\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 2: Join two datasets together\n\nThe goal of this part of the assignment is to join two datasets together. `gapminder` is a [R package](https://cran.r-project.org/web/packages/gapminder/README.html) that contains an excerpt from the [Gapminder data](https://www.gapminder.org/data/).\n\n### Tasks\n\n1.  Use this dataset it to create a new column called `continent` in our `chocolate` dataset that contains the continent name for each review where the country of bean origin is.\n2.  Only keep reviews that have reviews from countries of bean origin with at least 10 reviews.\n3.  Also, remove the country of bean origin named `\"Blend\"`.\n4.  Make a set of violin plots with ratings on the y-axis and `continent`s on the x-axis.\n\n**Hint**:\n\n-   Check to see if there are any `NA`s in the new column. If there are any `NA`s, add the continent name for each row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 3: Convert wide data into long data\n\nThe goal of this part of the assignment is to take a dataset that is either messy or simply not tidy and to make them tidy datasets. The objective is to gain some familiarity with the functions in the `dplyr`, `tidyr` packages. You may find it helpful to review the section on spreading and gathering data.\n\n### Tasks\n\nWe are going to create a set of features for us to plot over time. Use the functions in `dplyr` and `tidyr` to perform the following steps to the `chocolate` dataset:\n\n1.  Create a new set of columns titled `beans`, `sugar`, `cocoa_butter`, `vanilla`, `letchin`, and `salt` that contain a 1 or 0 representing whether or not that review for the chocolate bar contained that ingredient (1) or not (0).\n2.  Create a new set of columns titled `char_cocoa`, `char_sweet`, `char_nutty`, `char_creamy`, `char_roasty`, `char_earthy` that contain a 1 or 0 representing whether or not that the most memorable characteristic for the chocolate bar had that word (1) or not (0). For example, if the word \"sweet\" appears in the `most_memorable_characteristics`, then record a 1, otherwise a 0 for that review in the `char_sweet` column (**Hint**: check out `str_detect()` from the `stringr` package).\n3.  For each year (i.e. `review_date`), calculate the mean value in each new column you created across all reviews for that year. (**Hint**: If all has gone well thus far, you should have a dataset with 16 rows and 13 columns).\n4.  Convert this wide dataset into a long dataset with a new `feature` and `mean_score` column.\n\nIt should look something like this:\n\n```         \nreview_date     feature   mean_score\n<dbl>           <chr>     <dbl>\n2006    beans   0.967741935     \n2006    sugar   0.967741935     \n2006    cocoa_butter    0.903225806     \n2006    vanilla 0.693548387     \n2006    letchin 0.693548387     \n2006    salt    0.000000000     \n2006    char_cocoa  0.209677419     \n2006    char_sweet  0.161290323     \n2006    char_nutty  0.032258065     \n2006    char_creamy 0.241935484 \n```\n\n### Notes\n\n-   You may need to use functions outside these packages to obtain this result.\n\n-   Do not worry about the ordering of the rows or columns. Depending on whether you use `gather()` or `pivot_longer()`, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 4: Data visualization\n\nIn this part of the project, we will continue to work with our now tidy song dataset from the previous part.\n\n### Tasks\n\nUse the functions in `ggplot2` package to make a scatter plot of the `mean_score`s (y-axis) over time (x-axis). One plot for each `mean_score`. For full credit, your plot should include:\n\n1.  An overall title for the plot and a subtitle summarizing key trends that you found. Also include a caption in the figure with your name.\n2.  Both the observed points for the `mean_score`, but also a smoothed non-linear pattern of the trend\n3.  All plots should be shown in the one figure\n4.  There should be an informative x-axis and y-axis label\n\nConsider playing around with the `theme()` function to make the figure shine, including playing with background colors, font, etc.\n\n### Notes\n\n-   You may need to use functions outside these packages to obtain this result.\n\n-   Don't worry about the ordering of the rows or columns. Depending on whether you use `gather()` or `pivot_longer()`, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 5: Make the worst plot you can!\n\nThis sounds a bit crazy I know, but I want this to try and be FUN! Instead of trying to make a \"good\" plot, I want you to explore your creative side and make a really awful data visualization in every way. :)\n\n### Tasks\n\nUsing the `chocolate` dataset (or any of the modified versions you made throughout this assignment or anything else you wish you build upon it):\n\n1.  Make the absolute worst plot that you can. You need to customize it in **at least 7 ways** to make it awful.\n2.  In your document, write 1 - 2 sentences about each different customization you added (using bullets -- i.e. there should be at least 7 bullet points each with 1-2 sentences), and how it could be useful for you when you want to make an awesome data visualization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 6: Make my plot a better plot!\n\nThe goal is to take my sad looking plot and make it better! If you'd like an [example](https://twitter.com/drmowinckels/status/1392136510468763652), here is a tweet I came across of someone who gave a talk about how to zhoosh up your ggplots.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchocolate %>%\n    ggplot(aes(\n        x = as.factor(review_date),\n        y = rating,\n        fill = review_date\n    )) +\n    geom_violin()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n### Tasks\n\n1.  You need to customize it in **at least 7 ways** to make it better.\n2.  In your document, write 1 - 2 sentences about each different customization you added (using bullets -- i.e. there should be at least 7 bullet points each with 1-2 sentences), describing how you improved it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n labeling      0.4.2   2020-10-20 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [
       "index_files"
     ],
diff --git a/_freeze/projects/project-2/index/execute-results/html.json b/_freeze/projects/project-2/index/execute-results/html.json
index 2920667..dd08597 100644
--- a/_freeze/projects/project-2/index/execute-results/html.json
+++ b/_freeze/projects/project-2/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "4bf022efab83c25561d5a2990a969610",
+  "hash": "cbedf613dcadb67ae86cbdaca6341320",
   "result": {
-    "markdown": "---\ntitle: \"Project 2\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Exploring temperature and rainfall in Australia\"\ncategories: [project 2, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-2/index.qmd).*\n\n# Background\n\n**Due date: Sept 30 at 11:59pm**\n\nThe goal of this assignment is to practice designing and writing functions along with practicing our tidyverse skills that we learned in our previous project. Writing functions involves thinking about how code should be divided up and what the interface/arguments should be. In addition, you need to think about what the function will return as output.\n\n### To submit your project\n\nPlease write up your project using R Markdown and processed with `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** (i.e. make sure to set `echo = TRUE`) for each of the answers to each part.\n\n### Install packages\n\nBefore attempting this assignment, you should first install the following packages, if they are not already installed:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\ninstall.packages(\"tidytuesdayR\")\n```\n:::\n\n\n# Part 1: Fun with functions\n\nIn this part, we are going to practice creating functions.\n\n### Part 1A: Exponential transformation\n\nThe exponential of a number can be written as an infinite series expansion of the form $$\n\\exp(x) = 1 + x + \\frac{x^2}{2!} + \\frac{x^3}{3!} + \\cdots\n$$ Of course, we cannot compute an infinite series by the end of this term and so we must truncate it at a certain point in the series. The truncated sum of terms represents an approximation to the true exponential, but the approximation may be usable.\n\nWrite a function that computes the exponential of a number using the truncated series expansion. The function should take two arguments:\n\n-   `x`: the number to be exponentiated\n\n-   `k`: the number of terms to be used in the series expansion beyond the constant 1. The value of `k` is always $\\geq 1$.\n\nFor example, if $k = 1$, then the `Exp` function should return the number $1 + x$. If $k = 2$, then you should return the number $1 + x + x^2/2!$.\n\nInclude at least one example of output using your function.\n\n::: callout-note\n-   You can assume that the input value `x` will always be a *single* number.\n\n-   You can assume that the value `k` will always be an integer $\\geq 1$.\n\n-   Do not use the `exp()` function in R.\n\n-   The `factorial()` function can be used to compute factorials.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nExp <- function(x, k) {\n        # Add your solution here\n}\n```\n:::\n\n\n### Part 1B: Sample mean and sample standard deviation\n\nNext, write two functions called `sample_mean()` and `sample_sd()` that takes as input a vector of data of length $N$ and calculates the sample average and sample standard deviation for the set of $N$ observations.\n\n$$\n\\bar{x} = \\frac{1}{N} \\sum_{i=1}^n x_i\n$$ $$\ns = \\sqrt{\\frac{1}{N-1} \\sum_{i=1}^N (x_i - \\overline{x})^2}\n$$ Include at least one example of output using your functions.\n\n::: callout-note\n-   You can assume that the input value `x` will always be a *vector* of numbers of length *N*.\n\n-   Do not use the `mean()` and `sd()` functions in R.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsample_mean <- function(x) {\n        # Add your solution here\n}\n\nsample_sd <- function(x) {\n        # Add your solution here\n}\n```\n:::\n\n\n### Part 1C: Confidence intervals\n\nNext, write a function called `calculate_CI()` that:\n\n1.  There should be two inputs to the `calculate_CI()`. First, it should take as input a vector of data of length $N$. Second, the function should also have a `conf` ($=1-\\alpha$) argument that allows the confidence interval to be adapted for different $\\alpha$.\n\n2.  Calculates a confidence interval (CI) (e.g. a 95% CI) for the estimate of the mean in the population. If you are not familiar with confidence intervals, it is an interval that contains the population parameter with probability $1-\\alpha$ taking on this form\n\n$$\n\\bar{x} \\pm t_{\\alpha/2, N-1} s_{\\bar{x}}\n$$\n\nwhere $t_{\\alpha/2, N-1}$ is the value needed to generate an area of $\\alpha / 2$ in each tail of the $t$-distribution with $N-1$ degrees of freedom and $s_{\\bar{x}} = \\frac{s}{\\sqrt{N}}$ is the standard error of the mean. For example, if we pick a 95% confidence interval and $N$=50, then you can calculate $t_{\\alpha/2, N-1}$ as\n\n\n::: {.cell}\n\n```{.r .cell-code}\nalpha <- 1 - 0.95\ndegrees_freedom = 50 - 1\nt_score = qt(p=alpha/2, df=degrees_freedom, lower.tail=FALSE)\n```\n:::\n\n\n3.  Returns a named vector of length 2, where the first value is the `lower_bound`, the second value is the `upper_bound`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_CI <- function(x, conf = 0.95) {\n        # Add your solution here\n}\n```\n:::\n\n\nInclude example of output from your function showing the output when using two different levels of `conf`.\n\n::: callout-note\nIf you want to check if your function output matches an existing function in R, consider a vector $x$ of length $N$ and see if the following two code chunks match.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_CI(x, conf = 0.95)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndat = data.frame(x=x)\nfit <- lm(x ~ 1, dat)\n\n# Calculate a 95% confidence interval\nconfint(fit, level=0.95)\n```\n:::\n\n:::\n\n# Part 2: Wrangling data\n\nIn this part, we will practice our wrangling skills with the tidyverse that we learned about in module 1.\n\n### Data\n\nThe two datasets for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com). Specifically, we will use the following data from January 2020, which I have provided for you below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntuesdata <- tidytuesdayR::tt_load('2020-01-07')\nrainfall <- tuesdata$rainfall\ntemperature <- tuesdata$temperature\n```\n:::\n\n\nHowever, to avoid re-downloading data, we will check to see if those files already exist using an `if()` statement:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nif(!file.exists(here(\"data\",\"tuesdata_rainfall.RDS\"))){\n  tuesdata <- tidytuesdayR::tt_load('2020-01-07')\n  rainfall <- tuesdata$rainfall\n  temperature <- tuesdata$temperature\n  \n  # save the files to RDS objects\n  saveRDS(tuesdata$rainfall, file = here(\"data\",\"tuesdata_rainfall.RDS\"))\n  saveRDS(tuesdata$temperature, file = here(\"data\",\"tuesdata_temperature.RDS\"))\n}\n```\n:::\n\n\n::: callout-note\nThe above code will only run if it cannot find the path to the `tuesdata_rainfall.RDS` on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of re-downloading them every time.\n:::\n\nLet's load the datasets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrainfall <- readRDS(here(\"data\",\"tuesdata_rainfall.RDS\"))\ntemperature <- readRDS(here(\"data\",\"tuesdata_temperature.RDS\"))\n```\n:::\n\n\nNow we can look at the data with `glimpse()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nglimpse(rainfall)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 179,273\nColumns: 11\n$ station_code <chr> \"009151\", \"009151\", \"009151\", \"009151\", \"009151\", \"009151…\n$ city_name    <chr> \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Pe…\n$ year         <dbl> 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967, 196…\n$ month        <chr> \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01…\n$ day          <chr> \"01\", \"02\", \"03\", \"04\", \"05\", \"06\", \"07\", \"08\", \"09\", \"10…\n$ rainfall     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ period       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ quality      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ lat          <dbl> -31.96, -31.96, -31.96, -31.96, -31.96, -31.96, -31.96, -…\n$ long         <dbl> 115.79, 115.79, 115.79, 115.79, 115.79, 115.79, 115.79, 1…\n$ station_name <chr> \"Subiaco Wastewater Treatment Plant\", \"Subiaco Wastewater…\n```\n:::\n\n```{.r .cell-code}\nglimpse(temperature)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 528,278\nColumns: 5\n$ city_name   <chr> \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PER…\n$ date        <date> 1910-01-01, 1910-01-02, 1910-01-03, 1910-01-04, 1910-01-0…\n$ temperature <dbl> 26.7, 27.0, 27.5, 24.0, 24.8, 24.4, 25.3, 28.0, 32.6, 35.9…\n$ temp_type   <chr> \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"m…\n$ site_name   <chr> \"PERTH AIRPORT\", \"PERTH AIRPORT\", \"PERTH AIRPORT\", \"PERTH …\n```\n:::\n:::\n\n\nIf we look at the [TidyTuesday github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020#2020-data) from 2020, we see this dataset contains temperature and rainfall data from Australia.\n\n![](https://www.ga.gov.au/__data/assets/image/0005/12569/GA14206.jpg){.preview-image}\n\n\\[**Source**: [Geoscience Australia](https://www.ga.gov.au/scientific-topics/national-location-information/dimensions/climatic-extremes)\\]\n\nHere is a data dictionary for what all the column names mean:\n\n-   <https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-07/readme.md#data-dictionary>\n\n### Tasks\n\nUsing the `rainfall` and `temperature` data, perform the following steps and create a new data frame called `df`:\n\n1.  Start with `rainfall` dataset and drop any rows with NAs.\n2.  Create a new column titled `date` that combines the columns `year`, `month`, `day` into one column separated by \"-\". (e.g. \"2020-01-01\"). This column should not be a character, but should be recognized as a date. (**Hint**: check out the `ymd()` function in `lubridate` R package). You will also want to add a column that just keeps the `year`.\n3.  Using the `city_name` column, convert the city names (character strings) to all upper case.\n4.  Join this wrangled rainfall dataset with the `temperature` dataset such that it includes only observations that are in both data frames. (**Hint**: there are two keys that you will need to join the two datasets together). (**Hint**: If all has gone well thus far, you should have a dataset with 83,964 rows and 13 columns).\n\n::: callout-note\n-   You may need to use functions outside these packages to obtain this result, in particular you may find the functions `drop_na()` from `tidyr` and `str_to_upper()` function from `stringr` useful.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 3: Data visualization\n\nIn this part, we will practice our `ggplot2` plotting skills within the tidyverse starting with our wrangled `df` data from Part 2. For full credit in this part (and for all plots that you make), your plots should include:\n\n1.  An overall title for the plot and a subtitle summarizing key trends that you found. Also include a caption in the figure.\n2.  There should be an informative x-axis and y-axis label.\n\nConsider playing around with the `theme()` function to make the figure shine, including playing with background colors, font, etc.\n\n### Part 3A: Plotting temperature data over time\n\nUse the functions in `ggplot2` package to make a line plot of the max and min temperature (y-axis) over time (x-axis) for each city in our wrangled data from Part 2. You should only consider years 2014 and onwards. For full credit, your plot should include:\n\n1.  For a given city, the min and max temperature should both appear on the plot, but they should be two different colors.\n2.  Use a facet function to facet by `city_name` to show all cities in one figure.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n### Part 3B: Plotting rainfall over time\n\nHere we want to explore the distribution of rainfall (log scale) with histograms for a given city (indicated by the `city_name` column) for a given year (indicated by the `year` column) so we can make some exploratory plots of the data.\n\n::: callout-note\nYou are again using the wrangled data from Part 2.\n:::\n\nThe following code plots the data from one city (`city_name == \"PERTH\"`) in a given year (`year == 2000`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf %>% \n  filter(city_name == \"PERTH\", year == 2000) %>% \n  ggplot(aes(log(rainfall))) + \n    geom_histogram()\n```\n:::\n\n\nWhile this code is useful, it only provides us information on one city in one year. We could cut and paste this code to look at other cities/years, but that can be error prone and just plain messy.\n\nThe aim here is to **design** and **implement** a function that can be re-used to visualize all of the data in this dataset.\n\n1.  There are 2 aspects that may vary in the dataset: The **city_name** and the **year**. Note that not all combinations of `city_name` and `year` have measurements.\n\n2.  Your function should take as input two arguments **city_name** and **year**.\n\n3.  Given the input from the user, your function should return a **single** histogram for that input. Furthermore, the data should be **readable** on that plot so that it is in fact useful. It should be possible visualize the entire dataset with your function (through repeated calls to your function).\n\n4.  If the user enters an input that does not exist in the dataset, your function should catch that and report an error (via the `stop()` function).\n\nFor this section,\n\n1.  Write a short description of how you chose to design your function and why.\n\n2.  Present the code for your function in the R markdown document.\n\n3.  Include at least one example of output from your function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 4: Apply functions and plot\n\n### Part 4A: Tasks\n\nIn this part, we will apply the functions we wrote in Part 1 to our rainfall data starting with our wrangled `df` data from Part 2.\n\n1.  First, filter for only years including 2014 and onwards.\n2.  For a given city and for a given year, calculate the sample mean (using your function `sample_mean()`), the sample standard deviation (using your function `sample_sd()`), and a 95% confidence interval for the average rainfall (using your function `calculate_CI()`). Specifically, you should add two columns in this summarized dataset: a column titled `lower_bound` and a column titled `upper_bound` containing the lower and upper bounds for you CI that you calculated (using your function `calculate_CI()`).\n3.  Call this summarized dataset `rain_df`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n### Part 4B: Tasks\n\nUsing the `rain_df`, plots the estimates of mean rainfall and the 95% confidence intervals on the same plot. There should be a separate faceted plot for each city. Think about using `ggplot()` with both `geom_point()` (and `geom_line()` to connect the points) for the means and `geom_errorbar()` for the lower and upper bounds of the confidence interval.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"Project 2\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Exploring temperature and rainfall in Australia\"\ncategories: [project 2, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-2/index.qmd).*\n\n# Background\n\n**Due date: Sept 30 at 11:59pm**\n\nThe goal of this assignment is to practice designing and writing functions along with practicing our tidyverse skills that we learned in our previous project. Writing functions involves thinking about how code should be divided up and what the interface/arguments should be. In addition, you need to think about what the function will return as output.\n\n### To submit your project\n\nPlease write up your project using R Markdown and processed with `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** (i.e. make sure to set `echo = TRUE`) for each of the answers to each part.\n\n### Install packages\n\nBefore attempting this assignment, you should first install the following packages, if they are not already installed:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\ninstall.packages(\"tidytuesdayR\")\n```\n:::\n\n\n# Part 1: Fun with functions\n\nIn this part, we are going to practice creating functions.\n\n### Part 1A: Exponential transformation\n\nThe exponential of a number can be written as an infinite series expansion of the form $$\n\\exp(x) = 1 + x + \\frac{x^2}{2!} + \\frac{x^3}{3!} + \\cdots\n$$ Of course, we cannot compute an infinite series by the end of this term and so we must truncate it at a certain point in the series. The truncated sum of terms represents an approximation to the true exponential, but the approximation may be usable.\n\nWrite a function that computes the exponential of a number using the truncated series expansion. The function should take two arguments:\n\n-   `x`: the number to be exponentiated\n\n-   `k`: the number of terms to be used in the series expansion beyond the constant 1. The value of `k` is always $\\geq 1$.\n\nFor example, if $k = 1$, then the `Exp` function should return the number $1 + x$. If $k = 2$, then you should return the number $1 + x + x^2/2!$.\n\nInclude at least one example of output using your function.\n\n::: callout-note\n-   You can assume that the input value `x` will always be a *single* number.\n\n-   You can assume that the value `k` will always be an integer $\\geq 1$.\n\n-   Do not use the `exp()` function in R.\n\n-   The `factorial()` function can be used to compute factorials.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nExp <- function(x, k) {\n    # Add your solution here\n}\n```\n:::\n\n\n### Part 1B: Sample mean and sample standard deviation\n\nNext, write two functions called `sample_mean()` and `sample_sd()` that takes as input a vector of data of length $N$ and calculates the sample average and sample standard deviation for the set of $N$ observations.\n\n$$\n\\bar{x} = \\frac{1}{N} \\sum_{i=1}^n x_i\n$$ $$\ns = \\sqrt{\\frac{1}{N-1} \\sum_{i=1}^N (x_i - \\overline{x})^2}\n$$ Include at least one example of output using your functions.\n\n::: callout-note\n-   You can assume that the input value `x` will always be a *vector* of numbers of length *N*.\n\n-   Do not use the `mean()` and `sd()` functions in R.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsample_mean <- function(x) {\n    # Add your solution here\n}\n\nsample_sd <- function(x) {\n    # Add your solution here\n}\n```\n:::\n\n\n### Part 1C: Confidence intervals\n\nNext, write a function called `calculate_CI()` that:\n\n1.  There should be two inputs to the `calculate_CI()`. First, it should take as input a vector of data of length $N$. Second, the function should also have a `conf` ($=1-\\alpha$) argument that allows the confidence interval to be adapted for different $\\alpha$.\n\n2.  Calculates a confidence interval (CI) (e.g. a 95% CI) for the estimate of the mean in the population. If you are not familiar with confidence intervals, it is an interval that contains the population parameter with probability $1-\\alpha$ taking on this form\n\n$$\n\\bar{x} \\pm t_{\\alpha/2, N-1} s_{\\bar{x}}\n$$\n\nwhere $t_{\\alpha/2, N-1}$ is the value needed to generate an area of $\\alpha / 2$ in each tail of the $t$-distribution with $N-1$ degrees of freedom and $s_{\\bar{x}} = \\frac{s}{\\sqrt{N}}$ is the standard error of the mean. For example, if we pick a 95% confidence interval and $N$=50, then you can calculate $t_{\\alpha/2, N-1}$ as\n\n\n::: {.cell}\n\n```{.r .cell-code}\nalpha <- 1 - 0.95\ndegrees_freedom <- 50 - 1\nt_score <- qt(p = alpha / 2, df = degrees_freedom, lower.tail = FALSE)\n```\n:::\n\n\n3.  Returns a named vector of length 2, where the first value is the `lower_bound`, the second value is the `upper_bound`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_CI <- function(x, conf = 0.95) {\n    # Add your solution here\n}\n```\n:::\n\n\nInclude example of output from your function showing the output when using two different levels of `conf`.\n\n::: callout-note\nIf you want to check if your function output matches an existing function in R, consider a vector $x$ of length $N$ and see if the following two code chunks match.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_CI(x, conf = 0.95)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndat <- data.frame(x = x)\nfit <- lm(x ~ 1, dat)\n\n# Calculate a 95% confidence interval\nconfint(fit, level = 0.95)\n```\n:::\n\n:::\n\n# Part 2: Wrangling data\n\nIn this part, we will practice our wrangling skills with the tidyverse that we learned about in module 1.\n\n### Data\n\nThe two datasets for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com). Specifically, we will use the following data from January 2020, which I have provided for you below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntuesdata <- tidytuesdayR::tt_load(\"2020-01-07\")\nrainfall <- tuesdata$rainfall\ntemperature <- tuesdata$temperature\n```\n:::\n\n\nHowever, to avoid re-downloading data, we will check to see if those files already exist using an `if()` statement:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nif (!file.exists(here(\"data\", \"tuesdata_rainfall.RDS\"))) {\n    tuesdata <- tidytuesdayR::tt_load(\"2020-01-07\")\n    rainfall <- tuesdata$rainfall\n    temperature <- tuesdata$temperature\n\n    # save the files to RDS objects\n    saveRDS(tuesdata$rainfall, file = here(\"data\", \"tuesdata_rainfall.RDS\"))\n    saveRDS(tuesdata$temperature, file = here(\"data\", \"tuesdata_temperature.RDS\"))\n}\n```\n:::\n\n\n::: callout-note\nThe above code will only run if it cannot find the path to the `tuesdata_rainfall.RDS` on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of re-downloading them every time.\n:::\n\nLet's load the datasets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrainfall <- readRDS(here(\"data\", \"tuesdata_rainfall.RDS\"))\ntemperature <- readRDS(here(\"data\", \"tuesdata_temperature.RDS\"))\n```\n:::\n\n\nNow we can look at the data with `glimpse()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nglimpse(rainfall)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 179,273\nColumns: 11\n$ station_code <chr> \"009151\", \"009151\", \"009151\", \"009151\", \"009151\", \"009151…\n$ city_name    <chr> \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Pe…\n$ year         <dbl> 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967, 196…\n$ month        <chr> \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01…\n$ day          <chr> \"01\", \"02\", \"03\", \"04\", \"05\", \"06\", \"07\", \"08\", \"09\", \"10…\n$ rainfall     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ period       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ quality      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ lat          <dbl> -31.96, -31.96, -31.96, -31.96, -31.96, -31.96, -31.96, -…\n$ long         <dbl> 115.79, 115.79, 115.79, 115.79, 115.79, 115.79, 115.79, 1…\n$ station_name <chr> \"Subiaco Wastewater Treatment Plant\", \"Subiaco Wastewater…\n```\n:::\n\n```{.r .cell-code}\nglimpse(temperature)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 528,278\nColumns: 5\n$ city_name   <chr> \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PER…\n$ date        <date> 1910-01-01, 1910-01-02, 1910-01-03, 1910-01-04, 1910-01-0…\n$ temperature <dbl> 26.7, 27.0, 27.5, 24.0, 24.8, 24.4, 25.3, 28.0, 32.6, 35.9…\n$ temp_type   <chr> \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"m…\n$ site_name   <chr> \"PERTH AIRPORT\", \"PERTH AIRPORT\", \"PERTH AIRPORT\", \"PERTH …\n```\n:::\n:::\n\n\nIf we look at the [TidyTuesday github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020#2020-data) from 2020, we see this dataset contains temperature and rainfall data from Australia.\n\n![](https://www.ga.gov.au/__data/assets/image/0005/12569/GA14206.jpg){.preview-image}\n\n\\[**Source**: [Geoscience Australia](https://www.ga.gov.au/scientific-topics/national-location-information/dimensions/climatic-extremes)\\]\n\nHere is a data dictionary for what all the column names mean:\n\n-   <https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-07/readme.md#data-dictionary>\n\n### Tasks\n\nUsing the `rainfall` and `temperature` data, perform the following steps and create a new data frame called `df`:\n\n1.  Start with `rainfall` dataset and drop any rows with NAs.\n2.  Create a new column titled `date` that combines the columns `year`, `month`, `day` into one column separated by \"-\". (e.g. \"2020-01-01\"). This column should not be a character, but should be recognized as a date. (**Hint**: check out the `ymd()` function in `lubridate` R package). You will also want to add a column that just keeps the `year`.\n3.  Using the `city_name` column, convert the city names (character strings) to all upper case.\n4.  Join this wrangled rainfall dataset with the `temperature` dataset such that it includes only observations that are in both data frames. (**Hint**: there are two keys that you will need to join the two datasets together). (**Hint**: If all has gone well thus far, you should have a dataset with 83,964 rows and 13 columns).\n\n::: callout-note\n-   You may need to use functions outside these packages to obtain this result, in particular you may find the functions `drop_na()` from `tidyr` and `str_to_upper()` function from `stringr` useful.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 3: Data visualization\n\nIn this part, we will practice our `ggplot2` plotting skills within the tidyverse starting with our wrangled `df` data from Part 2. For full credit in this part (and for all plots that you make), your plots should include:\n\n1.  An overall title for the plot and a subtitle summarizing key trends that you found. Also include a caption in the figure.\n2.  There should be an informative x-axis and y-axis label.\n\nConsider playing around with the `theme()` function to make the figure shine, including playing with background colors, font, etc.\n\n### Part 3A: Plotting temperature data over time\n\nUse the functions in `ggplot2` package to make a line plot of the max and min temperature (y-axis) over time (x-axis) for each city in our wrangled data from Part 2. You should only consider years 2014 and onwards. For full credit, your plot should include:\n\n1.  For a given city, the min and max temperature should both appear on the plot, but they should be two different colors.\n2.  Use a facet function to facet by `city_name` to show all cities in one figure.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n### Part 3B: Plotting rainfall over time\n\nHere we want to explore the distribution of rainfall (log scale) with histograms for a given city (indicated by the `city_name` column) for a given year (indicated by the `year` column) so we can make some exploratory plots of the data.\n\n::: callout-note\nYou are again using the wrangled data from Part 2.\n:::\n\nThe following code plots the data from one city (`city_name == \"PERTH\"`) in a given year (`year == 2000`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf %>%\n    filter(city_name == \"PERTH\", year == 2000) %>%\n    ggplot(aes(log(rainfall))) +\n    geom_histogram()\n```\n:::\n\n\nWhile this code is useful, it only provides us information on one city in one year. We could cut and paste this code to look at other cities/years, but that can be error prone and just plain messy.\n\nThe aim here is to **design** and **implement** a function that can be re-used to visualize all of the data in this dataset.\n\n1.  There are 2 aspects that may vary in the dataset: The **city_name** and the **year**. Note that not all combinations of `city_name` and `year` have measurements.\n\n2.  Your function should take as input two arguments **city_name** and **year**.\n\n3.  Given the input from the user, your function should return a **single** histogram for that input. Furthermore, the data should be **readable** on that plot so that it is in fact useful. It should be possible visualize the entire dataset with your function (through repeated calls to your function).\n\n4.  If the user enters an input that does not exist in the dataset, your function should catch that and report an error (via the `stop()` function).\n\nFor this section,\n\n1.  Write a short description of how you chose to design your function and why.\n\n2.  Present the code for your function in the R markdown document.\n\n3.  Include at least one example of output from your function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 4: Apply functions and plot\n\n### Part 4A: Tasks\n\nIn this part, we will apply the functions we wrote in Part 1 to our rainfall data starting with our wrangled `df` data from Part 2.\n\n1.  First, filter for only years including 2014 and onwards.\n2.  For a given city and for a given year, calculate the sample mean (using your function `sample_mean()`), the sample standard deviation (using your function `sample_sd()`), and a 95% confidence interval for the average rainfall (using your function `calculate_CI()`). Specifically, you should add two columns in this summarized dataset: a column titled `lower_bound` and a column titled `upper_bound` containing the lower and upper bounds for you CI that you calculated (using your function `calculate_CI()`).\n3.  Call this summarized dataset `rain_df`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n### Part 4B: Tasks\n\nUsing the `rain_df`, plots the estimates of mean rainfall and the 95% confidence intervals on the same plot. There should be a separate faceted plot for each city. Think about using `ggplot()` with both `geom_point()` (and `geom_line()` to connect the points) for the means and `geom_errorbar()` for the lower and upper bounds of the confidence interval.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)\n generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)\n glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)\n gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)\n lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)\n magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)\n munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)\n pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)\n purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)\n R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)\n readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)\n stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)\n tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)\n tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)\n timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)\n tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)\n utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)\n vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)\n withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/projects/project-3/index/execute-results/html.json b/_freeze/projects/project-3/index/execute-results/html.json
index 922c1b1..694e684 100644
--- a/_freeze/projects/project-3/index/execute-results/html.json
+++ b/_freeze/projects/project-3/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "56629619a86bfd3294638678442c8666",
+  "hash": "8b6cd2b7c3501b72cf475987667e7aad",
   "result": {
-    "markdown": "---\ntitle: \"Project 3\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Exploring album sales and sentiment of lyrics from Beyoncé and Taylor Swift\"\ncategories: [project 3, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-3/index.qmd).*\n\n# Background\n\n**Due date: October 21 at 11:59pm**\n\nThe goal of this assignment is to practice wrangling special data types (including dates, character strings, and factors) and visualizing results while practicing our tidyverse skills.\n\n### To submit your project\n\nPlease write up your project using R Markdown and processed with `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** (i.e. make sure to set `echo = TRUE`) for each of the answers to each part.\n\n# Load data\n\nThe datasets for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com).\n\nData dictionary avaialble here:\n\n-   <https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-29>\n\n![Beyoncé (left) and Taylor Swift (right)](https://akns-images.eonline.com/eol_images/Entire_Site/2019721/rs_1024x759-190821125112-1024.taylor-swift-beyonce-2009-mtv-vmas.ct.082119.jpg){preview=\"TRUE\"}\n\nSpecifically, we will explore album sales and lyrics from two artists (Beyoncé and Taylor Swift), The data are available from TidyTuesday from September 2020, which I have provided for you below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv')\nts_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv')\nsales <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv')\n```\n:::\n\n\nHowever, to avoid re-downloading data, we will check to see if those files already exist using an `if()` statement:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nif(!file.exists(here(\"data\",\"b_lyrics.RDS\"))){\n  b_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv')\n  ts_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv')\n  sales <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv')\n  \n  # save the files to RDS objects\n  saveRDS(b_lyrics, file = here(\"data\",\"b_lyrics.RDS\"))\n  saveRDS(ts_lyrics, file = here(\"data\",\"ts_lyrics.RDS\"))\n  saveRDS(sales, file = here(\"data\",\"sales.RDS\"))\n}\n```\n:::\n\n\n::: callout-note\nThe above code will only run if it cannot find the path to the `b_lyrics.RDS` on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of re-downloading them every time.\n:::\n\nLet's load the datasets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb_lyrics <- readRDS(here(\"data\",\"b_lyrics.RDS\"))\nts_lyrics <- readRDS(here(\"data\",\"ts_lyrics.RDS\"))\nsales <- readRDS(here(\"data\",\"sales.RDS\"))\n```\n:::\n\n\n# Part 1: Explore album sales\n\nIn this section, the goal is to explore the sales of studio albums from Beyoncé and Taylor Swift.\n\n**Notes**\n\n-   In each of the subsections below that ask you to create a plot, you must create a title, subtitle, x-axis label, and y-axis label with units where applicable. For example, if your axis says \"sales\" as an axis label, change it to \"sales (in millions)\".\n\n## Part 1A\n\nIn this section, we will do some data wrangling.\n\n1.  Use `lubridate` to create a column called `released` that is a `Date` class. However, to be able to do this, you first need to use `stringr` to search for pattern that matches things like this \"(US)\\[51\\]\" in a string like this \"September 1, 2006 (US)\\[51\\]\" and removes them. (**Note**: to get full credit, you must create the regular expression).\n2.  Use `forcats` to create a factor called `country` (**Note**: you may need to collapse some factor levels).\n3.  Transform the `sales` into a unit that is album sales in millions of dollars.\n4.  Keep only album sales from the UK, the US or the World.\n5.  Auto print your final wrangled tibble data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 1B\n\nIn this section, we will do some more data wrangling followed by summarization using wrangled data from Part 1A.\n\n1.  Keep only album sales from the US.\n2.  Create a new column called `years_since_release` corresponding to the number of years since the release of each album from Beyoncé and Taylor Swift. This should be a whole number and you should round down to \"14\" if you get a non-whole number like \"14.12\" years. (**Hint**: you may find the `interval()` function from `lubridate` helpful here, but this not the only way to do this.)\n3.  Calculate the most recent, oldest, and the median years since albums were released for both Beyoncé and Taylor Swift.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 1C\n\nUsing the wrangled data from Part 1A:\n\n1.  Calculate the total album sales for each artist and for each `country` (only sales from the UK, US, and World).\n2.  Using the total album sales, create a [percent stacked barchart](https://r-graph-gallery.com/48-grouped-barplot-with-ggplot2) using `ggplot2` of the percentage of sales of studio albums (in millions) along the y-axis for the two artists along the x-axis colored by the `country`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 1D\n\nUsing the wrangled data from Part 1A, use `ggplot2` to create a bar plot for the sales of studio albums (in millions) along the x-axis for each of the album titles along the y-axis.\n\n**Note**:\n\n-   You only need to consider the global World sales (you can ignore US and UK sales for this part).\n-   The title of the album must be clearly readable along the y-axis.\n-   Each bar should be colored by which artist made that album.\n-   The bars should be ordered from albums with the most sales (top) to the least sales (bottom) (**Note**: you must use functions from `forcats` for this step).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 1E\n\nUsing the wrangled data from Part 1A, use `ggplot2` to create a scatter plot of sales of studio albums (in millions) along the y-axis by the released date for each album along the x-axis.\n\n**Note**:\n\n-   The points should be colored by the artist.\n-   There should be three scatter plots (one for UK, US and world sales) faceted by rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 2: Exploring sentiment of lyrics\n\nIn Part 2, we will explore the lyrics in the `b_lyrics` and `ts_lyrics` datasets.\n\n## Part 2A\n\nUsing `ts_lyrics`, create a new column called `line` with one line containing the character string for each line of Taylor Swift's songs.\n\n-   How many lines in Taylor Swift's lyrics contain the word \"hello\"? For full credit, show all the rows in `ts_lyrics` that have \"hello\" in the `line` column and report how many rows there are in total.\n-   How many lines in Taylor Swift's lyrics contain the word \"goodbye\"? For full credit, show all the rows in `ts_lyrics` that have \"goodbye\" in the `line` column and report how many rows there are in total.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 2B\n\nRepeat the same analysis for `b_lyrics` as described in Part 2A.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 2C\n\nUsing the `b_lyrics` dataset,\n\n1.  Tokenize each lyrical line by words.\n2.  Remove the \"stopwords\".\n3.  Calculate the total number for each word in the lyrics.\n4.  Using the \"bing\" sentiment lexicon, add a column to the summarized data frame adding the \"bing\" sentiment lexicon.\n5.  Sort the rows from most frequent to least frequent words.\n6.  Only keep the top 25 most frequent words.\n7.  Auto print the wrangled tibble data frame.\n8.  Use `ggplot2` to create a bar plot with the top words on the y-axis and the frequency of each word on the x-axis. Color each bar by the sentiment of each word from the \"bing\" sentiment lexicon. Bars should be ordered from most frequent on the top to least frequent on the bottom of the plot.\n9.  Create a word cloud of the top 25 most frequent words.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 2D\n\nRepeat the same analysis as above in Part 2C, but for `ts_lyrics`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 2E\n\nUsing the `ts_lyrics` dataset,\n\n1.  Tokenize each lyrical line by words.\n2.  Remove the \"stopwords\".\n3.  Calculate the total number for each word in the lyrics **for each Album**.\n4.  Using the \"afinn\" sentiment lexicon, add a column to the summarized data frame adding the \"afinn\" sentiment lexicon.\n5.  Calculate the average sentiment score **for each Album**.\n6.  Auto print the wrangled tibble data frame.\n7.  Join the wrangled data frame from Part 1A (album sales in millions) with the wrangled data frame from #6 above (average sentiment score for each album).\n8.  Using `ggplot2`, create a scatter plot of the average sentiment score for each album (y-axis) and the album release data along the x-axis. Make the size of each point the album sales in millions.\n9.  Add a horizontal line at y-intercept=0.\n10. Write 2-3 sentences interpreting the plot answering the question \"How has the sentiment of Taylor Swift's albums have changed over time?\". Add a title, subtitle, and useful axis labels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
+    "markdown": "---\ntitle: \"Project 3\"\nauthor:\n  - name: Leonardo Collado Torres\n    url: http://lcolladotor.github.io/\n    affiliations:\n      - id: libd\n        name: Lieber Institute for Brain Development\n        url: https://libd.org/\n      - id: jhsph\n        name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n        url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Exploring album sales and sentiment of lyrics from Beyoncé and Taylor Swift\"\ncategories: [project 3, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-3/index.qmd).*\n\n# Background\n\n**Due date: October 21 at 11:59pm**\n\nThe goal of this assignment is to practice wrangling special data types (including dates, character strings, and factors) and visualizing results while practicing our tidyverse skills.\n\n### To submit your project\n\nPlease write up your project using R Markdown and processed with `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** (i.e. make sure to set `echo = TRUE`) for each of the answers to each part.\n\n# Load data\n\nThe datasets for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com).\n\nData dictionary avaialble here:\n\n-   <https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-29>\n\n![Beyoncé (left) and Taylor Swift (right)](https://akns-images.eonline.com/eol_images/Entire_Site/2019721/rs_1024x759-190821125112-1024.taylor-swift-beyonce-2009-mtv-vmas.ct.082119.jpg){preview=\"TRUE\"}\n\nSpecifically, we will explore album sales and lyrics from two artists (Beyoncé and Taylor Swift), The data are available from TidyTuesday from September 2020, which I have provided for you below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb_lyrics <- readr::read_csv(\"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv\")\nts_lyrics <- readr::read_csv(\"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv\")\nsales <- readr::read_csv(\"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv\")\n```\n:::\n\n\nHowever, to avoid re-downloading data, we will check to see if those files already exist using an `if()` statement:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nif (!file.exists(here(\"data\", \"b_lyrics.RDS\"))) {\n    b_lyrics <- readr::read_csv(\"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv\")\n    ts_lyrics <- readr::read_csv(\"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv\")\n    sales <- readr::read_csv(\"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv\")\n\n    # save the files to RDS objects\n    saveRDS(b_lyrics, file = here(\"data\", \"b_lyrics.RDS\"))\n    saveRDS(ts_lyrics, file = here(\"data\", \"ts_lyrics.RDS\"))\n    saveRDS(sales, file = here(\"data\", \"sales.RDS\"))\n}\n```\n:::\n\n\n::: callout-note\nThe above code will only run if it cannot find the path to the `b_lyrics.RDS` on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of re-downloading them every time.\n:::\n\nLet's load the datasets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb_lyrics <- readRDS(here(\"data\", \"b_lyrics.RDS\"))\nts_lyrics <- readRDS(here(\"data\", \"ts_lyrics.RDS\"))\nsales <- readRDS(here(\"data\", \"sales.RDS\"))\n```\n:::\n\n\n# Part 1: Explore album sales\n\nIn this section, the goal is to explore the sales of studio albums from Beyoncé and Taylor Swift.\n\n**Notes**\n\n-   In each of the subsections below that ask you to create a plot, you must create a title, subtitle, x-axis label, and y-axis label with units where applicable. For example, if your axis says \"sales\" as an axis label, change it to \"sales (in millions)\".\n\n## Part 1A\n\nIn this section, we will do some data wrangling.\n\n1.  Use `lubridate` to create a column called `released` that is a `Date` class. However, to be able to do this, you first need to use `stringr` to search for pattern that matches things like this \"(US)\\[51\\]\" in a string like this \"September 1, 2006 (US)\\[51\\]\" and removes them. (**Note**: to get full credit, you must create the regular expression).\n2.  Use `forcats` to create a factor called `country` (**Note**: you may need to collapse some factor levels).\n3.  Transform the `sales` into a unit that is album sales in millions of dollars.\n4.  Keep only album sales from the UK, the US or the World.\n5.  Auto print your final wrangled tibble data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 1B\n\nIn this section, we will do some more data wrangling followed by summarization using wrangled data from Part 1A.\n\n1.  Keep only album sales from the US.\n2.  Create a new column called `years_since_release` corresponding to the number of years since the release of each album from Beyoncé and Taylor Swift. This should be a whole number and you should round down to \"14\" if you get a non-whole number like \"14.12\" years. (**Hint**: you may find the `interval()` function from `lubridate` helpful here, but this not the only way to do this.)\n3.  Calculate the most recent, oldest, and the median years since albums were released for both Beyoncé and Taylor Swift.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 1C\n\nUsing the wrangled data from Part 1A:\n\n1.  Calculate the total album sales for each artist and for each `country` (only sales from the UK, US, and World).\n2.  Using the total album sales, create a [percent stacked barchart](https://r-graph-gallery.com/48-grouped-barplot-with-ggplot2) using `ggplot2` of the percentage of sales of studio albums (in millions) along the y-axis for the two artists along the x-axis colored by the `country`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 1D\n\nUsing the wrangled data from Part 1A, use `ggplot2` to create a bar plot for the sales of studio albums (in millions) along the x-axis for each of the album titles along the y-axis.\n\n**Note**:\n\n-   You only need to consider the global World sales (you can ignore US and UK sales for this part).\n-   The title of the album must be clearly readable along the y-axis.\n-   Each bar should be colored by which artist made that album.\n-   The bars should be ordered from albums with the most sales (top) to the least sales (bottom) (**Note**: you must use functions from `forcats` for this step).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 1E\n\nUsing the wrangled data from Part 1A, use `ggplot2` to create a scatter plot of sales of studio albums (in millions) along the y-axis by the released date for each album along the x-axis.\n\n**Note**:\n\n-   The points should be colored by the artist.\n-   There should be three scatter plots (one for UK, US and world sales) faceted by rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 2: Exploring sentiment of lyrics\n\nIn Part 2, we will explore the lyrics in the `b_lyrics` and `ts_lyrics` datasets.\n\n## Part 2A\n\nUsing `ts_lyrics`, create a new column called `line` with one line containing the character string for each line of Taylor Swift's songs.\n\n-   How many lines in Taylor Swift's lyrics contain the word \"hello\"? For full credit, show all the rows in `ts_lyrics` that have \"hello\" in the `line` column and report how many rows there are in total.\n-   How many lines in Taylor Swift's lyrics contain the word \"goodbye\"? For full credit, show all the rows in `ts_lyrics` that have \"goodbye\" in the `line` column and report how many rows there are in total.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 2B\n\nRepeat the same analysis for `b_lyrics` as described in Part 2A.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 2C\n\nUsing the `b_lyrics` dataset,\n\n1.  Tokenize each lyrical line by words.\n2.  Remove the \"stopwords\".\n3.  Calculate the total number for each word in the lyrics.\n4.  Using the \"bing\" sentiment lexicon, add a column to the summarized data frame adding the \"bing\" sentiment lexicon.\n5.  Sort the rows from most frequent to least frequent words.\n6.  Only keep the top 25 most frequent words.\n7.  Auto print the wrangled tibble data frame.\n8.  Use `ggplot2` to create a bar plot with the top words on the y-axis and the frequency of each word on the x-axis. Color each bar by the sentiment of each word from the \"bing\" sentiment lexicon. Bars should be ordered from most frequent on the top to least frequent on the bottom of the plot.\n9.  Create a word cloud of the top 25 most frequent words.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 2D\n\nRepeat the same analysis as above in Part 2C, but for `ts_lyrics`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n## Part 2E\n\nUsing the `ts_lyrics` dataset,\n\n1.  Tokenize each lyrical line by words.\n2.  Remove the \"stopwords\".\n3.  Calculate the total number for each word in the lyrics **for each Album**.\n4.  Using the \"afinn\" sentiment lexicon, add a column to the summarized data frame adding the \"afinn\" sentiment lexicon.\n5.  Calculate the average sentiment score **for each Album**.\n6.  Auto print the wrangled tibble data frame.\n7.  Join the wrangled data frame from Part 1A (album sales in millions) with the wrangled data frame from #6 above (average sentiment score for each album).\n8.  Using `ggplot2`, create a scatter plot of the average sentiment score for each album (y-axis) and the album release data along the x-axis. Make the size of each point the album sales in millions.\n9.  Add a horizontal line at y-intercept=0.\n10. Write 2-3 sentences interpreting the plot answering the question \"How has the sentiment of Taylor Swift's albums have changed over time?\". Add a title, subtitle, and useful axis labels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.3.1 (2023-06-16)\n os       macOS Ventura 13.5\n system   aarch64, darwin20\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       America/New_York\n date     2023-08-17\n pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package     * version date (UTC) lib source\n cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)\n colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)\n evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)\n fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)\n here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)\n htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)\n knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)\n rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)\n xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)\n yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"