diff --git a/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json b/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json index 083d7be..ce18255 100644 --- a/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json +++ b/_freeze/posts/02-introduction-to-r-and-rstudio/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "f1d609aa26ae665593d37bd091989900", + "hash": "27583eecf8d325ba0fbc85dc5d42cee3", "result": { - "markdown": "---\ntitle: \"02 - Introduction to R and RStudio!\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Let's dig into the R programming language and the RStudio integrated developer environment\"\nimage: https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png\ncategories: [module 1, week 1, R, programming, RStudio]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/02-introduction-to-r-and-rstudio/index.qmd).*\n\n> There are only two kinds of languages: the ones people complain about and the ones nobody uses. ---*Bjarne Stroustrup*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. [An overview and history of R](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger Peng\n2. [Installing R and RStudio](https://rafalab.github.io/dsbook/installing-r-rstudio.html) from Rafael Irizarry\n3. [Getting Started in R and RStudio](https://rafalab.github.io/dsbook/getting-started.html) from Rafael Irizarry\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n## Learning objectives\n\n**At the end of this lesson you will:**\n\n- Learn about (some of) the history of R.\n- Identify some of the strengths and weaknesses of R.\n- Install R and Rstudio on your computer.\n- Know how to install and load R packages.\n:::\n\n# Overview and history of R\n\nBelow is a very quick introduction to R, to get you set up and running. We'll go deeper into R and coding later.\n\n### tl;dr (R in a nutshell)\n\nLike every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:\n\n- R is open-source, freely accessible, and cross-platform (multiple OS).\n- R is a [\"high-level\" programming language](https://en.wikipedia.org/wiki/High-level_programming_language), relatively easy to learn.\n - While \"Low-level\" programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.\n - In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.\n- R is great for statistics, data analysis, websites, web apps, data visualizations, and so much more!\n- R integrates easily with document preparation systems like $\\LaTeX$, but R files can also be used to create `.docx`, `.pdf`, `.html`, `.ppt` files with integrated R code output and graphics.\n- The R Community is very dynamic, helpful and welcoming.\n - Check out the [#rstats](https://twitter.com/search?q=%23rstats) or [#rtistry](https://twitter.com/search?q=%23rtistry) on Twitter, [TidyTuesday](https://www.tidytuesday.com) podcast and community activity in the [R4DS Online Learning Community](https://www.rfordatasci.com), and [r/rstats](https://www.reddit.com/r/rstats/) subreddit.\n - If you are looking for more local resources, check out [R-Ladies Baltimore](https://www.meetup.com/rladies-baltimore/).\n- Through R packages, it is easy to get lots of state-of-the-art algorithms.\n- Documentation and help files for R are generally good.\n\nWhile we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).\n\nDepending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.\n\nWith the skills gained in this course, hopefully you will find R a fun and useful programming language for your future projects.\n\n![Artwork by Allison Horst on learning R](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/r_first_then.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Basic Features of R\n\nToday R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.\n\nOne nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.\n\nAnother key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R's ability to create \"publication quality\" graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.\n\nR has maintained the original S philosophy (see box below), which is that **it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools**. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.\n\n::: callout-tip\nFor a great discussion on an overview and history of R and the S programming language, read through [this chapter](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger D. Peng.\n:::\n\nFinally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow, Twitter [#rstats](https://twitter.com/search?q=%23rstats), [#rtistry](https://twitter.com/search?q=%23rtistry), and [Reddit](https://www.reddit.com/r/rstats/).\n\n### Free Software\n\nA major advantage that R has over many other statistical packages and is that it's free in the sense of free software (it's also free in the sense of free beer). The copyright for the primary source code for R is held by the [R Foundation](http://www.r-project.org/foundation/) and is published under the [GNU General Public License version 2.0](http://www.gnu.org/licenses/gpl-2.0.html).\n\nAccording to the Free Software Foundation, with *free software*, you are granted the following [four freedoms](http://www.gnu.org/philosophy/free-sw.html)\n\n- The freedom to run the program, for any purpose (freedom 0).\n\n- The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.\n\n- The freedom to redistribute copies so you can help your neighbor (freedom 2).\n\n- The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.\n\n::: callout-tip\nYou can visit the [Free Software Foundation's web site](http://www.fsf.org) to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and [Stallman's personal web site](https://stallman.org) is an interesting read if you happen to have some spare time.\n:::\n\n### Design of the R System\n\nThe primary R system is available from the [Comprehensive R Archive Network](http://cran.r-project.org), also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.\n\nThe R system is divided into 2 conceptual parts:\n\n1. The \"base\" R system that you download from CRAN:\n\n- [Linux](http://cran.r-project.org/bin/linux/)\n- [Windows](http://cran.r-project.org/bin/windows/)\n- [Mac](http://cran.r-project.org/bin/macosx/)\n\n2. Everything else.\n\nR functionality is divided into a number of *packages*.\n\n- The \"base\" R system contains, among other things, the `base` package which is required to run R and contains the most fundamental functions.\n\n- The other packages contained in the \"base\" system include `utils`, `stats`, `datasets`, `graphics`, `grDevices`, `grid`, `methods`, `tools`, `parallel`, `compiler`, `splines`, `tcltk`, `stats4`.\n\n- There are also \"Recommended\" packages: `boot`, `class`, `cluster`, `codetools`, `foreign`, `KernSmooth`, `lattice`, `mgcv`, `nlme`, `rpart`, `survival`, `MASS`, `spatial`, `nnet`, `Matrix`.\n\nWhen you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:\n\n- There are over 10,000 packages on CRAN that have been developed by users and programmers around the world.\n\n- There are also many packages associated with the [Bioconductor project](http://bioconductor.org).\n\n- People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.\n\n::: callout-note\n## Questions\n\n1. How many R packages are on CRAN today?\n2. How many R packages are on Bioconductor today?\n3. How many R packages are on GitHub today?\n:::\n\n### Limitations of R\n\nNo programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on **almost 50 year old technology**, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the \"old days\").\n\nAnother commonly cited limitation of R is that **objects must generally be stored in physical memory** (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.\n\nAt a higher level one \"limitation\" of R is that **its functionality is based on consumer demand and (voluntary) user contributions**. If no one feels like implementing your favorite method, then it's *your* job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.\n\n# Using R and RStudio\n\n> If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. --- *Nicholas Tierney*\n\n\\[[Source](https://rmd4sci.njtierney.com)\\]\n\nThe RStudio layout has the following features:\n\n- On the upper left, something called a Rmarkdown script\n- On the lower left, the R console\n- On the lower right, the view for files, plots, packages, help, and viewer.\n- On the upper right, the environment / history pane\n\n![A screenshot of the RStudio integrated developer environment (IDE) -- aka the working environment](https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png)\n\nThe R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we'll learn about these files later).\n\nThe file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.\n\n### Installing R and RStudio\n\n- If you have not already, [install R first](http://cran.r-project.org). If you already have R installed, make sure it is a fairly recent version, version 4.0 or newer. If yours is older, I suggest you update (install a new R version).\n- Once you have R installed, install the free version of [RStudio Desktop](https://www.rstudio.com/products/rstudio/download/). Again, make sure it's a recent version.\n\n::: callout-tip\nInstalling R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry's `dsbook`\n\n- \n:::\n\nIf things don't work, ask for help in the Courseplus discussion board.\n\nI personally only have experience with Mac, but everything should work on all the standard operating systems (Windows, Mac, and even Linux).\n\n### RStudio default options\n\nTo first get set up, I highly recommend changing the following setting\n\nTools \\> Global Options (or `Cmd + ,` on macOS)\n\nUnder the **General** tab:\n\n- For **workspace**\n - Uncheck restore .RData into workspace at startup\n - Save workspace to .RData on exit : \"Never\"\n- For **History**\n - Uncheck \"Always save history (even when not saving .RData)\n - Uncheck \"Remove duplicate entries in history\"\n\nThis means that you won't save the objects and other things that you create in your R session and reload them. This is important for two reasons\n\n1. **Reproducibility**: you don't want to have objects from last week cluttering your session\n2. **Privacy**: you don't want to save private data or other things to your session. You only want to read these in.\n\nYour \"history\" is the commands that you have entered into R.\n\nAdditionally, not saving your history means that you won't be relying on things that you typed in the last session, which is a good habit to get into!\n\n### Installing and loading R packages\n\nAs we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.\n\nThe \"official\" place for packages is the [CRAN website](https://cran.r-project.org/web/packages/available_packages_by_name.html). If you are interested in packages on a specific topic, the [CRAN task views](http://cran.r-project.org/web/views/) provide curated descriptions of packages sorted by topic.\n\nTo install an R package from CRAN, one can simply call the `install.packages()` function and pass the name of the package as an argument. For example, to install the `ggplot2` package from CRAN: open RStudio,go to the R prompt (the `>` symbol) in the lower-left corner and type\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nand the appropriate version of the package will be installed.\n\nOften, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.\n\n::: callout-note\n## Questions\n\n1. As you installed the `ggplot2` package, what other packages were installed?\n2. What happens if you tried to install `GGplot2`?\n:::\n\nIt could be that you already have all packages required by `ggplot2` installed. In that case, you will not see any other packages installed. To see which of the packages above `ggplot2` needs (and thus installs if it is not present), type into the R console:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntools::package_dependencies(\"ggplot2\")\n```\n:::\n\n\nIn RStudio, you can also install (and update/remove) packages by clicking on the 'Packages' tab in the bottom right window.\n\nIt is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the `remotes` package and then use the following function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github()\n```\n:::\n\n\nWe will not do that now, but it is quite likely that at one point later in this course we will.\n\nYou only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the `library()` command. For instance to load the `ggplot2` package, type\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary('ggplot2')\n```\n:::\n\n\nYou may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.\n\nThis was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.\n\n### Getting started in RStudio\n\nWhile one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the [R Studio Essentials](https://resources.rstudio.com/) collection of videos.\n\n::: callout-tip\nFor more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the `dsbook`:\n\n- \n\nThis chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n## Questions\n\n1. If a software company asks you, as a requirement for using their software, to sign a license that restricts you from using their software to commit illegal activities, is this consistent with the \"Four Freedoms\" of Free Software?\n\n2. What is an R package and what is it used for?\n\n3. What function in R can be used to install packages from CRAN?\n\n4. What is a limitation of the current R system?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- [R for Data Science](https://r4ds.had.co.nz) by Wickham & Grolemund (2017). Covers most of the basics of using R for data analysis.\n\n- [Advanced R](https://adv-r.hadley.nz) by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.\n\n- [RStudio IDE cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf)\n:::\n\n## rtistry\n\n\n::: {.cell .fig-cap-location-top}\n::: {.cell-output-display}\n![](https://github.com/djnavarro/art/raw/master/static/gallery/water-colours/watercolour_splash.jpg)\n:::\n:::\n\n\n\\['Water Colours' from Danielle Navarro \\]\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"02 - Introduction to R and RStudio!\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Let's dig into the R programming language and the RStudio integrated developer environment\"\nimage: https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png\ncategories: [module 1, week 1, R, programming, RStudio]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/02-introduction-to-r-and-rstudio/index.qmd).*\n\n> There are only two kinds of languages: the ones people complain about and the ones nobody uses. ---*Bjarne Stroustrup*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. [An overview and history of R](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger Peng\n2. [Installing R and RStudio](https://rafalab.github.io/dsbook/installing-r-rstudio.html) from Rafael Irizarry\n3. [Getting Started in R and RStudio](https://rafalab.github.io/dsbook/getting-started.html) from Rafael Irizarry\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n## Learning objectives\n\n**At the end of this lesson you will:**\n\n- Learn about (some of) the history of R.\n- Identify some of the strengths and weaknesses of R.\n- Install R and Rstudio on your computer.\n- Know how to install and load R packages.\n:::\n\n# Overview and history of R\n\nBelow is a very quick introduction to R, to get you set up and running. We'll go deeper into R and coding later.\n\n### tl;dr (R in a nutshell)\n\nLike every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:\n\n- R is open-source, freely accessible, and cross-platform (multiple OS).\n- R is a [\"high-level\" programming language](https://en.wikipedia.org/wiki/High-level_programming_language), relatively easy to learn.\n - While \"Low-level\" programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.\n - In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.\n- R is great for statistics, data analysis, websites, web apps, data visualizations, and so much more!\n- R integrates easily with document preparation systems like $\\LaTeX$, but R files can also be used to create `.docx`, `.pdf`, `.html`, `.ppt` files with integrated R code output and graphics.\n- The R Community is very dynamic, helpful and welcoming.\n - Check out the [#rstats](https://twitter.com/search?q=%23rstats) or [#rtistry](https://twitter.com/search?q=%23rtistry) on Twitter, [TidyTuesday](https://www.tidytuesday.com) podcast and community activity in the [R4DS Online Learning Community](https://www.rfordatasci.com), and [r/rstats](https://www.reddit.com/r/rstats/) subreddit.\n - If you are looking for more local resources, check out [R-Ladies Baltimore](https://www.meetup.com/rladies-baltimore/).\n- Through R packages, it is easy to get lots of state-of-the-art algorithms.\n- Documentation and help files for R are generally good.\n\nWhile we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).\n\nDepending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.\n\nWith the skills gained in this course, hopefully you will find R a fun and useful programming language for your future projects.\n\n![Artwork by Allison Horst on learning R](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/r_first_then.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Basic Features of R\n\nToday R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.\n\nOne nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.\n\nAnother key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R's ability to create \"publication quality\" graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.\n\nR has maintained the original S philosophy (see box below), which is that **it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools**. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.\n\n::: callout-tip\nFor a great discussion on an overview and history of R and the S programming language, read through [this chapter](https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html) from Roger D. Peng.\n:::\n\nFinally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow, Twitter [#rstats](https://twitter.com/search?q=%23rstats), [#rtistry](https://twitter.com/search?q=%23rtistry), and [Reddit](https://www.reddit.com/r/rstats/).\n\n### Free Software\n\nA major advantage that R has over many other statistical packages and is that it's free in the sense of free software (it's also free in the sense of free beer). The copyright for the primary source code for R is held by the [R Foundation](http://www.r-project.org/foundation/) and is published under the [GNU General Public License version 2.0](http://www.gnu.org/licenses/gpl-2.0.html).\n\nAccording to the Free Software Foundation, with *free software*, you are granted the following [four freedoms](http://www.gnu.org/philosophy/free-sw.html)\n\n- The freedom to run the program, for any purpose (freedom 0).\n\n- The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.\n\n- The freedom to redistribute copies so you can help your neighbor (freedom 2).\n\n- The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.\n\n::: callout-tip\nYou can visit the [Free Software Foundation's web site](http://www.fsf.org) to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and [Stallman's personal web site](https://stallman.org) is an interesting read if you happen to have some spare time.\n:::\n\n### Design of the R System\n\nThe primary R system is available from the [Comprehensive R Archive Network](http://cran.r-project.org), also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.\n\nThe R system is divided into 2 conceptual parts:\n\n1. The \"base\" R system that you download from CRAN:\n\n- [Linux](http://cran.r-project.org/bin/linux/)\n- [Windows](http://cran.r-project.org/bin/windows/)\n- [Mac](http://cran.r-project.org/bin/macosx/)\n\n2. Everything else.\n\nR functionality is divided into a number of *packages*.\n\n- The \"base\" R system contains, among other things, the `base` package which is required to run R and contains the most fundamental functions.\n\n- The other packages contained in the \"base\" system include `utils`, `stats`, `datasets`, `graphics`, `grDevices`, `grid`, `methods`, `tools`, `parallel`, `compiler`, `splines`, `tcltk`, `stats4`.\n\n- There are also \"Recommended\" packages: `boot`, `class`, `cluster`, `codetools`, `foreign`, `KernSmooth`, `lattice`, `mgcv`, `nlme`, `rpart`, `survival`, `MASS`, `spatial`, `nnet`, `Matrix`.\n\nWhen you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:\n\n- There are over 10,000 packages on CRAN that have been developed by users and programmers around the world.\n\n- There are also many packages associated with the [Bioconductor project](http://bioconductor.org).\n\n- People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.\n\n::: callout-note\n## Questions\n\n1. How many R packages are on CRAN today?\n2. How many R packages are on Bioconductor today?\n3. How many R packages are on GitHub today?\n:::\n\n### Limitations of R\n\nNo programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on **almost 50 year old technology**, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the \"old days\").\n\nAnother commonly cited limitation of R is that **objects must generally be stored in physical memory** (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.\n\nAt a higher level one \"limitation\" of R is that **its functionality is based on consumer demand and (voluntary) user contributions**. If no one feels like implementing your favorite method, then it's *your* job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.\n\n# Using R and RStudio\n\n> If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. --- *Nicholas Tierney*\n\n\\[[Source](https://rmd4sci.njtierney.com)\\]\n\nThe RStudio layout has the following features:\n\n- On the upper left, something called a Rmarkdown script\n- On the lower left, the R console\n- On the lower right, the view for files, plots, packages, help, and viewer.\n- On the upper right, the environment / history pane\n\n![A screenshot of the RStudio integrated developer environment (IDE) -- aka the working environment](https://github.com/njtierney/rmd4sci/raw/master/figs/rstudio-screenshot.png)\n\nThe R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we'll learn about these files later).\n\nThe file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.\n\n### Installing R and RStudio\n\n- If you have not already, [install R first](http://cran.r-project.org). If you already have R installed, make sure it is a fairly recent version, version 4.0 or newer. If yours is older, I suggest you update (install a new R version).\n- Once you have R installed, install the free version of [RStudio Desktop](https://www.rstudio.com/products/rstudio/download/). Again, make sure it's a recent version.\n\n::: callout-tip\nInstalling R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry's `dsbook`\n\n- \n:::\n\nIf things don't work, ask for help in the Courseplus discussion board.\n\nI personally only have experience with Mac, but everything should work on all the standard operating systems (Windows, Mac, and even Linux).\n\n### RStudio default options\n\nTo first get set up, I highly recommend changing the following setting\n\nTools \\> Global Options (or `Cmd + ,` on macOS)\n\nUnder the **General** tab:\n\n- For **workspace**\n - Uncheck restore .RData into workspace at startup\n - Save workspace to .RData on exit : \"Never\"\n- For **History**\n - Uncheck \"Always save history (even when not saving .RData)\n - Uncheck \"Remove duplicate entries in history\"\n\nThis means that you won't save the objects and other things that you create in your R session and reload them. This is important for two reasons\n\n1. **Reproducibility**: you don't want to have objects from last week cluttering your session\n2. **Privacy**: you don't want to save private data or other things to your session. You only want to read these in.\n\nYour \"history\" is the commands that you have entered into R.\n\nAdditionally, not saving your history means that you won't be relying on things that you typed in the last session, which is a good habit to get into!\n\n### Installing and loading R packages\n\nAs we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.\n\nThe \"official\" place for packages is the [CRAN website](https://cran.r-project.org/web/packages/available_packages_by_name.html). If you are interested in packages on a specific topic, the [CRAN task views](http://cran.r-project.org/web/views/) provide curated descriptions of packages sorted by topic.\n\nTo install an R package from CRAN, one can simply call the `install.packages()` function and pass the name of the package as an argument. For example, to install the `ggplot2` package from CRAN: open RStudio,go to the R prompt (the `>` symbol) in the lower-left corner and type\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nand the appropriate version of the package will be installed.\n\nOften, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.\n\n::: callout-note\n## Questions\n\n1. As you installed the `ggplot2` package, what other packages were installed?\n2. What happens if you tried to install `GGplot2`?\n:::\n\nIt could be that you already have all packages required by `ggplot2` installed. In that case, you will not see any other packages installed. To see which of the packages above `ggplot2` needs (and thus installs if it is not present), type into the R console:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntools::package_dependencies(\"ggplot2\")\n```\n:::\n\n\nIn RStudio, you can also install (and update/remove) packages by clicking on the 'Packages' tab in the bottom right window.\n\nIt is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the `remotes` package and then use the following function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github()\n```\n:::\n\n\nWe will not do that now, but it is quite likely that at one point later in this course we will.\n\nYou only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the `library()` command. For instance to load the `ggplot2` package, type\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"ggplot2\")\n```\n:::\n\n\nYou may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.\n\nThis was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.\n\n### Getting started in RStudio\n\nWhile one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the [R Studio Essentials](https://resources.rstudio.com/) collection of videos.\n\n::: callout-tip\nFor more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the `dsbook`:\n\n- \n\nThis chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n## Questions\n\n1. If a software company asks you, as a requirement for using their software, to sign a license that restricts you from using their software to commit illegal activities, is this consistent with the \"Four Freedoms\" of Free Software?\n\n2. What is an R package and what is it used for?\n\n3. What function in R can be used to install packages from CRAN?\n\n4. What is a limitation of the current R system?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- [R for Data Science](https://r4ds.had.co.nz) by Wickham & Grolemund (2017). Covers most of the basics of using R for data analysis.\n\n- [Advanced R](https://adv-r.hadley.nz) by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.\n\n- [RStudio IDE cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf)\n:::\n\n## rtistry\n\n\n::: {.cell .fig-cap-location-top}\n::: {.cell-output-display}\n![](https://github.com/djnavarro/art/raw/master/static/gallery/water-colours/watercolour_splash.jpg)\n:::\n:::\n\n\n\\['Water Colours' from Danielle Navarro \\]\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/05-literate-programming/index/execute-results/html.json b/_freeze/posts/05-literate-programming/index/execute-results/html.json index a5f3c3b..658affc 100644 --- a/_freeze/posts/05-literate-programming/index/execute-results/html.json +++ b/_freeze/posts/05-literate-programming/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "4e89eeba2590cab5734e048c12e3697e", + "hash": "781f56af33f440fc1c8df4db96a05562", "result": { - "markdown": "---\ntitle: \"05 - Literate Statistical Programming\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to literate statistical programming tools including R Markdown\"\nimage: https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/rmarkdown_rockstar.png\ncategories: [module 1, week 1, R Markdown, programming]\nbibliography: my-refs.bib\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/05-literate-programming/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to define literate programming\n- Recognize differences between available tools to for literate programming\n- Know how to efficiently work within RStudio for efficient literate programming\n- Create a R Markdown document\n:::\n\n# Introduction\n\nOne basic idea to make writing reproducible reports easier is what's known as *literate statistical programming* (or sometimes called [literate statistical practice](http://www.r-project.org/conferences/DSC-2001/Proceedings/Rossini.pdf)). This comes from the idea of [literate programming](https://en.wikipedia.org/wiki/Literate_programming) in the area of writing computer programs.\n\nThe idea is to **think of a report or a publication as a stream of text and code**.\n\n- The text is readable by people and the code is readable by computers.\n\n- The analysis is described in a series of text and code chunks.\n\n- Each kind of code chunk will do something like load some data or compute some results.\n\n- Each text chunk will relay something in a human readable language.\n\nThere might also be **presentation code** that formats tables and figures and there's article text that explains what's going on around all this code. This stream of text and code is a literate statistical program or a literate statistical analysis.\n\n### Weaving and Tangling\n\nLiterate programs by themselves are a bit difficult to work with, but they can be processed in two important ways.\n\nLiterate programs can be **weaved** to produce human readable documents like PDFs or HTML web pages, and they can **tangled** to produce machine-readable \"documents\", or in other words, machine readable code.\n\nThe basic idea behind literate programming in order to generate the different kinds of output you might need, **you only need a single source document**---you can weave and tangle to get the rest.\n\nIn order to use a system like this you need a documentational language, that's human readable, and you need a programming language that's machine readable (or can be compiled/interpreted into something that's machine readable).\n\n### Sweave\n\nOne of the original literate programming systems in R that was designed to do this was called Sweave. Sweave enables users to combine R code with a documentation program called LaTeX.\n\n**Sweave files ends a `.Rnw`** and have R code weaved through the document:\n\n``` \n<>=\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n@\n```\n\nOnce you have created your `.Rnw` file, Sweave will process the file, executing the R chunks and replacing them with output as appropriate before creating the PDF document.\n\nIt was originally developed by Fritz Leisch, who is a core member of R, and the code base is still maintained by R Core. The Sweave system comes with any installation of R.\n\nThere are many limitations to the original Sweave system.\n\n- One of the limitations is that it is **focused primarily on LaTeX**, which is not a documentation language that many people are familiar with.\n- Therefore, it **can be difficult to learn this type of markup language** if you're not already in a field that uses it regularly.\n- Sweave also **lacks a lot of features that people find useful** like caching, and multiple plots per page and mixing programming languages.\n\nInstead, folks have **moved towards using something called knitr**, which offers everything Sweave does, plus it extends it further.\n\n- With Sweave, additional tools are required for advanced operations, whereas knitr supports more internally. We'll discuss knitr below.\n\n### rmarkdown\n\nAnother choice for literate programming is to build documents based on [Markdown](https://en.wikipedia.org/wiki/Markdown) language. A markdown file is a plain text file that is typically given the extension `.md.`. The [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) R package takes a R Markdown file (`.Rmd`) and weaves together R code chunks like this:\n\n```` \n```{r plot1, height=4, width=5, eval=FALSE, echo=TRUE}\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n```\n````\n\n::: callout-tip\nThe best resource for learning about R Markdown this by Yihui Xie, J. J. Allaire, and Garrett Grolemund:\n\n- \n\nThe R Markdown Cookbook by Yihui Xie, Christophe Dervieux, and Emily Riederer is really good too:\n\n- \n\nThe authors of the 2nd book describe the motivation for the 2nd book as:\n\n> \"However, we have received comments from our readers and publisher that it would be beneficial to provide more practical and relatively short examples to show the interesting and useful usage of R Markdown, because it can be daunting to find out how to achieve a certain task from the aforementioned reference book (put another way, that book is too dry to read). As a result, this cookbook was born.\"\n:::\n\nBecause this is lecture is built in a `.qmd` file (which is very similar to a `.Rmd` file), let's demonstrate how this work. I am going to change `eval=FALSE` to `eval=TRUE`.\n\n\n::: {.cell height='4' width='5'}\n\n```{.r .cell-code}\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/plot2-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Questions\n\n1. Why do we not see the back ticks \\`\\`\\` anymore in the code chunk above that made the plot?\n2. What do you think we should do if we want to have the code executed, but we want to hide the code that made it?\n:::\n\nBefore we leave this section, I find that there is quite a bit of terminology to understand the magic behind `rmarkdown` that can be confusing, so let's break it down:\n\n- [Pandoc](https://pandoc.org). Pandoc is a command line tool with no GUI that converts documents (e.g. from number of different markup formats to many other formats, such as .doc, .pdf etc). It is completely independent from R (but does come bundled with RStudio).\n- [Markdown](https://en.wikipedia.org/wiki/Markdown) (**markup language**). Markdown is a lightweight [markup language](https://en.wikipedia.org/wiki/Markup_language) with plain text formatting syntax designed so that it can be converted to HTML and many other formats. A markdown file is a plain text file that is typically given the extension `.md.` It is completely independent from R.\n- [`markdown`](https://CRAN.R-project.org/package=markdown) (**R package**). `markdown` is an R package which converts `.md` files into HTML. It is no longer recommended for use has been surpassed by [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) (discussed below).\n- R Markdown (**markup language**). R Markdown is an extension of the markdown syntax. R Markdown files are plain text files that typically have the file extension `.Rmd`.\n- [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) (**R package**). The R package `rmarkdown` is a library that uses pandoc to process and convert `.Rmd` files into a number of different formats. This core function is `rmarkdown::render()`. **Note**: this package only deals with the markdown language. If the input file is e.g. `.Rhtml` or `.Rnw`, then you need to use `knitr` prior to calling pandoc (see below).\n\n::: callout-tip\nCheck out the R Markdown Quick Tour for more:\n\n- \n:::\n\n![Artwork by Allison Horst on RMarkdown](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/rmarkdown_rockstar.png){width=\"80%\"}\n\n### knitr\n\nOne of the alternative that has come up in recent times is something called `knitr`.\n\n- The `knitr` package for R takes a lot of these ideas of literate programming and updates and improves upon them.\n- `knitr` still uses R as its programming language, but it allows you to mix other programming languages in.\n- You can also use a variety of documentation languages now, such as LaTeX, markdown and HTML.\n- `knitr` was developed by Yihui Xie while he was a graduate student at Iowa State and it has become a very popular package for writing literate statistical programs.\n\nKnitr takes a plain text document with embedded code, executes the code and 'knits' the results back into the document.\n\nFor for example, it converts\n\n- An R Markdown (`.Rmd)` file into a standard markdown file (`.md`)\n- An `.Rnw` (Sweave) file into to `.tex` format.\n- An `.Rhtml` file into to `.html`.\n\nThe core function is `knitr::knit()` and by default this will look at the input document and try and guess what type it is e.g. `Rnw`, `Rmd` etc.\n\nThis core function performs three roles:\n\n- A **source parser**, which looks at the input document and detects which parts are code that the user wants to be evaluated.\n- A **code evaluator**, which evaluates this code\n- An **output renderer**, which writes the results of evaluation back to the document in a format which is interpretable by the raw output type. For instance, if the input file is an `.Rmd`, the output render marks up the output of code evaluation in `.md` format.\n\n\n::: {.cell layout-align=\"center\" preview='true'}\n::: {.cell-output-display}\n![Converting a Rmd file to many outputs using knitr and pandoc](https://d33wubrfki0l68.cloudfront.net/61d189fd9cdf955058415d3e1b28dd60e1bd7c9b/9791d/images/rmarkdownflow.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\\[[Source](https://rmarkdown.rstudio.com/authoring_quick_tour.html)\\]\n\nAs seen in the figure above, from there pandoc is used to convert e.g. a `.md` file into many other types of file formats into a `.html`, etc.\n\nSo in summary:\n\n> \"R Markdown stands on the shoulders of knitr and Pandoc. The former executes the computer code embedded in Markdown, and converts R Markdown to Markdown. The latter renders Markdown to the output format you want (such as PDF, HTML, Word, and so on).\"\n\n\\[[Source](https://bookdown.org/yihui/rmarkdown/)\\]\n\n# Create and Knit Your First R Markdown Document\n\n\n\nWhen creating your first R Markdown document, in RStudio you can\n\n1. Go to File \\> New File \\> R Markdown...\n\n2. Feel free to edit the Title\n\n3. Make sure to select \"Default Output Format\" to be HTML\n\n4. Click \"OK\". RStudio creates the R Markdown document and places some boilerplate text in there just so you can see how things are setup.\n\n5. Click the \"Knit\" button (or go to File \\> Knit Document) to make sure you can create the HTML output\n\nIf you successfully knit your first R Markdown document, then congratulations!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Mission accomplished!](https://media.giphy.com/media/L4ZZNbDpOCfiX8uYSd/giphy.gif){width=60%}\n:::\n:::\n\n\n# Websites and Books in R Markdown\n\nNow that you are on the road to using R Markdown documents, it is important to know about other wonderful things you do with these documents. For example, let's say you have multiple `.Rmd` documents that you want to put together into a website, blog, book, etc.\n\nThere are primarily two ways to build multiple `.Rmd` documents together:\n\n1. [**blogdown**](https://bookdown.org/yihui/blogdown/) for building websites\n2. [**bookdown**](https://bookdown.org/yihui/bookdown/) for authoring books\n\nIn this section, we briefly introduce both packages, but it's worth mentioning that the [**rmarkdown** package also has a built-in site generator](https://bookdown.org/yihui/rmarkdown/rmarkdown-site.html) to build websites.\n\n### blogdown\n\n\n::: {.cell}\n::: {.cell-output-display}\n![blogdown logo](https://bookdown.org/yihui/blogdown/images/logo.png){width=30%}\n:::\n:::\n\n\n\\[[Source](https://bookdown.org/yihui/bookdown/images/logo.png)\\]\n\nThe `blogdown` R package is built on top of R Markdown, supports multi-page HTML output to write a blog post or a general page in an Rmd document, or a plain Markdown document.\n\n- These source documents (e.g. `.Rmd` or `.md`) are built into a static website (i.e. a bunch of static HTML files, images and CSS files).\n- Using this folder of files, it is very easy to publish it to any web server as a website.\n- Also, it is easy to maintain because it is only a single folder.\n\n::: callout-tip\nFor example, my personal website was built in blogdown:\n\n- \n\nOther really great examples can be found here:\n\n- \n:::\n\nOther advantages include the content likely being reproducible, easier to maintain, and easy to convert pages to e.g. PDF or other formats in the future if you do not want to convert to HTML files.\n\nBecause it is based on the Markdown syntax, it is easy to write technical documents, including math equations, insert figures or tables with captions, cross-reference with figure or table numbers, add citations, and present theorems or proofs.\n\nHere's a video you can watch of someone making a blogdown website.\n\n

\n\n\n\n

\n\n\\[[Source](https://www.youtube.com/watch?v=AADnslLpzJ4) on YouTube\\]\n\n### bookdown\n\n\n::: {.cell}\n::: {.cell-output-display}\n![book logo](https://bookdown.org/yihui/bookdown/images/logo.png){width=30%}\n:::\n:::\n\n\n\\[[Source](https://bookdown.org/yihui/bookdown/images/logo.png)\\]\n\nSimilar to `blogdown`, the `bookdown` R package is built on top of R Markdown, but also offers features like multi-page HTML output, numbering and cross-referencing figures/tables/sections/equations, inserting parts/appendices, and imported the GitBook style () to create elegant and appealing HTML book pages. Share\n\n::: callout-tip\nFor example, the previous version of this course was built in bookdown:\n\n- \n\nAnother example is the [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) book that the JHU Data Science Lab wrote. The github repo that contains all the `.Rmd` files can be found [here](https://github.com/jhudsl/tidyversecourse).\n\n- \n- \n:::\n\n**Note**: Even though the word \"book\" is in \"bookdown\", this package is not only for books. It really can be anything that consists of multiple `.Rmd` documents meant to be read in a linear sequence such as course dissertation/thesis, handouts, study notes, a software manual, a thesis, or even a diary.\n\n- https://bookdown.org/yihui/rmarkdown/basics-examples.html#examples-books\n\n### distill\n\nThere is another great way to build blogs or websites using the [distill for R Markdown](https://rstudio.github.io/distill/).\n\n- \n\nDistill for R Markdown combines the technical authoring features of the [Distill web framework](https://github.com/distillpub/template) (optimized for scientific and technical communication) with [R Markdown](https://rmarkdown.rstudio.com), enabling a fully reproducible workflow based on literate programming [@knuth1984].\n\nDistill articles include:\n\n- Reader-friendly typography that adapts well to mobile devices.\n- Features essential to technical writing like LaTeX math, citations, and footnotes.\n- Flexible figure layout options (e.g. displaying figures at a larger width than the article text).\n- Attractively rendered tables with optional support for pagination.\n- Support for a wide variety of diagramming tools for illustrating concepts. The ability to incorporate JavaScript and D3-based interactive visualizations.\n- A variety of ways to publish articles, including support for publishing sets of articles as a Distill website or as a Distill blog.\n\nThe course website from last year was built in Distill for R Markdown:\n\n- Website: \n- Github: \n\nSome other cool things about distill is the use of footnotes and asides.\n\nFor example [^1]. The number of the footnote will be automatically generated.\n\n[^1]: This will become a hover-able footnote\n\nYou can also optionally include notes in the gutter of the article (immediately to the right of the article text). To do this use the aside tag.\n\n\n\nYou can also include figures in the gutter. Just enclose the code chunk which generates the figure in an aside tag\n\n# Tips and tricks in R Markdown in RStudio\n\nHere are shortcuts and tips on efficiently using RStudio to improve how you write code.\n\n### Run code\n\nIf you want to run a code chunk:\n\n``` \ncommand + Enter on Mac\nCtrl + Enter on Windows\n```\n\n### Insert a comment in R and R Markdown\n\nTo insert a comment:\n\n``` \ncommand + Shift + C on Mac\nCtrl + Shift + C on Windows\n```\n\nThis shortcut can be used both for:\n\n- R code when you want to comment your code. It will add a `#` at the beginning of the line\n- for text in R Markdown. It will add `` around the text\n\nNote that if you want to comment more than one line, select all the lines you want to comment then use the shortcut. If you want to uncomment a comment, apply the same shortcut.\n\n### Knit a R Markdown document\n\nYou can knit R Markdown documents by using this shortcut:\n\n``` \ncommand + Shift + K on Mac\nCtrl + Shift + K on Windows\n```\n\n### Code snippets\n\nCode snippets is usually a few characters long and is used as a shortcut to insert a common piece of code. You simply type a few characters then press `Tab` and it will complete your code with a larger code. `Tab` is then used again to navigate through the code where customization is required. For instance, if you type `fun` then press `Tab`, it will auto-complete the code with the required code to create a function:\n\n``` \nname <- function(variables) {\n \n}\n```\n\nPressing `Tab` again will jump through the placeholders for you to edit it. So you can first edit the name of the function, then the variables and finally the code inside the function (try by yourself!).\n\nThere are many code snippets by default in RStudio. Here are the code snippets I use most often:\n\n- `lib` to call `library()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(package)\n```\n:::\n\n\n- `mat` to create a matrix\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data, nrow = rows, ncol = cols)\n```\n:::\n\n\n- `if`, `el`, and `ei` to create conditional expressions such as `if() {}`, `else {}` and `else if () {}`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (condition) {\n \n}\n\nelse {\n \n}\n\nelse if (condition) {\n \n}\n```\n:::\n\n\n- `fun` to create a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nname <- function(variables) {\n \n}\n```\n:::\n\n\n- `for` to create for loops\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (variable in vector) {\n \n}\n```\n:::\n\n\n- `ts` to insert a comment with the current date and time (useful if you have very long code and share it with others so they see when it has been edited)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Tue Jan 21 20:20:14 2020 ------------------------------\n```\n:::\n\n\nYou can see all default code snippets and add yours by clicking on Tools \\> Global Options... \\> Code (left sidebar) \\> Edit Snippets...\n\n### Ordered list in R Markdown\n\nIn R Markdown, when creating an ordered list such as this one:\n\n1. Item 1\n2. Item 2\n3. Item 3\n\nInstead of bothering with the numbers and typing\n\n``` \n1. Item 1\n2. Item 2\n3. Item 3\n```\n\nyou can simply type\n\n``` \n1. Item 1\n1. Item 2\n1. Item 3\n```\n\nfor the exact same result (try it yourself or check the code of this article!). This way you do not need to bother which number is next when creating a new item.\n\nTo go even further, any numeric will actually render the same result as long as the first item is the number you want to start from. For example, you could type:\n\n``` \n1. Item 1\n7. Item 2\n3. Item 3\n```\n\nwhich renders\n\n1. Item 1\n2. Item 2\n3. Item 3\n\nHowever, I suggest always using the number you want to start from for all items because if you move one item at the top, the list will start with this new number. For instance, if we move `7. Item 2` from the previous list at the top, the list becomes:\n\n``` \n7. Item 2\n1. Item 1\n3. Item 3\n```\n\nwhich incorrectly renders\n\n7. Item 2\n8. Item 1\n9. Item 3\n\n### New code chunk in R Markdown\n\nWhen editing R Markdown documents, you will need to insert a new R code chunk many times. The following shortcuts will make your life easier:\n\n``` \ncommand + option + I on Mac (or command + alt + I depending on your keyboard)\nCtrl + ALT + I on Windows\n```\n\n### Reformat code\n\nA clear and readable code is always easier and faster to read (and look more professional when sharing it to collaborators). To automatically apply the most common coding guidelines such as white spaces, indents, etc., use:\n\n``` \ncmd + Shift + A on Mac\nCtrl + Shift + A on Windows\n```\n\nSo for example the following code which does not respect the guidelines (and which is not easy to read):\n\n``` \n1+1\n for(i in 1:10){if(!i%%2){next}\nprint(i)\n }\n```\n\nbecomes much more neat and readable:\n\n``` \n1 + 1\nfor (i in 1:10) {\n if (!i %% 2) {\n next\n }\n print(i)\n}\n```\n\n### RStudio addins\n\nRStudio addins are extensions which provide a simple mechanism for executing advanced R functions from within RStudio. In simpler words, when executing an addin (by clicking a button in the Addins menu), the corresponding code is executed without you having to write the code. RStudio addins have the advantage that they allow you to execute complex and advanced code much more easily than if you would have to write it yourself.\n\n::: callout-tip\n**For more information about RStudio addins, check out**:\n\n- \n- \n:::\n\n### Others\n\nSimilar to many other programs, you can also use:\n\n- `command + Shift + N` on Mac and `Ctrl + Shift + N` on Windows to open a new R Script\n- `command + S` on Mac and `Ctrl + S` on Windows to save your current script or R Markdown document\n\nCheck out Tools --\\> Keyboard Shortcuts Help to see a long list of these shortcuts.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: questions\n### Questions\n\n1. What is literate programming?\n\n2. What was the first literate statistical programming tool to weave together a statistical language (R) with a markup language (LaTeX)?\n\n3. What is `knitr` and how is different than other literate statistical programming tools?\n\n4. Where can you find a list of other commands that help make your code writing more efficient in RStudio?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- [RMarkdown Tips and Tricks](https://indrajeetpatil.github.io/RmarkdownTips/) by Indrajeet Patil\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"05 - Literate Statistical Programming\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to literate statistical programming tools including R Markdown\"\nimage: https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/rmarkdown_rockstar.png\ncategories: [module 1, week 1, R Markdown, programming]\nbibliography: my-refs.bib\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/05-literate-programming/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n### Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to define literate programming\n- Recognize differences between available tools to for literate programming\n- Know how to efficiently work within RStudio for efficient literate programming\n- Create a R Markdown document\n:::\n\n# Introduction\n\nOne basic idea to make writing reproducible reports easier is what's known as *literate statistical programming* (or sometimes called [literate statistical practice](http://www.r-project.org/conferences/DSC-2001/Proceedings/Rossini.pdf)). This comes from the idea of [literate programming](https://en.wikipedia.org/wiki/Literate_programming) in the area of writing computer programs.\n\nThe idea is to **think of a report or a publication as a stream of text and code**.\n\n- The text is readable by people and the code is readable by computers.\n\n- The analysis is described in a series of text and code chunks.\n\n- Each kind of code chunk will do something like load some data or compute some results.\n\n- Each text chunk will relay something in a human readable language.\n\nThere might also be **presentation code** that formats tables and figures and there's article text that explains what's going on around all this code. This stream of text and code is a literate statistical program or a literate statistical analysis.\n\n### Weaving and Tangling\n\nLiterate programs by themselves are a bit difficult to work with, but they can be processed in two important ways.\n\nLiterate programs can be **weaved** to produce human readable documents like PDFs or HTML web pages, and they can **tangled** to produce machine-readable \"documents\", or in other words, machine readable code.\n\nThe basic idea behind literate programming in order to generate the different kinds of output you might need, **you only need a single source document**---you can weave and tangle to get the rest.\n\nIn order to use a system like this you need a documentational language, that's human readable, and you need a programming language that's machine readable (or can be compiled/interpreted into something that's machine readable).\n\n### Sweave\n\nOne of the original literate programming systems in R that was designed to do this was called Sweave. Sweave enables users to combine R code with a documentation program called LaTeX.\n\n**Sweave files ends a `.Rnw`** and have R code weaved through the document:\n\n``` \n<>=\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n@\n```\n\nOnce you have created your `.Rnw` file, Sweave will process the file, executing the R chunks and replacing them with output as appropriate before creating the PDF document.\n\nIt was originally developed by Fritz Leisch, who is a core member of R, and the code base is still maintained by R Core. The Sweave system comes with any installation of R.\n\nThere are many limitations to the original Sweave system.\n\n- One of the limitations is that it is **focused primarily on LaTeX**, which is not a documentation language that many people are familiar with.\n- Therefore, it **can be difficult to learn this type of markup language** if you're not already in a field that uses it regularly.\n- Sweave also **lacks a lot of features that people find useful** like caching, and multiple plots per page and mixing programming languages.\n\nInstead, folks have **moved towards using something called knitr**, which offers everything Sweave does, plus it extends it further.\n\n- With Sweave, additional tools are required for advanced operations, whereas knitr supports more internally. We'll discuss knitr below.\n\n### rmarkdown\n\nAnother choice for literate programming is to build documents based on [Markdown](https://en.wikipedia.org/wiki/Markdown) language. A markdown file is a plain text file that is typically given the extension `.md.`. The [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) R package takes a R Markdown file (`.Rmd`) and weaves together R code chunks like this:\n\n```` \n```{r plot1, height=4, width=5, eval=FALSE, echo=TRUE}\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n```\n````\n\n::: callout-tip\nThe best resource for learning about R Markdown this by Yihui Xie, J. J. Allaire, and Garrett Grolemund:\n\n- \n\nThe R Markdown Cookbook by Yihui Xie, Christophe Dervieux, and Emily Riederer is really good too:\n\n- \n\nThe authors of the 2nd book describe the motivation for the 2nd book as:\n\n> \"However, we have received comments from our readers and publisher that it would be beneficial to provide more practical and relatively short examples to show the interesting and useful usage of R Markdown, because it can be daunting to find out how to achieve a certain task from the aforementioned reference book (put another way, that book is too dry to read). As a result, this cookbook was born.\"\n:::\n\nBecause this is lecture is built in a `.qmd` file (which is very similar to a `.Rmd` file), let's demonstrate how this work. I am going to change `eval=FALSE` to `eval=TRUE`.\n\n\n::: {.cell height='4' width='5'}\n\n```{.r .cell-code}\ndata(airquality)\nplot(airquality$Ozone ~ airquality$Wind)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/plot2-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Questions\n\n1. Why do we not see the back ticks \\`\\`\\` anymore in the code chunk above that made the plot?\n2. What do you think we should do if we want to have the code executed, but we want to hide the code that made it?\n:::\n\nBefore we leave this section, I find that there is quite a bit of terminology to understand the magic behind `rmarkdown` that can be confusing, so let's break it down:\n\n- [Pandoc](https://pandoc.org). Pandoc is a command line tool with no GUI that converts documents (e.g. from number of different markup formats to many other formats, such as .doc, .pdf etc). It is completely independent from R (but does come bundled with RStudio).\n- [Markdown](https://en.wikipedia.org/wiki/Markdown) (**markup language**). Markdown is a lightweight [markup language](https://en.wikipedia.org/wiki/Markup_language) with plain text formatting syntax designed so that it can be converted to HTML and many other formats. A markdown file is a plain text file that is typically given the extension `.md.` It is completely independent from R.\n- [`markdown`](https://CRAN.R-project.org/package=markdown) (**R package**). `markdown` is an R package which converts `.md` files into HTML. It is no longer recommended for use has been surpassed by [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) (discussed below).\n- R Markdown (**markup language**). R Markdown is an extension of the markdown syntax. R Markdown files are plain text files that typically have the file extension `.Rmd`.\n- [`rmarkdown`](https://CRAN.R-project.org/package=rmarkdown) (**R package**). The R package `rmarkdown` is a library that uses pandoc to process and convert `.Rmd` files into a number of different formats. This core function is `rmarkdown::render()`. **Note**: this package only deals with the markdown language. If the input file is e.g. `.Rhtml` or `.Rnw`, then you need to use `knitr` prior to calling pandoc (see below).\n\n::: callout-tip\nCheck out the R Markdown Quick Tour for more:\n\n- \n:::\n\n![Artwork by Allison Horst on RMarkdown](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/rmarkdown_rockstar.png){width=\"80%\"}\n\n### knitr\n\nOne of the alternative that has come up in recent times is something called `knitr`.\n\n- The `knitr` package for R takes a lot of these ideas of literate programming and updates and improves upon them.\n- `knitr` still uses R as its programming language, but it allows you to mix other programming languages in.\n- You can also use a variety of documentation languages now, such as LaTeX, markdown and HTML.\n- `knitr` was developed by Yihui Xie while he was a graduate student at Iowa State and it has become a very popular package for writing literate statistical programs.\n\nKnitr takes a plain text document with embedded code, executes the code and 'knits' the results back into the document.\n\nFor for example, it converts\n\n- An R Markdown (`.Rmd)` file into a standard markdown file (`.md`)\n- An `.Rnw` (Sweave) file into to `.tex` format.\n- An `.Rhtml` file into to `.html`.\n\nThe core function is `knitr::knit()` and by default this will look at the input document and try and guess what type it is e.g. `Rnw`, `Rmd` etc.\n\nThis core function performs three roles:\n\n- A **source parser**, which looks at the input document and detects which parts are code that the user wants to be evaluated.\n- A **code evaluator**, which evaluates this code\n- An **output renderer**, which writes the results of evaluation back to the document in a format which is interpretable by the raw output type. For instance, if the input file is an `.Rmd`, the output render marks up the output of code evaluation in `.md` format.\n\n\n::: {.cell layout-align=\"center\" preview='true'}\n::: {.cell-output-display}\n![Converting a Rmd file to many outputs using knitr and pandoc](https://d33wubrfki0l68.cloudfront.net/61d189fd9cdf955058415d3e1b28dd60e1bd7c9b/9791d/images/rmarkdownflow.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\\[[Source](https://rmarkdown.rstudio.com/authoring_quick_tour.html)\\]\n\nAs seen in the figure above, from there pandoc is used to convert e.g. a `.md` file into many other types of file formats into a `.html`, etc.\n\nSo in summary:\n\n> \"R Markdown stands on the shoulders of knitr and Pandoc. The former executes the computer code embedded in Markdown, and converts R Markdown to Markdown. The latter renders Markdown to the output format you want (such as PDF, HTML, Word, and so on).\"\n\n\\[[Source](https://bookdown.org/yihui/rmarkdown/)\\]\n\n# Create and Knit Your First R Markdown Document\n\n\n\nWhen creating your first R Markdown document, in RStudio you can\n\n1. Go to File \\> New File \\> R Markdown...\n\n2. Feel free to edit the Title\n\n3. Make sure to select \"Default Output Format\" to be HTML\n\n4. Click \"OK\". RStudio creates the R Markdown document and places some boilerplate text in there just so you can see how things are setup.\n\n5. Click the \"Knit\" button (or go to File \\> Knit Document) to make sure you can create the HTML output\n\nIf you successfully knit your first R Markdown document, then congratulations!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Mission accomplished!](https://media.giphy.com/media/L4ZZNbDpOCfiX8uYSd/giphy.gif){width=60%}\n:::\n:::\n\n\n# Websites and Books in R Markdown\n\nNow that you are on the road to using R Markdown documents, it is important to know about other wonderful things you do with these documents. For example, let's say you have multiple `.Rmd` documents that you want to put together into a website, blog, book, etc.\n\nThere are primarily two ways to build multiple `.Rmd` documents together:\n\n1. [**blogdown**](https://bookdown.org/yihui/blogdown/) for building websites\n2. [**bookdown**](https://bookdown.org/yihui/bookdown/) for authoring books\n\nIn this section, we briefly introduce both packages, but it's worth mentioning that the [**rmarkdown** package also has a built-in site generator](https://bookdown.org/yihui/rmarkdown/rmarkdown-site.html) to build websites.\n\n### blogdown\n\n\n::: {.cell}\n::: {.cell-output-display}\n![blogdown logo](https://bookdown.org/yihui/blogdown/images/logo.png){width=30%}\n:::\n:::\n\n\n\\[[Source](https://bookdown.org/yihui/bookdown/images/logo.png)\\]\n\nThe `blogdown` R package is built on top of R Markdown, supports multi-page HTML output to write a blog post or a general page in an Rmd document, or a plain Markdown document.\n\n- These source documents (e.g. `.Rmd` or `.md`) are built into a static website (i.e. a bunch of static HTML files, images and CSS files).\n- Using this folder of files, it is very easy to publish it to any web server as a website.\n- Also, it is easy to maintain because it is only a single folder.\n\n::: callout-tip\nFor example, my personal website was built in blogdown:\n\n- \n\nOther really great examples can be found here:\n\n- \n:::\n\nOther advantages include the content likely being reproducible, easier to maintain, and easy to convert pages to e.g. PDF or other formats in the future if you do not want to convert to HTML files.\n\nBecause it is based on the Markdown syntax, it is easy to write technical documents, including math equations, insert figures or tables with captions, cross-reference with figure or table numbers, add citations, and present theorems or proofs.\n\nHere's a video you can watch of someone making a blogdown website.\n\n

\n\n\n\n

\n\n\\[[Source](https://www.youtube.com/watch?v=AADnslLpzJ4) on YouTube\\]\n\n### bookdown\n\n\n::: {.cell}\n::: {.cell-output-display}\n![book logo](https://bookdown.org/yihui/bookdown/images/logo.png){width=30%}\n:::\n:::\n\n\n\\[[Source](https://bookdown.org/yihui/bookdown/images/logo.png)\\]\n\nSimilar to `blogdown`, the `bookdown` R package is built on top of R Markdown, but also offers features like multi-page HTML output, numbering and cross-referencing figures/tables/sections/equations, inserting parts/appendices, and imported the GitBook style () to create elegant and appealing HTML book pages. Share\n\n::: callout-tip\nFor example, the previous version of this course was built in bookdown:\n\n- \n\nAnother example is the [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) book that the JHU Data Science Lab wrote. The github repo that contains all the `.Rmd` files can be found [here](https://github.com/jhudsl/tidyversecourse).\n\n- \n- \n:::\n\n**Note**: Even though the word \"book\" is in \"bookdown\", this package is not only for books. It really can be anything that consists of multiple `.Rmd` documents meant to be read in a linear sequence such as course dissertation/thesis, handouts, study notes, a software manual, a thesis, or even a diary.\n\n- https://bookdown.org/yihui/rmarkdown/basics-examples.html#examples-books\n\n### distill\n\nThere is another great way to build blogs or websites using the [distill for R Markdown](https://rstudio.github.io/distill/).\n\n- \n\nDistill for R Markdown combines the technical authoring features of the [Distill web framework](https://github.com/distillpub/template) (optimized for scientific and technical communication) with [R Markdown](https://rmarkdown.rstudio.com), enabling a fully reproducible workflow based on literate programming [@knuth1984].\n\nDistill articles include:\n\n- Reader-friendly typography that adapts well to mobile devices.\n- Features essential to technical writing like LaTeX math, citations, and footnotes.\n- Flexible figure layout options (e.g. displaying figures at a larger width than the article text).\n- Attractively rendered tables with optional support for pagination.\n- Support for a wide variety of diagramming tools for illustrating concepts. The ability to incorporate JavaScript and D3-based interactive visualizations.\n- A variety of ways to publish articles, including support for publishing sets of articles as a Distill website or as a Distill blog.\n\nThe course website from last year was built in Distill for R Markdown:\n\n- Website: \n- Github: \n\nSome other cool things about distill is the use of footnotes and asides.\n\nFor example [^1]. The number of the footnote will be automatically generated.\n\n[^1]: This will become a hover-able footnote\n\nYou can also optionally include notes in the gutter of the article (immediately to the right of the article text). To do this use the aside tag.\n\n\n\nYou can also include figures in the gutter. Just enclose the code chunk which generates the figure in an aside tag\n\n# Tips and tricks in R Markdown in RStudio\n\nHere are shortcuts and tips on efficiently using RStudio to improve how you write code.\n\n### Run code\n\nIf you want to run a code chunk:\n\n``` \ncommand + Enter on Mac\nCtrl + Enter on Windows\n```\n\n### Insert a comment in R and R Markdown\n\nTo insert a comment:\n\n``` \ncommand + Shift + C on Mac\nCtrl + Shift + C on Windows\n```\n\nThis shortcut can be used both for:\n\n- R code when you want to comment your code. It will add a `#` at the beginning of the line\n- for text in R Markdown. It will add `` around the text\n\nNote that if you want to comment more than one line, select all the lines you want to comment then use the shortcut. If you want to uncomment a comment, apply the same shortcut.\n\n### Knit a R Markdown document\n\nYou can knit R Markdown documents by using this shortcut:\n\n``` \ncommand + Shift + K on Mac\nCtrl + Shift + K on Windows\n```\n\n### Code snippets\n\nCode snippets is usually a few characters long and is used as a shortcut to insert a common piece of code. You simply type a few characters then press `Tab` and it will complete your code with a larger code. `Tab` is then used again to navigate through the code where customization is required. For instance, if you type `fun` then press `Tab`, it will auto-complete the code with the required code to create a function:\n\n``` \nname <- function(variables) {\n \n}\n```\n\nPressing `Tab` again will jump through the placeholders for you to edit it. So you can first edit the name of the function, then the variables and finally the code inside the function (try by yourself!).\n\nThere are many code snippets by default in RStudio. Here are the code snippets I use most often:\n\n- `lib` to call `library()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(package)\n```\n:::\n\n\n- `mat` to create a matrix\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data, nrow = rows, ncol = cols)\n```\n:::\n\n\n- `if`, `el`, and `ei` to create conditional expressions such as `if() {}`, `else {}` and `else if () {}`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (condition) {\n ## Case 1\n} else if (condition) {\n ## Case 2\n} else if (condition) {\n ## Case 3\n}\n```\n:::\n\n\n- `fun` to create a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nname <- function(variables) {\n\n}\n```\n:::\n\n\n- `for` to create for loops\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (variable in vector) {\n\n}\n```\n:::\n\n\n- `ts` to insert a comment with the current date and time (useful if you have very long code and share it with others so they see when it has been edited)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Tue Jan 21 20:20:14 2020 ------------------------------\n```\n:::\n\n\nYou can see all default code snippets and add yours by clicking on Tools \\> Global Options... \\> Code (left sidebar) \\> Edit Snippets...\n\n### Ordered list in R Markdown\n\nIn R Markdown, when creating an ordered list such as this one:\n\n1. Item 1\n2. Item 2\n3. Item 3\n\nInstead of bothering with the numbers and typing\n\n``` \n1. Item 1\n2. Item 2\n3. Item 3\n```\n\nyou can simply type\n\n``` \n1. Item 1\n1. Item 2\n1. Item 3\n```\n\nfor the exact same result (try it yourself or check the code of this article!). This way you do not need to bother which number is next when creating a new item.\n\nTo go even further, any numeric will actually render the same result as long as the first item is the number you want to start from. For example, you could type:\n\n``` \n1. Item 1\n7. Item 2\n3. Item 3\n```\n\nwhich renders\n\n1. Item 1\n2. Item 2\n3. Item 3\n\nHowever, I suggest always using the number you want to start from for all items because if you move one item at the top, the list will start with this new number. For instance, if we move `7. Item 2` from the previous list at the top, the list becomes:\n\n``` \n7. Item 2\n1. Item 1\n3. Item 3\n```\n\nwhich incorrectly renders\n\n7. Item 2\n8. Item 1\n9. Item 3\n\n### New code chunk in R Markdown\n\nWhen editing R Markdown documents, you will need to insert a new R code chunk many times. The following shortcuts will make your life easier:\n\n``` \ncommand + option + I on Mac (or command + alt + I depending on your keyboard)\nCtrl + ALT + I on Windows\n```\n\n### Reformat code\n\nA clear and readable code is always easier and faster to read (and look more professional when sharing it to collaborators). To automatically apply the most common coding guidelines such as white spaces, indents, etc., use:\n\n``` \ncmd + Shift + A on Mac\nCtrl + Shift + A on Windows\n```\n\nSo for example the following code which does not respect the guidelines (and which is not easy to read):\n\n``` \n1+1\n for(i in 1:10){if(!i%%2){next}\nprint(i)\n }\n```\n\nbecomes much more neat and readable:\n\n``` \n1 + 1\nfor (i in 1:10) {\n if (!i %% 2) {\n next\n }\n print(i)\n}\n```\n\n### RStudio addins\n\nRStudio addins are extensions which provide a simple mechanism for executing advanced R functions from within RStudio. In simpler words, when executing an addin (by clicking a button in the Addins menu), the corresponding code is executed without you having to write the code. RStudio addins have the advantage that they allow you to execute complex and advanced code much more easily than if you would have to write it yourself.\n\n::: callout-tip\n**For more information about RStudio addins, check out**:\n\n- \n- \n:::\n\n### Others\n\nSimilar to many other programs, you can also use:\n\n- `command + Shift + N` on Mac and `Ctrl + Shift + N` on Windows to open a new R Script\n- `command + S` on Mac and `Ctrl + S` on Windows to save your current script or R Markdown document\n\nCheck out Tools --\\> Keyboard Shortcuts Help to see a long list of these shortcuts.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: questions\n### Questions\n\n1. What is literate programming?\n\n2. What was the first literate statistical programming tool to weave together a statistical language (R) with a markup language (LaTeX)?\n\n3. What is `knitr` and how is different than other literate statistical programming tools?\n\n4. Where can you find a list of other commands that help make your code writing more efficient in RStudio?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- [RMarkdown Tips and Tricks](https://indrajeetpatil.github.io/RmarkdownTips/) by Indrajeet Patil\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [ "index_files" ], diff --git a/_freeze/posts/06-reference-management/index/execute-results/html.json b/_freeze/posts/06-reference-management/index/execute-results/html.json index 60fbf6e..bf3f70d 100644 --- a/_freeze/posts/06-reference-management/index/execute-results/html.json +++ b/_freeze/posts/06-reference-management/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "16b0cad09428b521aecb026163f42472", + "hash": "7d57048944742a579c008b905e80a842", "result": { - "markdown": "---\ntitle: \"06 - Reference management\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"How to use citations and incorporate references from a bibliography in R Markdown.\"\nimage: https://www.bibtex.com/img/bibtex-format-700x402.png\ncategories: [module 1, week 1, R Markdown, programming]\nbibliography: my-refs.bib\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/06-reference-management/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. Authoring in [R Markdown from RStudio](https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html)\n2. Citations from [Reproducible Research in R](https://monashdatafluency.github.io/r-rep-res/citations.html) from the [Monash Data Fluency](https://monashdatafluency.github.io) initiative\n3. Bibliography from [R Markdown Cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/bibliography.html)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Know what types of bibliography file formats can be used in a R Markdown file\n- Learn how to add citations to a R Markdown file\n- Know how to change the citation style (e.g. APA, Chicago, etc)\n:::\n\n# Introduction\n\nFor almost any data analysis, especially if it is meant for publication in the academic literature, you will have to cite other people's work and include the references (bibliographies or citations) in your work. In this class, you are likely to need to include references and cite other people's work like in a regular research paper.\n\nR provides nice function `citation()` that helps us generating citation blob for R packages that we have used. Let's try generating citation text for rmarkdown package by using the following command\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncitation(\"rmarkdown\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTo cite package 'rmarkdown' in publications use:\n\n Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K,\n Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2023). _rmarkdown:\n Dynamic Documents for R_. R package version 2.24,\n .\n\n Xie Y, Allaire J, Grolemund G (2018). _R Markdown: The Definitive\n Guide_. Chapman and Hall/CRC, Boca Raton, Florida. ISBN\n 9781138359338, .\n\n Xie Y, Dervieux C, Riederer E (2020). _R Markdown Cookbook_. Chapman\n and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837,\n .\n\nTo see these entries in BibTeX format, use 'print(,\nbibtex=TRUE)', 'toBibtex(.)', or set\n'options(citation.bibtex.max=999)'.\n```\n:::\n:::\n\n\nI assume you are familiar with how citing references works, and hopefully, you are already using a reference manager. If not, let me know in the discussion boards.\n\nTo have something that plays well with R Markdown, you need file format that stores all the references. Click here to learn more other possible file formats available to you to use within a R Markdown file:\n\n- \n\n### Citation management software\n\nAs you can see, there are ton of file formats including `.medline` (MEDLINE), `.bib` (BibTeX), `.ris` (RIS), `.enl` (EndNote).\n\nI will not discuss underlying citational management software itself, but I will talk briefly how you might create one of these file formats.\n\nIf you recall the output from `citation(\"rmarkdown\")` above, we might consider manually copying and pasting the output into a citation management software, but instead we can use `write_bib()` function from `knitr` package to create a bibliography file ending in `.bib`.\n\nLet's run the following code in order to generate a `my-refs.bib` file\n\n\n::: {.cell}\n\n```{.r .cell-code}\nknitr::write_bib(\"rmarkdown\", file = \"my-refs.bib\")\n```\n:::\n\n\nNow we can see we have the file saved locally.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.files()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\" \"index.rmarkdown\" \"my-refs.bib\" \n```\n:::\n:::\n\n\nIf you open up the `my-refs.bib` file, you will see\n\n``` \n@Manual{R-rmarkdown,\n title = {rmarkdown: Dynamic Documents for R},\n author = {JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},\n year = {2021},\n note = {R package version 2.8},\n url = {https://CRAN.R-project.org/package=rmarkdown},\n}\n\n@Book{rmarkdown2018,\n title = {R Markdown: The Definitive Guide},\n author = {Yihui Xie and J.J. Allaire and Garrett Grolemund},\n publisher = {Chapman and Hall/CRC},\n address = {Boca Raton, Florida},\n year = {2018},\n note = {ISBN 9781138359338},\n url = {https://bookdown.org/yihui/rmarkdown},\n}\n\n@Book{rmarkdown2020,\n title = {R Markdown Cookbook},\n author = {Yihui Xie and Christophe Dervieux and Emily Riederer},\n publisher = {Chapman and Hall/CRC},\n address = {Boca Raton, Florida},\n year = {2020},\n note = {ISBN 9780367563837},\n url = {https://bookdown.org/yihui/rmarkdown-cookbook},\n}\n```\n\n::: resources\n**Note there are three keys that we will use later on**:\n\n- `R-rmarkdown`\n- `rmarkdown2018`\n- `rmarkdown2020`\n:::\n\n### Linking `.bib` file with `.rmd` (and `.qmd`) files\n\nIn order to use references within a R Markdown file, you will need to specify the name and a location of a bibliography file using the bibliography metadata field in a YAML metadata section. For example:\n\n``` yaml\n---\ntitle: \"My top ten favorite R packages\"\noutput: html_document\nbibliography: my-refs.bib\n---\n```\n\nYou can include multiple reference files using the following syntax, alternatively you can concatenate two bib files into one.\n\n``` yaml\n---\nbibliography: [\"my-refs1.bib\", \"my-refs2.bib\"]\n---\n```\n\n### Inline citation\n\nNow we can start using those bib keys that we have learned just before, using the following syntax\n\n- `[@key]` for single citation\n- `[@key1; @key2]` multiple citation can be separated by semi-colon\n- `[-@key]` in order to suppress author name, and just display the year\n- `[see @key1 p 12; also this ref @key2]` is also a valid syntax\n\nLet's start by citing the `rmarkdown` package using the following code and press `Knit` button:\n\n------------------------------------------------------------------------\n\nI have been using the amazing Rmarkdown package [@R-rmarkdown]! I should also go and read [@rmarkdown2018; and @rmarkdown2020] books.\n\n------------------------------------------------------------------------\n\nPretty cool, eh??\n\n### Citation styles\n\nBy default, Pandoc will use a Chicago author-date format for citations and references.\n\nTo use another style, you will need to specify a CSL (Citation Style Language) file in the `csl` metadata field, e.g.,\n\n``` yaml\n---\ntitle: \"My top ten favorite R packages\"\noutput: html_document\nbibliography: my-refs.bib\ncsl: biomed-central.csl\n---\n```\n\n::: resources\nTo find your required formats, we recommend using the [Zotero Style Repository](https://www.zotero.org/styles), which makes it easy to search for and download your desired style.\n:::\n\nCSL files can be tweaked to meet custom formatting requirements. For example, we can change the number of authors required before \"et al.\" is used to abbreviate them. This can be simplified through the use of visual editors such as the one available at https://editor.citationstyles.org.\n\n### Other cool features\n\n#### Add an item to a bibliography without using it\n\nBy default, the bibliography will only display items that are directly referenced in the document. If you want to include items in the bibliography without actually citing them in the body text, you can define a dummy nocite metadata field and put the citations there.\n\n``` yaml\n---\nnocite: |\n @item1, @item2\n---\n```\n\n#### Add all items to the bibliography\n\nIf we do not wish to explicitly state all of the items within the bibliography but would still like to show them in our references, we can use the following syntax:\n\n``` yaml\n---\nnocite: '@*'\n---\n```\n\nThis will force all items to be displayed in the bibliography.\n\n::: resources\nYou can also have an appendix appear after bibliography. For more on this, see:\n\n- \n:::\n\n# Other useful tips\n\nWe have learned that inside your file that contains all your references (e.g. `my-refs.bib`), typically each reference gets a key, which is a shorthand that is generated by the reference manager or you can create yourself.\n\nFor instance, I use a format of lower-case first author last name followed by 4 digit year for each reference followed by a keyword (e.g name of a software package). Alternatively, you can omit the keyword. But note that if I cite a paper by the same first author that was published in the same year, then a lower case letter is added to the end. For instance, for a paper that I wrote as 1st author in 2010, my bibtex key might be `hicks2022` or `hicks2022a`. You can decide what scheme to use, just pick one and use it *forever*.\n\nIn your R Markdown document, you can then cite the reference by adding the key, such as `...in the paper by Hicks et al. [@hicks2022]...`.\n\n# Post-lecture materials\n\n### Practice\n\nHere are some post-lecture tasks to practice some of the material discussed.\n\n::: callout-note\n### Questions\n\n**Try out the following:**\n\n1. What do you notice that's different when you run `citation(\"tidyverse\")` (compared to `citation(\"rmarkdown\")`)?\n\n2. Install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(c(\"bibtex\", \"RefManageR\")\n```\n:::\n\n\nWhat do they do? How might they be helpful to you in terms of reference management?\n\n3. Instead of using a `.bib` file, try using a different bibliography file format in an R Markdown document.\n\n4. Practice using a different CSL file to change the citation style.\n:::\n\n### Additional Resources\n\n::: callout-tip\n- Add here.\n:::\n\n## rtistry\n\n\n::: {.cell .fig-cap-location-top}\n\n:::\n\n\n\\[Add here.\\]\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"06 - Reference management\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"How to use citations and incorporate references from a bibliography in R Markdown.\"\nimage: https://www.bibtex.com/img/bibtex-format-700x402.png\ncategories: [module 1, week 1, R Markdown, programming]\nbibliography: my-refs.bib\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/06-reference-management/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. Authoring in [R Markdown from RStudio](https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html)\n2. Citations from [Reproducible Research in R](https://monashdatafluency.github.io/r-rep-res/citations.html) from the [Monash Data Fluency](https://monashdatafluency.github.io) initiative\n3. Bibliography from [R Markdown Cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/bibliography.html)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Know what types of bibliography file formats can be used in a R Markdown file\n- Learn how to add citations to a R Markdown file\n- Know how to change the citation style (e.g. APA, Chicago, etc)\n:::\n\n# Introduction\n\nFor almost any data analysis, especially if it is meant for publication in the academic literature, you will have to cite other people's work and include the references (bibliographies or citations) in your work. In this class, you are likely to need to include references and cite other people's work like in a regular research paper.\n\nR provides nice function `citation()` that helps us generating citation blob for R packages that we have used. Let's try generating citation text for rmarkdown package by using the following command\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncitation(\"rmarkdown\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTo cite package 'rmarkdown' in publications use:\n\n Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K,\n Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2023). _rmarkdown:\n Dynamic Documents for R_. R package version 2.24,\n .\n\n Xie Y, Allaire J, Grolemund G (2018). _R Markdown: The Definitive\n Guide_. Chapman and Hall/CRC, Boca Raton, Florida. ISBN\n 9781138359338, .\n\n Xie Y, Dervieux C, Riederer E (2020). _R Markdown Cookbook_. Chapman\n and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837,\n .\n\nTo see these entries in BibTeX format, use 'print(,\nbibtex=TRUE)', 'toBibtex(.)', or set\n'options(citation.bibtex.max=999)'.\n```\n:::\n:::\n\n\nI assume you are familiar with how citing references works, and hopefully, you are already using a reference manager. If not, let me know in the discussion boards.\n\nTo have something that plays well with R Markdown, you need file format that stores all the references. Click here to learn more other possible file formats available to you to use within a R Markdown file:\n\n- \n\n### Citation management software\n\nAs you can see, there are ton of file formats including `.medline` (MEDLINE), `.bib` (BibTeX), `.ris` (RIS), `.enl` (EndNote).\n\nI will not discuss underlying citational management software itself, but I will talk briefly how you might create one of these file formats.\n\nIf you recall the output from `citation(\"rmarkdown\")` above, we might consider manually copying and pasting the output into a citation management software, but instead we can use `write_bib()` function from `knitr` package to create a bibliography file ending in `.bib`.\n\nLet's run the following code in order to generate a `my-refs.bib` file\n\n\n::: {.cell}\n\n```{.r .cell-code}\nknitr::write_bib(\"rmarkdown\", file = \"my-refs.bib\")\n```\n:::\n\n\nNow we can see we have the file saved locally.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.files()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\" \"index.rmarkdown\" \"my-refs.bib\" \n```\n:::\n:::\n\n\nIf you open up the `my-refs.bib` file, you will see\n\n``` \n@Manual{R-rmarkdown,\n title = {rmarkdown: Dynamic Documents for R},\n author = {JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},\n year = {2021},\n note = {R package version 2.8},\n url = {https://CRAN.R-project.org/package=rmarkdown},\n}\n\n@Book{rmarkdown2018,\n title = {R Markdown: The Definitive Guide},\n author = {Yihui Xie and J.J. Allaire and Garrett Grolemund},\n publisher = {Chapman and Hall/CRC},\n address = {Boca Raton, Florida},\n year = {2018},\n note = {ISBN 9781138359338},\n url = {https://bookdown.org/yihui/rmarkdown},\n}\n\n@Book{rmarkdown2020,\n title = {R Markdown Cookbook},\n author = {Yihui Xie and Christophe Dervieux and Emily Riederer},\n publisher = {Chapman and Hall/CRC},\n address = {Boca Raton, Florida},\n year = {2020},\n note = {ISBN 9780367563837},\n url = {https://bookdown.org/yihui/rmarkdown-cookbook},\n}\n```\n\n::: resources\n**Note there are three keys that we will use later on**:\n\n- `R-rmarkdown`\n- `rmarkdown2018`\n- `rmarkdown2020`\n:::\n\n### Linking `.bib` file with `.rmd` (and `.qmd`) files\n\nIn order to use references within a R Markdown file, you will need to specify the name and a location of a bibliography file using the bibliography metadata field in a YAML metadata section. For example:\n\n``` yaml\n---\ntitle: \"My top ten favorite R packages\"\noutput: html_document\nbibliography: my-refs.bib\n---\n```\n\nYou can include multiple reference files using the following syntax, alternatively you can concatenate two bib files into one.\n\n``` yaml\n---\nbibliography: [\"my-refs1.bib\", \"my-refs2.bib\"]\n---\n```\n\n### Inline citation\n\nNow we can start using those bib keys that we have learned just before, using the following syntax\n\n- `[@key]` for single citation\n- `[@key1; @key2]` multiple citation can be separated by semi-colon\n- `[-@key]` in order to suppress author name, and just display the year\n- `[see @key1 p 12; also this ref @key2]` is also a valid syntax\n\nLet's start by citing the `rmarkdown` package using the following code and press `Knit` button:\n\n------------------------------------------------------------------------\n\nI have been using the amazing Rmarkdown package [@R-rmarkdown]! I should also go and read [@rmarkdown2018; and @rmarkdown2020] books.\n\n------------------------------------------------------------------------\n\nPretty cool, eh??\n\n### Citation styles\n\nBy default, Pandoc will use a Chicago author-date format for citations and references.\n\nTo use another style, you will need to specify a CSL (Citation Style Language) file in the `csl` metadata field, e.g.,\n\n``` yaml\n---\ntitle: \"My top ten favorite R packages\"\noutput: html_document\nbibliography: my-refs.bib\ncsl: biomed-central.csl\n---\n```\n\n::: resources\nTo find your required formats, we recommend using the [Zotero Style Repository](https://www.zotero.org/styles), which makes it easy to search for and download your desired style.\n:::\n\nCSL files can be tweaked to meet custom formatting requirements. For example, we can change the number of authors required before \"et al.\" is used to abbreviate them. This can be simplified through the use of visual editors such as the one available at https://editor.citationstyles.org.\n\n### Other cool features\n\n#### Add an item to a bibliography without using it\n\nBy default, the bibliography will only display items that are directly referenced in the document. If you want to include items in the bibliography without actually citing them in the body text, you can define a dummy nocite metadata field and put the citations there.\n\n``` yaml\n---\nnocite: |\n @item1, @item2\n---\n```\n\n#### Add all items to the bibliography\n\nIf we do not wish to explicitly state all of the items within the bibliography but would still like to show them in our references, we can use the following syntax:\n\n``` yaml\n---\nnocite: '@*'\n---\n```\n\nThis will force all items to be displayed in the bibliography.\n\n::: resources\nYou can also have an appendix appear after bibliography. For more on this, see:\n\n- \n:::\n\n# Other useful tips\n\nWe have learned that inside your file that contains all your references (e.g. `my-refs.bib`), typically each reference gets a key, which is a shorthand that is generated by the reference manager or you can create yourself.\n\nFor instance, I use a format of lower-case first author last name followed by 4 digit year for each reference followed by a keyword (e.g name of a software package). Alternatively, you can omit the keyword. But note that if I cite a paper by the same first author that was published in the same year, then a lower case letter is added to the end. For instance, for a paper that I wrote as 1st author in 2010, my bibtex key might be `hicks2022` or `hicks2022a`. You can decide what scheme to use, just pick one and use it *forever*.\n\nIn your R Markdown document, you can then cite the reference by adding the key, such as `...in the paper by Hicks et al. [@hicks2022]...`.\n\n# Post-lecture materials\n\n### Practice\n\nHere are some post-lecture tasks to practice some of the material discussed.\n\n::: callout-note\n### Questions\n\n**Try out the following:**\n\n1. What do you notice that's different when you run `citation(\"tidyverse\")` (compared to `citation(\"rmarkdown\")`)?\n\n2. Install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(c(\"bibtex\", \"RefManageR\"))\n```\n:::\n\n\nWhat do they do? How might they be helpful to you in terms of reference management?\n\n3. Instead of using a `.bib` file, try using a different bibliography file format in an R Markdown document.\n\n4. Practice using a different CSL file to change the citation style.\n:::\n\n### Additional Resources\n\n::: callout-tip\n- Add here.\n:::\n\n## rtistry\n\n\n::: {.cell .fig-cap-location-top}\n\n:::\n\n\n\\[Add here.\\]\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/07-reading-and-writing-data/index/execute-results/html.json b/_freeze/posts/07-reading-and-writing-data/index/execute-results/html.json index 75acc3f..4d72a5a 100644 --- a/_freeze/posts/07-reading-and-writing-data/index/execute-results/html.json +++ b/_freeze/posts/07-reading-and-writing-data/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "da2ed2dbc83c00a905134a7a4a7fecbb", + "hash": "53f6a7572e4a213383e32cd28c924565", "result": { - "markdown": "---\ntitle: \"07 - Reading and Writing data\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"How to get data in and out of R using relative paths\"\ncategories: [module 2, week 2, R, programming, readr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/07-reading-and-writing-data/index.qmd).*\n\n\n::: {.cell}\n\n:::\n\n\n\n\n> \"When writing code, you're always collaborating with future-you; and past-you doesn't respond to emails\". ---*Hadley Wickham*\n\n\\[[Source](https://fivebooks.com/best-books/computer-science-data-science-hadley-wickham/)\\]\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Know difference between relative vs absolute paths\n- Be able to read and write text / csv files in R\n- Be able to read and write R data objects in R\n- Be able to calculate memory requirements for R objects\n- Use modern R packages for reading and writing data\n:::\n\n# Introduction\n\nThis lesson introduces **ways to read and write data** (e.g. `.txt` and `.csv` files) using base R functions as well as more modern R packages, such as `readr`, which is typically [10x faster than base R](https://r4ds.had.co.nz/data-import.html#compared-to-base-r).\n\nWe will also briefly describe different ways for reading and writing other data types such as, Excel files, google spreadsheets, or SQL databases.\n\n# Relative versus absolute paths\n\nWhen you are starting a data analysis, you can create a new `.Rproj` file that asks RStudio to change the path (location on your computer) to the `.Rproj` location.\n\nLet's try this out. In RStudio, click `Project: (None)` in the top right corner and `New Project`.\n\nAfter opening up a `.Rproj` file, you can test this by\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngetwd()\n```\n:::\n\n\nWhen you open up someone else's R code or analysis, you might also see the\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd()\n```\n:::\n\n\nfunction being used which explicitly tells R to change the absolute path or absolute location of which directory to move into.\n\nFor example, say I want to clone a GitHub repo from my colleague Brian, which has 100 R script files, and in every one of those files at the top is:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"C:\\Users\\Brian\\path\\only\\that\\Brian\\has\")\n```\n:::\n\n\nThe problem is, if I want to use his code, I will need to go and hand-edit every single one of those paths (`C:\\Users\\Brian\\path\\only\\that\\Brian\\has`) to the path that I want to use on my computer or wherever I saved the folder on my computer (e.g. `/Users/Stephanie/Documents/path/only/I/have`).\n\n1. This is an unsustainable practice.\n2. I can go in and manually edit the path, but this assumes I know how to set a working directory. Not everyone does.\n\nSo instead of absolute paths:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"/Users/bcaffo/data\")\nsetwd(\"~/Desktop/files/data\")\nsetwd(\"C:\\\\Users\\\\Michelle\\\\Downloads\")\n```\n:::\n\n\nA better idea is to use relative paths:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"../data\")\nsetwd(\"../files\")\nsetwd(\"..\\tmp\")\n```\n:::\n\n\nWithin R, an even better idea is to use the [here](https://github.com/r-lib/here) R package will recognize the top-level directory of a Git repo and supports building all paths relative to that. For more on project-oriented workflow suggestions, read [this post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) from Jenny Bryan.\n\n![Artwork by Allison Horst on setwd() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/cracked_setwd.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### The `here` package\n\nIn her post, Jenny Bryan writes\n\n> \"I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work.\"\n\nInstead of using `setwd()` at the top your `.R` or `.Rmd` file, she suggests:\n\n- Organize each logical project into a folder on your computer.\n- Make sure the top-level folder advertises itself as such. This can be as simple as having an empty file named `.here`. Or, if you use RStudio and/or Git, those both leave characteristic files behind that will get the job done.\n- Use the `here()` function from the `here` package to build the path when you read or write a file. Create paths relative to the top-level directory.\n- Whenever you work on this project, launch the R process from the project's top-level directory. If you launch R from the shell, `cd` to the correct folder first.\n\nLet's test this out. We can use `getwd()` to see our current working directory path and the files available using `list.files()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngetwd()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/posts/07-reading-and-writing-data\"\n```\n:::\n\n```{.r .cell-code}\nlist.files()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\" \"index.rmarkdown\"\n```\n:::\n:::\n\n\nOK so our current location is in the reading and writing lectures sub-folder of the `jhustatcomputing2022` course repository. Let's try using the `here` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n\nlist.files(here::here())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"_freeze\" \"_post_template.qmd\" \n [3] \"_quarto.yml\" \"_site\" \n [5] \"data\" \"gh-pages\" \n [7] \"icon_32.png\" \"images\" \n [9] \"index.qmd\" \"jhustatcomputing2023.Rproj\"\n[11] \"lectures.qmd\" \"posts\" \n[13] \"profile.jpg\" \"projects\" \n[15] \"projects.qmd\" \"README.md\" \n[17] \"resources.qmd\" \"schedule.qmd\" \n[19] \"scripts\" \"site_libs\" \n[21] \"styles.css\" \"syllabus.qmd\" \n[23] \"videos\" \n```\n:::\n\n```{.r .cell-code}\nlist.files(here(\"data\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"2016-07-19.csv.bz2\" \"b_lyrics.RDS\" \n [3] \"bmi_pm25_no2_sim.csv\" \"chicago.rds\" \n [5] \"chocolate.RDS\" \"flights.csv\" \n [7] \"maacs_sim.csv\" \"sales.RDS\" \n [9] \"storms_2004.csv.gz\" \"team_standings.csv\" \n[11] \"ts_lyrics.RDS\" \"tuesdata_rainfall.RDS\" \n[13] \"tuesdata_temperature.RDS\"\n```\n:::\n:::\n\n\nNow we see that using the `here::here()` function is a *relative* path (relative to the `.Rproj` file in our `jhustatcomputing2022` repository. We also see there is are two `.csv` files in the `data` folder. We will learn how to read those files into R in the next section.\n\n![Artwork by Allison Horst on here package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/here.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Finding and creating files locally\n\nOne last thing. If you want to download a file, one way to use the `file.exists()`, `dir.create()` and `list.files()` functions.\n\n- `file.exists(here(\"my\", \"relative\", \"path\"))`: logical test if the file exists\n- `dir.create(here(\"my\", \"relative\", \"path\"))`: create a folder\n- `list.files(here(\"my\", \"relative\", \"path\"))`: list contents of folder\n- `file.create(here(\"my\", \"relative\", \"path\"))`: create a file\n- `file.remove(here(\"my\", \"relative\", \"path\"))`: delete a file\n\nFor example, I can put all this together by\n\n1. Checking to see if a file exists in my path. If not, then\n2. Create a directory in that path.\n3. List the files in the path.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif(!file.exists(here(\"my\", \"relative\", \"path\"))){\n dir.create(here(\"my\", \"relative\", \"path\"))\n}\nlist.files(here(\"my\", \"relative\", \"path\"))\n```\n:::\n\n\nLet's put relative paths to use while reading and writing data.\n\n# Reading data in base R\n\nIn this section, we're going to demonstrate the essential functions you need to know to read and write (or save) data in R.\n\n## txt or csv\n\nThere are a few primary functions reading data from base R.\n\n- `read.table()`, `read.csv()`: for reading tabular data\n- `readLines()`: for reading lines of a text file\n\nThere are analogous functions for writing data to files\n\n- `write.table()`: for writing tabular data to text files (i.e. CSV) or connections\n- `writeLines()`: for writing character data line-by-line to a file or connection\n\nLet's try reading some data into R with the `read.csv()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(here(\"data\", \"team_standings.csv\"))\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Standing Team\n1 1 Spain\n2 2 Netherlands\n3 3 Germany\n4 4 Uruguay\n5 5 Argentina\n6 6 Brazil\n7 7 Ghana\n8 8 Paraguay\n9 9 Japan\n10 10 Chile\n11 11 Portugal\n12 12 USA\n13 13 England\n14 14 Mexico\n15 15 South Korea\n16 16 Slovakia\n17 17 Ivory Coast\n18 18 Slovenia\n19 19 Switzerland\n20 20 South Africa\n21 21 Australia\n22 22 New Zealand\n23 23 Serbia\n24 24 Denmark\n25 25 Greece\n26 26 Italy\n27 27 Nigeria\n28 28 Algeria\n29 29 France\n30 30 Honduras\n31 31 Cameroon\n32 32 North Korea\n```\n:::\n:::\n\n\nWe can use the `$` symbol to pick out a specific column:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$Team\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Spain\" \"Netherlands\" \"Germany\" \"Uruguay\" \"Argentina\" \n [6] \"Brazil\" \"Ghana\" \"Paraguay\" \"Japan\" \"Chile\" \n[11] \"Portugal\" \"USA\" \"England\" \"Mexico\" \"South Korea\" \n[16] \"Slovakia\" \"Ivory Coast\" \"Slovenia\" \"Switzerland\" \"South Africa\"\n[21] \"Australia\" \"New Zealand\" \"Serbia\" \"Denmark\" \"Greece\" \n[26] \"Italy\" \"Nigeria\" \"Algeria\" \"France\" \"Honduras\" \n[31] \"Cameroon\" \"North Korea\" \n```\n:::\n:::\n\n\nWe can also ask for the full paths for specific files\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhere(\"data\", \"team_standings.csv\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/data/team_standings.csv\"\n```\n:::\n:::\n\n\n::: callout-note\n### Questions\n\n- What happens when you use `readLines()` function with the `team_standings.csv` data?\n- How would you only read in the first 5 lines?\n:::\n\n## R code\n\nSometimes, someone will give you a file that ends in a `.R`.\n\nThis is what's called an **R script file**. It may contain code someone has written (maybe even you!), for example, a function that you can use with your data. In this case, you want the function available for you to use.\n\nTo use the function, **you have to first, read in the function from R script file into R**.\n\nYou can check to see if the function already is loaded in R by looking at the Environment tab.\n\nThe function you want to use is\n\n- `source()`: for reading in R code files\n\nFor example, it might be something like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource(here::here('functions.R'))\n```\n:::\n\n\n## R objects\n\nAlternatively, you might be interested in reading and writing R objects.\n\nWriting data in e.g. `.txt`, `.csv` or Excel file formats is good if you want to open these files with other analysis software, such as Excel. However, these formats do not preserve data structures, such as column data types (numeric, character or factor). In order to do that, the data should be written out in a R data format.\n\nThere are several types R data file formats to be aware of:\n\n- `.RData`: Stores **multiple** R objects\n- `.Rda`: This is short for `.RData` and is equivalent.\n- `.Rds`: Stores a **single** R object\n\n::: callout-note\n### Question\n\n**Why is saving data in as a R object useful?**\n\nSaving data into R data formats can **typically** reduce considerably the size of large files by compression.\n:::\n\nNext, we will learn how to read and save\n\n1. A single R object\n2. Multiple R objects\n3. Your entire work space in a specified file\n\n### Reading in data from files\n\n- `load()`: for reading in single or multiple R objects (opposite of `save()`) with a `.Rda` or `.RData` file format (objects must be same name)\n- `readRDS()`: for reading in a single object with a `.Rds` file format (can rename objects)\n- `unserialize()`: for reading single R objects in binary form\n\n### Writing data to files\n\n- `save()`: for saving an arbitrary number of R objects in binary format (possibly compressed) to a file.\n- `saveRDS()`: for saving a single object\n- `serialize()`: for converting an R object into a binary format for outputting to a connection (or file).\n- `save.image()`: short for 'save my current workspace'; while this **sounds** nice, it's not terribly useful for reproducibility (hence not suggested); it's also what happens when you try to quit R and it asks if you want to save your work space.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Save data into R data file formats: RDS and RDATA](http://www.sthda.com/sthda/RDoc/images/save-data-into-r-data-formats.png)\n:::\n:::\n\n\n\\[[Source](http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata)\\]\n\n### Example\n\nLet's try an example. Let's save a vector of length 5 into the two file formats.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\nsave(x, file=here(\"data\", \"x.Rda\"))\nsaveRDS(x, file=here(\"data\", \"x.Rds\"))\nlist.files(path=here(\"data\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"2016-07-19.csv.bz2\" \"b_lyrics.RDS\" \n [3] \"bmi_pm25_no2_sim.csv\" \"chicago.rds\" \n [5] \"chocolate.RDS\" \"flights.csv\" \n [7] \"maacs_sim.csv\" \"sales.RDS\" \n [9] \"storms_2004.csv.gz\" \"team_standings.csv\" \n[11] \"ts_lyrics.RDS\" \"tuesdata_rainfall.RDS\" \n[13] \"tuesdata_temperature.RDS\" \"x.Rda\" \n[15] \"x.Rds\" \n```\n:::\n:::\n\n\nHere we assign the imported data to an object using `readRDS()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnew_x1 <- readRDS(here(\"data\", \"x.Rds\"))\nnew_x1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2 3 4 5\n```\n:::\n:::\n\n\nHere we assign the imported data to an object using `load()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnew_x2 <- load(here(\"data\", \"x.Rda\"))\nnew_x2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\n`load()` simply returns the name of the objects loaded. Not the values.\n:::\n\nLet's clean up our space.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rda\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rds\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nrm(x)\n```\n:::\n\n\n::: callout-note\n### Question\n\nWhat do you think this code will do?\n\n**Hint**: change `eval=TRUE` to see result\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\ny <- x^2\nsave(x,y, file=here(\"data\", \"x.Rda\"))\nnew_x2 <- load(here(\"data\", \"x.Rda\"))\n```\n:::\n\n\nWhen you are done:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rda\"))\n```\n:::\n\n:::\n\n## Other data types\n\nNow, there are of course, many R packages that have been developed to read in all kinds of other datasets, and you may need to resort to one of these packages if you are working in a specific area.\n\nFor example, check out\n\n- [`DBI`](https://github.com/r-dbi/DBI) for relational databases\n- [`haven`](https://haven.tidyverse.org) for SPSS, Stata, and SAS data\n- [`httr`](https://github.com/r-lib/httr) for web APIs\n- [`readxl`](https://readxl.tidyverse.org) for `.xls` and `.xlsx` sheets\n- [`googlesheets4`](https://googlesheets4.tidyverse.org) for Google Sheets\n- [`googledrive`](https://googledrive.tidyverse.org) for Google Drive files\n- [`rvest`](https://github.com/tidyverse/rvest) for web scraping\n- [`jsonlite`](https://github.com/jeroen/jsonlite#jsonlite) for JSON\n- [`xml2`](https://github.com/r-lib/xml2) for XML.\n\n## Reading data files with `read.table()`\n\n
\n\nFor details on reading data with `read.table()`, click here.\n\nThe `read.table()` function is one of the most commonly used functions for reading data. The help file for `read.table()` is worth reading in its entirety if only because the function gets used a lot (run `?read.table` in R).\n\n**I know, I know**, everyone always says to read the help file, but this one is actually worth reading.\n\nThe `read.table()` function has a few important arguments:\n\n- `file`, the name of a file, or a connection\n- `header`, logical indicating if the file has a header line\n- `sep`, a string indicating how the columns are separated\n- `colClasses`, a character vector indicating the class of each column in the dataset\n- `nrows`, the number of rows in the dataset. By default `read.table()` reads an entire file.\n- `comment.char`, a character string indicating the comment character. This defaults to `\"#\"`. If there are no commented lines in your file, it's worth setting this to be the empty string `\"\"`.\n- `skip`, the number of lines to skip from the beginning\n- `stringsAsFactors`, should character variables be coded as factors? This defaults to `FALSE`. However, back in the \"old days\", it defaulted to `TRUE`. The reason for this was because, if you had data that were stored as strings, it was because those strings represented levels of a categorical variable. Now, we have lots of data that is text data and they do not always represent categorical variables. So you may want to set this to be `FALSE` in those cases. If you *always* want this to be `FALSE`, you can set a global option via `options(stringsAsFactors = FALSE)`.\n\nI've never seen so much heat generated on discussion forums about an R function argument than the `stringsAsFactors` argument. **Seriously**.\n\nFor small to moderately sized datasets, you can usually call `read.table()` without specifying any other arguments\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- read.table(\"foo.txt\")\n```\n:::\n\n\n::: callout-tip\n### Note\n\n`foo.txt` is not a real dataset here. It is only used as an example for how to use `read.table()`\n:::\n\nIn this case, R will automatically:\n\n- skip lines that begin with a \\#\n- figure out how many rows there are (and how much memory needs to be allocated)\n- figure what type of variable is in each column of the table.\n\nTelling R all these things directly makes R run faster and more efficiently.\n\n::: callout-tip\n### Note\n\nThe `read.csv()` function is identical to `read.table()` except that some of the defaults are set differently (like the `sep` argument).\n:::\n\n
\n\n## Reading in larger datasets with `read.table()`\n\n
\n\nFor details on reading larger datasets with `read.table()`, click here.\n\nWith much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking.\n\n- Read the help page for `read.table()`, which contains many hints\n- Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.\n- Set `comment.char = \"\"` if there are no commented lines in your file.\n- Use the `colClasses` argument. Specifying this option instead of using the default can make `read.table()` run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are \"numeric\", for example, then you can just set `colClasses = \"numeric\"`. A quick an dirty way to figure out the classes of each column is the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninitial <- read.table(\"datatable.txt\", nrows = 100)\nclasses <- sapply(initial, class)\ntabAll <- read.table(\"datatable.txt\", colClasses = classes)\n```\n:::\n\n\n**Note**: `datatable.txt` is not a real dataset here. It is only used as an example for how to use `read.table()`.\n\n- Set `nrows`. This does not make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool `wc` to calculate the number of lines in a file.\n\nIn general, when using R with larger datasets, it's also useful to know a few things about your system.\n\n- How much memory is available on your system?\n- What other applications are in use? Can you close any of them?\n- Are there other users logged into the same system?\n- What operating system ar you using? Some operating systems can limit the amount of memory a single process can access\n\n
\n\n# Calculating Memory Requirements for R Objects\n\nBecause **R stores all of its objects in physical memory**, it is important to be cognizant of how much memory is being used up by all of the data objects residing in your workspace.\n\nOne situation where it is particularly important to understand memory requirements is when you are reading in a new dataset into R. Fortunately, it is easy to make a back of the envelope calculation of how much memory will be required by a new dataset.\n\nFor example, suppose I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?\n\nWell, on most modern computers [double precision floating point numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) are stored using 64 bits of memory, or 8 bytes. Given that information, you can do the following calculation\n\n1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes\n\n= 1,440,000,000 / 2^20^ bytes/MB\n\n= 1,373.29 MB\n\n= 1.34 GB\n\nSo the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware of\n\n- what other programs might be running on your computer, using up RAM\n- what other R objects might already be taking up RAM in your workspace\n\nReading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later.\n\n# Using the `readr` package\n\nThe `readr` package was developed by Posit (formerly known as RStudio) to deal with reading in large flat files quickly.\n\nThe package provides replacements for functions like `read.table()` and `read.csv()`. The analogous functions in `readr` are `read_table()` and `read_csv()`. These **functions are often much faster than their base R analogues** and provide a few other nice features such as progress meters.\n\nFor example, the package includes a variety of functions in the `read_*()` family that allow you to read in data from different formats of flat files. The following table gives a guide to several functions in the `read_*()` family.\n\n\n::: {.cell}\n::: {.cell-output-display}\n|`readr` function |Use |\n|:----------------|:--------------------------------------------|\n|`read_csv()` |Reads comma-separated file |\n|`read_csv2()` |Reads semicolon-separated file |\n|`read_tsv()` |Reads tab-separated file |\n|`read_delim()` |General function for reading delimited files |\n|`read_fwf()` |Reads fixed width files |\n|`read_log()` |Reads log files |\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nIn this code, I used the `kable()` function from the `knitr` package to create the summary table in a table format, rather than as basic R output.\n\nThis function is very useful **for formatting basic tables in R markdown documents**. For more complex tables, check out the `pander` and `xtable` packages.\n:::\n\nFor the most part, you can read use `read_table()` and `read_csv()` pretty much anywhere you might use `read.table()` and `read.csv()`.\n\nIn addition, if there are non-fatal problems that occur while reading in the data, you will **get a warning and the returned data frame will have some information about which rows/observations triggered the warning**.\n\nThis can be very helpful for \"debugging\" problems with your data before you get neck deep in data analysis.\n\n## Advantages\n\nThe advantage of the `read_csv()` function is perhaps better understood from an historical perspective.\n\n- R's built in `read.csv()` function similarly reads CSV files, but the `read_csv()` function in `readr` builds on that by **removing some of the quirks and \"gotchas\"** of `read.csv()` as well as **dramatically optimizing the speed** with which it can read data into R.\n- The `read_csv()` function also adds some nice user-oriented features like a progress meter and a compact method for specifying column types.\n\n## Example\n\nA typical call to `read_csv()` will look as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readr)\nteams <- read_csv(here(\"data\", \"team_standings.csv\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 32 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (1): Team\ndbl (1): Standing\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n```{.r .cell-code}\nteams\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 32 × 2\n Standing Team \n \n 1 1 Spain \n 2 2 Netherlands\n 3 3 Germany \n 4 4 Uruguay \n 5 5 Argentina \n 6 6 Brazil \n 7 7 Ghana \n 8 8 Paraguay \n 9 9 Japan \n10 10 Chile \n# ℹ 22 more rows\n```\n:::\n:::\n\n\nBy default, `read_csv()` will open a CSV file and read it in line-by-line. Similar to `read.table()`, you can tell the function to `skip` lines or which lines are comments:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_csv(\"The first line of metadata\n The second line of metadata\n x,y,z\n 1,2,3\",\n skip = 2)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 1 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): x, y, z\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 3\n x y z\n \n1 1 2 3\n```\n:::\n:::\n\n\nAlternatively, you can use the `comment` argument:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_csv(\"# A comment I want to skip\n x,y,z\n 1,2,3\",\n comment = \"#\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 1 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): x, y, z\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 3\n x y z\n \n1 1 2 3\n```\n:::\n:::\n\n\nIt will also (**by default**), **read in the first few rows of the table** in order to figure out the type of each column (i.e. integer, character, etc.). From the `read_csv()` help page:\n\n> If 'NULL', all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.\n\nYou can specify the type of each column with the `col_types` argument.\n\n::: callout-tip\n### Note\n\nIn general, it is a good idea to **specify the column types explicitly**.\n\nThis rules out any possible guessing errors on the part of `read_csv()`.\n\nAlso, specifying the column types explicitly provides a useful safety check in case anything about the dataset should change without you knowing about it.\n:::\n\nHere is an example of how to specify the column types explicitly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nteams <- read_csv(here(\"data\", \"team_standings.csv\"), \n col_types = \"cc\")\n```\n:::\n\n\nNote that the `col_types` argument accepts a compact representation. Here `\"cc\"` indicates that the first column is `character` and the second column is `character` (there are only two columns). Using the `col_types` argument is useful because often it is not easy to automatically figure out the type of a column by looking at a few rows (especially if a column has many missing values).\n\n::: callout-tip\n### Note\n\nThe `read_csv()` function **will also read compressed files** automatically.\n\nThere is no need to decompress the file first or use the `gzfile` connection function.\n:::\n\nThe following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogs <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"), \n n_max = 10)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 10 Columns: 10\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (6): r_version, r_arch, r_os, package, version, country\ndbl (2): size, ip_id\ndate (1): date\ntime (1): time\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\nNote that the warnings indicate that `read_csv()` may have had some difficulty identifying the type of each column. This can be solved by using the `col_types` argument.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogs <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"), \n col_types = \"ccicccccci\", \n n_max = 10)\nlogs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 10\n date time size r_version r_arch r_os package version country ip_id\n \n 1 2016-07-19 22:00… 1.89e6 3.3.0 x86_64 ming… data.t… 1.9.6 US 1\n 2 2016-07-19 22:00… 4.54e4 3.3.1 x86_64 ming… assert… 0.1 US 2\n 3 2016-07-19 22:00… 1.43e7 3.3.1 x86_64 ming… stringi 1.1.1 DE 3\n 4 2016-07-19 22:00… 1.89e6 3.3.1 x86_64 ming… data.t… 1.9.6 US 4\n 5 2016-07-19 22:00… 3.90e5 3.3.1 x86_64 ming… foreach 1.4.3 US 4\n 6 2016-07-19 22:00… 4.88e4 3.3.1 x86_64 linu… tree 1.0-37 CO 5\n 7 2016-07-19 22:00… 5.25e2 3.3.1 x86_64 darw… surviv… 2.39-5 US 6\n 8 2016-07-19 22:00… 3.23e6 3.3.1 x86_64 ming… Rcpp 0.12.5 US 2\n 9 2016-07-19 22:00… 5.56e5 3.3.1 x86_64 ming… tibble 1.1 US 2\n10 2016-07-19 22:00… 1.52e5 3.3.1 x86_64 ming… magrit… 1.5 US 2\n```\n:::\n:::\n\n\nYou can **specify the column type in a more detailed fashion** by using the various `col_*()` functions.\n\nFor example, in the log data above, the first column is actually a date, so it might make more sense to read it in as a `Date` object.\n\nIf we wanted to just read in that first column, we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogdates <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"), \n col_types = cols_only(date = col_date()),\n n_max = 10)\nlogdates\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 1\n date \n \n 1 2016-07-19\n 2 2016-07-19\n 3 2016-07-19\n 4 2016-07-19\n 5 2016-07-19\n 6 2016-07-19\n 7 2016-07-19\n 8 2016-07-19\n 9 2016-07-19\n10 2016-07-19\n```\n:::\n:::\n\n\nNow the `date` column is stored as a `Date` object which can be used for relevant date-related computations (for example, see the `lubridate` package).\n\n::: callout-tip\n### Note\n\nThe `read_csv()` function has a `progress` option that defaults to TRUE.\n\nThis options provides a nice progress meter while the CSV file is being read.\n\nHowever, if you are using `read_csv()` in a function, or perhaps embedding it in a loop, it is probably best to set `progress = FALSE`.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. What is the point of reference for using relative paths with the `here::here()` function?\n\n2. Why was the argument `stringsAsFactors=TRUE` historically used?\n\n3. What is the difference between `.Rds` and `.Rda` file formats?\n\n4. What function in `readr` would you use to read a file where fields were separated with \"\\|\"?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)\n bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"07 - Reading and Writing data\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"How to get data in and out of R using relative paths\"\ncategories: [module 2, week 2, R, programming, readr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/07-reading-and-writing-data/index.qmd).*\n\n\n::: {.cell}\n\n:::\n\n\n\n\n> \"When writing code, you're always collaborating with future-you; and past-you doesn't respond to emails\". ---*Hadley Wickham*\n\n\\[[Source](https://fivebooks.com/best-books/computer-science-data-science-hadley-wickham/)\\]\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Know difference between relative vs absolute paths\n- Be able to read and write text / csv files in R\n- Be able to read and write R data objects in R\n- Be able to calculate memory requirements for R objects\n- Use modern R packages for reading and writing data\n:::\n\n# Introduction\n\nThis lesson introduces **ways to read and write data** (e.g. `.txt` and `.csv` files) using base R functions as well as more modern R packages, such as `readr`, which is typically [10x faster than base R](https://r4ds.had.co.nz/data-import.html#compared-to-base-r).\n\nWe will also briefly describe different ways for reading and writing other data types such as, Excel files, google spreadsheets, or SQL databases.\n\n# Relative versus absolute paths\n\nWhen you are starting a data analysis, you can create a new `.Rproj` file that asks RStudio to change the path (location on your computer) to the `.Rproj` location.\n\nLet's try this out. In RStudio, click `Project: (None)` in the top right corner and `New Project`.\n\nAfter opening up a `.Rproj` file, you can test this by\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngetwd()\n```\n:::\n\n\nWhen you open up someone else's R code or analysis, you might also see the\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd()\n```\n:::\n\n\nfunction being used which explicitly tells R to change the absolute path or absolute location of which directory to move into.\n\nFor example, say I want to clone a GitHub repo from my colleague Brian, which has 100 R script files, and in every one of those files at the top is:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"C:\\\\Users\\\\Brian\\\\path\\\\only\\\\that\\\\Brian\\\\has\")\n```\n:::\n\n\nThe problem is, if I want to use his code, I will need to go and hand-edit every single one of those paths (`C:\\Users\\Brian\\path\\only\\that\\Brian\\has`) to the path that I want to use on my computer or wherever I saved the folder on my computer (e.g. `/Users/leocollado/Documents/path/only/I/have`).\n\n1. This is an unsustainable practice.\n2. I can go in and manually edit the path, but this assumes I know how to set a working directory. Not everyone does.\n\nSo instead of absolute paths:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"/Users/bcaffo/data\")\nsetwd(\"~/Desktop/files/data\")\nsetwd(\"C:\\\\Users\\\\Michelle\\\\Downloads\")\n```\n:::\n\n\nA better idea is to use relative paths:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"../data\")\nsetwd(\"../files\")\nsetwd(\"..\\tmp\")\n```\n:::\n\n\nWithin R, an even better idea is to use the [here](https://github.com/r-lib/here) R package will recognize the top-level directory of a Git repo and supports building all paths relative to that. For more on project-oriented workflow suggestions, read [this post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) from Jenny Bryan.\n\n![Artwork by Allison Horst on setwd() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/cracked_setwd.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### The `here` package\n\nIn her post, Jenny Bryan writes\n\n> \"I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work.\"\n\nInstead of using `setwd()` at the top your `.R` or `.Rmd` file, she suggests:\n\n- Organize each logical project into a folder on your computer.\n- Make sure the top-level folder advertises itself as such. This can be as simple as having an empty file named `.here`. Or, if you use RStudio and/or Git, those both leave characteristic files behind that will get the job done.\n- Use the `here()` function from the `here` package to build the path when you read or write a file. Create paths relative to the top-level directory.\n- Whenever you work on this project, launch the R process from the project's top-level directory. If you launch R from the shell, `cd` to the correct folder first.\n\nLet's test this out. We can use `getwd()` to see our current working directory path and the files available using `list.files()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngetwd()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/posts/07-reading-and-writing-data\"\n```\n:::\n\n```{.r .cell-code}\nlist.files()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"index.qmd\" \"index.rmarkdown\"\n```\n:::\n:::\n\n\nOK so our current location is in the reading and writing lectures sub-folder of the `jhustatcomputing2022` course repository. Let's try using the `here` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n\nlist.files(here::here())\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"_freeze\" \"_post_template.qmd\" \n [3] \"_quarto.yml\" \"_site\" \n [5] \"data\" \"gh-pages\" \n [7] \"icon_32.png\" \"images\" \n [9] \"index.qmd\" \"jhustatcomputing2023.Rproj\"\n[11] \"lectures.qmd\" \"posts\" \n[13] \"profile.jpg\" \"projects\" \n[15] \"projects.qmd\" \"README.md\" \n[17] \"resources.qmd\" \"schedule.qmd\" \n[19] \"scripts\" \"site_libs\" \n[21] \"styles.css\" \"syllabus.qmd\" \n[23] \"videos\" \n```\n:::\n\n```{.r .cell-code}\nlist.files(here(\"data\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"2016-07-19.csv.bz2\" \"b_lyrics.RDS\" \n [3] \"bmi_pm25_no2_sim.csv\" \"chicago.rds\" \n [5] \"chocolate.RDS\" \"flights.csv\" \n [7] \"maacs_sim.csv\" \"sales.RDS\" \n [9] \"storms_2004.csv.gz\" \"team_standings.csv\" \n[11] \"ts_lyrics.RDS\" \"tuesdata_rainfall.RDS\" \n[13] \"tuesdata_temperature.RDS\"\n```\n:::\n:::\n\n\nNow we see that using the `here::here()` function is a *relative* path (relative to the `.Rproj` file in our `jhustatcomputing2022` repository. We also see there is are two `.csv` files in the `data` folder. We will learn how to read those files into R in the next section.\n\n![Artwork by Allison Horst on here package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/here.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### Finding and creating files locally\n\nOne last thing. If you want to download a file, one way to use the `file.exists()`, `dir.create()` and `list.files()` functions.\n\n- `file.exists(here(\"my\", \"relative\", \"path\"))`: logical test if the file exists\n- `dir.create(here(\"my\", \"relative\", \"path\"))`: create a folder\n- `list.files(here(\"my\", \"relative\", \"path\"))`: list contents of folder\n- `file.create(here(\"my\", \"relative\", \"path\"))`: create a file\n- `file.remove(here(\"my\", \"relative\", \"path\"))`: delete a file\n\nFor example, I can put all this together by\n\n1. Checking to see if a file exists in my path. If not, then\n2. Create a directory in that path.\n3. List the files in the path.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (!file.exists(here(\"my\", \"relative\", \"path\"))) {\n dir.create(here(\"my\", \"relative\", \"path\"))\n}\nlist.files(here(\"my\", \"relative\", \"path\"))\n```\n:::\n\n\nLet's put relative paths to use while reading and writing data.\n\n# Reading data in base R\n\nIn this section, we're going to demonstrate the essential functions you need to know to read and write (or save) data in R.\n\n## txt or csv\n\nThere are a few primary functions reading data from base R.\n\n- `read.table()`, `read.csv()`: for reading tabular data\n- `readLines()`: for reading lines of a text file\n\nThere are analogous functions for writing data to files\n\n- `write.table()`: for writing tabular data to text files (i.e. CSV) or connections\n- `writeLines()`: for writing character data line-by-line to a file or connection\n\nLet's try reading some data into R with the `read.csv()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(here(\"data\", \"team_standings.csv\"))\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Standing Team\n1 1 Spain\n2 2 Netherlands\n3 3 Germany\n4 4 Uruguay\n5 5 Argentina\n6 6 Brazil\n7 7 Ghana\n8 8 Paraguay\n9 9 Japan\n10 10 Chile\n11 11 Portugal\n12 12 USA\n13 13 England\n14 14 Mexico\n15 15 South Korea\n16 16 Slovakia\n17 17 Ivory Coast\n18 18 Slovenia\n19 19 Switzerland\n20 20 South Africa\n21 21 Australia\n22 22 New Zealand\n23 23 Serbia\n24 24 Denmark\n25 25 Greece\n26 26 Italy\n27 27 Nigeria\n28 28 Algeria\n29 29 France\n30 30 Honduras\n31 31 Cameroon\n32 32 North Korea\n```\n:::\n:::\n\n\nWe can use the `$` symbol to pick out a specific column:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$Team\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"Spain\" \"Netherlands\" \"Germany\" \"Uruguay\" \"Argentina\" \n [6] \"Brazil\" \"Ghana\" \"Paraguay\" \"Japan\" \"Chile\" \n[11] \"Portugal\" \"USA\" \"England\" \"Mexico\" \"South Korea\" \n[16] \"Slovakia\" \"Ivory Coast\" \"Slovenia\" \"Switzerland\" \"South Africa\"\n[21] \"Australia\" \"New Zealand\" \"Serbia\" \"Denmark\" \"Greece\" \n[26] \"Italy\" \"Nigeria\" \"Algeria\" \"France\" \"Honduras\" \n[31] \"Cameroon\" \"North Korea\" \n```\n:::\n:::\n\n\nWe can also ask for the full paths for specific files\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhere(\"data\", \"team_standings.csv\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"/Users/leocollado/Dropbox/Code/jhustatcomputing2023/data/team_standings.csv\"\n```\n:::\n:::\n\n\n::: callout-note\n### Questions\n\n- What happens when you use `readLines()` function with the `team_standings.csv` data?\n- How would you only read in the first 5 lines?\n:::\n\n## R code\n\nSometimes, someone will give you a file that ends in a `.R`.\n\nThis is what's called an **R script file**. It may contain code someone has written (maybe even you!), for example, a function that you can use with your data. In this case, you want the function available for you to use.\n\nTo use the function, **you have to first, read in the function from R script file into R**.\n\nYou can check to see if the function already is loaded in R by looking at the Environment tab.\n\nThe function you want to use is\n\n- `source()`: for reading in R code files\n\nFor example, it might be something like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource(here::here(\"functions.R\"))\n```\n:::\n\n\n## R objects\n\nAlternatively, you might be interested in reading and writing R objects.\n\nWriting data in e.g. `.txt`, `.csv` or Excel file formats is good if you want to open these files with other analysis software, such as Excel. However, these formats do not preserve data structures, such as column data types (numeric, character or factor). In order to do that, the data should be written out in a R data format.\n\nThere are several types R data file formats to be aware of:\n\n- `.RData`: Stores **multiple** R objects\n- `.Rda`: This is short for `.RData` and is equivalent.\n- `.Rds`: Stores a **single** R object\n\n::: callout-note\n### Question\n\n**Why is saving data in as a R object useful?**\n\nSaving data into R data formats can **typically** reduce considerably the size of large files by compression.\n:::\n\nNext, we will learn how to read and save\n\n1. A single R object\n2. Multiple R objects\n3. Your entire work space in a specified file\n\n### Reading in data from files\n\n- `load()`: for reading in single or multiple R objects (opposite of `save()`) with a `.Rda` or `.RData` file format (objects must be same name)\n- `readRDS()`: for reading in a single object with a `.Rds` file format (can rename objects)\n- `unserialize()`: for reading single R objects in binary form\n\n### Writing data to files\n\n- `save()`: for saving an arbitrary number of R objects in binary format (possibly compressed) to a file.\n- `saveRDS()`: for saving a single object\n- `serialize()`: for converting an R object into a binary format for outputting to a connection (or file).\n- `save.image()`: short for 'save my current workspace'; while this **sounds** nice, it's not terribly useful for reproducibility (hence not suggested); it's also what happens when you try to quit R and it asks if you want to save your work space.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Save data into R data file formats: RDS and RDATA](http://www.sthda.com/sthda/RDoc/images/save-data-into-r-data-formats.png)\n:::\n:::\n\n\n\\[[Source](http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata)\\]\n\n### Example\n\nLet's try an example. Let's save a vector of length 5 into the two file formats.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\nsave(x, file = here(\"data\", \"x.Rda\"))\nsaveRDS(x, file = here(\"data\", \"x.Rds\"))\nlist.files(path = here(\"data\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"2016-07-19.csv.bz2\" \"b_lyrics.RDS\" \n [3] \"bmi_pm25_no2_sim.csv\" \"chicago.rds\" \n [5] \"chocolate.RDS\" \"flights.csv\" \n [7] \"maacs_sim.csv\" \"sales.RDS\" \n [9] \"storms_2004.csv.gz\" \"team_standings.csv\" \n[11] \"ts_lyrics.RDS\" \"tuesdata_rainfall.RDS\" \n[13] \"tuesdata_temperature.RDS\" \"x.Rda\" \n[15] \"x.Rds\" \n```\n:::\n:::\n\n\nHere we assign the imported data to an object using `readRDS()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnew_x1 <- readRDS(here(\"data\", \"x.Rds\"))\nnew_x1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2 3 4 5\n```\n:::\n:::\n\n\nHere we assign the imported data to an object using `load()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnew_x2 <- load(here(\"data\", \"x.Rda\"))\nnew_x2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x\"\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\n`load()` simply returns the name of the objects loaded. Not the values.\n:::\n\nLet's clean up our space.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rda\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rds\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nrm(x)\n```\n:::\n\n\n::: callout-note\n### Question\n\nWhat do you think this code will do?\n\n**Hint**: change `eval=TRUE` to see result\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\ny <- x^2\nsave(x, y, file = here(\"data\", \"x.Rda\"))\nnew_x2 <- load(here(\"data\", \"x.Rda\"))\n```\n:::\n\n\nWhen you are done:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(here(\"data\", \"x.Rda\"))\n```\n:::\n\n:::\n\n## Other data types\n\nNow, there are of course, many R packages that have been developed to read in all kinds of other datasets, and you may need to resort to one of these packages if you are working in a specific area.\n\nFor example, check out\n\n- [`DBI`](https://github.com/r-dbi/DBI) for relational databases\n- [`haven`](https://haven.tidyverse.org) for SPSS, Stata, and SAS data\n- [`httr`](https://github.com/r-lib/httr) for web APIs\n- [`readxl`](https://readxl.tidyverse.org) for `.xls` and `.xlsx` sheets\n- [`googlesheets4`](https://googlesheets4.tidyverse.org) for Google Sheets\n- [`googledrive`](https://googledrive.tidyverse.org) for Google Drive files\n- [`rvest`](https://github.com/tidyverse/rvest) for web scraping\n- [`jsonlite`](https://github.com/jeroen/jsonlite#jsonlite) for JSON\n- [`xml2`](https://github.com/r-lib/xml2) for XML.\n\n## Reading data files with `read.table()`\n\n
\n\nFor details on reading data with `read.table()`, click here.\n\nThe `read.table()` function is one of the most commonly used functions for reading data. The help file for `read.table()` is worth reading in its entirety if only because the function gets used a lot (run `?read.table` in R).\n\n**I know, I know**, everyone always says to read the help file, but this one is actually worth reading.\n\nThe `read.table()` function has a few important arguments:\n\n- `file`, the name of a file, or a connection\n- `header`, logical indicating if the file has a header line\n- `sep`, a string indicating how the columns are separated\n- `colClasses`, a character vector indicating the class of each column in the dataset\n- `nrows`, the number of rows in the dataset. By default `read.table()` reads an entire file.\n- `comment.char`, a character string indicating the comment character. This defaults to `\"#\"`. If there are no commented lines in your file, it's worth setting this to be the empty string `\"\"`.\n- `skip`, the number of lines to skip from the beginning\n- `stringsAsFactors`, should character variables be coded as factors? This defaults to `FALSE`. However, back in the \"old days\", it defaulted to `TRUE`. The reason for this was because, if you had data that were stored as strings, it was because those strings represented levels of a categorical variable. Now, we have lots of data that is text data and they do not always represent categorical variables. So you may want to set this to be `FALSE` in those cases. If you *always* want this to be `FALSE`, you can set a global option via `options(stringsAsFactors = FALSE)`.\n\nI've never seen so much heat generated on discussion forums about an R function argument than the `stringsAsFactors` argument. **Seriously**.\n\nFor small to moderately sized datasets, you can usually call `read.table()` without specifying any other arguments\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- read.table(\"foo.txt\")\n```\n:::\n\n\n::: callout-tip\n### Note\n\n`foo.txt` is not a real dataset here. It is only used as an example for how to use `read.table()`\n:::\n\nIn this case, R will automatically:\n\n- skip lines that begin with a \\#\n- figure out how many rows there are (and how much memory needs to be allocated)\n- figure what type of variable is in each column of the table.\n\nTelling R all these things directly makes R run faster and more efficiently.\n\n::: callout-tip\n### Note\n\nThe `read.csv()` function is identical to `read.table()` except that some of the defaults are set differently (like the `sep` argument).\n:::\n\n
\n\n## Reading in larger datasets with `read.table()`\n\n
\n\nFor details on reading larger datasets with `read.table()`, click here.\n\nWith much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking.\n\n- Read the help page for `read.table()`, which contains many hints\n- Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.\n- Set `comment.char = \"\"` if there are no commented lines in your file.\n- Use the `colClasses` argument. Specifying this option instead of using the default can make `read.table()` run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are \"numeric\", for example, then you can just set `colClasses = \"numeric\"`. A quick an dirty way to figure out the classes of each column is the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninitial <- read.table(\"datatable.txt\", nrows = 100)\nclasses <- sapply(initial, class)\ntabAll <- read.table(\"datatable.txt\", colClasses = classes)\n```\n:::\n\n\n**Note**: `datatable.txt` is not a real dataset here. It is only used as an example for how to use `read.table()`.\n\n- Set `nrows`. This does not make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool `wc` to calculate the number of lines in a file.\n\nIn general, when using R with larger datasets, it's also useful to know a few things about your system.\n\n- How much memory is available on your system?\n- What other applications are in use? Can you close any of them?\n- Are there other users logged into the same system?\n- What operating system ar you using? Some operating systems can limit the amount of memory a single process can access\n\n
\n\n# Calculating Memory Requirements for R Objects\n\nBecause **R stores all of its objects in physical memory**, it is important to be cognizant of how much memory is being used up by all of the data objects residing in your workspace.\n\nOne situation where it is particularly important to understand memory requirements is when you are reading in a new dataset into R. Fortunately, it is easy to make a back of the envelope calculation of how much memory will be required by a new dataset.\n\nFor example, suppose I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?\n\nWell, on most modern computers [double precision floating point numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) are stored using 64 bits of memory, or 8 bytes. Given that information, you can do the following calculation\n\n1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes\n\n= 1,440,000,000 / 2^20^ bytes/MB\n\n= 1,373.29 MB\n\n= 1.34 GB\n\nSo the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware of\n\n- what other programs might be running on your computer, using up RAM\n- what other R objects might already be taking up RAM in your workspace\n\nReading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later.\n\n# Using the `readr` package\n\nThe `readr` package was developed by Posit (formerly known as RStudio) to deal with reading in large flat files quickly.\n\nThe package provides replacements for functions like `read.table()` and `read.csv()`. The analogous functions in `readr` are `read_table()` and `read_csv()`. These **functions are often much faster than their base R analogues** and provide a few other nice features such as progress meters.\n\nFor example, the package includes a variety of functions in the `read_*()` family that allow you to read in data from different formats of flat files. The following table gives a guide to several functions in the `read_*()` family.\n\n\n::: {.cell}\n::: {.cell-output-display}\n|`readr` function |Use |\n|:----------------|:--------------------------------------------|\n|`read_csv()` |Reads comma-separated file |\n|`read_csv2()` |Reads semicolon-separated file |\n|`read_tsv()` |Reads tab-separated file |\n|`read_delim()` |General function for reading delimited files |\n|`read_fwf()` |Reads fixed width files |\n|`read_log()` |Reads log files |\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nIn this code, I used the `kable()` function from the `knitr` package to create the summary table in a table format, rather than as basic R output.\n\nThis function is very useful **for formatting basic tables in R markdown documents**. For more complex tables, check out the `pander` and `xtable` packages.\n:::\n\nFor the most part, you can read use `read_table()` and `read_csv()` pretty much anywhere you might use `read.table()` and `read.csv()`.\n\nIn addition, if there are non-fatal problems that occur while reading in the data, you will **get a warning and the returned data frame will have some information about which rows/observations triggered the warning**.\n\nThis can be very helpful for \"debugging\" problems with your data before you get neck deep in data analysis.\n\n## Advantages\n\nThe advantage of the `read_csv()` function is perhaps better understood from an historical perspective.\n\n- R's built in `read.csv()` function similarly reads CSV files, but the `read_csv()` function in `readr` builds on that by **removing some of the quirks and \"gotchas\"** of `read.csv()` as well as **dramatically optimizing the speed** with which it can read data into R.\n- The `read_csv()` function also adds some nice user-oriented features like a progress meter and a compact method for specifying column types.\n\n## Example\n\nA typical call to `read_csv()` will look as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readr)\nteams <- read_csv(here(\"data\", \"team_standings.csv\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 32 Columns: 2\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (1): Team\ndbl (1): Standing\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n```{.r .cell-code}\nteams\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 32 × 2\n Standing Team \n \n 1 1 Spain \n 2 2 Netherlands\n 3 3 Germany \n 4 4 Uruguay \n 5 5 Argentina \n 6 6 Brazil \n 7 7 Ghana \n 8 8 Paraguay \n 9 9 Japan \n10 10 Chile \n# ℹ 22 more rows\n```\n:::\n:::\n\n\nBy default, `read_csv()` will open a CSV file and read it in line-by-line. Similar to `read.table()`, you can tell the function to `skip` lines or which lines are comments:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_csv(\"The first line of metadata\n The second line of metadata\n x,y,z\n 1,2,3\",\n skip = 2\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 1 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): x, y, z\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 3\n x y z\n \n1 1 2 3\n```\n:::\n:::\n\n\nAlternatively, you can use the `comment` argument:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_csv(\"# A comment I want to skip\n x,y,z\n 1,2,3\",\n comment = \"#\"\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 1 Columns: 3\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (3): x, y, z\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 3\n x y z\n \n1 1 2 3\n```\n:::\n:::\n\n\nIt will also (**by default**), **read in the first few rows of the table** in order to figure out the type of each column (i.e. integer, character, etc.). From the `read_csv()` help page:\n\n> If 'NULL', all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.\n\nYou can specify the type of each column with the `col_types` argument.\n\n::: callout-tip\n### Note\n\nIn general, it is a good idea to **specify the column types explicitly**.\n\nThis rules out any possible guessing errors on the part of `read_csv()`.\n\nAlso, specifying the column types explicitly provides a useful safety check in case anything about the dataset should change without you knowing about it.\n:::\n\nHere is an example of how to specify the column types explicitly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nteams <- read_csv(here(\"data\", \"team_standings.csv\"),\n col_types = \"cc\"\n)\n```\n:::\n\n\nNote that the `col_types` argument accepts a compact representation. Here `\"cc\"` indicates that the first column is `character` and the second column is `character` (there are only two columns). Using the `col_types` argument is useful because often it is not easy to automatically figure out the type of a column by looking at a few rows (especially if a column has many missing values).\n\n::: callout-tip\n### Note\n\nThe `read_csv()` function **will also read compressed files** automatically.\n\nThere is no need to decompress the file first or use the `gzfile` connection function.\n:::\n\nThe following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogs <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"),\n n_max = 10\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 10 Columns: 10\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (6): r_version, r_arch, r_os, package, version, country\ndbl (2): size, ip_id\ndate (1): date\ntime (1): time\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\nNote that the warnings indicate that `read_csv()` may have had some difficulty identifying the type of each column. This can be solved by using the `col_types` argument.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogs <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"),\n col_types = \"ccicccccci\",\n n_max = 10\n)\nlogs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 10\n date time size r_version r_arch r_os package version country ip_id\n \n 1 2016-07-19 22:00… 1.89e6 3.3.0 x86_64 ming… data.t… 1.9.6 US 1\n 2 2016-07-19 22:00… 4.54e4 3.3.1 x86_64 ming… assert… 0.1 US 2\n 3 2016-07-19 22:00… 1.43e7 3.3.1 x86_64 ming… stringi 1.1.1 DE 3\n 4 2016-07-19 22:00… 1.89e6 3.3.1 x86_64 ming… data.t… 1.9.6 US 4\n 5 2016-07-19 22:00… 3.90e5 3.3.1 x86_64 ming… foreach 1.4.3 US 4\n 6 2016-07-19 22:00… 4.88e4 3.3.1 x86_64 linu… tree 1.0-37 CO 5\n 7 2016-07-19 22:00… 5.25e2 3.3.1 x86_64 darw… surviv… 2.39-5 US 6\n 8 2016-07-19 22:00… 3.23e6 3.3.1 x86_64 ming… Rcpp 0.12.5 US 2\n 9 2016-07-19 22:00… 5.56e5 3.3.1 x86_64 ming… tibble 1.1 US 2\n10 2016-07-19 22:00… 1.52e5 3.3.1 x86_64 ming… magrit… 1.5 US 2\n```\n:::\n:::\n\n\nYou can **specify the column type in a more detailed fashion** by using the various `col_*()` functions.\n\nFor example, in the log data above, the first column is actually a date, so it might make more sense to read it in as a `Date` object.\n\nIf we wanted to just read in that first column, we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogdates <- read_csv(here(\"data\", \"2016-07-19.csv.bz2\"),\n col_types = cols_only(date = col_date()),\n n_max = 10\n)\nlogdates\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 1\n date \n \n 1 2016-07-19\n 2 2016-07-19\n 3 2016-07-19\n 4 2016-07-19\n 5 2016-07-19\n 6 2016-07-19\n 7 2016-07-19\n 8 2016-07-19\n 9 2016-07-19\n10 2016-07-19\n```\n:::\n:::\n\n\nNow the `date` column is stored as a `Date` object which can be used for relevant date-related computations (for example, see the `lubridate` package).\n\n::: callout-tip\n### Note\n\nThe `read_csv()` function has a `progress` option that defaults to TRUE.\n\nThis options provides a nice progress meter while the CSV file is being read.\n\nHowever, if you are using `read_csv()` in a function, or perhaps embedding it in a loop, it is probably best to set `progress = FALSE`.\n:::\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. What is the point of reference for using relative paths with the `here::here()` function?\n\n2. Why was the argument `stringsAsFactors=TRUE` historically used?\n\n3. What is the difference between `.Rds` and `.Rda` file formats?\n\n4. What function in `readr` would you use to read a file where fields were separated with \"\\|\"?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)\n bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json b/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json index fda21c2..9d27133 100644 --- a/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json +++ b/_freeze/posts/08-managing-data-frames-with-tidyverse/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "692ae9c13383582e0de2ed88c813e47b", + "hash": "14e8698f5056b6c674e42b8c203edc6c", "result": { - "markdown": "---\ntitle: \"08 - Managing data frames with the Tidyverse\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \" An introduction to data frames in R and the managing them with the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tibble, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/08-managing-data-frames-with-tidyverse/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n3. [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Understand the advantages of a `tibble` and `data.frame` data objects in R\n- Learn about the dplyr R package to manage data frames\n- Recognize the key verbs to manage data frames in dplyr\n- Use the \"pipe\" operator to combine verbs together\n:::\n\n# Data Frames\n\nThe **data frame** (or `data.frame`) is a **key data structure** in statistics and in R.\n\nThe basic structure of a data frame is that there is **one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation**.\n\nR has an internal implementation of data frames that is likely the one you will use most often. However, there are packages on CRAN that implement data frames via things like relational databases that allow you to operate on very, very large data frames (but we will not discuss them here).\n\nGiven the importance of managing data frames, it is **important that we have good tools for dealing with them.**\n\nFor example, **operations** like filtering rows, re-ordering rows, and selecting columns, can often be tedious operations in R whose syntax is not very intuitive. The `dplyr` package is designed to mitigate a lot of these problems and to provide a highly optimized set of routines specifically for dealing with data frames.\n\n## Tibbles\n\nAnother type of data structure that we need to discuss is called the **tibble**! It's best to think of tibbles as an updated and stylish version of the `data.frame`.\n\nTibbles are what tidyverse packages work with most seamlessly. Now, that **does not mean tidyverse packages *require* tibbles**.\n\nIn fact, they still work with `data.frames`, but the more you work with tidyverse and tidyverse-adjacent packages, the more you will see the advantages of using tibbles.\n\nBefore we go any further, tibbles *are* data frames, but they have some new bells and whistles to make your life easier.\n\n### How tibbles differ from `data.frame`\n\nThere are a number of differences between tibbles and `data.frames`.\n\n::: callout-tip\n### Note\n\nTo see a full vignette about tibbles and how they differ from data.frame, you will want to execute `vignette(\"tibble\")` and read through that vignette.\n:::\n\nWe will summarize some of the most important points here:\n\n- **Input type remains unchanged** - `data.frame` is notorious for treating strings as factors; this will not happen with tibbles\n- **Variable names remain unchanged** - In base R, creating `data.frames` will remove spaces from names, converting them to periods or add \"x\" before numeric column names. Creating tibbles will not change variable (column) names.\n- **There are no `row.names()` for a tibble** - Tidy data requires that variables be stored in a consistent way, removing the need for row names.\n- **Tibbles print first ten rows and columns that fit on one screen** - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.\n\n## Creating a tibble\n\nThe tibble package is part of the `tidyverse` and can thus be loaded in (once installed) using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### `as_tibble()`\n\nSince many packages use the historical `data.frame` from base R, you will often find yourself in the situation that you have a `data.frame` and want to convert that `data.frame` to a `tibbl`e.\n\nTo do so, the `as_tibble()` function is exactly what you are looking for.\n\nFor the example, here we use a dataset (`chicago.rds`) containing air pollution and temperature data for the city of Chicago in the U.S.\n\nThe dataset is available in the `/data` repository. You can load the data into R using the `readRDS()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing2023\n```\n:::\n\n```{.r .cell-code}\nchicago <- readRDS(here(\"data\", \"chicago.rds\"))\n```\n:::\n\n\nYou can see some basic characteristics of the dataset with the `dim()` and `str()` functions.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6940 8\n```\n:::\n\n```{.r .cell-code}\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t6940 obs. of 8 variables:\n $ city : chr \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date, format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num 34 NA 34.2 47 NA ...\n $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nWe see this data structure is a `data.frame` with 6940 observations and 8 variables.\n\nTo convert this `data.frame` to a tibble you would use the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(as_tibble(chicago))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTibbles, by default, **only print the first ten rows to screen**.\n\nIf you were to print the `data.frame` `chicago` to screen, all 6940 rows would be displayed. When working with large `data.frames`, this **default behavior can be incredibly frustrating**.\n\nUsing tibbles removes this frustration because of the default settings for tibble printing.\n:::\n\nAdditionally, you will note that the **type of the variable is printed for each variable in the tibble**. This helpful feature is another added bonus of tibbles relative to `data.frame`.\n\n#### Want to see more of the tibble?\n\nIf you *do* want to see more rows from the tibble, there are a few options!\n\n1. The `View()` function in RStudio is incredibly helpful. The input to this function is the `data.frame` or tibble you would like to see.\n\nSpecifically, `View(chicago)` would provide you, the viewer, with a scrollable view (in a new tab) of the complete dataset.\n\n2. Use the fact that `print()` enables you to specify how many rows and columns you would like to display.\n\nHere, we again display the `chicago` data.frame as a tibble but specify that we would only like to see 5 rows. The `width = Inf` argument specifies that we would like to see all the possible columns. Here, there are only 8, but for larger datasets, this can be helpful to specify.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas_tibble(chicago) %>% \n print(n = 5, width = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6,940 × 8\n city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0\n2 chic 33 29.9 1987-01-02 NA NA 3.30 23.2\n3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8\n4 chic 29 28.6 1987-01-04 NA 47 4.38 30.4\n5 chic 32 28.9 1987-01-05 NA NA 4.75 30.3\n# ℹ 6,935 more rows\n```\n:::\n:::\n\n\n### `tibble()`\n\nAlternatively, you can **create a tibble on the fly** by using `tibble()` and specifying the information you would like stored in each column.\n\n::: callout-tip\n### Note\n\nIf you provide a single value, this value will be repeated across all rows of the tibble. This is referred to as \"recycling inputs of length 1.\"\n\nIn the example here, we see that the column `c` will contain the value '1' across all rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n a = 1:5,\n b = 6:10,\n c = 1,\n z = (a + b)^2 + c\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 4\n a b c z\n \n1 1 6 1 50\n2 2 7 1 82\n3 3 8 1 122\n4 4 9 1 170\n5 5 10 1 226\n```\n:::\n:::\n\n:::\n\nThe `tibble()` function allows you to quickly generate tibbles and even allows you to **reference columns within the tibble you are creating**, as seen in column z of the example above.\n\n::: callout-tip\n### Note\n\n**Tibbles can have column names that are not allowed** in `data.frame`.\n\nIn the example below, we see that to utilize a nontraditional variable name, you surround the column name with backticks.\n\nNote that to refer to such columns in other tidyverse packages, you willl continue to use backticks surrounding the variable name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n `two words` = 1:5,\n `12` = \"numeric\",\n `:)` = \"smile\",\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n `two words` `12` `:)` \n \n1 1 numeric smile\n2 2 numeric smile\n3 3 numeric smile\n4 4 numeric smile\n5 5 numeric smile\n```\n:::\n:::\n\n:::\n\n## Subsetting tibbles\n\nSubsetting tibbles also differs slightly from how subsetting occurs with `data.frame`.\n\nWhen it comes to tibbles,\n\n- `[[` can subset by name or position\n- `$` only subsets by name\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n a = 1:5,\n b = 6:10,\n c = 1,\n z = (a + b)^2 + c\n)\n\n# Extract by name using $ or [[]]\ndf$z\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 50 82 122 170 226\n```\n:::\n\n```{.r .cell-code}\ndf[[\"z\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 50 82 122 170 226\n```\n:::\n\n```{.r .cell-code}\n# Extract by position requires [[]]\ndf[[4]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 50 82 122 170 226\n```\n:::\n:::\n\n\nHaving now discussed tibbles, which are the type of object most tidyverse and tidyverse-adjacent packages work best with, we now know the goal.\n\nIn many cases, **tibbles are ultimately what we want to work with in R**.\n\nHowever, **data are stored in many different formats outside of R**. We will spend the rest of this lesson discussing wrangling functions that work either a `data.frame` or `tibble`.\n\n# The `dplyr` Package\n\nThe `dplyr` package was developed by Posit (formely RStudio) and is **an optimized and distilled** version of the older `plyr` **package for data manipulation or wrangling**.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_wrangling.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe `dplyr` package does not provide any \"new\" functionality to R per se, in the sense that everything `dplyr` does could already be done with base R, but it **greatly** simplifies existing functionality in R.\n\nOne important contribution of the `dplyr` package is that it **provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames**.\n\nWith this grammar, you can sensibly communicate what it is that you are doing to a data frame that other people can understand (assuming they also know the grammar). This is useful because it **provides an abstraction for data manipulation that previously did not exist**.\n\nAnother useful contribution is that the `dplyr` functions are **very** fast, as many key operations are coded in C++.\n\n### `dplyr` grammar\n\nSome of the key \"verbs\" provided by the `dplyr` package are\n\n- `select()`: return a subset of the columns of a data frame, using a flexible notation\n\n- `filter()`: extract a subset of rows from a data frame based on logical conditions\n\n- `arrange()`: reorder rows of a data frame\n\n- `rename()`: rename variables in a data frame\n\n- `mutate()`: add new variables/columns or transform existing variables\n\n- `summarise()` / `summarize()`: generate summary statistics of different variables in the data frame, possibly within strata\n\n- `%>%`: the \"pipe\" operator is used to connect multiple verb actions together into a pipeline\n\n::: callout-tip\n### Note\n\nThe `dplyr` package as a number of its own data types that it takes advantage of.\n\nFor example, there is a handy `print()` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.\n:::\n\n### `dplyr` functions\n\nAll of the functions that we will discuss here will have a few common characteristics. In particular,\n\n1. The **first argument** is a data frame type object.\n\n2. The **subsequent arguments** describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly (without using the `$` operator, just use the column names).\n\n3. The **return result** of a function is a new data frame.\n\n4. Data frames must be **properly formatted** and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### `dplyr` installation\n\nThe `dplyr` package can be installed from CRAN or from GitHub using the `devtools` package and the `install_github()` function. The GitHub repository will usually contain the latest updates to the package and the development version.\n\nTo install from CRAN, just run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"dplyr\")\n```\n:::\n\n\nThe `dplyr` package is also installed when you install the `tidyverse` meta-package.\n\nAfter installing the package it is important that you load it into your R session with the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n```\n:::\n\n\nYou may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings.\n\n### `select()`\n\nWe will continue to use the `chicago` dataset containing air pollution and temperature data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- as_tibble(chicago)\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nThe `select()` function can be used to **select columns of a data frame** that you want to focus on.\n\n::: callout-tip\n### Example\n\nSuppose we wanted to take the first 3 columns only. There are a few ways to do this.\n\nWe could for example use numerical indices:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(chicago)[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"city\" \"tmpd\" \"dptp\"\n```\n:::\n:::\n\n\nBut we can also use the names directly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, city:dptp)\nhead(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n city tmpd dptp\n \n1 chic 31.5 31.5\n2 chic 33 29.9\n3 chic 33 27.4\n4 chic 29 28.6\n5 chic 32 28.9\n6 chic 40 35.1\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nThe `:` normally cannot be used with names or strings, but inside the `select()` function you can use it to specify a range of variable names.\n:::\n\nYou can also **omit** variables using the `select()` function by using the negative sign. With `select()` you can do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect(chicago, -(city:dptp))\n```\n:::\n\n\nwhich indicates that we should include every variable *except* the variables `city` through `dptp`. The equivalent code in base R would be\n\n\n::: {.cell}\n\n```{.r .cell-code}\ni <- match(\"city\", names(chicago))\nj <- match(\"dptp\", names(chicago))\nhead(chicago[, -(i:j)])\n```\n:::\n\n\nNot super intuitive, right?\n\nThe `select()` function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a \"2\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, ends_with(\"2\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 4] (S3: tbl_df/tbl/data.frame)\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nOr if we wanted to keep every variable that starts with a \"d\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, starts_with(\"d\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 2] (S3: tbl_df/tbl/data.frame)\n $ dptp: num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date: Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n```\n:::\n:::\n\n\nYou can also use more general regular expressions if necessary. See the help page (`?select`) for more details.\n\n### `filter()`\n\nThe `filter()` function is used to **extract subsets of rows** from a data frame. This function is similar to the existing `subset()` function in R but is quite a bit faster in my experience.\n\n![Artwork by Allison Horst on filter() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_filter.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n::: callout-tip\n### Example\n\nSuppose we wanted to extract the rows of the `chicago` data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30)\nstr(chic.f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [194 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:194] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:194] 23 28 55 59 57 57 75 61 73 78 ...\n $ dptp : num [1:194] 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...\n $ date : Date[1:194], format: \"1998-01-17\" \"1998-01-23\" ...\n $ pm25tmean2: num [1:194] 38.1 34 39.4 35.4 33.3 ...\n $ pm10tmean2: num [1:194] 32.5 38.7 34 28.5 35 ...\n $ o3tmean2 : num [1:194] 3.18 1.75 10.79 14.3 20.66 ...\n $ no2tmean2 : num [1:194] 25.3 29.4 25.3 31.4 26.8 ...\n```\n:::\n:::\n\n:::\n\nYou can see that there are now only 194 rows in the data frame and the distribution of the `pm25tmean2` values is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(chic.f$pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n 30.05 32.12 35.04 36.63 39.53 61.50 \n```\n:::\n:::\n\n\nWe can place an arbitrarily complex logical sequence inside of `filter()`, so we could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)\nselect(chic.f, date, tmpd, pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 17 × 3\n date tmpd pm25tmean2\n \n 1 1998-08-23 81 39.6\n 2 1998-09-06 81 31.5\n 3 2001-07-20 82 32.3\n 4 2001-08-01 84 43.7\n 5 2001-08-08 85 38.8\n 6 2001-08-09 84 38.2\n 7 2002-06-20 82 33 \n 8 2002-06-23 82 42.5\n 9 2002-07-08 81 33.1\n10 2002-07-18 82 38.8\n11 2003-06-25 82 33.9\n12 2003-07-04 84 32.9\n13 2005-06-24 86 31.9\n14 2005-06-27 82 51.5\n15 2005-06-28 85 31.2\n16 2005-07-17 84 32.7\n17 2005-08-03 84 37.9\n```\n:::\n:::\n\n\nNow there are only 17 observations where both of those conditions are met.\n\nOther logical operators you should be aware of include:\n\n| Operator | Meaning | Example |\n|----------:|-------------------------:|-------------------------------:|\n| `==` | Equals | `city == chic` |\n| `!=` | Does not equal | `city != chic` |\n| `>` | Greater than | `tmpd > 32.0` |\n| `>=` | Greater than or equal to | `tmpd >- 32.0` |\n| `<` | Less than | `tmpd < 32.0` |\n| `<=` | Less than or equal to | `tmpd <= 32.0` |\n| `%in%` | Included in | `city %in% c(\"chic\", \"bmore\")` |\n| `is.na()` | Is a missing value | `is.na(pm10tmean2)` |\n\n::: callout-tip\n### Note\n\nIf you are ever unsure of how to write a logical statement, but know how to write its opposite, you can use the `!` operator to negate the whole statement.\n\nA common use of this is to identify observations with non-missing data (e.g., `!(is.na(pm10tmean2))`).\n:::\n\n### `arrange()`\n\nThe `arrange()` function is used to **reorder rows** of a data frame according to one of the variables/columns. Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R. The `arrange()` function simplifies the process quite a bit.\n\nHere we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, date)\n```\n:::\n\n\nWe can now check the first few rows\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 1987-01-01 NA\n2 1987-01-02 NA\n3 1987-01-03 NA\n```\n:::\n:::\n\n\nand the last few rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 2005-12-29 7.45\n2 2005-12-30 15.1 \n3 2005-12-31 15 \n```\n:::\n:::\n\n\nColumns can be arranged in descending order too by useing the special `desc()` operator.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, desc(date))\n```\n:::\n\n\nLooking at the first three and last three rows shows the dates in descending order.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 2005-12-31 15 \n2 2005-12-30 15.1 \n3 2005-12-29 7.45\n```\n:::\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 1987-01-03 NA\n2 1987-01-02 NA\n3 1987-01-01 NA\n```\n:::\n:::\n\n\n### `rename()`\n\n**Renaming a variable** in a data frame in R is surprisingly hard to do! The `rename()` function is designed to make this process easier.\n\nHere you can see the names of the first five variables in the `chicago` data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n city tmpd dptp date pm25tmean2\n \n1 chic 35 30.1 2005-12-31 15 \n2 chic 36 31 2005-12-30 15.1 \n3 chic 35 29.4 2005-12-29 7.45\n```\n:::\n:::\n\n\nThe `dptp` column is supposed to represent the dew point temperature and the `pm25tmean2` column provides the PM2.5 data.\n\nHowever, these names are pretty obscure or awkward and probably be renamed to something more sensible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n city tmpd dewpoint date pm25\n \n1 chic 35 30.1 2005-12-31 15 \n2 chic 36 31 2005-12-30 15.1 \n3 chic 35 29.4 2005-12-29 7.45\n```\n:::\n:::\n\n\nThe syntax inside the `rename()` function is to have the new name on the left-hand side of the `=` sign and the old name on the right-hand side.\n\n::: callout-note\n### Question\n\nHow would you do the equivalent in base R without `dplyr`?\n:::\n\n### `mutate()`\n\nThe `mutate()` function exists to **compute transformations of variables** in a data frame. Often, you want to create new variables that are derived from existing variables and `mutate()` provides a clean interface for doing that.\n\n![Artwork by Allison Horst on mutate() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_mutate.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nFor example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data.\n\n- That way we can look at whether a given day's air pollution level is higher than or less than average (as opposed to looking at its absolute level).\n\nHere, we create a `pm25detrend` variable that subtracts the mean from the `pm25` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))\nhead(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 9\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2\n2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8\n3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0\n4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3\n5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5\n6 chic 35 29.6 2005-12-26 8.4 8.5 14.0 16.8\n# ℹ 1 more variable: pm25detrend \n```\n:::\n:::\n\n\nThere is also the related `transmute()` function, which does the same thing as `mutate()` but then *drops all non-transformed variables*.\n\nHere, we de-trend the PM10 and ozone (O3) variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(transmute(chicago, \n pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),\n o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n pm10detrend o3detrend\n \n1 -10.4 -16.9 \n2 -14.7 -16.4 \n3 -10.4 -12.6 \n4 -6.40 -16.2 \n5 -6.90 -15.0 \n6 -25.4 -5.39\n```\n:::\n:::\n\n\nNote that there are only two columns in the transmuted data frame.\n\n### `group_by()`\n\nThe `group_by()` function is used to **generate summary statistics** from the data frame within strata defined by a variable.\n\nFor example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is?\n\nSo the stratum is the year, and that is something we can derive from the `date` variable.\n\n**In conjunction** with the `group_by()` function, we often use the `summarize()` function (or `summarise()` for some parts of the world).\n\n::: callout-tip\n### Note\n\nThe **general operation** here is a combination of\n\n1. Splitting a data frame into separate pieces defined by a variable or group of variables (`group_by()`)\n2. Then, applying a summary function across those subsets (`summarize()`)\n:::\n\n::: callout-tip\n### Example\n\nFirst, we can create a `year` variable using `as.POSIXlt()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)\n```\n:::\n\n\nNow we can create a separate data frame that splits the original data frame by year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nyears <- group_by(chicago, year)\n```\n:::\n\n\nFinally, we compute summary statistics for each year in the data frame with the `summarize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(years, pm25 = mean(pm25, na.rm = TRUE), \n o3 = max(o3tmean2, na.rm = TRUE), \n no2 = median(no2tmean2, na.rm = TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 19 × 4\n year pm25 o3 no2\n \n 1 1987 NaN 63.0 23.5\n 2 1988 NaN 61.7 24.5\n 3 1989 NaN 59.7 26.1\n 4 1990 NaN 52.2 22.6\n 5 1991 NaN 63.1 21.4\n 6 1992 NaN 50.8 24.8\n 7 1993 NaN 44.3 25.8\n 8 1994 NaN 52.2 28.5\n 9 1995 NaN 66.6 27.3\n10 1996 NaN 58.4 26.4\n11 1997 NaN 56.5 25.5\n12 1998 18.3 50.7 24.6\n13 1999 18.5 57.5 24.7\n14 2000 16.9 55.8 23.5\n15 2001 16.9 51.8 25.1\n16 2002 15.3 54.9 22.7\n17 2003 15.2 56.2 24.6\n18 2004 14.6 44.5 23.4\n19 2005 16.2 58.8 22.6\n```\n:::\n:::\n\n:::\n\n`summarize()` returns a data frame with `year` as the first column, and then the annual summary statistics of `pm25`, `o3`, and `no2`.\n\n::: callout-tip\n### More complicated example\n\nIn a slightly more complicated example, we might want to know what are the average levels of ozone (`o3`) and nitrogen dioxide (`no2`) within quintiles of `pm25`. A slicker way to do this would be through a regression model, but we can actually do this quickly with `group_by()` and `summarize()`.\n\nFirst, we can create a categorical variable of `pm25` divided into quantiles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)\nchicago <- mutate(chicago, pm25.quint = cut(pm25, qq))\n```\n:::\n\n\nNow we can group the data frame by the `pm25.quint` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquint <- group_by(chicago, pm25.quint)\n```\n:::\n\n\nFinally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(quint, o3 = mean(o3tmean2, na.rm = TRUE), \n no2 = mean(no2tmean2, na.rm = TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n pm25.quint o3 no2\n \n1 (1.7,8.7] 21.7 18.0\n2 (8.7,12.4] 20.4 22.1\n3 (12.4,16.7] 20.7 24.4\n4 (16.7,22.6] 19.9 27.3\n5 (22.6,61.5] 20.3 29.6\n6 18.8 25.8\n```\n:::\n:::\n\n:::\n\nFrom the table, it seems there is not a strong relationship between `pm25` and `o3`, but there appears to be a positive correlation between `pm25` and `no2`.\n\nMore sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of `dplyr` functions can often get you most of the way there.\n\n### `%>%`\n\nThe pipeline operator `%>%` is very handy for **stringing together multiple `dplyr` functions in a sequence of operations**.\n\nNotice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nthird(second(first(x)))\n```\n:::\n\n\nThis **nesting is not a natural way** to think about a sequence of operations.\n\nThe `%>%` operator allows you to string operations in a left-to-right fashion, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfirst(x) %>% second %>% third\n```\n:::\n\n\n::: callout-tip\n### Example\n\nTake the example that we just did in the last section.\n\nThat can be done with the following sequence in a single R expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago %>% \n mutate(year = as.POSIXlt(date)$year + 1900) %>% \n group_by(year) %>% \n summarize(pm25 = mean(pm25, na.rm = TRUE), \n o3 = max(o3tmean2, na.rm = TRUE), \n no2 = median(no2tmean2, na.rm = TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 19 × 4\n year pm25 o3 no2\n \n 1 1987 NaN 63.0 23.5\n 2 1988 NaN 61.7 24.5\n 3 1989 NaN 59.7 26.1\n 4 1990 NaN 52.2 22.6\n 5 1991 NaN 63.1 21.4\n 6 1992 NaN 50.8 24.8\n 7 1993 NaN 44.3 25.8\n 8 1994 NaN 52.2 28.5\n 9 1995 NaN 66.6 27.3\n10 1996 NaN 58.4 26.4\n11 1997 NaN 56.5 25.5\n12 1998 18.3 50.7 24.6\n13 1999 18.5 57.5 24.7\n14 2000 16.9 55.8 23.5\n15 2001 16.9 51.8 25.1\n16 2002 15.3 54.9 22.7\n17 2003 15.2 56.2 24.6\n18 2004 14.6 44.5 23.4\n19 2005 16.2 58.8 22.6\n```\n:::\n:::\n\n:::\n\nThis way we do not have to create a set of temporary variables along the way or create a massive nested sequence of function calls.\n\n::: callout-tip\n### Note\n\nIn the above code, I pass the `chicago` data frame to the first call to `mutate()`, but then afterwards I do not have to pass the first argument to `group_by()` or `summarize()`.\n\nOnce you travel down the pipeline with `%>%`, the first argument is taken to be the output of the previous element in the pipeline.\n:::\n\nAnother example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutate(chicago, month = as.POSIXlt(date)$mon + 1) %>% \n group_by(month) %>% \n summarize(pm25 = mean(pm25, na.rm = TRUE), \n o3 = max(o3tmean2, na.rm = TRUE), \n no2 = median(no2tmean2, na.rm = TRUE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 4\n month pm25 o3 no2\n \n 1 1 17.8 28.2 25.4\n 2 2 20.4 37.4 26.8\n 3 3 17.4 39.0 26.8\n 4 4 13.9 47.9 25.0\n 5 5 14.1 52.8 24.2\n 6 6 15.9 66.6 25.0\n 7 7 16.6 59.5 22.4\n 8 8 16.9 54.0 23.0\n 9 9 15.9 57.5 24.5\n10 10 14.2 47.1 24.2\n11 11 15.2 29.5 23.6\n12 12 17.5 27.7 24.5\n```\n:::\n:::\n\n\nHere, we can see that `o3` tends to be low in the winter months and high in the summer while `no2` is higher in the winter and lower in the summer.\n\n### `slice_*()`\n\nThe `slice_sample()` function of the `dplyr` package will allow you to see a **sample of random rows** in random order.\n\nThe number of rows to show is specified by the `n` argument.\n\n- This can be useful if you **do not want to print the entire tibble**, but you want to get a greater sense of the values.\n- This is a **good option for data analysis reports**, where printing the entire tibble would not be appropriate if the tibble is quite large.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_sample(chicago, n = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n 1 chic 45 25.7 2002-04-26 9.9 30.2 23.9 26.3\n 2 chic 52.5 41.4 1993-04-23 NA 63.5 24.6 40.1\n 3 chic 13 9.3 2005-01-27 10.8 22 20.7 27.3\n 4 chic 41 43.4 1993-11-12 NA 51.5 10.0 25.3\n 5 chic 52.5 39 1996-05-02 NA 38 22.5 32.1\n 6 chic 65.5 51.8 1990-09-27 NA 45.5 19.6 40.9\n 7 chic 46 31 2000-11-05 12.1 26 14.3 24.7\n 8 chic 86.5 73.4 1990-07-04 NA 60.6 52.2 12.8\n 9 chic 10 7.75 1992-12-24 NA 39 6.82 21.9\n10 chic 33 20 1992-10-19 NA 30 9.30 33.6\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n:::\n:::\n\n:::\n\nYou can also use `slice_head()` or `slice_tail()` to take a look at the top rows or bottom rows of your tibble. Again the number of rows can be specified with the `n` argument.\n\nThis will show the first 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_head(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2\n2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8\n3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0\n4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3\n5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n:::\n:::\n\n\nThis will show the last 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_tail(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 32 28.9 1987-01-05 NA NA 4.75 30.3\n2 chic 29 28.6 1987-01-04 NA 47 4.38 30.4\n3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8\n4 chic 33 29.9 1987-01-02 NA NA 3.30 23.2\n5 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n:::\n:::\n\n\n# Summary\n\nThe `dplyr` pacfkage provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.\n\nOnce you learn the `dplyr` grammar there are a few additional benefits\n\n- `dplyr` can work with other data frame \"back ends\" such as SQL databases. There is an SQL interface for relational databases via the DBI package\n\n- `dplyr` can be integrated with the `data.table` package for large fast tables\n\nThe `dplyr` package is handy way to both simplify and speed up your data frame management code. It is rare that you get such a combination at the same time!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. How can you tell if an object is a tibble?\n2. What option controls how many additional column names are printed at the footer of a tibble?\n3. Using the `trees` dataset in base R (this dataset stores the girth, height, and volume for Black Cherry Trees) and using the pipe operator: (i) convert the `data.frame` to a tibble, (ii) filter for rows with a tree height of greater than 70, and (iii) order rows by `Volume` (smallest to largest).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Girth Height Volume\n1 8.3 70 10.3\n2 8.6 65 10.3\n3 8.8 63 10.2\n4 10.5 72 16.4\n5 10.7 81 18.8\n6 10.8 83 19.7\n```\n:::\n:::\n\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"08 - Managing data frames with the Tidyverse\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \" An introduction to data frames in R and the managing them with the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tibble, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/08-managing-data-frames-with-tidyverse/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n3. [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Understand the advantages of a `tibble` and `data.frame` data objects in R\n- Learn about the dplyr R package to manage data frames\n- Recognize the key verbs to manage data frames in dplyr\n- Use the \"pipe\" operator to combine verbs together\n:::\n\n# Data Frames\n\nThe **data frame** (or `data.frame`) is a **key data structure** in statistics and in R.\n\nThe basic structure of a data frame is that there is **one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation**.\n\nR has an internal implementation of data frames that is likely the one you will use most often. However, there are packages on CRAN that implement data frames via things like relational databases that allow you to operate on very, very large data frames (but we will not discuss them here).\n\nGiven the importance of managing data frames, it is **important that we have good tools for dealing with them.**\n\nFor example, **operations** like filtering rows, re-ordering rows, and selecting columns, can often be tedious operations in R whose syntax is not very intuitive. The `dplyr` package is designed to mitigate a lot of these problems and to provide a highly optimized set of routines specifically for dealing with data frames.\n\n## Tibbles\n\nAnother type of data structure that we need to discuss is called the **tibble**! It's best to think of tibbles as an updated and stylish version of the `data.frame`.\n\nTibbles are what tidyverse packages work with most seamlessly. Now, that **does not mean tidyverse packages *require* tibbles**.\n\nIn fact, they still work with `data.frames`, but the more you work with tidyverse and tidyverse-adjacent packages, the more you will see the advantages of using tibbles.\n\nBefore we go any further, tibbles *are* data frames, but they have some new bells and whistles to make your life easier.\n\n### How tibbles differ from `data.frame`\n\nThere are a number of differences between tibbles and `data.frames`.\n\n::: callout-tip\n### Note\n\nTo see a full vignette about tibbles and how they differ from data.frame, you will want to execute `vignette(\"tibble\")` and read through that vignette.\n:::\n\nWe will summarize some of the most important points here:\n\n- **Input type remains unchanged** - `data.frame` is notorious for treating strings as factors; this will not happen with tibbles\n- **Variable names remain unchanged** - In base R, creating `data.frames` will remove spaces from names, converting them to periods or add \"x\" before numeric column names. Creating tibbles will not change variable (column) names.\n- **There are no `row.names()` for a tibble** - Tidy data requires that variables be stored in a consistent way, removing the need for row names.\n- **Tibbles print first ten rows and columns that fit on one screen** - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.\n\n## Creating a tibble\n\nThe tibble package is part of the `tidyverse` and can thus be loaded in (once installed) using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### `as_tibble()`\n\nSince many packages use the historical `data.frame` from base R, you will often find yourself in the situation that you have a `data.frame` and want to convert that `data.frame` to a `tibbl`e.\n\nTo do so, the `as_tibble()` function is exactly what you are looking for.\n\nFor the example, here we use a dataset (`chicago.rds`) containing air pollution and temperature data for the city of Chicago in the U.S.\n\nThe dataset is available in the `/data` repository. You can load the data into R using the `readRDS()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing2023\n```\n:::\n\n```{.r .cell-code}\nchicago <- readRDS(here(\"data\", \"chicago.rds\"))\n```\n:::\n\n\nYou can see some basic characteristics of the dataset with the `dim()` and `str()` functions.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6940 8\n```\n:::\n\n```{.r .cell-code}\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t6940 obs. of 8 variables:\n $ city : chr \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date, format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num 34 NA 34.2 47 NA ...\n $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nWe see this data structure is a `data.frame` with 6940 observations and 8 variables.\n\nTo convert this `data.frame` to a tibble you would use the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(as_tibble(chicago))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTibbles, by default, **only print the first ten rows to screen**.\n\nIf you were to print the `data.frame` `chicago` to screen, all 6940 rows would be displayed. When working with large `data.frames`, this **default behavior can be incredibly frustrating**.\n\nUsing tibbles removes this frustration because of the default settings for tibble printing.\n:::\n\nAdditionally, you will note that the **type of the variable is printed for each variable in the tibble**. This helpful feature is another added bonus of tibbles relative to `data.frame`.\n\n#### Want to see more of the tibble?\n\nIf you *do* want to see more rows from the tibble, there are a few options!\n\n1. The `View()` function in RStudio is incredibly helpful. The input to this function is the `data.frame` or tibble you would like to see.\n\nSpecifically, `View(chicago)` would provide you, the viewer, with a scrollable view (in a new tab) of the complete dataset.\n\n2. Use the fact that `print()` enables you to specify how many rows and columns you would like to display.\n\nHere, we again display the `chicago` data.frame as a tibble but specify that we would only like to see 5 rows. The `width = Inf` argument specifies that we would like to see all the possible columns. Here, there are only 8, but for larger datasets, this can be helpful to specify.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas_tibble(chicago) %>%\n print(n = 5, width = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6,940 × 8\n city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0\n2 chic 33 29.9 1987-01-02 NA NA 3.30 23.2\n3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8\n4 chic 29 28.6 1987-01-04 NA 47 4.38 30.4\n5 chic 32 28.9 1987-01-05 NA NA 4.75 30.3\n# ℹ 6,935 more rows\n```\n:::\n:::\n\n\n### `tibble()`\n\nAlternatively, you can **create a tibble on the fly** by using `tibble()` and specifying the information you would like stored in each column.\n\n::: callout-tip\n### Note\n\nIf you provide a single value, this value will be repeated across all rows of the tibble. This is referred to as \"recycling inputs of length 1.\"\n\nIn the example here, we see that the column `c` will contain the value '1' across all rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n a = 1:5,\n b = 6:10,\n c = 1,\n z = (a + b)^2 + c\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 4\n a b c z\n \n1 1 6 1 50\n2 2 7 1 82\n3 3 8 1 122\n4 4 9 1 170\n5 5 10 1 226\n```\n:::\n:::\n\n:::\n\nThe `tibble()` function allows you to quickly generate tibbles and even allows you to **reference columns within the tibble you are creating**, as seen in column z of the example above.\n\n::: callout-tip\n### Note\n\n**Tibbles can have column names that are not allowed** in `data.frame`.\n\nIn the example below, we see that to utilize a nontraditional variable name, you surround the column name with backticks.\n\nNote that to refer to such columns in other tidyverse packages, you willl continue to use backticks surrounding the variable name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n `two words` = 1:5,\n `12` = \"numeric\",\n `:)` = \"smile\",\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n `two words` `12` `:)` \n \n1 1 numeric smile\n2 2 numeric smile\n3 3 numeric smile\n4 4 numeric smile\n5 5 numeric smile\n```\n:::\n:::\n\n:::\n\n## Subsetting tibbles\n\nSubsetting tibbles also differs slightly from how subsetting occurs with `data.frame`.\n\nWhen it comes to tibbles,\n\n- `[[` can subset by name or position\n- `$` only subsets by name\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n a = 1:5,\n b = 6:10,\n c = 1,\n z = (a + b)^2 + c\n)\n\n# Extract by name using $ or [[]]\ndf$z\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 50 82 122 170 226\n```\n:::\n\n```{.r .cell-code}\ndf[[\"z\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 50 82 122 170 226\n```\n:::\n\n```{.r .cell-code}\n# Extract by position requires [[]]\ndf[[4]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 50 82 122 170 226\n```\n:::\n:::\n\n\nHaving now discussed tibbles, which are the type of object most tidyverse and tidyverse-adjacent packages work best with, we now know the goal.\n\nIn many cases, **tibbles are ultimately what we want to work with in R**.\n\nHowever, **data are stored in many different formats outside of R**. We will spend the rest of this lesson discussing wrangling functions that work either a `data.frame` or `tibble`.\n\n# The `dplyr` Package\n\nThe `dplyr` package was developed by Posit (formely RStudio) and is **an optimized and distilled** version of the older `plyr` **package for data manipulation or wrangling**.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_wrangling.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe `dplyr` package does not provide any \"new\" functionality to R per se, in the sense that everything `dplyr` does could already be done with base R, but it **greatly** simplifies existing functionality in R.\n\nOne important contribution of the `dplyr` package is that it **provides a \"grammar\" (in particular, verbs) for data manipulation and for operating on data frames**.\n\nWith this grammar, you can sensibly communicate what it is that you are doing to a data frame that other people can understand (assuming they also know the grammar). This is useful because it **provides an abstraction for data manipulation that previously did not exist**.\n\nAnother useful contribution is that the `dplyr` functions are **very** fast, as many key operations are coded in C++.\n\n### `dplyr` grammar\n\nSome of the key \"verbs\" provided by the `dplyr` package are\n\n- `select()`: return a subset of the columns of a data frame, using a flexible notation\n\n- `filter()`: extract a subset of rows from a data frame based on logical conditions\n\n- `arrange()`: reorder rows of a data frame\n\n- `rename()`: rename variables in a data frame\n\n- `mutate()`: add new variables/columns or transform existing variables\n\n- `summarise()` / `summarize()`: generate summary statistics of different variables in the data frame, possibly within strata\n\n- `%>%`: the \"pipe\" operator is used to connect multiple verb actions together into a pipeline\n\n::: callout-tip\n### Note\n\nThe `dplyr` package as a number of its own data types that it takes advantage of.\n\nFor example, there is a handy `print()` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.\n:::\n\n### `dplyr` functions\n\nAll of the functions that we will discuss here will have a few common characteristics. In particular,\n\n1. The **first argument** is a data frame type object.\n\n2. The **subsequent arguments** describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly (without using the `$` operator, just use the column names).\n\n3. The **return result** of a function is a new data frame.\n\n4. Data frames must be **properly formatted** and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n### `dplyr` installation\n\nThe `dplyr` package can be installed from CRAN or from GitHub using the `devtools` package and the `install_github()` function. The GitHub repository will usually contain the latest updates to the package and the development version.\n\nTo install from CRAN, just run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"dplyr\")\n```\n:::\n\n\nThe `dplyr` package is also installed when you install the `tidyverse` meta-package.\n\nAfter installing the package it is important that you load it into your R session with the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n```\n:::\n\n\nYou may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings.\n\n### `select()`\n\nWe will continue to use the `chicago` dataset containing air pollution and temperature data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- as_tibble(chicago)\nstr(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:6940] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:6940] 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...\n $ dptp : num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date : Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nThe `select()` function can be used to **select columns of a data frame** that you want to focus on.\n\n::: callout-tip\n### Example\n\nSuppose we wanted to take the first 3 columns only. There are a few ways to do this.\n\nWe could for example use numerical indices:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(chicago)[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"city\" \"tmpd\" \"dptp\"\n```\n:::\n:::\n\n\nBut we can also use the names directly:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, city:dptp)\nhead(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n city tmpd dptp\n \n1 chic 31.5 31.5\n2 chic 33 29.9\n3 chic 33 27.4\n4 chic 29 28.6\n5 chic 32 28.9\n6 chic 40 35.1\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nThe `:` normally cannot be used with names or strings, but inside the `select()` function you can use it to specify a range of variable names.\n:::\n\nYou can also **omit** variables using the `select()` function by using the negative sign. With `select()` you can do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect(chicago, -(city:dptp))\n```\n:::\n\n\nwhich indicates that we should include every variable *except* the variables `city` through `dptp`. The equivalent code in base R would be\n\n\n::: {.cell}\n\n```{.r .cell-code}\ni <- match(\"city\", names(chicago))\nj <- match(\"dptp\", names(chicago))\nhead(chicago[, -(i:j)])\n```\n:::\n\n\nNot super intuitive, right?\n\nThe `select()` function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a \"2\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, ends_with(\"2\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 4] (S3: tbl_df/tbl/data.frame)\n $ pm25tmean2: num [1:6940] NA NA NA NA NA NA NA NA NA NA ...\n $ pm10tmean2: num [1:6940] 34 NA 34.2 47 NA ...\n $ o3tmean2 : num [1:6940] 4.25 3.3 3.33 4.38 4.75 ...\n $ no2tmean2 : num [1:6940] 20 23.2 23.8 30.4 30.3 ...\n```\n:::\n:::\n\n\nOr if we wanted to keep every variable that starts with a \"d\", we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubset <- select(chicago, starts_with(\"d\"))\nstr(subset)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [6,940 × 2] (S3: tbl_df/tbl/data.frame)\n $ dptp: num [1:6940] 31.5 29.9 27.4 28.6 28.9 ...\n $ date: Date[1:6940], format: \"1987-01-01\" \"1987-01-02\" ...\n```\n:::\n:::\n\n\nYou can also use more general regular expressions if necessary. See the help page (`?select`) for more details.\n\n### `filter()`\n\nThe `filter()` function is used to **extract subsets of rows** from a data frame. This function is similar to the existing `subset()` function in R but is quite a bit faster in my experience.\n\n![Artwork by Allison Horst on filter() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_filter.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n::: callout-tip\n### Example\n\nSuppose we wanted to extract the rows of the `chicago` data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30)\nstr(chic.f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [194 × 8] (S3: tbl_df/tbl/data.frame)\n $ city : chr [1:194] \"chic\" \"chic\" \"chic\" \"chic\" ...\n $ tmpd : num [1:194] 23 28 55 59 57 57 75 61 73 78 ...\n $ dptp : num [1:194] 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...\n $ date : Date[1:194], format: \"1998-01-17\" \"1998-01-23\" ...\n $ pm25tmean2: num [1:194] 38.1 34 39.4 35.4 33.3 ...\n $ pm10tmean2: num [1:194] 32.5 38.7 34 28.5 35 ...\n $ o3tmean2 : num [1:194] 3.18 1.75 10.79 14.3 20.66 ...\n $ no2tmean2 : num [1:194] 25.3 29.4 25.3 31.4 26.8 ...\n```\n:::\n:::\n\n:::\n\nYou can see that there are now only 194 rows in the data frame and the distribution of the `pm25tmean2` values is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(chic.f$pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n 30.05 32.12 35.04 36.63 39.53 61.50 \n```\n:::\n:::\n\n\nWe can place an arbitrarily complex logical sequence inside of `filter()`, so we could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)\nselect(chic.f, date, tmpd, pm25tmean2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 17 × 3\n date tmpd pm25tmean2\n \n 1 1998-08-23 81 39.6\n 2 1998-09-06 81 31.5\n 3 2001-07-20 82 32.3\n 4 2001-08-01 84 43.7\n 5 2001-08-08 85 38.8\n 6 2001-08-09 84 38.2\n 7 2002-06-20 82 33 \n 8 2002-06-23 82 42.5\n 9 2002-07-08 81 33.1\n10 2002-07-18 82 38.8\n11 2003-06-25 82 33.9\n12 2003-07-04 84 32.9\n13 2005-06-24 86 31.9\n14 2005-06-27 82 51.5\n15 2005-06-28 85 31.2\n16 2005-07-17 84 32.7\n17 2005-08-03 84 37.9\n```\n:::\n:::\n\n\nNow there are only 17 observations where both of those conditions are met.\n\nOther logical operators you should be aware of include:\n\n| Operator | Meaning | Example |\n|----------:|-------------------------:|-------------------------------:|\n| `==` | Equals | `city == chic` |\n| `!=` | Does not equal | `city != chic` |\n| `>` | Greater than | `tmpd > 32.0` |\n| `>=` | Greater than or equal to | `tmpd >- 32.0` |\n| `<` | Less than | `tmpd < 32.0` |\n| `<=` | Less than or equal to | `tmpd <= 32.0` |\n| `%in%` | Included in | `city %in% c(\"chic\", \"bmore\")` |\n| `is.na()` | Is a missing value | `is.na(pm10tmean2)` |\n\n::: callout-tip\n### Note\n\nIf you are ever unsure of how to write a logical statement, but know how to write its opposite, you can use the `!` operator to negate the whole statement.\n\nA common use of this is to identify observations with non-missing data (e.g., `!(is.na(pm10tmean2))`).\n:::\n\n### `arrange()`\n\nThe `arrange()` function is used to **reorder rows** of a data frame according to one of the variables/columns. Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R. The `arrange()` function simplifies the process quite a bit.\n\nHere we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, date)\n```\n:::\n\n\nWe can now check the first few rows\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 1987-01-01 NA\n2 1987-01-02 NA\n3 1987-01-03 NA\n```\n:::\n:::\n\n\nand the last few rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 2005-12-29 7.45\n2 2005-12-30 15.1 \n3 2005-12-31 15 \n```\n:::\n:::\n\n\nColumns can be arranged in descending order too by useing the special `desc()` operator.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- arrange(chicago, desc(date))\n```\n:::\n\n\nLooking at the first three and last three rows shows the dates in descending order.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 2005-12-31 15 \n2 2005-12-30 15.1 \n3 2005-12-29 7.45\n```\n:::\n\n```{.r .cell-code}\ntail(select(chicago, date, pm25tmean2), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n date pm25tmean2\n \n1 1987-01-03 NA\n2 1987-01-02 NA\n3 1987-01-01 NA\n```\n:::\n:::\n\n\n### `rename()`\n\n**Renaming a variable** in a data frame in R is surprisingly hard to do! The `rename()` function is designed to make this process easier.\n\nHere you can see the names of the first five variables in the `chicago` data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n city tmpd dptp date pm25tmean2\n \n1 chic 35 30.1 2005-12-31 15 \n2 chic 36 31 2005-12-30 15.1 \n3 chic 35 29.4 2005-12-29 7.45\n```\n:::\n:::\n\n\nThe `dptp` column is supposed to represent the dew point temperature and the `pm25tmean2` column provides the PM2.5 data.\n\nHowever, these names are pretty obscure or awkward and probably be renamed to something more sensible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)\nhead(chicago[, 1:5], 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n city tmpd dewpoint date pm25\n \n1 chic 35 30.1 2005-12-31 15 \n2 chic 36 31 2005-12-30 15.1 \n3 chic 35 29.4 2005-12-29 7.45\n```\n:::\n:::\n\n\nThe syntax inside the `rename()` function is to have the new name on the left-hand side of the `=` sign and the old name on the right-hand side.\n\n::: callout-note\n### Question\n\nHow would you do the equivalent in base R without `dplyr`?\n:::\n\n### `mutate()`\n\nThe `mutate()` function exists to **compute transformations of variables** in a data frame. Often, you want to create new variables that are derived from existing variables and `mutate()` provides a clean interface for doing that.\n\n![Artwork by Allison Horst on mutate() function](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/dplyr_mutate.png){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nFor example, with air pollution data, we often want to *detrend* the data by subtracting the mean from the data.\n\n- That way we can look at whether a given day's air pollution level is higher than or less than average (as opposed to looking at its absolute level).\n\nHere, we create a `pm25detrend` variable that subtracts the mean from the `pm25` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))\nhead(chicago)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 9\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2\n2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8\n3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0\n4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3\n5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5\n6 chic 35 29.6 2005-12-26 8.4 8.5 14.0 16.8\n# ℹ 1 more variable: pm25detrend \n```\n:::\n:::\n\n\nThere is also the related `transmute()` function, which does the same thing as `mutate()` but then *drops all non-transformed variables*.\n\nHere, we de-trend the PM10 and ozone (O3) variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(transmute(chicago,\n pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),\n o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)\n))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n pm10detrend o3detrend\n \n1 -10.4 -16.9 \n2 -14.7 -16.4 \n3 -10.4 -12.6 \n4 -6.40 -16.2 \n5 -6.90 -15.0 \n6 -25.4 -5.39\n```\n:::\n:::\n\n\nNote that there are only two columns in the transmuted data frame.\n\n### `group_by()`\n\nThe `group_by()` function is used to **generate summary statistics** from the data frame within strata defined by a variable.\n\nFor example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is?\n\nSo the stratum is the year, and that is something we can derive from the `date` variable.\n\n**In conjunction** with the `group_by()` function, we often use the `summarize()` function (or `summarise()` for some parts of the world).\n\n::: callout-tip\n### Note\n\nThe **general operation** here is a combination of\n\n1. Splitting a data frame into separate pieces defined by a variable or group of variables (`group_by()`)\n2. Then, applying a summary function across those subsets (`summarize()`)\n:::\n\n::: callout-tip\n### Example\n\nFirst, we can create a `year` variable using `as.POSIXlt()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)\n```\n:::\n\n\nNow we can create a separate data frame that splits the original data frame by year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nyears <- group_by(chicago, year)\n```\n:::\n\n\nFinally, we compute summary statistics for each year in the data frame with the `summarize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(years,\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 19 × 4\n year pm25 o3 no2\n \n 1 1987 NaN 63.0 23.5\n 2 1988 NaN 61.7 24.5\n 3 1989 NaN 59.7 26.1\n 4 1990 NaN 52.2 22.6\n 5 1991 NaN 63.1 21.4\n 6 1992 NaN 50.8 24.8\n 7 1993 NaN 44.3 25.8\n 8 1994 NaN 52.2 28.5\n 9 1995 NaN 66.6 27.3\n10 1996 NaN 58.4 26.4\n11 1997 NaN 56.5 25.5\n12 1998 18.3 50.7 24.6\n13 1999 18.5 57.5 24.7\n14 2000 16.9 55.8 23.5\n15 2001 16.9 51.8 25.1\n16 2002 15.3 54.9 22.7\n17 2003 15.2 56.2 24.6\n18 2004 14.6 44.5 23.4\n19 2005 16.2 58.8 22.6\n```\n:::\n:::\n\n:::\n\n`summarize()` returns a data frame with `year` as the first column, and then the annual summary statistics of `pm25`, `o3`, and `no2`.\n\n::: callout-tip\n### More complicated example\n\nIn a slightly more complicated example, we might want to know what are the average levels of ozone (`o3`) and nitrogen dioxide (`no2`) within quintiles of `pm25`. A slicker way to do this would be through a regression model, but we can actually do this quickly with `group_by()` and `summarize()`.\n\nFirst, we can create a categorical variable of `pm25` divided into quantiles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)\nchicago <- mutate(chicago, pm25.quint = cut(pm25, qq))\n```\n:::\n\n\nNow we can group the data frame by the `pm25.quint` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquint <- group_by(chicago, pm25.quint)\n```\n:::\n\n\nFinally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummarize(quint,\n o3 = mean(o3tmean2, na.rm = TRUE),\n no2 = mean(no2tmean2, na.rm = TRUE)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 3\n pm25.quint o3 no2\n \n1 (1.7,8.7] 21.7 18.0\n2 (8.7,12.4] 20.4 22.1\n3 (12.4,16.7] 20.7 24.4\n4 (16.7,22.6] 19.9 27.3\n5 (22.6,61.5] 20.3 29.6\n6 18.8 25.8\n```\n:::\n:::\n\n:::\n\nFrom the table, it seems there is not a strong relationship between `pm25` and `o3`, but there appears to be a positive correlation between `pm25` and `no2`.\n\nMore sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of `dplyr` functions can often get you most of the way there.\n\n### `%>%`\n\nThe pipeline operator `%>%` is very handy for **stringing together multiple `dplyr` functions in a sequence of operations**.\n\nNotice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nthird(second(first(x)))\n```\n:::\n\n\nThis **nesting is not a natural way** to think about a sequence of operations.\n\nThe `%>%` operator allows you to string operations in a left-to-right fashion, i.e.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfirst(x) %>%\n second() %>%\n third()\n```\n:::\n\n\n::: callout-tip\n### Example\n\nTake the example that we just did in the last section.\n\nThat can be done with the following sequence in a single R expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchicago %>%\n mutate(year = as.POSIXlt(date)$year + 1900) %>%\n group_by(year) %>%\n summarize(\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 19 × 4\n year pm25 o3 no2\n \n 1 1987 NaN 63.0 23.5\n 2 1988 NaN 61.7 24.5\n 3 1989 NaN 59.7 26.1\n 4 1990 NaN 52.2 22.6\n 5 1991 NaN 63.1 21.4\n 6 1992 NaN 50.8 24.8\n 7 1993 NaN 44.3 25.8\n 8 1994 NaN 52.2 28.5\n 9 1995 NaN 66.6 27.3\n10 1996 NaN 58.4 26.4\n11 1997 NaN 56.5 25.5\n12 1998 18.3 50.7 24.6\n13 1999 18.5 57.5 24.7\n14 2000 16.9 55.8 23.5\n15 2001 16.9 51.8 25.1\n16 2002 15.3 54.9 22.7\n17 2003 15.2 56.2 24.6\n18 2004 14.6 44.5 23.4\n19 2005 16.2 58.8 22.6\n```\n:::\n:::\n\n:::\n\nThis way we do not have to create a set of temporary variables along the way or create a massive nested sequence of function calls.\n\n::: callout-tip\n### Note\n\nIn the above code, I pass the `chicago` data frame to the first call to `mutate()`, but then afterwards I do not have to pass the first argument to `group_by()` or `summarize()`.\n\nOnce you travel down the pipeline with `%>%`, the first argument is taken to be the output of the previous element in the pipeline.\n:::\n\nAnother example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutate(chicago, month = as.POSIXlt(date)$mon + 1) %>%\n group_by(month) %>%\n summarize(\n pm25 = mean(pm25, na.rm = TRUE),\n o3 = max(o3tmean2, na.rm = TRUE),\n no2 = median(no2tmean2, na.rm = TRUE)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 4\n month pm25 o3 no2\n \n 1 1 17.8 28.2 25.4\n 2 2 20.4 37.4 26.8\n 3 3 17.4 39.0 26.8\n 4 4 13.9 47.9 25.0\n 5 5 14.1 52.8 24.2\n 6 6 15.9 66.6 25.0\n 7 7 16.6 59.5 22.4\n 8 8 16.9 54.0 23.0\n 9 9 15.9 57.5 24.5\n10 10 14.2 47.1 24.2\n11 11 15.2 29.5 23.6\n12 12 17.5 27.7 24.5\n```\n:::\n:::\n\n\nHere, we can see that `o3` tends to be low in the winter months and high in the summer while `no2` is higher in the winter and lower in the summer.\n\n### `slice_*()`\n\nThe `slice_sample()` function of the `dplyr` package will allow you to see a **sample of random rows** in random order.\n\nThe number of rows to show is specified by the `n` argument.\n\n- This can be useful if you **do not want to print the entire tibble**, but you want to get a greater sense of the values.\n- This is a **good option for data analysis reports**, where printing the entire tibble would not be appropriate if the tibble is quite large.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_sample(chicago, n = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n 1 chic 49 40.2 2000-09-25 6.6 7 17.2 15.5\n 2 chic 35 24.1 1989-11-02 NA 25 8.83 17.3\n 3 chic 63.5 54.4 1996-04-18 NA 54 30.5 26.7\n 4 chic 70 65.9 1997-06-19 NA 60.5 32.4 39.9\n 5 chic 54 50.6 2005-11-05 27.2 32 11.5 18.2\n 6 chic 86.5 73.4 1990-07-04 NA 60.6 52.2 12.8\n 7 chic 74 74.6 1987-08-14 NA 49.5 24.2 18.6\n 8 chic 34.5 29.1 1995-11-27 NA 25 6.57 29.3\n 9 chic 73 61.2 1995-09-13 NA 46 25.3 26.5\n10 chic 79 64.6 2005-07-31 20.8 29.5 40.8 20.2\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n:::\n:::\n\n:::\n\nYou can also use `slice_head()` or `slice_tail()` to take a look at the top rows or bottom rows of your tibble. Again the number of rows can be specified with the `n` argument.\n\nThis will show the first 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_head(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 35 30.1 2005-12-31 15 23.5 2.53 13.2\n2 chic 36 31 2005-12-30 15.1 19.2 3.03 22.8\n3 chic 35 29.4 2005-12-29 7.45 23.5 6.79 20.0\n4 chic 37 34.5 2005-12-28 17.8 27.5 3.26 19.3\n5 chic 40 33.6 2005-12-27 23.6 27 4.47 23.5\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n:::\n:::\n\n\nThis will show the last 5 rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nslice_tail(chicago, n = 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 11\n city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2\n \n1 chic 32 28.9 1987-01-05 NA NA 4.75 30.3\n2 chic 29 28.6 1987-01-04 NA 47 4.38 30.4\n3 chic 33 27.4 1987-01-03 NA 34.2 3.33 23.8\n4 chic 33 29.9 1987-01-02 NA NA 3.30 23.2\n5 chic 31.5 31.5 1987-01-01 NA 34 4.25 20.0\n# ℹ 3 more variables: pm25detrend , year , pm25.quint \n```\n:::\n:::\n\n\n# Summary\n\nThe `dplyr` pacfkage provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`.\n\nOnce you learn the `dplyr` grammar there are a few additional benefits\n\n- `dplyr` can work with other data frame \"back ends\" such as SQL databases. There is an SQL interface for relational databases via the DBI package\n\n- `dplyr` can be integrated with the `data.table` package for large fast tables\n\nThe `dplyr` package is handy way to both simplify and speed up your data frame management code. It is rare that you get such a combination at the same time!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. How can you tell if an object is a tibble?\n2. What option controls how many additional column names are printed at the footer of a tibble?\n3. Using the `trees` dataset in base R (this dataset stores the girth, height, and volume for Black Cherry Trees) and using the pipe operator: (i) convert the `data.frame` to a tibble, (ii) filter for rows with a tree height of greater than 70, and (iii) order rows by `Volume` (smallest to largest).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(trees)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Girth Height Volume\n1 8.3 70 10.3\n2 8.6 65 10.3\n3 8.8 63 10.2\n4 10.5 72 16.4\n5 10.7 81 18.8\n6 10.8 83 19.7\n```\n:::\n:::\n\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- [dplyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/09-tidy-data-and-the-tidyverse/index/execute-results/html.json b/_freeze/posts/09-tidy-data-and-the-tidyverse/index/execute-results/html.json index dd7cfe1..745ed89 100644 --- a/_freeze/posts/09-tidy-data-and-the-tidyverse/index/execute-results/html.json +++ b/_freeze/posts/09-tidy-data-and-the-tidyverse/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "2d7b312f39fbff603576c0df9c26dee5", + "hash": "4988f121f5166e5d1440e31a42847a58", "result": { - "markdown": "---\ntitle: \"09 - Tidy data and the Tidyverse\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to tidy data and how to convert between wide and long data with the tidyr R package\"\ncategories: [module 2, week 2, R, programming, tidyr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/09-tidy-data-and-the-tidyverse/index.qmd).*\n\n\n\n> \"Happy families are all alike; every unhappy family is unhappy in its own way.\" ---- Leo Tolstoy\n\n> \"Tidy datasets are all alike, but every messy dataset is messy in its own way.\" ---- Hadley Wickham\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software\n2. \n3. [tidyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Define tidy data\n- Be able to transform non-tidy data into tidy data\n- Be able to transform wide data into long data\n- Be able to separate character columns into multiple columns\n- Be able to unite multiple character columns into one column\n:::\n\n# Tidy data\n\nAs we learned in the last lesson, one unifying concept of the tidyverse is the notion of **tidy data**.\n\nAs defined by Hadley Wickham in his 2014 paper published in the *Journal of Statistical Software*, a [tidy dataset](https://www.jstatsoft.org/article/view/v059i10) has the following properties:\n\n1. Each variable forms a column.\n\n2. Each observation forms a row.\n\n3. Each type of observational unit forms a table.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe **purpose of defining tidy data** is to highlight the fact that **most data do not start out life as tidy**.\n\nIn fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience).\n\n- Once a dataset is tidy, it **can be used as input into a variety of other functions** that may transform, model, or visualize the data.\n\n::: callout-tip\n### Example\n\nAs a quick example, consider the following data illustrating **religion and income survey data** with the number of respondents with income range in column name.\n\nThis is in a classic table format:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\nrelig_income\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 11\n religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`\n \n 1 Agnostic 27 34 60 81 76 137 122\n 2 Atheist 12 27 37 52 35 70 73\n 3 Buddhist 27 21 30 34 33 58 62\n 4 Catholic 418 617 732 670 638 1116 949\n 5 Don’t k… 15 14 15 11 10 35 21\n 6 Evangel… 575 869 1064 982 881 1486 949\n 7 Hindu 1 9 7 9 11 34 47\n 8 Histori… 228 244 236 238 197 223 131\n 9 Jehovah… 20 27 24 24 21 30 15\n10 Jewish 19 19 25 25 30 95 69\n11 Mainlin… 289 495 619 655 651 1107 939\n12 Mormon 29 40 48 51 56 112 85\n13 Muslim 6 7 9 10 9 23 16\n14 Orthodox 13 17 23 32 32 47 38\n15 Other C… 9 7 11 13 13 14 18\n16 Other F… 20 33 40 46 49 63 46\n17 Other W… 5 2 3 4 2 7 3\n18 Unaffil… 217 299 374 365 341 528 407\n# ℹ 3 more variables: `$100-150k` , `>150k` ,\n# `Don't know/refused` \n```\n:::\n:::\n\n:::\n\nWhile this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy.\n\n**This format violates the tidy form** because there are variables in the columns.\n\n- In this case the variables are religion, income bracket, and the number of respondents, which is the third variable, is presented inside the table.\n\nConverting this data to tidy format would give us\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nrelig_income %>%\n pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\") %>%\n mutate(religion = factor(religion), income = factor(income))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 180 × 3\n religion income respondents\n \n 1 Agnostic <$10k 27\n 2 Agnostic $10-20k 34\n 3 Agnostic $20-30k 60\n 4 Agnostic $30-40k 81\n 5 Agnostic $40-50k 76\n 6 Agnostic $50-75k 137\n 7 Agnostic $75-100k 122\n 8 Agnostic $100-150k 109\n 9 Agnostic >150k 84\n10 Agnostic Don't know/refused 96\n# ℹ 170 more rows\n```\n:::\n:::\n\n\nSome of these functions you have seen before, others might be new to you. Let's talk about each one in the context of the `tidyverse` R packages.\n\n# The \"Tidyverse\"\n\nThere are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and **the collection of packages is often referred to as the \"tidyverse\"** because of their **dependence on and presumption of tidy data**.\n\n::: callout-tip\n### Note\n\nA subset of the \"Tidyverse\" packages include:\n\n- [ggplot2](https://cran.r-project.org/package=ggplot2): a plotting system based on the grammar of graphics\n\n- [magrittr](https://cran.r-project.org/package=magrittr%22): defines the `%>%` operator for chaining functions together in a series of operations on data\n\n- [dplyr](https://cran.r-project.org/package=dplyr): a suite of (fast) functions for working with data frames\n\n- [tidyr](https://cran.r-project.org/package=tidyr): easily tidy data with `pivot_wider()` and `pivot_longer()` functions (also `separate()` and `unite()`)\n\nA complete list can be found here ().\n:::\n\nWe will be using these packages quite a bit.\n\nThe \"tidyverse\" package can be used to install all of the packages in the tidyverse at once.\n\nFor example, instead of starting an R script with this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(readr)\nlibrary(ggplot2)\n```\n:::\n\n\nYou can start with this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\nIn the example above, let's talk about what we did using the `pivot_longer()` function.\n\nWe will also talk about `pivot_wider()`.\n\n### `pivot_longer()`\n\nThe `tidyr` package includes functions to transfer a data frame between *long* and *wide*.\n\n- **Wide format** data tends to have different attributes or variables describing an observation placed in separate columns.\n- **Long format** data tends to have different attributes encoded as levels of a single variable, followed by another column that contains tha values of the observation at those different levels.\n\n::: callout-tip\n### Example\n\nIn the section above, we showed an example that used `pivot_longer()` to convert data into a tidy format.\n\nThe **key problem** with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.\n\nTo **fix this**, you can use the `pivot_longer()` function to **gather values spread across several columns into a single column**, here with the column names gathered into an `income` column.\n\n**Note**: when gathering, exclude any columns that you do not want \"gathered\" (`religion` in this case) by including the column names with a the minus sign in the `pivot_longer()` function.\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Gather everything EXCEPT religion to tidy data\nrelig_income %>%\n pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 180 × 3\n religion income respondents\n \n 1 Agnostic <$10k 27\n 2 Agnostic $10-20k 34\n 3 Agnostic $20-30k 60\n 4 Agnostic $30-40k 81\n 5 Agnostic $40-50k 76\n 6 Agnostic $50-75k 137\n 7 Agnostic $75-100k 122\n 8 Agnostic $100-150k 109\n 9 Agnostic >150k 84\n10 Agnostic Don't know/refused 96\n# ℹ 170 more rows\n```\n:::\n:::\n\n:::\n\nEven if your data is in a tidy format, `pivot_longer()` is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.\n\n### `pivot_wider()`\n\nThe `pivot_wider()` function is less commonly needed to tidy data. It can, however, be useful for creating summary tables.\n\n::: callout-tip\n### Example\n\nYou use the `summarize()` function in `dplyr` to summarize the total number of respondents per income category.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_income %>%\n pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\") %>%\n mutate(religion = factor(religion), income = factor(income)) %>% \n group_by(income) %>% \n summarize(total_respondents = sum(respondents)) %>%\n pivot_wider(names_from = \"income\", \n values_from = \"total_respondents\") %>%\n knitr::kable()\n```\n\n::: {.cell-output-display}\n| <$10k| >150k| $10-20k| $100-150k| $20-30k| $30-40k| $40-50k| $50-75k| $75-100k| Don't know/refused|\n|-----:|-----:|-------:|---------:|-------:|-------:|-------:|-------:|--------:|------------------:|\n| 1930| 2608| 2781| 3197| 3357| 3302| 3085| 5185| 3990| 6121|\n:::\n:::\n\n:::\n\nNotice in this example how `pivot_wider()` has been used at the **very end of the code sequence** to convert the summarized data into a shape that **offers a better tabular presentation for a report**.\n\n::: callout-tip\n### Note\n\nIn the `pivot_wider()` call, you first specify the name of the column to use for the new column names (`income` in this example) and then specify the column to use for the cell values (`total_respondents` here).\n:::\n\n::: callout-tip\n### Example of `pivot_longer()`\n\nLet's try another dataset. This data contain an excerpt of the [Gapminder data](https://cran.r-project.org/web/packages/gapminder/README.html#gapminder) on life expectancy, GDP per capita, and population by country.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(gapminder)\ngapminder\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 6\n country continent year lifeExp pop gdpPercap\n \n 1 Afghanistan Asia 1952 28.8 8425333 779.\n 2 Afghanistan Asia 1957 30.3 9240934 821.\n 3 Afghanistan Asia 1962 32.0 10267083 853.\n 4 Afghanistan Asia 1967 34.0 11537966 836.\n 5 Afghanistan Asia 1972 36.1 13079460 740.\n 6 Afghanistan Asia 1977 38.4 14880372 786.\n 7 Afghanistan Asia 1982 39.9 12881816 978.\n 8 Afghanistan Asia 1987 40.8 13867957 852.\n 9 Afghanistan Asia 1992 41.7 16317921 649.\n10 Afghanistan Asia 1997 41.8 22227415 635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\nIf we wanted to make `lifeExp`, `pop` and `gdpPercap` (all measurements that we observe) go from a wide table into a long table, what would we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nOne more! Try using `pivot_longer()` to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.\n\nAfterward, use `group_by()` and `summarize()` to calculate the average revenue for each company across all years and all quarters.\n\n**Bonus**: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n \"company\" = rep(1:3, each=4), \n \"year\" = rep(2006:2009, 3),\n \"Q1\" = sample(x = 0:100, size = 12),\n \"Q2\" = sample(x = 0:100, size = 12),\n \"Q3\" = sample(x = 0:100, size = 12),\n \"Q4\" = sample(x = 0:100, size = 12),\n)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 6\n company year Q1 Q2 Q3 Q4\n \n 1 1 2006 8 55 99 17\n 2 1 2007 9 57 98 48\n 3 1 2008 20 40 77 24\n 4 1 2009 42 68 61 26\n 5 2 2006 100 84 13 3\n 6 2 2007 86 93 17 93\n 7 2 2008 97 83 62 62\n 8 2 2009 46 12 25 79\n 9 3 2006 5 48 81 41\n10 3 2007 53 73 73 34\n11 3 2008 81 39 49 84\n12 3 2009 90 69 30 56\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself \n```\n:::\n\n:::\n\n### `separate()` and `unite()`\n\nThe same `tidyr` package also contains two useful functions:\n\n- `unite()`: combine contents of two or more columns into a single column\n- `separate()`: separate contents of a column into two or more columns\n\nFirst, we combine the first three columns into one new column using `unite()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder %>% \n unite(col=\"country_continent_year\", \n country:year, \n sep=\"_\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 4\n country_continent_year lifeExp pop gdpPercap\n \n 1 Afghanistan_Asia_1952 28.8 8425333 779.\n 2 Afghanistan_Asia_1957 30.3 9240934 821.\n 3 Afghanistan_Asia_1962 32.0 10267083 853.\n 4 Afghanistan_Asia_1967 34.0 11537966 836.\n 5 Afghanistan_Asia_1972 36.1 13079460 740.\n 6 Afghanistan_Asia_1977 38.4 14880372 786.\n 7 Afghanistan_Asia_1982 39.9 12881816 978.\n 8 Afghanistan_Asia_1987 40.8 13867957 852.\n 9 Afghanistan_Asia_1992 41.7 16317921 649.\n10 Afghanistan_Asia_1997 41.8 22227415 635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\nNext, we show how to separate the columns into three separate columns using `separate()` using the `col`, `into` and `sep` arguments.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder %>% \n unite(col=\"country_continent_year\", \n country:year, \n sep=\"_\") %>% \n separate(col=\"country_continent_year\", \n into=c(\"country\", \"continent\", \"year\"), \n sep=\"_\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 6\n country continent year lifeExp pop gdpPercap\n \n 1 Afghanistan Asia 1952 28.8 8425333 779.\n 2 Afghanistan Asia 1957 30.3 9240934 821.\n 3 Afghanistan Asia 1962 32.0 10267083 853.\n 4 Afghanistan Asia 1967 34.0 11537966 836.\n 5 Afghanistan Asia 1972 36.1 13079460 740.\n 6 Afghanistan Asia 1977 38.4 14880372 786.\n 7 Afghanistan Asia 1982 39.9 12881816 978.\n 8 Afghanistan Asia 1987 40.8 13867957 852.\n 9 Afghanistan Asia 1992 41.7 16317921 649.\n10 Afghanistan Asia 1997 41.8 22227415 635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.\n\n2. What do the extra and fill arguments do in `separate()`? Experiment with the various options for the following two toy datasets.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(x = c(\"a,b,c\", \"d,e,f,g\", \"h,i,j\")) %>% \n separate(x, c(\"one\", \"two\", \"three\"))\n\ntibble(x = c(\"a,b,c\", \"d,e\", \"f,g,i\")) %>% \n separate(x, c(\"one\", \"two\", \"three\"))\n```\n:::\n\n\n3. Both `unite()` and `separate()` have a remove argument. What does it do? Why would you set it to FALSE?\n\n4. Compare and contrast `separate()` and `extract()`. Why are there three variations of separation (by position, by separator, and with groups), but only one `unite()`?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software\n- https://r4ds.had.co.nz/tidy-data.html\n- [tidyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n gapminder * 1.0.0 2023-03-10 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"09 - Tidy data and the Tidyverse\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to tidy data and how to convert between wide and long data with the tidyr R package\"\ncategories: [module 2, week 2, R, programming, tidyr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/09-tidy-data-and-the-tidyverse/index.qmd).*\n\n\n\n> \"Happy families are all alike; every unhappy family is unhappy in its own way.\" ---- Leo Tolstoy\n\n> \"Tidy datasets are all alike, but every messy dataset is messy in its own way.\" ---- Hadley Wickham\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software\n2. \n3. [tidyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Define tidy data\n- Be able to transform non-tidy data into tidy data\n- Be able to transform wide data into long data\n- Be able to separate character columns into multiple columns\n- Be able to unite multiple character columns into one column\n:::\n\n# Tidy data\n\nAs we learned in the last lesson, one unifying concept of the tidyverse is the notion of **tidy data**.\n\nAs defined by Hadley Wickham in his 2014 paper published in the *Journal of Statistical Software*, a [tidy dataset](https://www.jstatsoft.org/article/view/v059i10) has the following properties:\n\n1. Each variable forms a column.\n\n2. Each observation forms a row.\n\n3. Each type of observational unit forms a table.\n\n![Artwork by Allison Horst on tidy data](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg){width=\"80%\"}\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\nThe **purpose of defining tidy data** is to highlight the fact that **most data do not start out life as tidy**.\n\nIn fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience).\n\n- Once a dataset is tidy, it **can be used as input into a variety of other functions** that may transform, model, or visualize the data.\n\n::: callout-tip\n### Example\n\nAs a quick example, consider the following data illustrating **religion and income survey data** with the number of respondents with income range in column name.\n\nThis is in a classic table format:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\nrelig_income\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 18 × 11\n religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`\n \n 1 Agnostic 27 34 60 81 76 137 122\n 2 Atheist 12 27 37 52 35 70 73\n 3 Buddhist 27 21 30 34 33 58 62\n 4 Catholic 418 617 732 670 638 1116 949\n 5 Don’t k… 15 14 15 11 10 35 21\n 6 Evangel… 575 869 1064 982 881 1486 949\n 7 Hindu 1 9 7 9 11 34 47\n 8 Histori… 228 244 236 238 197 223 131\n 9 Jehovah… 20 27 24 24 21 30 15\n10 Jewish 19 19 25 25 30 95 69\n11 Mainlin… 289 495 619 655 651 1107 939\n12 Mormon 29 40 48 51 56 112 85\n13 Muslim 6 7 9 10 9 23 16\n14 Orthodox 13 17 23 32 32 47 38\n15 Other C… 9 7 11 13 13 14 18\n16 Other F… 20 33 40 46 49 63 46\n17 Other W… 5 2 3 4 2 7 3\n18 Unaffil… 217 299 374 365 341 528 407\n# ℹ 3 more variables: `$100-150k` , `>150k` ,\n# `Don't know/refused` \n```\n:::\n:::\n\n:::\n\nWhile this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy.\n\n**This format violates the tidy form** because there are variables in the columns.\n\n- In this case the variables are religion, income bracket, and the number of respondents, which is the third variable, is presented inside the table.\n\nConverting this data to tidy format would give us\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nrelig_income %>%\n pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\") %>%\n mutate(religion = factor(religion), income = factor(income))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 180 × 3\n religion income respondents\n \n 1 Agnostic <$10k 27\n 2 Agnostic $10-20k 34\n 3 Agnostic $20-30k 60\n 4 Agnostic $30-40k 81\n 5 Agnostic $40-50k 76\n 6 Agnostic $50-75k 137\n 7 Agnostic $75-100k 122\n 8 Agnostic $100-150k 109\n 9 Agnostic >150k 84\n10 Agnostic Don't know/refused 96\n# ℹ 170 more rows\n```\n:::\n:::\n\n\nSome of these functions you have seen before, others might be new to you. Let's talk about each one in the context of the `tidyverse` R packages.\n\n# The \"Tidyverse\"\n\nThere are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and **the collection of packages is often referred to as the \"tidyverse\"** because of their **dependence on and presumption of tidy data**.\n\n::: callout-tip\n### Note\n\nA subset of the \"Tidyverse\" packages include:\n\n- [ggplot2](https://cran.r-project.org/package=ggplot2): a plotting system based on the grammar of graphics\n\n- [magrittr](https://cran.r-project.org/package=magrittr%22): defines the `%>%` operator for chaining functions together in a series of operations on data\n\n- [dplyr](https://cran.r-project.org/package=dplyr): a suite of (fast) functions for working with data frames\n\n- [tidyr](https://cran.r-project.org/package=tidyr): easily tidy data with `pivot_wider()` and `pivot_longer()` functions (also `separate()` and `unite()`)\n\nA complete list can be found here ().\n:::\n\nWe will be using these packages quite a bit.\n\nThe \"tidyverse\" package can be used to install all of the packages in the tidyverse at once.\n\nFor example, instead of starting an R script with this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(tidyr)\nlibrary(readr)\nlibrary(ggplot2)\n```\n:::\n\n\nYou can start with this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\nIn the example above, let's talk about what we did using the `pivot_longer()` function.\n\nWe will also talk about `pivot_wider()`.\n\n### `pivot_longer()`\n\nThe `tidyr` package includes functions to transfer a data frame between *long* and *wide*.\n\n- **Wide format** data tends to have different attributes or variables describing an observation placed in separate columns.\n- **Long format** data tends to have different attributes encoded as levels of a single variable, followed by another column that contains tha values of the observation at those different levels.\n\n::: callout-tip\n### Example\n\nIn the section above, we showed an example that used `pivot_longer()` to convert data into a tidy format.\n\nThe **key problem** with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.\n\nTo **fix this**, you can use the `pivot_longer()` function to **gather values spread across several columns into a single column**, here with the column names gathered into an `income` column.\n\n**Note**: when gathering, exclude any columns that you do not want \"gathered\" (`religion` in this case) by including the column names with a the minus sign in the `pivot_longer()` function.\n\nFor example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Gather everything EXCEPT religion to tidy data\nrelig_income %>%\n pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 180 × 3\n religion income respondents\n \n 1 Agnostic <$10k 27\n 2 Agnostic $10-20k 34\n 3 Agnostic $20-30k 60\n 4 Agnostic $30-40k 81\n 5 Agnostic $40-50k 76\n 6 Agnostic $50-75k 137\n 7 Agnostic $75-100k 122\n 8 Agnostic $100-150k 109\n 9 Agnostic >150k 84\n10 Agnostic Don't know/refused 96\n# ℹ 170 more rows\n```\n:::\n:::\n\n:::\n\nEven if your data is in a tidy format, `pivot_longer()` is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.\n\n### `pivot_wider()`\n\nThe `pivot_wider()` function is less commonly needed to tidy data. It can, however, be useful for creating summary tables.\n\n::: callout-tip\n### Example\n\nYou use the `summarize()` function in `dplyr` to summarize the total number of respondents per income category.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrelig_income %>%\n pivot_longer(-religion, names_to = \"income\", values_to = \"respondents\") %>%\n mutate(religion = factor(religion), income = factor(income)) %>%\n group_by(income) %>%\n summarize(total_respondents = sum(respondents)) %>%\n pivot_wider(\n names_from = \"income\",\n values_from = \"total_respondents\"\n ) %>%\n knitr::kable()\n```\n\n::: {.cell-output-display}\n| <$10k| >150k| $10-20k| $100-150k| $20-30k| $30-40k| $40-50k| $50-75k| $75-100k| Don't know/refused|\n|-----:|-----:|-------:|---------:|-------:|-------:|-------:|-------:|--------:|------------------:|\n| 1930| 2608| 2781| 3197| 3357| 3302| 3085| 5185| 3990| 6121|\n:::\n:::\n\n:::\n\nNotice in this example how `pivot_wider()` has been used at the **very end of the code sequence** to convert the summarized data into a shape that **offers a better tabular presentation for a report**.\n\n::: callout-tip\n### Note\n\nIn the `pivot_wider()` call, you first specify the name of the column to use for the new column names (`income` in this example) and then specify the column to use for the cell values (`total_respondents` here).\n:::\n\n::: callout-tip\n### Example of `pivot_longer()`\n\nLet's try another dataset. This data contain an excerpt of the [Gapminder data](https://cran.r-project.org/web/packages/gapminder/README.html#gapminder) on life expectancy, GDP per capita, and population by country.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(gapminder)\ngapminder\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 6\n country continent year lifeExp pop gdpPercap\n \n 1 Afghanistan Asia 1952 28.8 8425333 779.\n 2 Afghanistan Asia 1957 30.3 9240934 821.\n 3 Afghanistan Asia 1962 32.0 10267083 853.\n 4 Afghanistan Asia 1967 34.0 11537966 836.\n 5 Afghanistan Asia 1972 36.1 13079460 740.\n 6 Afghanistan Asia 1977 38.4 14880372 786.\n 7 Afghanistan Asia 1982 39.9 12881816 978.\n 8 Afghanistan Asia 1987 40.8 13867957 852.\n 9 Afghanistan Asia 1992 41.7 16317921 649.\n10 Afghanistan Asia 1997 41.8 22227415 635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\nIf we wanted to make `lifeExp`, `pop` and `gdpPercap` (all measurements that we observe) go from a wide table into a long table, what would we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nOne more! Try using `pivot_longer()` to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.\n\nAfterward, use `group_by()` and `summarize()` to calculate the average revenue for each company across all years and all quarters.\n\n**Bonus**: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble(\n \"company\" = rep(1:3, each = 4),\n \"year\" = rep(2006:2009, 3),\n \"Q1\" = sample(x = 0:100, size = 12),\n \"Q2\" = sample(x = 0:100, size = 12),\n \"Q3\" = sample(x = 0:100, size = 12),\n \"Q4\" = sample(x = 0:100, size = 12),\n)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 6\n company year Q1 Q2 Q3 Q4\n \n 1 1 2006 99 6 54 47\n 2 1 2007 28 79 90 9\n 3 1 2008 7 72 69 24\n 4 1 2009 16 56 6 100\n 5 2 2006 42 58 75 25\n 6 2 2007 64 1 100 6\n 7 2 2008 43 88 37 77\n 8 2 2009 95 74 17 44\n 9 3 2006 34 47 77 38\n10 3 2007 73 31 31 54\n11 3 2008 4 49 93 0\n12 3 2009 57 4 45 96\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n### `separate()` and `unite()`\n\nThe same `tidyr` package also contains two useful functions:\n\n- `unite()`: combine contents of two or more columns into a single column\n- `separate()`: separate contents of a column into two or more columns\n\nFirst, we combine the first three columns into one new column using `unite()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder %>%\n unite(\n col = \"country_continent_year\",\n country:year,\n sep = \"_\"\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 4\n country_continent_year lifeExp pop gdpPercap\n \n 1 Afghanistan_Asia_1952 28.8 8425333 779.\n 2 Afghanistan_Asia_1957 30.3 9240934 821.\n 3 Afghanistan_Asia_1962 32.0 10267083 853.\n 4 Afghanistan_Asia_1967 34.0 11537966 836.\n 5 Afghanistan_Asia_1972 36.1 13079460 740.\n 6 Afghanistan_Asia_1977 38.4 14880372 786.\n 7 Afghanistan_Asia_1982 39.9 12881816 978.\n 8 Afghanistan_Asia_1987 40.8 13867957 852.\n 9 Afghanistan_Asia_1992 41.7 16317921 649.\n10 Afghanistan_Asia_1997 41.8 22227415 635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\nNext, we show how to separate the columns into three separate columns using `separate()` using the `col`, `into` and `sep` arguments.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngapminder %>%\n unite(\n col = \"country_continent_year\",\n country:year,\n sep = \"_\"\n ) %>%\n separate(\n col = \"country_continent_year\",\n into = c(\"country\", \"continent\", \"year\"),\n sep = \"_\"\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,704 × 6\n country continent year lifeExp pop gdpPercap\n \n 1 Afghanistan Asia 1952 28.8 8425333 779.\n 2 Afghanistan Asia 1957 30.3 9240934 821.\n 3 Afghanistan Asia 1962 32.0 10267083 853.\n 4 Afghanistan Asia 1967 34.0 11537966 836.\n 5 Afghanistan Asia 1972 36.1 13079460 740.\n 6 Afghanistan Asia 1977 38.4 14880372 786.\n 7 Afghanistan Asia 1982 39.9 12881816 978.\n 8 Afghanistan Asia 1987 40.8 13867957 852.\n 9 Afghanistan Asia 1992 41.7 16317921 649.\n10 Afghanistan Asia 1997 41.8 22227415 635.\n# ℹ 1,694 more rows\n```\n:::\n:::\n\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.\n\n2. What do the extra and fill arguments do in `separate()`? Experiment with the various options for the following two toy datasets.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(x = c(\"a,b,c\", \"d,e,f,g\", \"h,i,j\")) %>%\n separate(x, c(\"one\", \"two\", \"three\"))\n\ntibble(x = c(\"a,b,c\", \"d,e\", \"f,g,i\")) %>%\n separate(x, c(\"one\", \"two\", \"three\"))\n```\n:::\n\n\n3. Both `unite()` and `separate()` have a remove argument. What does it do? Why would you set it to FALSE?\n\n4. Compare and contrast `separate()` and `extract()`. Why are there three variations of separation (by position, by separator, and with groups), but only one `unite()`?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software\n- https://r4ds.had.co.nz/tidy-data.html\n- [tidyr cheat sheet from RStudio](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)\n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n gapminder * 1.0.0 2023-03-10 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/10-joining-data-in-r/index/execute-results/html.json b/_freeze/posts/10-joining-data-in-r/index/execute-results/html.json index f61dc52..e81da47 100644 --- a/_freeze/posts/10-joining-data-in-r/index/execute-results/html.json +++ b/_freeze/posts/10-joining-data-in-r/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "9484ca31dfe45edc4a74b72df287baad", + "hash": "c065e9bff64a00dbd0bc879eea567da8", "result": { - "markdown": "---\ntitle: \"10 - Joining data in R\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to relational data and join functions in the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/10-joining-data-in-r/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to define relational data and keys\n- Be able to define the three types of join functions for relational data\n- Be able to implement mutational join functions\n:::\n\n# Relational data\n\nData analyses rarely involve only a single table of data.\n\nTypically you have many tables of data, and you **must combine the datasets** to answer the questions that you are interested in.\n\nCollectively, **multiple tables of data are called relational data** because it is the *relations*, not just the individual datasets, that are important.\n\nRelations are **always defined between a pair of tables**. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.\n\nSometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.\n\nTo work with relational data you **need verbs that work with pairs of tables**.\n\n::: callout-tip\n### Three important families of verbs\n\nThere are three families of verbs designed to work with relational data:\n\n- [**Mutating joins**](https://r4ds.had.co.nz/relational-data.html#mutating-joins): A mutating join allows you to **combine variables from two tables**. It first matches observations by their keys, then copies across variables from one table to the other on the right side of the table (similar to `mutate()`). We will discuss a few of these below.\n - See @sec-mutjoins for Table of mutating joins.\n- [**Filtering joins**](https://r4ds.had.co.nz/relational-data.html#filtering-joins): Filtering joins **match observations** in the same way as mutating joins, **but affect the observations, not the variables** (i.e. filter observations from one data frame based on whether or not they match an observation in the other).\n - Two types: `semi_join(x, y)` and `anti_join(x, y)`.\n- [**Set operations**](https://r4ds.had.co.nz/relational-data.html#set-operations): Treat **observations as if they were set elements**. Typically used less frequently, but occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:\n - Examples of set operations: `intersect(x, y)`, `union(x, y)`, and `setdiff(x, y)`.\n:::\n\n## Keys\n\nThe **variables used to connect each pair of tables** are called **keys**. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation.\n\n::: callout-tip\n### Note\n\nThere are two types of keys:\n\n- A **primary key** uniquely identifies an observation in its own table.\n- A **foreign key** uniquely identifies an observation in another table.\n:::\n\nLet's consider an example to help us understand the difference between a **primary key** and **foreign key**.\n\n## Example of keys\n\nImagine you are conduct a study and **collecting data on subjects and a health outcome**.\n\nOften, subjects will **make multiple visits** (a so-called longitudinal study) and so we will record the outcome for each visit. Similarly, we may record other information about them, such as the kind of housing they live in.\n\n### The first table\n\nThis code creates a simple table with some made up data about some hypothetical subjects' outcomes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\noutcomes <- tibble(\n id = rep(c(\"a\", \"b\", \"c\"), each = 3),\n visit = rep(0:2, 3),\n outcome = rnorm(3 * 3, 3)\n)\n\nprint(outcomes)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 3\n id visit outcome\n \n1 a 0 5.04\n2 a 1 2.18\n3 a 2 4.33\n4 b 0 3.33\n5 b 1 2.32\n6 b 2 2.06\n7 c 0 1.98\n8 c 1 1.28\n9 c 2 2.29\n```\n:::\n:::\n\n\nNote that subjects are labeled by a unique identifer in the `id` column.\n\n### A second table\n\nHere is some code to create a second table (we will be joining the first and second tables shortly). This table contains some data about the hypothetical subjects' housing situation by recording the type of house they live in.\n\n\n::: {.cell exercise='true'}\n\n```{.r .cell-code}\nsubjects <- tibble(\n id = c(\"a\", \"b\", \"c\"),\n house = c(\"detached\", \"rowhouse\", \"rowhouse\")\n)\n\nprint(subjects)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n id house \n \n1 a detached\n2 b rowhouse\n3 c rowhouse\n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nWhat is the **primary key** and **foreign key**?\n\n- The `outcomes$id` is a **primary key** because it uniquely identifies each subject in the `outcomes` table.\n- The `subjects$id` is a **foreign key** because it appears in the `subjects` table where it matches each subject to a unique `id`.\n:::\n\n# Mutating joins {#sec-mutjoins}\n\nThe `dplyr` package provides a set of **functions for joining two data frames** into a single data frame based on a set of key columns.\n\nThere are several functions in the `*_join()` family.\n\n- These functions all merge together two data frames\n- They differ in how they handle observations that exist in one but not both data frames.\n\nHere, are the **four functions from this family** that you will likely use the most often:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n|Function |What it includes in merged data frame |\n|:--------------|:---------------------------------------------------------------------------------------------------------|\n|`left_join()` |Includes all observations in the left data frame, whether or not there is a match in the right data frame |\n|`right_join()` |Includes all observations in the right data frame, whether or not there is a match in the left data frame |\n|`inner_join()` |Includes only observations that are in both data frames |\n|`full_join()` |Includes all observations from both data frames |\n:::\n:::\n\n\n![](https://d33wubrfki0l68.cloudfront.net/aeab386461820b029b7e7606ccff1286f623bae1/ef0d4/diagrams/join-venn.png)\n\n\\[[Source from R for Data Science](https://r4ds.had.co.nz/relational-data#relational-data)\\]\n\n## Left Join\n\nRecall the `outcomes` and `subjects` datasets above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\noutcomes\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 3\n id visit outcome\n \n1 a 0 5.04\n2 a 1 2.18\n3 a 2 4.33\n4 b 0 3.33\n5 b 1 2.32\n6 b 2 2.06\n7 c 0 1.98\n8 c 1 1.28\n9 c 2 2.29\n```\n:::\n\n```{.r .cell-code}\nsubjects\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n id house \n \n1 a detached\n2 b rowhouse\n3 c rowhouse\n```\n:::\n:::\n\n\nSuppose we want to create a table that combines the information about houses (`subjects`) with the information about the outcomes (`outcomes`).\n\nWe can use the `left_join()` function to merge the `outcomes` and `subjects` tables and produce the output above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(x = outcomes, y = subjects, by = \"id\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n id visit outcome house \n \n1 a 0 5.04 detached\n2 a 1 2.18 detached\n3 a 2 4.33 detached\n4 b 0 3.33 rowhouse\n5 b 1 2.32 rowhouse\n6 b 2 2.06 rowhouse\n7 c 0 1.98 rowhouse\n8 c 1 1.28 rowhouse\n9 c 2 2.29 rowhouse\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `by` argument indicates the column (or columns) that the two tables have in common.\n:::\n\n### Left Join with Incomplete Data\n\nIn the previous examples, the `subjects` table didn't have a `visit` column. But suppose it did? Maybe people move around during the study. We could image a table like this one.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubjects <- tibble(\n id = c(\"a\", \"b\", \"c\"),\n visit = c(0, 1, 0),\n house = c(\"detached\", \"rowhouse\", \"rowhouse\"),\n)\n\nprint(subjects)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n id visit house \n \n1 a 0 detached\n2 b 1 rowhouse\n3 c 0 rowhouse\n```\n:::\n:::\n\n\nWhen we left joint the tables now we get:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(outcomes, subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n id visit outcome house \n \n1 a 0 5.04 detached\n2 a 1 2.18 \n3 a 2 4.33 \n4 b 0 3.33 \n5 b 1 2.32 rowhouse\n6 b 2 2.06 \n7 c 0 1.98 rowhouse\n8 c 1 1.28 \n9 c 2 2.29 \n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTwo things to point out here:\n\n1. If we do not have information about a subject's housing in a given visit, the `left_join()` function automatically inserts an `NA` value to indicate that it is missing.\n\n2. We can \"join\" on multiple variable (e.g. here we joined on the `id` and the `visit` columns).\n:::\n\nWe may even have a situation where we are missing housing data for a subject completely. The following table has no information about subject `a`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubjects <- tibble(\n id = c(\"b\", \"c\"),\n visit = c(1, 0),\n house = c(\"rowhouse\", \"rowhouse\"),\n)\n\nsubjects\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 3\n id visit house \n \n1 b 1 rowhouse\n2 c 0 rowhouse\n```\n:::\n:::\n\n\nBut we can still join the tables together and the `house` values for subject `a` will all be `NA`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n id visit outcome house \n \n1 a 0 5.04 \n2 a 1 2.18 \n3 a 2 4.33 \n4 b 0 3.33 \n5 b 1 2.32 rowhouse\n6 b 2 2.06 \n7 c 0 1.98 rowhouse\n8 c 1 1.28 \n9 c 2 2.29 \n```\n:::\n:::\n\n\n::: callout-tip\n### Important\n\nThe bottom line for `left_join()` is that it **always retains the values in the \"left\" argument** (in this case the `outcomes` table).\n\n- If there are no corresponding values in the \"right\" argument, `NA` values will be filled in.\n:::\n\n## Inner Join\n\nThe `inner_join()` function only **retains the rows of both tables** that have corresponding values. Here we can see the difference.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninner_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 4\n id visit outcome house \n \n1 b 1 2.32 rowhouse\n2 c 0 1.98 rowhouse\n```\n:::\n:::\n\n\n## Right Join\n\nThe `right_join()` function is like the `left_join()` function except that it **gives priority to the \"right\" hand argument**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nright_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 4\n id visit outcome house \n \n1 b 1 2.32 rowhouse\n2 c 0 1.98 rowhouse\n```\n:::\n:::\n\n\n# Summary\n\n- `left_join()` is useful for merging a \"large\" data frame with a \"smaller\" one while retaining all the rows of the \"large\" data frame\n\n- `inner_join()` gives you the intersection of the rows between two data frames\n\n- `right_join()` is like `left_join()` with the arguments reversed (likely only useful at the end of a pipeline)\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. If you had three data frames to combine with a shared key, how would you join them using the verbs you now know?\n\n2. Using `df1` and `df2` below, what is the difference between `inner_join(df1, df2)`, `semi_join(df1, df2)` and `anti_join(df1, df2)`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Create first example data frame\ndf1 <- data.frame(ID = 1:3,\n X1 = c(\"a1\", \"a2\", \"a3\"))\n# Create second example data frame\ndf2 <- data.frame(ID = 2:4, \n X2 = c(\"b1\", \"b2\", \"b3\"))\n```\n:::\n\n\n3. Try changing the order from the above e.g. `inner_join(df2, df1)`, `semi_join(df2, df1)` and `anti_join(df2, df1)`. What changed? What did not change?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr * 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"10 - Joining data in R\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to relational data and join functions in the dplyr R package\"\ncategories: [module 2, week 2, R, programming, dplyr, here, tidyverse]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/10-joining-data-in-r/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to define relational data and keys\n- Be able to define the three types of join functions for relational data\n- Be able to implement mutational join functions\n:::\n\n# Relational data\n\nData analyses rarely involve only a single table of data.\n\nTypically you have many tables of data, and you **must combine the datasets** to answer the questions that you are interested in.\n\nCollectively, **multiple tables of data are called relational data** because it is the *relations*, not just the individual datasets, that are important.\n\nRelations are **always defined between a pair of tables**. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.\n\nSometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.\n\nTo work with relational data you **need verbs that work with pairs of tables**.\n\n::: callout-tip\n### Three important families of verbs\n\nThere are three families of verbs designed to work with relational data:\n\n- [**Mutating joins**](https://r4ds.had.co.nz/relational-data.html#mutating-joins): A mutating join allows you to **combine variables from two tables**. It first matches observations by their keys, then copies across variables from one table to the other on the right side of the table (similar to `mutate()`). We will discuss a few of these below.\n - See @sec-mutjoins for Table of mutating joins.\n- [**Filtering joins**](https://r4ds.had.co.nz/relational-data.html#filtering-joins): Filtering joins **match observations** in the same way as mutating joins, **but affect the observations, not the variables** (i.e. filter observations from one data frame based on whether or not they match an observation in the other).\n - Two types: `semi_join(x, y)` and `anti_join(x, y)`.\n- [**Set operations**](https://r4ds.had.co.nz/relational-data.html#set-operations): Treat **observations as if they were set elements**. Typically used less frequently, but occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:\n - Examples of set operations: `intersect(x, y)`, `union(x, y)`, and `setdiff(x, y)`.\n:::\n\n## Keys\n\nThe **variables used to connect each pair of tables** are called **keys**. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation.\n\n::: callout-tip\n### Note\n\nThere are two types of keys:\n\n- A **primary key** uniquely identifies an observation in its own table.\n- A **foreign key** uniquely identifies an observation in another table.\n:::\n\nLet's consider an example to help us understand the difference between a **primary key** and **foreign key**.\n\n## Example of keys\n\nImagine you are conduct a study and **collecting data on subjects and a health outcome**.\n\nOften, subjects will **make multiple visits** (a so-called longitudinal study) and so we will record the outcome for each visit. Similarly, we may record other information about them, such as the kind of housing they live in.\n\n### The first table\n\nThis code creates a simple table with some made up data about some hypothetical subjects' outcomes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\noutcomes <- tibble(\n id = rep(c(\"a\", \"b\", \"c\"), each = 3),\n visit = rep(0:2, 3),\n outcome = rnorm(3 * 3, 3)\n)\n\nprint(outcomes)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 3\n id visit outcome\n \n1 a 0 3.07\n2 a 1 3.25\n3 a 2 3.93\n4 b 0 2.18\n5 b 1 2.91\n6 b 2 2.83\n7 c 0 1.49\n8 c 1 2.56\n9 c 2 1.46\n```\n:::\n:::\n\n\nNote that subjects are labeled by a unique identifer in the `id` column.\n\n### A second table\n\nHere is some code to create a second table (we will be joining the first and second tables shortly). This table contains some data about the hypothetical subjects' housing situation by recording the type of house they live in.\n\n\n::: {.cell exercise='true'}\n\n```{.r .cell-code}\nsubjects <- tibble(\n id = c(\"a\", \"b\", \"c\"),\n house = c(\"detached\", \"rowhouse\", \"rowhouse\")\n)\n\nprint(subjects)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n id house \n \n1 a detached\n2 b rowhouse\n3 c rowhouse\n```\n:::\n:::\n\n\n::: callout-note\n### Question\n\nWhat is the **primary key** and **foreign key**?\n\n- The `outcomes$id` is a **primary key** because it uniquely identifies each subject in the `outcomes` table.\n- The `subjects$id` is a **foreign key** because it appears in the `subjects` table where it matches each subject to a unique `id`.\n:::\n\n# Mutating joins {#sec-mutjoins}\n\nThe `dplyr` package provides a set of **functions for joining two data frames** into a single data frame based on a set of key columns.\n\nThere are several functions in the `*_join()` family.\n\n- These functions all merge together two data frames\n- They differ in how they handle observations that exist in one but not both data frames.\n\nHere, are the **four functions from this family** that you will likely use the most often:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n|Function |What it includes in merged data frame |\n|:--------------|:---------------------------------------------------------------------------------------------------------|\n|`left_join()` |Includes all observations in the left data frame, whether or not there is a match in the right data frame |\n|`right_join()` |Includes all observations in the right data frame, whether or not there is a match in the left data frame |\n|`inner_join()` |Includes only observations that are in both data frames |\n|`full_join()` |Includes all observations from both data frames |\n:::\n:::\n\n\n![](https://d33wubrfki0l68.cloudfront.net/aeab386461820b029b7e7606ccff1286f623bae1/ef0d4/diagrams/join-venn.png)\n\n\\[[Source from R for Data Science](https://r4ds.had.co.nz/relational-data#relational-data)\\]\n\n## Left Join\n\nRecall the `outcomes` and `subjects` datasets above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\noutcomes\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 3\n id visit outcome\n \n1 a 0 3.07\n2 a 1 3.25\n3 a 2 3.93\n4 b 0 2.18\n5 b 1 2.91\n6 b 2 2.83\n7 c 0 1.49\n8 c 1 2.56\n9 c 2 1.46\n```\n:::\n\n```{.r .cell-code}\nsubjects\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n id house \n \n1 a detached\n2 b rowhouse\n3 c rowhouse\n```\n:::\n:::\n\n\nSuppose we want to create a table that combines the information about houses (`subjects`) with the information about the outcomes (`outcomes`).\n\nWe can use the `left_join()` function to merge the `outcomes` and `subjects` tables and produce the output above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(x = outcomes, y = subjects, by = \"id\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n id visit outcome house \n \n1 a 0 3.07 detached\n2 a 1 3.25 detached\n3 a 2 3.93 detached\n4 b 0 2.18 rowhouse\n5 b 1 2.91 rowhouse\n6 b 2 2.83 rowhouse\n7 c 0 1.49 rowhouse\n8 c 1 2.56 rowhouse\n9 c 2 1.46 rowhouse\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `by` argument indicates the column (or columns) that the two tables have in common.\n:::\n\n### Left Join with Incomplete Data\n\nIn the previous examples, the `subjects` table didn't have a `visit` column. But suppose it did? Maybe people move around during the study. We could image a table like this one.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubjects <- tibble(\n id = c(\"a\", \"b\", \"c\"),\n visit = c(0, 1, 0),\n house = c(\"detached\", \"rowhouse\", \"rowhouse\"),\n)\n\nprint(subjects)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n id visit house \n \n1 a 0 detached\n2 b 1 rowhouse\n3 c 0 rowhouse\n```\n:::\n:::\n\n\nWhen we left joint the tables now we get:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(outcomes, subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n id visit outcome house \n \n1 a 0 3.07 detached\n2 a 1 3.25 \n3 a 2 3.93 \n4 b 0 2.18 \n5 b 1 2.91 rowhouse\n6 b 2 2.83 \n7 c 0 1.49 rowhouse\n8 c 1 2.56 \n9 c 2 1.46 \n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nTwo things to point out here:\n\n1. If we do not have information about a subject's housing in a given visit, the `left_join()` function automatically inserts an `NA` value to indicate that it is missing.\n\n2. We can \"join\" on multiple variable (e.g. here we joined on the `id` and the `visit` columns).\n:::\n\nWe may even have a situation where we are missing housing data for a subject completely. The following table has no information about subject `a`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsubjects <- tibble(\n id = c(\"b\", \"c\"),\n visit = c(1, 0),\n house = c(\"rowhouse\", \"rowhouse\"),\n)\n\nsubjects\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 3\n id visit house \n \n1 b 1 rowhouse\n2 c 0 rowhouse\n```\n:::\n:::\n\n\nBut we can still join the tables together and the `house` values for subject `a` will all be `NA`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleft_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n id visit outcome house \n \n1 a 0 3.07 \n2 a 1 3.25 \n3 a 2 3.93 \n4 b 0 2.18 \n5 b 1 2.91 rowhouse\n6 b 2 2.83 \n7 c 0 1.49 rowhouse\n8 c 1 2.56 \n9 c 2 1.46 \n```\n:::\n:::\n\n\n::: callout-tip\n### Important\n\nThe bottom line for `left_join()` is that it **always retains the values in the \"left\" argument** (in this case the `outcomes` table).\n\n- If there are no corresponding values in the \"right\" argument, `NA` values will be filled in.\n:::\n\n## Inner Join\n\nThe `inner_join()` function only **retains the rows of both tables** that have corresponding values. Here we can see the difference.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninner_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 4\n id visit outcome house \n \n1 b 1 2.91 rowhouse\n2 c 0 1.49 rowhouse\n```\n:::\n:::\n\n\n## Right Join\n\nThe `right_join()` function is like the `left_join()` function except that it **gives priority to the \"right\" hand argument**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nright_join(x = outcomes, y = subjects, by = c(\"id\", \"visit\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 4\n id visit outcome house \n \n1 b 1 2.91 rowhouse\n2 c 0 1.49 rowhouse\n```\n:::\n:::\n\n\n# Summary\n\n- `left_join()` is useful for merging a \"large\" data frame with a \"smaller\" one while retaining all the rows of the \"large\" data frame\n\n- `inner_join()` gives you the intersection of the rows between two data frames\n\n- `right_join()` is like `left_join()` with the arguments reversed (likely only useful at the end of a pipeline)\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. If you had three data frames to combine with a shared key, how would you join them using the verbs you now know?\n\n2. Using `df1` and `df2` below, what is the difference between `inner_join(df1, df2)`, `semi_join(df1, df2)` and `anti_join(df1, df2)`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Create first example data frame\ndf1 <- data.frame(\n ID = 1:3,\n X1 = c(\"a1\", \"a2\", \"a3\")\n)\n# Create second example data frame\ndf2 <- data.frame(\n ID = 2:4,\n X2 = c(\"b1\", \"b2\", \"b3\")\n)\n```\n:::\n\n\n3. Try changing the order from the above e.g. `inner_join(df2, df1)`, `semi_join(df2, df1)` and `anti_join(df2, df1)`. What changed? What did not change?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr * 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/11-plotting-systems/index/execute-results/html.json b/_freeze/posts/11-plotting-systems/index/execute-results/html.json index e10a6cd..d091bad 100644 --- a/_freeze/posts/11-plotting-systems/index/execute-results/html.json +++ b/_freeze/posts/11-plotting-systems/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "33dd70dca1357e6e4f0495a8ce6313a1", + "hash": "6c6916c9091b2c6d74a4ff7c2b8b7db1", "result": { - "markdown": "---\ntitle: \"11 - Plotting Systems\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Overview of three plotting systems in R\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/11-plotting-systems/index.qmd).*\n\n> The data may not contain the answer. And, if you torture the data long enough, it will tell you anything. ---*John W. Tukey*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. Paul Murrell (2011). *R Graphics*, CRC Press.\n3. Hadley Wickham (2009). *ggplot2*, Springer.\n4. Deepayan Sarkar (2008). *Lattice: Multivariate Data Visualization with R*, Springer.\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to identify and describe the three plotting systems in R\n:::\n\n# Plotting Systems\n\nThere are **three different plotting systems in R** and they each have different characteristics and modes of operation.\n\n::: callout-tip\n### Important\n\nThe three systems are\n\n1. The base plotting system\n2. The lattice system\n3. The ggplot2 system\n\n**This course will focus primarily on the ggplot2 plotting system**. The other two systems are presented for context.\n:::\n\n## The Base Plotting System\n\nThe **base plotting system** is the original plotting system for R. The basic model is sometimes **referred to as the \"artist's palette\" model**.\n\nThe idea is you start with blank canvas and build up from there.\n\nIn more R-specific terms, you **typically start with `plot()` function** (or similar plot creating function) to *initiate* a plot and then *annotate* the plot with various annotation functions (`text`, `lines`, `points`, `axis`)\n\nThe base plotting system is **often the most convenient plotting system** to use because it mirrors how we sometimes think of building plots and analyzing data.\n\nIf we do not have a completely well-formed idea of how we want to look at some data, often we will start by \"throwing some data on the page\" and then slowly add more information to it as our thought process evolves.\n\n::: callout-tip\n### Example\n\nWe might look at a simple scatterplot and then decide to add a linear regression line or a smoother to it to highlight the trends.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(airquality)\nwith(airquality, {\n plot(Temp, Ozone)\n lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot with loess curve](index_files/figure-html/unnamed-chunk-1-1.png){width=480}\n:::\n:::\n\n:::\n\nIn the code above:\n\n- The `plot()` function creates the initial plot and draws the points (circles) on the canvas.\n- The `lines` function is used to annotate or add to the plot (in this case it adds a loess smoother to the scatterplot).\n\nNext, we use the `plot()` function to draw the points on the scatterplot and then use the `main` argument to add a main title to the plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(airquality)\nwith(airquality, {\n plot(Temp, Ozone, main = \"my plot\")\n lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot with loess curve](index_files/figure-html/unnamed-chunk-2-1.png){width=480}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nOne downside with constructing base plots is that you **cannot go backwards once the plot has started**.\n\nIt is possible that you could start down the road of constructing a plot and realize later (when it is too late) that you do not have enough room to add a y-axis label or something like that\n:::\n\nIf you have specific plot in mind, there is then a need to **plan in advance** to make sure, for example, that you have set your margins to be the right size to fit all of the annotations that you may want to include.\n\nWhile the base plotting system is nice in that it gives you the flexibility to specify these kinds of details to painstaking accuracy, **sometimes it would be nice if the system could just figure it out for you**.\n\n::: callout-tip\n### Note\n\nAnother downside of the base plotting system is that it is **difficult to describe or translate a plot to others because there is no clear graphical language or grammar** that can be used to communicate what you have done.\n\nThe only real way to describe what you have done in a base plot is to just list the series of commands/functions that you have executed, which is not a particularly compact way of communicating things.\n\nThis is one problem that the `ggplot2` package attempts to address.\n:::\n\n::: callout-tip\n### Example\n\nAnother typical base plot is constructed with the following code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(cars)\n\n## Create the plot / draw canvas\nwith(cars, plot(speed, dist))\n\n## Add annotation\ntitle(\"Speed vs. Stopping distance\")\n```\n\n::: {.cell-output-display}\n![Base plot with title](index_files/figure-html/unnamed-chunk-3-1.png){width=480}\n:::\n:::\n\n:::\n\nWe will go into more detail on what these functions do in later lessons.\n\n## The Lattice System\n\nThe **lattice plotting system** is implemented in the `lattice` R package which comes with every installation of R (although it is not loaded by default).\n\nTo **use the lattice plotting functions**, you must first load the `lattice` package with the `library` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(lattice)\n```\n:::\n\n\nWith the lattice system, **plots are created with a single function call**, such as `xyplot()` or `bwplot()`.\n\nThere is **no real distinction between functions that create or initiate plots** and **functions that annotate plots** because it all happens at once.\n\nLattice plots tend to be **most useful for conditioning types of plots**, i.e. looking at how `y` changes with `x` across levels of `z`.\n\n- e.g. these types of plots are useful for looking at multi-dimensional data and often allow you to squeeze a lot of information into a single window or page.\n\nAnother aspect of lattice that makes it different from base plotting is that **things like margins and spacing are set automatically**.\n\nThis is possible because entire plot is specified at once via a single function call, so all of the available information needed to figure out the spacing and margins is already there.\n\n::: callout-tip\n### Example\n\nHere is a lattice plot that looks at the relationship between life expectancy and income and how that relationship varies by region in the United States.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstate <- data.frame(state.x77, region = state.region)\nxyplot(Life.Exp ~ Income | region, data = state, layout = c(4, 1))\n```\n\n::: {.cell-output-display}\n![Lattice plot](index_files/figure-html/unnamed-chunk-5-1.png){width=768}\n:::\n:::\n\n:::\n\nYou can see that the entire plot was generated by the call to `xyplot()` and all of the data for the plot were stored in the `state` data frame.\n\nThe **plot itself contains four panels**---one for each region---and **within each panel is a scatterplot** of life expectancy and income.\n\nThe notion of *panels* comes up a lot with lattice plots because you typically have many panels in a lattice plot (each panel typically represents a *condition*, like \"region\").\n\n::: callout-tip\n### Note\n\nDownsides with the lattice system\n\n- It can sometimes be very **awkward to specify an entire plot** in a single function call (you end up with functions with many many arguments).\n- **Annotation in panels in plots is not especially intuitive** and can be difficult to explain. In particular, the use of custom panel functions and subscripts can be difficult to wield and requires intense preparation.\n- Once a plot is created, **you cannot \"add\" to the plot** (but of course you can just make it again with modifications).\n:::\n\n## The ggplot2 System\n\nThe **ggplot2 plotting system** attempts to split the difference between base and lattice in a number of ways.\n\n::: callout-tip\n### Note\n\nTaking cues from lattice, the ggplot2 system automatically deals with spacings, text, titles but also allows you to annotate by \"adding\" to a plot.\n:::\n\nThe ggplot2 system is implemented in the `ggplot2` package (part of the `tidyverse` package), which is available from CRAN (it does not come with R).\n\nYou can install it from CRAN via\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nand then load it into R via the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n```\n:::\n\n\nSuperficially, the `ggplot2` functions are similar to `lattice`, but the system is generally easier and more intuitive to use.\n\nThe defaults used in `ggplot2` make many choices for you, but you can still customize plots to your heart's desire.\n\n::: callout-tip\n### Example\n\nA typical plot with the `ggplot2` package looks as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata(mpg)\nmpg %>%\n ggplot(aes(displ, hwy)) + \n geom_point()\n```\n\n::: {.cell-output-display}\n![ggplot2 plot](index_files/figure-html/unnamed-chunk-8-1.png){width=576}\n:::\n:::\n\n:::\n\nThere are additional functions in `ggplot2` that allow you to make arbitrarily sophisticated plots.\n\nWe will discuss more about this in the next lecture.\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)\n lattice * 0.21-8 2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"11 - Plotting Systems\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Overview of three plotting systems in R\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/11-plotting-systems/index.qmd).*\n\n> The data may not contain the answer. And, if you torture the data long enough, it will tell you anything. ---*John W. Tukey*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. Paul Murrell (2011). *R Graphics*, CRC Press.\n3. Hadley Wickham (2009). *ggplot2*, Springer.\n4. Deepayan Sarkar (2008). *Lattice: Multivariate Data Visualization with R*, Springer.\n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to identify and describe the three plotting systems in R\n:::\n\n# Plotting Systems\n\nThere are **three different plotting systems in R** and they each have different characteristics and modes of operation.\n\n::: callout-tip\n### Important\n\nThe three systems are\n\n1. The base plotting system\n2. The lattice system\n3. The ggplot2 system\n\n**This course will focus primarily on the ggplot2 plotting system**. The other two systems are presented for context.\n:::\n\n## The Base Plotting System\n\nThe **base plotting system** is the original plotting system for R. The basic model is sometimes **referred to as the \"artist's palette\" model**.\n\nThe idea is you start with blank canvas and build up from there.\n\nIn more R-specific terms, you **typically start with `plot()` function** (or similar plot creating function) to *initiate* a plot and then *annotate* the plot with various annotation functions (`text`, `lines`, `points`, `axis`)\n\nThe base plotting system is **often the most convenient plotting system** to use because it mirrors how we sometimes think of building plots and analyzing data.\n\nIf we do not have a completely well-formed idea of how we want to look at some data, often we will start by \"throwing some data on the page\" and then slowly add more information to it as our thought process evolves.\n\n::: callout-tip\n### Example\n\nWe might look at a simple scatterplot and then decide to add a linear regression line or a smoother to it to highlight the trends.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(airquality)\nwith(airquality, {\n plot(Temp, Ozone)\n lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot with loess curve](index_files/figure-html/unnamed-chunk-1-1.png){width=480}\n:::\n:::\n\n:::\n\nIn the code above:\n\n- The `plot()` function creates the initial plot and draws the points (circles) on the canvas.\n- The `lines` function is used to annotate or add to the plot (in this case it adds a loess smoother to the scatterplot).\n\nNext, we use the `plot()` function to draw the points on the scatterplot and then use the `main` argument to add a main title to the plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(airquality)\nwith(airquality, {\n plot(Temp, Ozone, main = \"my plot\")\n lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot with loess curve](index_files/figure-html/unnamed-chunk-2-1.png){width=480}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nOne downside with constructing base plots is that you **cannot go backwards once the plot has started**.\n\nIt is possible that you could start down the road of constructing a plot and realize later (when it is too late) that you do not have enough room to add a y-axis label or something like that\n:::\n\nIf you have specific plot in mind, there is then a need to **plan in advance** to make sure, for example, that you have set your margins to be the right size to fit all of the annotations that you may want to include.\n\nWhile the base plotting system is nice in that it gives you the flexibility to specify these kinds of details to painstaking accuracy, **sometimes it would be nice if the system could just figure it out for you**.\n\n::: callout-tip\n### Note\n\nAnother downside of the base plotting system is that it is **difficult to describe or translate a plot to others because there is no clear graphical language or grammar** that can be used to communicate what you have done.\n\nThe only real way to describe what you have done in a base plot is to just list the series of commands/functions that you have executed, which is not a particularly compact way of communicating things.\n\nThis is one problem that the `ggplot2` package attempts to address.\n:::\n\n::: callout-tip\n### Example\n\nAnother typical base plot is constructed with the following code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(cars)\n\n## Create the plot / draw canvas\nwith(cars, plot(speed, dist))\n\n## Add annotation\ntitle(\"Speed vs. Stopping distance\")\n```\n\n::: {.cell-output-display}\n![Base plot with title](index_files/figure-html/unnamed-chunk-3-1.png){width=480}\n:::\n:::\n\n:::\n\nWe will go into more detail on what these functions do in later lessons.\n\n## The Lattice System\n\nThe **lattice plotting system** is implemented in the `lattice` R package which comes with every installation of R (although it is not loaded by default).\n\nTo **use the lattice plotting functions**, you must first load the `lattice` package with the `library` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(lattice)\n```\n:::\n\n\nWith the lattice system, **plots are created with a single function call**, such as `xyplot()` or `bwplot()`.\n\nThere is **no real distinction between functions that create or initiate plots** and **functions that annotate plots** because it all happens at once.\n\nLattice plots tend to be **most useful for conditioning types of plots**, i.e. looking at how `y` changes with `x` across levels of `z`.\n\n- e.g. these types of plots are useful for looking at multi-dimensional data and often allow you to squeeze a lot of information into a single window or page.\n\nAnother aspect of lattice that makes it different from base plotting is that **things like margins and spacing are set automatically**.\n\nThis is possible because entire plot is specified at once via a single function call, so all of the available information needed to figure out the spacing and margins is already there.\n\n::: callout-tip\n### Example\n\nHere is a lattice plot that looks at the relationship between life expectancy and income and how that relationship varies by region in the United States.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstate <- data.frame(state.x77, region = state.region)\nxyplot(Life.Exp ~ Income | region, data = state, layout = c(4, 1))\n```\n\n::: {.cell-output-display}\n![Lattice plot](index_files/figure-html/unnamed-chunk-5-1.png){width=768}\n:::\n:::\n\n:::\n\nYou can see that the entire plot was generated by the call to `xyplot()` and all of the data for the plot were stored in the `state` data frame.\n\nThe **plot itself contains four panels**---one for each region---and **within each panel is a scatterplot** of life expectancy and income.\n\nThe notion of *panels* comes up a lot with lattice plots because you typically have many panels in a lattice plot (each panel typically represents a *condition*, like \"region\").\n\n::: callout-tip\n### Note\n\nDownsides with the lattice system\n\n- It can sometimes be very **awkward to specify an entire plot** in a single function call (you end up with functions with many many arguments).\n- **Annotation in panels in plots is not especially intuitive** and can be difficult to explain. In particular, the use of custom panel functions and subscripts can be difficult to wield and requires intense preparation.\n- Once a plot is created, **you cannot \"add\" to the plot** (but of course you can just make it again with modifications).\n:::\n\n## The ggplot2 System\n\nThe **ggplot2 plotting system** attempts to split the difference between base and lattice in a number of ways.\n\n::: callout-tip\n### Note\n\nTaking cues from lattice, the ggplot2 system automatically deals with spacings, text, titles but also allows you to annotate by \"adding\" to a plot.\n:::\n\nThe ggplot2 system is implemented in the `ggplot2` package (part of the `tidyverse` package), which is available from CRAN (it does not come with R).\n\nYou can install it from CRAN via\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"ggplot2\")\n```\n:::\n\n\nand then load it into R via the `library()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggplot2)\n```\n:::\n\n\nSuperficially, the `ggplot2` functions are similar to `lattice`, but the system is generally easier and more intuitive to use.\n\nThe defaults used in `ggplot2` make many choices for you, but you can still customize plots to your heart's desire.\n\n::: callout-tip\n### Example\n\nA typical plot with the `ggplot2` package looks as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\ndata(mpg)\nmpg %>%\n ggplot(aes(displ, hwy)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![ggplot2 plot](index_files/figure-html/unnamed-chunk-8-1.png){width=576}\n:::\n:::\n\n:::\n\nThere are additional functions in `ggplot2` that allow you to make arbitrarily sophisticated plots.\n\nWe will discuss more about this in the next lecture.\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)\n lattice * 0.21-8 2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [ "index_files" ], diff --git a/_freeze/posts/12-ggplot2-plotting-system-part-1/index/execute-results/html.json b/_freeze/posts/12-ggplot2-plotting-system-part-1/index/execute-results/html.json index a32e326..36e05af 100644 --- a/_freeze/posts/12-ggplot2-plotting-system-part-1/index/execute-results/html.json +++ b/_freeze/posts/12-ggplot2-plotting-system-part-1/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "7a7a1e3161c0ff96ad2a3bef68a62f06", + "hash": "94c3dfcaf944f3b724e55205af55c601", "result": { - "markdown": "---\ntitle: \"12 - The ggplot2 plotting system: qplot()\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An overview of the ggplot2 plotting system in R with qplot()\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/12-ggplot2-plotting-system-part-1/index.qmd).*\n\n> \"The greatest value of a picture is when it forces us to notice what we never expected to see.\" ---John Tukey\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Recognize the difference between *aesthetics* and *geoms*\n- Become familiar with different types of plots (e.g. scatterplots, boxplots, and histograms)\n- Be able to facet plots into a grid\n:::\n\n# The ggplot2 Plotting System\n\nThe `ggplot2` package in R is **an implementation of *The Grammar of Graphics*** as described by Leland Wilkinson in his book. The **package was originally written by Hadley Wickham** while he was a graduate student at Iowa State University (he still actively maintains the package).\n\nThe package implements what might be considered a third graphics system for R (along with `base` graphics and `lattice`).\n\nThe package is available from [CRAN](http://cran.r-project.org/package=ggplot2) via `install.packages()`; the latest version of the source can be found on the package's [GitHub Repository](https://github.com/hadley/ggplot2). Documentation of the package can be found at [the tidyverse web site](https://ggplot2.tidyverse.org).\n\nThe **grammar of graphics** represents **an abstraction of graphics ideas and objects**.\n\nYou can think of this as **developing the verbs, nouns, and adjectives for data graphics**.\n\n::: callout-tip\n### Note\n\nDeveloping such a **grammar allows for a \"theory\" of graphics** on which to build new graphics and graphics objects.\n\nTo quote from Hadley Wickham's book on `ggplot2`, we want to \"shorten the distance from mind to page\". In summary,\n\n> \"...the grammar tells us that a statistical graphic is a **mapping** from data to **aesthetic** attributes (colour, shape, size) of **geometric** objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system\" -- from *ggplot2* book\n:::\n\nYou might ask yourself \"Why do we need a grammar of graphics?\".\n\nWell, for much the same reasons that **having a grammar is useful for spoken languages**. The grammar allows for\n\n- A more compact summary of the base components of a language\n- An extension of the language to handle situations that we have not before seen\n\nIf you think about making a plot with the base graphics system, the plot is **constructed by calling a series of functions that either create or annotate a plot**. There's **no convenient agreed-upon way to describe the plot**, except to just recite the series of R functions that were called to create the thing in the first place.\n\n::: callout-tip\n### Example\n\nConsider the following plot made using base graphics previously.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwith(airquality, { \n plot(Temp, Ozone)\n lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot of Temperature and Ozone in New York (base graphics)](index_files/figure-html/unnamed-chunk-1-1.png){width=480}\n:::\n:::\n\n\nHow would one **describe the creation of this plot**?\n\nWell, we could say that we called the `plot()` function and then added a loess smoother by calling the `lines()` function on the output of `loess.smooth()`.\n\nWhile the base plotting system is convenient and it often mirrors how we think of building plots and analyzing data, there are **drawbacks**:\n\n- You cannot go back once plot has started (e.g. to adjust margins), so there is in fact a need to plan in advance.\n- It is difficult to \"translate\" a plot to others because there is no formal graphical language; each plot is just a series of R commands.\n:::\n\nHere is the same plot made using `ggplot2` in the `tidyverse`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nairquality %>%\n ggplot(aes(Temp, Ozone)) + \n geom_point() + \n geom_smooth(method = \"loess\", \n se = FALSE) + \n theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scatterplot of Temperature and Ozone in New York (ggplot2)](index_files/figure-html/unnamed-chunk-2-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe output is roughly equivalent and the amount of code is similar, but `ggplot2` allows for a more elegant way of expressing the components of the plot.\n\nIn this case, the plot is a **dataset** (`airquality`) with **aesthetic mappings** (visual properties of the objects in your plot) derived from the `Temp` and `Ozone` variables, a set of **points**, and a **smoother**.\n:::\n\nIn a sense, the `ggplot2` system takes many of the cues from the base plotting system and from the `lattice` plotting systems, and formalizes the cues a bit.\n\nIt **automatically handles things like margins and spacing**, and also has the concept of \"themes\" which **provide a default set of plotting symbols and colors** (which are all customizable).\n\nWhile `ggplot2` bears a superficial similarity to `lattice`, `ggplot2` is generally easier and more intuitive to use.\n\n## The Basics: `qplot()`\n\nThe `qplot()` function in `ggplot2` is meant to get you going **q**uickly.\n\nIt works much like the `plot()` function in base graphics system. It **looks for variables to plot within a data frame**, similar to lattice, or in the parent environment.\n\nIn general, it is good to get used to putting your data in a data frame and then passing it to `qplot()`.\n\n::: callout-tip\n### Pro tip\n\nThe `qplot()` function is **somewhat discouraged** in `ggplot2` now and new users are encouraged to use the more general `ggplot()` function (more details in the next lesson).\n\nHowever, the `qplot()` function is still useful and may be easier to use if transitioning from the base plotting system or a different statistical package.\n:::\n\nPlots are made up of\n\n- **aesthetics** (e.g. size, shape, color)\n- **geoms** (e.g. points, lines)\n\nFactors play an important role for indicating subsets of the data (if they are to have different properties) so they should be **labeled** properly.\n\nThe `qplot()` hides much of what goes on underneath, which is okay for most operations, `ggplot()` is the core function and is very flexible for doing things `qplot()` cannot do.\n\n## Before you start: label your data\n\nOne thing that is always true, but is particularly useful when using `ggplot2`, is that you should always **use informative and descriptive labels on your data**.\n\nMore generally, your data should have appropriate **metadata** so that you can quickly look at a dataset and know\n\n- what are variables?\n- what do the values of each variable mean?\n\n::: callout-tip\n### Pro tip\n\n- **Each column** of a data frame should **have a meaningful (but concise) variable name** that accurately reflects the data stored in that column\n- Non-numeric or **categorical variables should be coded as factor variables** and have meaningful labels for each level of the factor.\n - Might be common to code a binary variable as a \"0\" or a \"1\", but the problem is that from quickly looking at the data, it's impossible to know whether which level of that variable is represented by a \"0\" or a \"1\".\n - Much better to simply label each observation as what they are.\n - If a variable represents temperature categories, it might be better to use \"cold\", \"mild\", and \"hot\" rather than \"1\", \"2\", and \"3\".\n:::\n\nWhile it is sometimes a pain to make sure all of your data are properly labeled, this **investment in time can pay dividends down the road** when you're trying to figure out what you were plotting.\n\nIn other words, including the proper metadata can make your exploratory plots essentially self-documenting.\n\n## ggplot2 \"Hello, world!\"\n\nThis example dataset comes with the `ggplot2` package and contains data on the fuel economy of 38 popular car models from 1999 to 2008.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse) # this loads the ggplot2 R package\n# library(ggplot2) # an alternative way to just load the ggplot2 R package\nglimpse(mpg)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 234\nColumns: 11\n$ manufacturer \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"…\n$ model \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4 quattro\", \"…\n$ displ 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…\n$ year 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…\n$ cyl 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …\n$ trans \"auto(l5)\", \"manual(m5)\", \"manual(m6)\", \"auto(av)\", \"auto…\n$ drv \"f\", \"f\", \"f\", \"f\", \"f\", \"f\", \"f\", \"4\", \"4\", \"4\", \"4\", \"4…\n$ cty 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…\n$ hwy 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…\n$ fl \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p…\n$ class \"compact\", \"compact\", \"compact\", \"compact\", \"compact\", \"c…\n```\n:::\n:::\n\n\nYou can see from the `glimpse()` (part of the `dplyr` package) output that all of the categorical variables (like \"manufacturer\" or \"class\") are \\*\\*appropriately coded with meaningful label\\*s\\*\\*.\n\nThis will come in handy when `qplot()` has to label different aspects of a plot.\n\nAlso note that all of the **columns/variables have meaningful names** (if sometimes abbreviated), rather than names like \"X1\", and \"X2\", etc.\n\n::: callout-tip\n### Example\n\nWe can **make a quick scatterplot** using `qplot()` of the engine displacement (`displ`) and the highway miles per gallon (`hwy`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: `qplot()` was deprecated in ggplot2 3.4.0.\n```\n:::\n\n::: {.cell-output-display}\n![Plot of engine displacement and highway mileage using the mtcars dataset](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n:::\n\nIt has a *very* similar feeling to `plot()` in base R.\n\n::: callout-tip\n### Note\n\nIn the call to `qplot()` you **must specify the `data` argument** so that `qplot()` knows where to look up the variables.\n\nYou must also specify `x` and `y`, but hopefully that part is obvious.\n:::\n\n## Modifying aesthetics\n\nWe can introduce a third variable into the plot by **modifying the color** of the points based on the value of that third variable.\n\nColor (or colour) is one type of **aesthetic** and using the `ggplot2` language:\n\n> \"the color of each point can be mapped to a variable\"\n\nThis sounds technical, but let's give an example.\n\n::: callout-tip\n### Example\n\nWe map the `color` argument to the `drv` variable, which indicates whether a car is front wheel drive, rear wheel drive, or 4-wheel drive.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, color = drv)\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n:::\n\nNow we can see that the front wheel drive cars tend to have lower displacement relative to the 4-wheel or rear wheel drive cars.\n\nAlso, it's clear that the 4-wheel drive cars have the lowest highway gas mileage.\n\n::: callout-tip\n### Note\n\nThe `x` argument and `y` argument are aesthetics too, and they got mapped to the `displ` and `hwy` variables, respectively.\n:::\n\n::: callout-note\n### Question\n\nIn the above plot, I did not specify the `x` and `y` variable. What happens when you run these two code chunks. What's the difference?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, displ, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(y = hwy, x = displ, data = mpg, color = drv)\n```\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nLet's try mapping colors in another dataset, namely the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/) dataset. These data contain observations for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Palmer penguins](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(palmerpenguins)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 344\nColumns: 8\n$ species Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…\n$ island Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…\n$ bill_length_mm 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …\n$ bill_depth_mm 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …\n$ flipper_length_mm 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…\n$ body_mass_g 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …\n$ sex male, female, female, NA, female, male, female, male…\n$ year 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…\n```\n:::\n:::\n\n\nIf we wanted to count the number of penguins for each of the three species, we can use the `count()` function in `dplyr`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins %>% \n count(species)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n species n\n \n1 Adelie 152\n2 Chinstrap 68\n3 Gentoo 124\n```\n:::\n:::\n\n:::\n\nFor example, we see there are a total of 152 Adelie penguins in the `palmerpenguins` dataset.\n\n::: callout-note\n### Question\n\nIf we wanted to use `qplot()` to map `flipper_length_mm` and `bill_length_mm` to the x and y coordinates, what would we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n\nNow try mapping color to the `species` variable on top of the code you just wrote:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Adding a geom\n\nSometimes it is nice to **add a smoother** to a scatterplot to highlight any trends.\n\nTrends can be difficult to see if the data are very noisy or there are many data points obscuring the view.\n\nA smoother is a **type of \"geom\"** that you can add along with your data points.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, geom = c(\"point\", \"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage w/smoother](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n:::\n\nHere it seems that engine displacement and highway mileage have a nonlinear U-shaped relationship, but from the previous plot we know that this is largely due to confounding by the drive class of the car.\n\n::: callout-tip\n### Note\n\nPreviously, we did not have to specify `geom = \"point\"` because that was done automatically.\n\nBut if you want the smoother overlaid with the points, then you need to specify both explicitly.\n:::\n\nLook at what happens if we *do not* include the `point` geom.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, geom = c(\"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage w/smoother](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\nSometimes that is the plot you want to show, but in this case it might make more sense to show the data along with the smoother.\n\n::: callout-note\n### Question\n\nLet's **add a smoother** to our `palmerpenguins` dataset example.\n\nUsing the code we previously wrote mapping variables to points and color, add a \"point\" and \"smooth\" geom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Histograms and boxplots\n\nThe `qplot()` function can be used to be used to plot 1-dimensional data too.\n\nBy **specifying a single variable**, `qplot()` will by default make a **histogram**.\n\n::: callout-tip\n### Example\n\nWe can make a histogram of the highway mileage data and stratify on the drive class. So technically this is three histograms on top of each other.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, data = mpg, fill = drv, binwidth = 2)\n```\n\n::: {.cell-output-display}\n![Histogram of highway mileage by drive class](index_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n:::\n\n::: callout-note\n### Question\n\nNotice, I used `fill` here to map color to the `drv` variable. Why is this? What happens when you use `color` instead?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\nHaving the different colors for each drive class is nice, but the three histograms can be a bit difficult to separate out.\n\n**Side-by-side boxplots** are one solution to this problem.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(drv, hwy, data = mpg, geom = \"boxplot\")\n```\n\n::: {.cell-output-display}\n![Boxplots of highway mileage by drive class](index_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\nAnother solution is to plot the histograms in separate panels using facets.\n\n## Facets\n\n**Facets** are a way to **create multiple panels of plots based on the levels of categorical variable**.\n\nHere, we want to see a histogram of the highway mileages and the categorical variable is the drive class variable. We can do that using the `facets` argument to `qplot()`.\n\n::: callout-tip\n### Note\n\nThe `facets` argument **expects a formula type of input**, with a `~` separating the left hand side variable and the right hand side variable.\n\n- The **left hand side** variable indicates how the rows of the panels should be divided\n- The **right hand side** variable indicates how the columns of the panels should be divided\n:::\n\n::: callout-tip\n### Example\n\nHere, we just want three rows of histograms (and just one column), one for each drive class, so we specify `drv` on the left hand side and `.` on the right hand side indicating that there's no variable there (it's empty).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)\n```\n\n::: {.cell-output-display}\n![Histogram of highway mileage by drive class](index_files/figure-html/unnamed-chunk-22-1.png){width=480}\n:::\n:::\n\n:::\n\nWe could also look at **more data using facets**, so instead of histograms we could look at scatter plots of engine displacement and highway mileage by drive class.\n\nHere, we put the `drv` variable on the right hand side to indicate that we want a column for each drive class (as opposed to splitting by rows like we did above).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, facets = . ~ drv)\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class](index_files/figure-html/unnamed-chunk-23-1.png){width=576}\n:::\n:::\n\n\nWhat if you wanted to **add a smoother to each one of those panels**? Simple, you literally just add the smoother as another **geom**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, facets = . ~ drv) + \n geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class w/smoother](index_files/figure-html/unnamed-chunk-24-1.png){width=576}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nWe used a different type of smoother above.\n\nHere, we add a **linear regression line** (a type of smoother) to each group to see if there's any difference.\n:::\n\n::: callout-note\n### Question\n\nLet's facet our `palmerpenguins` dataset example and explore different types of plots.\n\nBuilding off the code we previously wrote, perform the following tasks:\n\n- Facet the plot based on `species` with the the three species along rows.\n- Add a linear regression line to each the types of `species`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n\nNext, make a histogram of the `body_mass_g` for each of the species colored by the three species.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Summary\n\nThe `qplot()` function in `ggplot2` is the analog of `plot()` in base graphics but with many built-in features that the traditionaly `plot()` does not provide. The syntax is somewhere in between the base and lattice graphics system. The `qplot()` function is useful for quickly putting data on the page/screen, but for ultimate customization, it may make more sense to use some of the lower level functions that we discuss later in the next lesson.\n\n# Post-lecture materials\n\n### Case Study: MAACS Cohort\n\n
\n\nClick here for case study practicing the `qplot()` function.\n\nThis case study will use data based on the Mouse Allergen and Asthma Cohort Study (MAACS). This study was aimed at characterizing the indoor (home) environment and its relationship with asthma morbidity amonst children aged 5--17 living in Baltimore, MD. The children all had persistent asthma, defined as having had an exacerbation in the past year. A representative publication of results from this study can be found in this paper by [Lu, et al.](https://pubmed.ncbi.nlm.nih.gov/23403052/)\n\n::: keyideas\nBecause the individual-level data for this study are protected by various U.S. privacy laws, we cannot make those data available. For the purposes of this lesson, we have simulated data that share many of the same features of the original data, but do not contain any of the actual measurements or values contained in the original dataset.\n:::\n\nHere is a snapshot of what the data look like.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing2023\n```\n:::\n\n```{.r .cell-code}\nmaacs <- read_csv(here(\"data\", \"maacs_sim.csv\"), col_types = \"icnn\")\nmaacs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 750 × 4\n id mopos pm25 eno\n \n 1 1 yes 6.01 28.8 \n 2 2 no 25.2 17.7 \n 3 3 yes 21.8 43.6 \n 4 4 no 13.4 288. \n 5 5 no 49.4 7.60\n 6 6 no 43.4 12.0 \n 7 7 yes 33.0 79.2 \n 8 8 yes 32.7 34.2 \n 9 9 yes 52.2 12.1 \n10 10 yes 51.9 65.0 \n# ℹ 740 more rows\n```\n:::\n:::\n\n\nThe key variables are:\n\n- `mopos`: an indicator of whether the subject is allergic to mouse allergen (yes/no)\n\n- `pm25`: average level of PM2.5 over the course of 7 days (micrograms per cubic meter)\n\n- `eno`: exhaled nitric oxide\n\nThe outcome of interest for this analysis will be exhaled nitric oxide (eNO), which is a measure of pulmonary inflamation. We can get a sense of how eNO is distributed in this population by making a quick histogram of the variable. Here, we take the log of eNO because some right-skew in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n```\n:::\n\n::: {.cell-output-display}\n![Histogram of log eNO](index_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\nA quick glance suggests that the histogram is a bit \"fat\", suggesting that there might be multiple groups of people being lumped together. We can stratify the histogram by whether they are allergic to mouse.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, fill = mopos)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n```\n:::\n\n::: {.cell-output-display}\n![Histogram of log eNO by mouse allergic status](index_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\nWe can see from this plot that the non-allergic subjects are shifted slightly to the left, indicating a lower eNO and less pulmonary inflammation. That said, there is significant overlap between the two groups.\n\nAn alternative to histograms is a density smoother, which sometimes can be easier to visualize when there are multiple groups. Here is a density smooth of the entire study population.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, geom = \"density\")\n```\n\n::: {.cell-output-display}\n![Density smooth of log eNO](index_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n\nAnd here are the densities straitified by allergic status. We can map the color aesthetic to the `mopos` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, geom = \"density\", color = mopos)\n```\n\n::: {.cell-output-display}\n![Density smooth of log eNO by mouse allergic status](index_files/figure-html/unnamed-chunk-31-1.png){width=672}\n:::\n:::\n\n\nThese tell the same story as the stratified histograms, which should come as no surprise.\n\nNow we can examine the indoor environment and its relationship to eNO. Here, we use the level of indoor PM2.5 as a measure of indoor environment air quality. We can make a simple scatterplot of PM2.5 and eNO.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, geom = c(\"point\", \"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![eNO and PM2.5](index_files/figure-html/unnamed-chunk-32-1.png){width=672}\n:::\n:::\n\n\nThe relationship appears modest at best, as there is substantial noise in the data. However, one question that we might be interested in is whether allergic individuals are perhaps more sensitive to PM2.5 inhalation than non-allergic individuals. To examine that question we can stratify the data into two groups.\n\nThis first plot uses different plot symbols for the two groups and overlays them on a single canvas. We can do this by mapping the `mopos` variable to the `shape` aesthetic.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, shape = mopos)\n```\n\n::: {.cell-output-display}\n![eNO and PM2.5 by mouse allergic status](index_files/figure-html/unnamed-chunk-33-1.png){width=672}\n:::\n:::\n\n\nBecause there is substantial overlap in the data it is a bit challenging to discern the circles from the triangles. Part of the reason might be that all of the symbols are the same color (black).\n\nWe can plot each group a different color to see if that helps.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, color = mopos)\n```\n\n::: {.cell-output-display}\n![eNO and PM2.5 by mouse allergic status](index_files/figure-html/unnamed-chunk-34-1.png){width=672}\n:::\n:::\n\n\nThis is slightly better but the substantial overlap makes it difficult to discern any trends in the data. For this we need to add a smoother of some sort. Here we add a linear regression line (a type of smoother) to each group to see if there's any difference.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, color = mopos) + \n geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-35-1.png){width=672}\n:::\n:::\n\n\nHere we see quite clearly that the red group and the green group exhibit rather different relationships between PM2.5 and eNO. For the non-allergic individuals, there appears to be a slightly negative relationship between PM2.5 and eNO and for the allergic individuals, there is a positive relationship. This suggests a strong interaction between PM2.5 and allergic status, an hypothesis perhaps worth following up on in greater detail than this brief exploratory analysis.\n\nAnother, and perhaps more clear, way to visualize this interaction is to use separate panels for the non-allergic and allergic individuals using the `facets` argument to `qplot()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, facets = . ~ mopos) + \n geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-36-1.png){width=864}\n:::\n:::\n\n\n
\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. What is gone wrong with this code? Why are the points not blue?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg, color = \"blue\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-37-1.png){width=672}\n:::\n:::\n\n\n2. Which variables in `mpg` are categorical? Which variables are continuous? (Hint: type `?mpg` to read the documentation for the dataset). How can you see this information when you run `mpg`?\n\n3. Map a continuous variable to `color`, `size`, and `shape` aesthetics. How do these aesthetics behave differently for categorical vs. continuous variables?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)\n bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)\n lattice 0.21-8 2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n Matrix 1.6-1 2023-08-14 [1] CRAN (R 4.3.0)\n mgcv 1.9-0 2023-07-11 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n nlme 3.1-163 2023-08-09 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1 2022-08-15 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"12 - The ggplot2 plotting system: qplot()\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An overview of the ggplot2 plotting system in R with qplot()\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/12-ggplot2-plotting-system-part-1/index.qmd).*\n\n> \"The greatest value of a picture is when it forces us to notice what we never expected to see.\" ---John Tukey\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Recognize the difference between *aesthetics* and *geoms*\n- Become familiar with different types of plots (e.g. scatterplots, boxplots, and histograms)\n- Be able to facet plots into a grid\n:::\n\n# The ggplot2 Plotting System\n\nThe `ggplot2` package in R is **an implementation of *The Grammar of Graphics*** as described by Leland Wilkinson in his book. The **package was originally written by Hadley Wickham** while he was a graduate student at Iowa State University (he still actively maintains the package).\n\nThe package implements what might be considered a third graphics system for R (along with `base` graphics and `lattice`).\n\nThe package is available from [CRAN](http://cran.r-project.org/package=ggplot2) via `install.packages()`; the latest version of the source can be found on the package's [GitHub Repository](https://github.com/hadley/ggplot2). Documentation of the package can be found at [the tidyverse web site](https://ggplot2.tidyverse.org).\n\nThe **grammar of graphics** represents **an abstraction of graphics ideas and objects**.\n\nYou can think of this as **developing the verbs, nouns, and adjectives for data graphics**.\n\n::: callout-tip\n### Note\n\nDeveloping such a **grammar allows for a \"theory\" of graphics** on which to build new graphics and graphics objects.\n\nTo quote from Hadley Wickham's book on `ggplot2`, we want to \"shorten the distance from mind to page\". In summary,\n\n> \"...the grammar tells us that a statistical graphic is a **mapping** from data to **aesthetic** attributes (colour, shape, size) of **geometric** objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system\" -- from *ggplot2* book\n:::\n\nYou might ask yourself \"Why do we need a grammar of graphics?\".\n\nWell, for much the same reasons that **having a grammar is useful for spoken languages**. The grammar allows for\n\n- A more compact summary of the base components of a language\n- An extension of the language to handle situations that we have not before seen\n\nIf you think about making a plot with the base graphics system, the plot is **constructed by calling a series of functions that either create or annotate a plot**. There's **no convenient agreed-upon way to describe the plot**, except to just recite the series of R functions that were called to create the thing in the first place.\n\n::: callout-tip\n### Example\n\nConsider the following plot made using base graphics previously.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwith(airquality, {\n plot(Temp, Ozone)\n lines(loess.smooth(Temp, Ozone))\n})\n```\n\n::: {.cell-output-display}\n![Scatterplot of Temperature and Ozone in New York (base graphics)](index_files/figure-html/unnamed-chunk-1-1.png){width=480}\n:::\n:::\n\n\nHow would one **describe the creation of this plot**?\n\nWell, we could say that we called the `plot()` function and then added a loess smoother by calling the `lines()` function on the output of `loess.smooth()`.\n\nWhile the base plotting system is convenient and it often mirrors how we think of building plots and analyzing data, there are **drawbacks**:\n\n- You cannot go back once plot has started (e.g. to adjust margins), so there is in fact a need to plan in advance.\n- It is difficult to \"translate\" a plot to others because there is no formal graphical language; each plot is just a series of R commands.\n:::\n\nHere is the same plot made using `ggplot2` in the `tidyverse`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nairquality %>%\n ggplot(aes(Temp, Ozone)) +\n geom_point() +\n geom_smooth(\n method = \"loess\",\n se = FALSE\n ) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![Scatterplot of Temperature and Ozone in New York (ggplot2)](index_files/figure-html/unnamed-chunk-2-1.png){width=672}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe output is roughly equivalent and the amount of code is similar, but `ggplot2` allows for a more elegant way of expressing the components of the plot.\n\nIn this case, the plot is a **dataset** (`airquality`) with **aesthetic mappings** (visual properties of the objects in your plot) derived from the `Temp` and `Ozone` variables, a set of **points**, and a **smoother**.\n:::\n\nIn a sense, the `ggplot2` system takes many of the cues from the base plotting system and from the `lattice` plotting systems, and formalizes the cues a bit.\n\nIt **automatically handles things like margins and spacing**, and also has the concept of \"themes\" which **provide a default set of plotting symbols and colors** (which are all customizable).\n\nWhile `ggplot2` bears a superficial similarity to `lattice`, `ggplot2` is generally easier and more intuitive to use.\n\n## The Basics: `qplot()`\n\nThe `qplot()` function in `ggplot2` is meant to get you going **q**uickly.\n\nIt works much like the `plot()` function in base graphics system. It **looks for variables to plot within a data frame**, similar to lattice, or in the parent environment.\n\nIn general, it is good to get used to putting your data in a data frame and then passing it to `qplot()`.\n\n::: callout-tip\n### Pro tip\n\nThe `qplot()` function is **somewhat discouraged** in `ggplot2` now and new users are encouraged to use the more general `ggplot()` function (more details in the next lesson).\n\nHowever, the `qplot()` function is still useful and may be easier to use if transitioning from the base plotting system or a different statistical package.\n:::\n\nPlots are made up of\n\n- **aesthetics** (e.g. size, shape, color)\n- **geoms** (e.g. points, lines)\n\nFactors play an important role for indicating subsets of the data (if they are to have different properties) so they should be **labeled** properly.\n\nThe `qplot()` hides much of what goes on underneath, which is okay for most operations, `ggplot()` is the core function and is very flexible for doing things `qplot()` cannot do.\n\n## Before you start: label your data\n\nOne thing that is always true, but is particularly useful when using `ggplot2`, is that you should always **use informative and descriptive labels on your data**.\n\nMore generally, your data should have appropriate **metadata** so that you can quickly look at a dataset and know\n\n- what are variables?\n- what do the values of each variable mean?\n\n::: callout-tip\n### Pro tip\n\n- **Each column** of a data frame should **have a meaningful (but concise) variable name** that accurately reflects the data stored in that column\n- Non-numeric or **categorical variables should be coded as factor variables** and have meaningful labels for each level of the factor.\n - Might be common to code a binary variable as a \"0\" or a \"1\", but the problem is that from quickly looking at the data, it's impossible to know whether which level of that variable is represented by a \"0\" or a \"1\".\n - Much better to simply label each observation as what they are.\n - If a variable represents temperature categories, it might be better to use \"cold\", \"mild\", and \"hot\" rather than \"1\", \"2\", and \"3\".\n:::\n\nWhile it is sometimes a pain to make sure all of your data are properly labeled, this **investment in time can pay dividends down the road** when you're trying to figure out what you were plotting.\n\nIn other words, including the proper metadata can make your exploratory plots essentially self-documenting.\n\n## ggplot2 \"Hello, world!\"\n\nThis example dataset comes with the `ggplot2` package and contains data on the fuel economy of 38 popular car models from 1999 to 2008.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse) # this loads the ggplot2 R package\n# library(ggplot2) # an alternative way to just load the ggplot2 R package\nglimpse(mpg)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 234\nColumns: 11\n$ manufacturer \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"audi\", \"…\n$ model \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4\", \"a4 quattro\", \"…\n$ displ 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…\n$ year 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…\n$ cyl 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …\n$ trans \"auto(l5)\", \"manual(m5)\", \"manual(m6)\", \"auto(av)\", \"auto…\n$ drv \"f\", \"f\", \"f\", \"f\", \"f\", \"f\", \"f\", \"4\", \"4\", \"4\", \"4\", \"4…\n$ cty 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…\n$ hwy 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…\n$ fl \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p\", \"p…\n$ class \"compact\", \"compact\", \"compact\", \"compact\", \"compact\", \"c…\n```\n:::\n:::\n\n\nYou can see from the `glimpse()` (part of the `dplyr` package) output that all of the categorical variables (like \"manufacturer\" or \"class\") are \\*\\*appropriately coded with meaningful label\\*s\\*\\*.\n\nThis will come in handy when `qplot()` has to label different aspects of a plot.\n\nAlso note that all of the **columns/variables have meaningful names** (if sometimes abbreviated), rather than names like \"X1\", and \"X2\", etc.\n\n::: callout-tip\n### Example\n\nWe can **make a quick scatterplot** using `qplot()` of the engine displacement (`displ`) and the highway miles per gallon (`hwy`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: `qplot()` was deprecated in ggplot2 3.4.0.\n```\n:::\n\n::: {.cell-output-display}\n![Plot of engine displacement and highway mileage using the mtcars dataset](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n:::\n\nIt has a *very* similar feeling to `plot()` in base R.\n\n::: callout-tip\n### Note\n\nIn the call to `qplot()` you **must specify the `data` argument** so that `qplot()` knows where to look up the variables.\n\nYou must also specify `x` and `y`, but hopefully that part is obvious.\n:::\n\n## Modifying aesthetics\n\nWe can introduce a third variable into the plot by **modifying the color** of the points based on the value of that third variable.\n\nColor (or colour) is one type of **aesthetic** and using the `ggplot2` language:\n\n> \"the color of each point can be mapped to a variable\"\n\nThis sounds technical, but let's give an example.\n\n::: callout-tip\n### Example\n\nWe map the `color` argument to the `drv` variable, which indicates whether a car is front wheel drive, rear wheel drive, or 4-wheel drive.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, color = drv)\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n:::\n\nNow we can see that the front wheel drive cars tend to have lower displacement relative to the 4-wheel or rear wheel drive cars.\n\nAlso, it's clear that the 4-wheel drive cars have the lowest highway gas mileage.\n\n::: callout-tip\n### Note\n\nThe `x` argument and `y` argument are aesthetics too, and they got mapped to the `displ` and `hwy` variables, respectively.\n:::\n\n::: callout-note\n### Question\n\nIn the above plot, I did not specify the `x` and `y` variable. What happens when you run these two code chunks. What's the difference?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, displ, data = mpg, color = drv)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(y = hwy, x = displ, data = mpg, color = drv)\n```\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nLet's try mapping colors in another dataset, namely the [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/) dataset. These data contain observations for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Palmer penguins](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(palmerpenguins)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 344\nColumns: 8\n$ species Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…\n$ island Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…\n$ bill_length_mm 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …\n$ bill_depth_mm 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …\n$ flipper_length_mm 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…\n$ body_mass_g 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …\n$ sex male, female, female, NA, female, male, female, male…\n$ year 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…\n```\n:::\n:::\n\n\nIf we wanted to count the number of penguins for each of the three species, we can use the `count()` function in `dplyr`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins %>%\n count(species)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 2\n species n\n \n1 Adelie 152\n2 Chinstrap 68\n3 Gentoo 124\n```\n:::\n:::\n\n:::\n\nFor example, we see there are a total of 152 Adelie penguins in the `palmerpenguins` dataset.\n\n::: callout-note\n### Question\n\nIf we wanted to use `qplot()` to map `flipper_length_mm` and `bill_length_mm` to the x and y coordinates, what would we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n\nNow try mapping color to the `species` variable on top of the code you just wrote:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Adding a geom\n\nSometimes it is nice to **add a smoother** to a scatterplot to highlight any trends.\n\nTrends can be difficult to see if the data are very noisy or there are many data points obscuring the view.\n\nA smoother is a **type of \"geom\"** that you can add along with your data points.\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, geom = c(\"point\", \"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage w/smoother](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n:::\n\nHere it seems that engine displacement and highway mileage have a nonlinear U-shaped relationship, but from the previous plot we know that this is largely due to confounding by the drive class of the car.\n\n::: callout-tip\n### Note\n\nPreviously, we did not have to specify `geom = \"point\"` because that was done automatically.\n\nBut if you want the smoother overlaid with the points, then you need to specify both explicitly.\n:::\n\nLook at what happens if we *do not* include the `point` geom.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, geom = c(\"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage w/smoother](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\nSometimes that is the plot you want to show, but in this case it might make more sense to show the data along with the smoother.\n\n::: callout-note\n### Question\n\nLet's **add a smoother** to our `palmerpenguins` dataset example.\n\nUsing the code we previously wrote mapping variables to points and color, add a \"point\" and \"smooth\" geom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Histograms and boxplots\n\nThe `qplot()` function can be used to be used to plot 1-dimensional data too.\n\nBy **specifying a single variable**, `qplot()` will by default make a **histogram**.\n\n::: callout-tip\n### Example\n\nWe can make a histogram of the highway mileage data and stratify on the drive class. So technically this is three histograms on top of each other.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, data = mpg, fill = drv, binwidth = 2)\n```\n\n::: {.cell-output-display}\n![Histogram of highway mileage by drive class](index_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n:::\n\n::: callout-note\n### Question\n\nNotice, I used `fill` here to map color to the `drv` variable. Why is this? What happens when you use `color` instead?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\nHaving the different colors for each drive class is nice, but the three histograms can be a bit difficult to separate out.\n\n**Side-by-side boxplots** are one solution to this problem.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(drv, hwy, data = mpg, geom = \"boxplot\")\n```\n\n::: {.cell-output-display}\n![Boxplots of highway mileage by drive class](index_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\nAnother solution is to plot the histograms in separate panels using facets.\n\n## Facets\n\n**Facets** are a way to **create multiple panels of plots based on the levels of categorical variable**.\n\nHere, we want to see a histogram of the highway mileages and the categorical variable is the drive class variable. We can do that using the `facets` argument to `qplot()`.\n\n::: callout-tip\n### Note\n\nThe `facets` argument **expects a formula type of input**, with a `~` separating the left hand side variable and the right hand side variable.\n\n- The **left hand side** variable indicates how the rows of the panels should be divided\n- The **right hand side** variable indicates how the columns of the panels should be divided\n:::\n\n::: callout-tip\n### Example\n\nHere, we just want three rows of histograms (and just one column), one for each drive class, so we specify `drv` on the left hand side and `.` on the right hand side indicating that there's no variable there (it's empty).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)\n```\n\n::: {.cell-output-display}\n![Histogram of highway mileage by drive class](index_files/figure-html/unnamed-chunk-22-1.png){width=480}\n:::\n:::\n\n:::\n\nWe could also look at **more data using facets**, so instead of histograms we could look at scatter plots of engine displacement and highway mileage by drive class.\n\nHere, we put the `drv` variable on the right hand side to indicate that we want a column for each drive class (as opposed to splitting by rows like we did above).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, facets = . ~ drv)\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class](index_files/figure-html/unnamed-chunk-23-1.png){width=576}\n:::\n:::\n\n\nWhat if you wanted to **add a smoother to each one of those panels**? Simple, you literally just add the smoother as another **geom**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(displ, hwy, data = mpg, facets = . ~ drv) +\n geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output-display}\n![Engine displacement and highway mileage by drive class w/smoother](index_files/figure-html/unnamed-chunk-24-1.png){width=576}\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nWe used a different type of smoother above.\n\nHere, we add a **linear regression line** (a type of smoother) to each group to see if there's any difference.\n:::\n\n::: callout-note\n### Question\n\nLet's facet our `palmerpenguins` dataset example and explore different types of plots.\n\nBuilding off the code we previously wrote, perform the following tasks:\n\n- Facet the plot based on `species` with the the three species along rows.\n- Add a linear regression line to each the types of `species`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n\nNext, make a histogram of the `body_mass_g` for each of the species colored by the three species.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n## Summary\n\nThe `qplot()` function in `ggplot2` is the analog of `plot()` in base graphics but with many built-in features that the traditionaly `plot()` does not provide. The syntax is somewhere in between the base and lattice graphics system. The `qplot()` function is useful for quickly putting data on the page/screen, but for ultimate customization, it may make more sense to use some of the lower level functions that we discuss later in the next lesson.\n\n# Post-lecture materials\n\n### Case Study: MAACS Cohort\n\n
\n\nClick here for case study practicing the `qplot()` function.\n\nThis case study will use data based on the Mouse Allergen and Asthma Cohort Study (MAACS). This study was aimed at characterizing the indoor (home) environment and its relationship with asthma morbidity amonst children aged 5--17 living in Baltimore, MD. The children all had persistent asthma, defined as having had an exacerbation in the past year. A representative publication of results from this study can be found in this paper by [Lu, et al.](https://pubmed.ncbi.nlm.nih.gov/23403052/)\n\n::: keyideas\nBecause the individual-level data for this study are protected by various U.S. privacy laws, we cannot make those data available. For the purposes of this lesson, we have simulated data that share many of the same features of the original data, but do not contain any of the actual measurements or values contained in the original dataset.\n:::\n\nHere is a snapshot of what the data look like.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nhere() starts at /Users/leocollado/Dropbox/Code/jhustatcomputing2023\n```\n:::\n\n```{.r .cell-code}\nmaacs <- read_csv(here(\"data\", \"maacs_sim.csv\"), col_types = \"icnn\")\nmaacs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 750 × 4\n id mopos pm25 eno\n \n 1 1 yes 6.01 28.8 \n 2 2 no 25.2 17.7 \n 3 3 yes 21.8 43.6 \n 4 4 no 13.4 288. \n 5 5 no 49.4 7.60\n 6 6 no 43.4 12.0 \n 7 7 yes 33.0 79.2 \n 8 8 yes 32.7 34.2 \n 9 9 yes 52.2 12.1 \n10 10 yes 51.9 65.0 \n# ℹ 740 more rows\n```\n:::\n:::\n\n\nThe key variables are:\n\n- `mopos`: an indicator of whether the subject is allergic to mouse allergen (yes/no)\n\n- `pm25`: average level of PM2.5 over the course of 7 days (micrograms per cubic meter)\n\n- `eno`: exhaled nitric oxide\n\nThe outcome of interest for this analysis will be exhaled nitric oxide (eNO), which is a measure of pulmonary inflamation. We can get a sense of how eNO is distributed in this population by making a quick histogram of the variable. Here, we take the log of eNO because some right-skew in the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n```\n:::\n\n::: {.cell-output-display}\n![Histogram of log eNO](index_files/figure-html/unnamed-chunk-28-1.png){width=672}\n:::\n:::\n\n\nA quick glance suggests that the histogram is a bit \"fat\", suggesting that there might be multiple groups of people being lumped together. We can stratify the histogram by whether they are allergic to mouse.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, fill = mopos)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n```\n:::\n\n::: {.cell-output-display}\n![Histogram of log eNO by mouse allergic status](index_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\nWe can see from this plot that the non-allergic subjects are shifted slightly to the left, indicating a lower eNO and less pulmonary inflammation. That said, there is significant overlap between the two groups.\n\nAn alternative to histograms is a density smoother, which sometimes can be easier to visualize when there are multiple groups. Here is a density smooth of the entire study population.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, geom = \"density\")\n```\n\n::: {.cell-output-display}\n![Density smooth of log eNO](index_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n\nAnd here are the densities straitified by allergic status. We can map the color aesthetic to the `mopos` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(eno), data = maacs, geom = \"density\", color = mopos)\n```\n\n::: {.cell-output-display}\n![Density smooth of log eNO by mouse allergic status](index_files/figure-html/unnamed-chunk-31-1.png){width=672}\n:::\n:::\n\n\nThese tell the same story as the stratified histograms, which should come as no surprise.\n\nNow we can examine the indoor environment and its relationship to eNO. Here, we use the level of indoor PM2.5 as a measure of indoor environment air quality. We can make a simple scatterplot of PM2.5 and eNO.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, geom = c(\"point\", \"smooth\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![eNO and PM2.5](index_files/figure-html/unnamed-chunk-32-1.png){width=672}\n:::\n:::\n\n\nThe relationship appears modest at best, as there is substantial noise in the data. However, one question that we might be interested in is whether allergic individuals are perhaps more sensitive to PM2.5 inhalation than non-allergic individuals. To examine that question we can stratify the data into two groups.\n\nThis first plot uses different plot symbols for the two groups and overlays them on a single canvas. We can do this by mapping the `mopos` variable to the `shape` aesthetic.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, shape = mopos)\n```\n\n::: {.cell-output-display}\n![eNO and PM2.5 by mouse allergic status](index_files/figure-html/unnamed-chunk-33-1.png){width=672}\n:::\n:::\n\n\nBecause there is substantial overlap in the data it is a bit challenging to discern the circles from the triangles. Part of the reason might be that all of the symbols are the same color (black).\n\nWe can plot each group a different color to see if that helps.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, color = mopos)\n```\n\n::: {.cell-output-display}\n![eNO and PM2.5 by mouse allergic status](index_files/figure-html/unnamed-chunk-34-1.png){width=672}\n:::\n:::\n\n\nThis is slightly better but the substantial overlap makes it difficult to discern any trends in the data. For this we need to add a smoother of some sort. Here we add a linear regression line (a type of smoother) to each group to see if there's any difference.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, color = mopos) +\n geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-35-1.png){width=672}\n:::\n:::\n\n\nHere we see quite clearly that the red group and the green group exhibit rather different relationships between PM2.5 and eNO. For the non-allergic individuals, there appears to be a slightly negative relationship between PM2.5 and eNO and for the allergic individuals, there is a positive relationship. This suggests a strong interaction between PM2.5 and allergic status, an hypothesis perhaps worth following up on in greater detail than this brief exploratory analysis.\n\nAnother, and perhaps more clear, way to visualize this interaction is to use separate panels for the non-allergic and allergic individuals using the `facets` argument to `qplot()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(log(pm25), log(eno), data = maacs, facets = . ~ mopos) +\n geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-36-1.png){width=864}\n:::\n:::\n\n\n
\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. What is gone wrong with this code? Why are the points not blue?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqplot(x = displ, y = hwy, data = mpg, color = \"blue\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-37-1.png){width=672}\n:::\n:::\n\n\n2. Which variables in `mpg` are categorical? Which variables are continuous? (Hint: type `?mpg` to read the documentation for the dataset). How can you see this information when you run `mpg`?\n\n3. Map a continuous variable to `color`, `size`, and `shape` aesthetics. How do these aesthetics behave differently for categorical vs. continuous variables?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)\n bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)\n lattice 0.21-8 2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n Matrix 1.6-1 2023-08-14 [1] CRAN (R 4.3.0)\n mgcv 1.9-0 2023-07-11 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n nlme 3.1-163 2023-08-09 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1 2022-08-15 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [ "index_files" ], diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/execute-results/html.json b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/execute-results/html.json index 831e44b..d6c02cd 100644 --- a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/execute-results/html.json +++ b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "76ec5c219411d3408adcbd15aa8e66ab", + "hash": "d55dbad2db95c2293e2f5b900bdc8450", "result": { - "markdown": "---\ntitle: \"13 - The ggplot2 plotting system: ggplot()\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An overview of the ggplot2 plotting system in R with ggplot()\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/13-ggplot2-plotting-system-part-2/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to build up layers of graphics using `ggplot()`\n- Be able to modify properties of a `ggplot()` including layers and labels\n:::\n\n# The ggplot2 Plotting System\n\nIn this lesson, we will get into a little more of the nitty gritty of **how `ggplot2` builds plots** and how you can customize various aspects of any plot.\n\nPreviously, we used the `qplot()` function to quickly put points on a page.\n\n- The `qplot()` function's syntax is very similar to that of the `plot()` function in base graphics so for those switching over, it makes for an easy transition.\n\nBut it is worth knowing the underlying details of how `ggplot2` works so that you can really exploit its power.\n\n## Basic components of a ggplot2 plot\n\n::: callout-tip\n### Key components\n\nA **`ggplot2` plot** consists of a number of **key components**.\n\n- A **data frame**: stores all of the data that will be displayed on the plot\n\n- **aesthetic mappings**: describe how data are mapped to color, size, shape, location\n\n- **geoms**: geometric objects like points, lines, shapes\n\n- **facets**: describes how conditional/panel plots should be constructed\n\n- **stats**: statistical transformations like binning, quantiles, smoothing\n\n- **scales**: what scale an aesthetic map uses (example: left-handed = red, right-handed = blue)\n\n- **coordinate system**: describes the system in which the locations of the geoms will be drawn\n:::\n\nIt is **essential to organize your data into a data frame** before you start with `ggplot2` (and all the **appropriate metadata** so that your data frame is self-describing and your plots will be self-documenting).\n\nWhen **building plots in `ggplot2`** (rather than using `qplot()`), the **\"artist's palette\" model may be the closest analogy**.\n\nEssentially, you start with some raw data, and then you **gradually add bits and pieces to it to create a plot**.\n\n::: callout-tip\n### Note\n\nPlots are built up in layers, with the typically ordering being\n\n1. Plot the data\n2. Overlay a summary\n3. Add metadata and annotation\n:::\n\nFor quick exploratory plots you may not get past step 1.\n\n## Example: BMI, PM2.5, Asthma\n\nTo demonstrate the various pieces of `ggplot2` we will use a running example from the **Mouse Allergen and Asthma Cohort Study (MAACS)**. Here, the question we are interested in is\n\n> \"Are overweight individuals, as measured by body mass index (BMI), more susceptible than normal weight individuals to the harmful effects of PM2.5 on asthma symptoms?\"\n\nThere is a suggestion that overweight individuals may be more susceptible to the negative effects of inhaling PM2.5.\n\nThis would suggest that increases in PM2.5 exposure in the home of an overweight child would be more deleterious to his/her asthma symptoms than they would be in the home of a normal weight child.\n\nWe want to see if we can see that difference in the data from MAACS.\n\n::: callout-tip\n### Note\n\nBecause the individual-level data for this study are protected by various U.S. privacy laws, we cannot make those data available.\n\nFor the purposes of this lesson, we have **simulated data** that share many of the same features of the original data, but do not contain any of the actual measurements or values contained in the original dataset.\n:::\n\n::: callout-tip\n### Example\n\nWe can look at the data quickly by reading it in as a tibble with `read_csv()` in the `tidyverse` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(here)\nmaacs <- read_csv(here(\"data\", \"bmi_pm25_no2_sim.csv\"),\n col_types = \"nnci\")\nmaacs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 517 × 4\n logpm25 logno2_new bmicat NocturnalSympt\n \n 1 1.25 1.18 normal weight 1\n 2 1.12 1.55 overweight 0\n 3 1.93 1.43 normal weight 0\n 4 1.37 1.77 overweight 2\n 5 0.775 0.765 normal weight 0\n 6 1.49 1.11 normal weight 0\n 7 2.16 1.43 normal weight 0\n 8 1.65 1.40 normal weight 0\n 9 1.55 1.81 normal weight 0\n10 2.04 1.35 overweight 3\n# ℹ 507 more rows\n```\n:::\n:::\n\n:::\n\nThe outcome we will look at here (`NocturnalSymp`) is the number of days in the past 2 weeks where the child experienced asthma symptoms (e.g. coughing, wheezing) while sleeping.\n\nThe other key variables are:\n\n- `logpm25`: average level of PM2.5 over the course of 7 days (micrograms per cubic meter) on the log scale\n\n- `logno2_new`: exhaled nitric oxide on the log scale\n\n- `bmicat`: categorical variable with BMI status\n\n# Building up in layers\n\nFirst, we can **create a `ggplot` object** that stores the dataset and the basic aesthetics for mapping the x- and y-coordinates for the plot.\n\n::: callout-tip\n### Example\n\nHere, we will eventually be plotting the log of PM2.5 and `NocturnalSymp` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(maacs, aes(x = logpm25, \n y = NocturnalSympt))\nsummary(g)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ndata: logpm25, logno2_new, bmicat, NocturnalSympt [517x4]\nmapping: x = ~logpm25, y = ~NocturnalSympt\nfaceting: \n compute_layout: function\n draw_back: function\n draw_front: function\n draw_labels: function\n draw_panels: function\n finish_data: function\n init_scales: function\n map_data: function\n params: list\n setup_data: function\n setup_params: function\n shrink: TRUE\n train_scales: function\n vars: function\n super: \n```\n:::\n\n```{.r .cell-code}\nclass(g)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"gg\" \"ggplot\"\n```\n:::\n:::\n\n:::\n\nYou can see above that the object `g` contains the dataset `maacs` and the mappings.\n\nNow, normally if you were to `print()` a `ggplot` object a plot would appear on the plot device, however, our object `g` actually does not contain enough information to make a plot yet.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- maacs %>%\n ggplot(aes(logpm25, NocturnalSympt))\nprint(g)\n```\n\n::: {.cell-output-display}\n![Nothing to see here!](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n## First plot with point layer\n\nTo make a scatter plot, we need add at least one **geom**, such as points.\n\nHere, we add the `geom_point()` function to create a traditional scatter plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- maacs %>%\n ggplot(aes(logpm25, NocturnalSympt))\ng + geom_point()\n```\n\n::: {.cell-output-display}\n![Scatterplot of PM2.5 and days with nocturnal symptoms](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nHow does ggplot know what points to plot? In this case, it can grab them from the data frame `maacs` that served as the input into the `ggplot()` function.\n\n## Adding more layers\n\n### smooth\n\nBecause the data appear rather noisy, it might be better if we added a smoother on top of the points to see if there is a trend in the data with PM2.5.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n geom_point() + \n geom_smooth()\n```\n\n::: {.cell-output-display}\n![Scatterplot with smoother](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nThe default smoother is a loess smoother, which is flexible and nonparametric but might be too flexible for our purposes. Perhaps we'd prefer a simple linear regression line to highlight any first order trends. We can do this by specifying `method = \"lm\"` to `geom_smooth()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n geom_point() + \n geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output-display}\n![Scatterplot with linear regression line](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nHere, we can see there appears to be a slight increasing trend, suggesting that higher levels of PM2.5 are associated with increased days with nocturnal symptoms.\n\n::: callout-note\n### Question\n\nLet's use the `ggplot()` function with our `palmerpenguins` dataset example and make a scatter plot with `flipper_length_mm` on the x-axis, `bill_length_mm` on the y-axis, colored by `species`, and a smoother by adding a linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(palmerpenguins)\npenguins \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n \n 1 Adelie Torgersen 39.1 18.7 181 3750\n 2 Adelie Torgersen 39.5 17.4 186 3800\n 3 Adelie Torgersen 40.3 18 195 3250\n 4 Adelie Torgersen NA NA NA NA\n 5 Adelie Torgersen 36.7 19.3 193 3450\n 6 Adelie Torgersen 39.3 20.6 190 3650\n 7 Adelie Torgersen 38.9 17.8 181 3625\n 8 Adelie Torgersen 39.2 19.6 195 4675\n 9 Adelie Torgersen 34.1 18.1 193 3475\n10 Adelie Torgersen 42 20.2 190 4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex , year \n```\n:::\n:::\n\n:::\n\n### facets\n\nBecause our primary question involves comparing overweight individuals to normal weight individuals, we can **stratify the scatter plot** of PM2.5 and nocturnal symptoms by the BMI category (`bmicat`) variable, which indicates whether an individual is overweight or now.\n\nTo visualize this we can **add a `facet_grid()`**, which takes a formula argument.\n\n::: callout-tip\n### Example\n\nWe want one row and two columns, one column for each weight category. So we specify `bmicat` on the right hand side of the forumla passed to `facet_grid()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n geom_point() + \n geom_smooth(method = \"lm\") +\n facet_grid(. ~ bmicat) \n```\n\n::: {.cell-output-display}\n![Scatterplot of PM2.5 and nocturnal symptoms by BMI category](index_files/figure-html/unnamed-chunk-8-1.png){width=864}\n:::\n:::\n\n:::\n\nNow it seems clear that the relationship between PM2.5 and nocturnal symptoms is relatively flat among normal weight individuals, while the relationship is increasing among overweight individuals.\n\nThis plot suggests that overweight individuals may be more susceptible to the effects of PM2.5.\n\n# Modifying geom properties\n\nYou can **modify properties of geoms** by specifying options to their respective `geom_*()` functions.\n\n### map aesthetics to constants\n\n::: callout-tip\n### Example\n\nFor example, here we modify the points in the scatterplot to make the color \"steelblue\", the size larger, and the alpha transparency greater.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + geom_point(color = \"steelblue\", size = 4, alpha = 1/2)\n```\n\n::: {.cell-output-display}\n![Modifying point color with a constant](index_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n:::\n\n### map aesthetics to variables\n\nIn addition to setting specific geom attributes to constant values, we can **map aesthetics to variables** in our dataset.\n\nFor example, we can map the aesthetic `color` to the variable `bmicat`, so the points will be colored according to the levels of `bmicat`.\n\nWe use the `aes()` function to indicate this difference from the plot above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + geom_point(aes(color = bmicat), size = 4, alpha = 1/2)\n```\n\n::: {.cell-output-display}\n![Mapping color to a variable](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Customizing the smooth\n\nWe can also **customize aspects of the geoms**.\n\nFor example, we can customize the smoother that we overlay on the points with `geom_smooth()`.\n\nHere we change the line type and increase the size from the default. We also remove the shaded standard error from the line.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n geom_point(aes(color = bmicat), \n size = 2, \n alpha = 1/2) + \n geom_smooth(size = 4, \n linetype = 3, \n method = \"lm\", \n se = FALSE)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n```\n:::\n\n::: {.cell-output-display}\n![Customizing a smoother](index_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\n# Other important stuff\n\n## Changing the theme\n\nThe **default theme for `ggplot2` uses the gray background** with white grid lines.\n\nIf you don't find this suitable, you can use the black and white theme by using the `theme_bw()` function.\n\nThe `theme_bw()` function also allows you to set the typeface for the plot, in case you don't want the default Helvetica. Here we change the typeface to Times.\n\n::: callout-tip\n### Note\n\nFor things that only make sense globally, use `theme()`, i.e. `theme(legend.position = \"none\")`. Two standard appearance themes are included\n\n- `theme_gray()`: The default theme (gray background)\n- `theme_bw()`: More stark/plain\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n geom_point(aes(color = bmicat)) + \n theme_bw(base_family = \"Times\")\n```\n\n::: {.cell-output-display}\n![Modifying the theme for a plot](index_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's take our `palmerpenguins` scatterplot from above and change out the theme to use `theme_dark()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(palmerpenguins)\npenguins \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n \n 1 Adelie Torgersen 39.1 18.7 181 3750\n 2 Adelie Torgersen 39.5 17.4 186 3800\n 3 Adelie Torgersen 40.3 18 195 3250\n 4 Adelie Torgersen NA NA NA NA\n 5 Adelie Torgersen 36.7 19.3 193 3450\n 6 Adelie Torgersen 39.3 20.6 190 3650\n 7 Adelie Torgersen 38.9 17.8 181 3625\n 8 Adelie Torgersen 39.2 19.6 195 4675\n 9 Adelie Torgersen 34.1 18.1 193 3475\n10 Adelie Torgersen 42 20.2 190 4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex , year \n```\n:::\n:::\n\n:::\n\n## Modifying labels\n\n::: callout-tip\n### Note\n\nThere are a variety of **annotations** you can add to a plot, including **different kinds of labels**.\n\n- `xlab()` for x-axis labels\n- `ylab()` for y-axis labels\n- `ggtitle()` for specifying plot titles\n\n`labs()` function is generic and can be used to modify multiple types of labels at once\n:::\n\nHere is an example of modifying the title and the `x` and `y` labels to make the plot a bit more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n geom_point(aes(color = bmicat)) + \n labs(title = \"MAACS Cohort\") + \n labs(x = expression(\"log \" * PM[2.5]), \n y = \"Nocturnal Symptoms\")\n```\n\n::: {.cell-output-display}\n![Modifying plot labels](index_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## A quick aside about axis limits\n\nOne quick **quirk about `ggplot2`** that caught me up when I first started using the package can be displayed in the following example.\n\nIf you make a lot of time series plots, you often **want to restrict the range of the y-axis** while still plotting all the data.\n\nIn the base graphics system you can do that as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntestdat <- data.frame(x = 1:100, \n y = rnorm(100))\ntestdat[50,2] <- 100 ## Outlier!\nplot(testdat$x, \n testdat$y,\n type = \"l\", \n ylim = c(-3,3))\n```\n\n::: {.cell-output-display}\n![Time series plot with base graphics](index_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nHere, we have restricted the y-axis range to be between -3 and 3, even though there is a clear outlier in the data.\n\n::: callout-tip\n### Example\n\nWith `ggplot2` the default settings will give you this.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(testdat, aes(x = x, y = y))\ng + geom_line()\n```\n\n::: {.cell-output-display}\n![Time series plot with default settings](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\nOne might think that modifying the `ylim()` attribute would give you the same thing as the base plot, but it doesn't (?????)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n geom_line() + \n ylim(-3, 3)\n```\n\n::: {.cell-output-display}\n![Time series plot with modified ylim](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n:::\n\nEffectively, what this does is subset the data so that only observations between -3 and 3 are included, then plot the data.\n\nTo plot the data without subsetting it first and still get the restricted range, you have to do the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + \n geom_line() + \n coord_cartesian(ylim = c(-3, 3))\n```\n\n::: {.cell-output-display}\n![Time series plot with restricted y-axis range](index_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\nAnd now you know!\n\n# Post-lecture materials\n\n### Resources\n\n- The *ggplot2* book by Hadley Wickham\n- The *R Graphics Cookbook* by Winston Chang (examples in base plots and in `ggplot2`)\n- [tidyverse web site](http://ggplot2.tidyverse.org)\n\n### More complex example with `ggplot2`\n\nNow you get the sense that plots in the `ggplot2` system are constructed by successively adding components to the plot, starting with the base dataset and maybe a scatterplot. In this section bleow, you can see a slightly more complicated example with an additional variable.\n\n
\n\nClick here for a slightly more complicated example with `ggplot()`.\n\nNow, we will ask the question\n\n> How does the relationship between PM2.5 and nocturnal symptoms vary by BMI category and nitrogen dioxide (NO2)?\n\nUnlike our previous BMI variable, NO2 is continuous, and so we need to make NO2 categorical so we can condition on it in the plotting. We can use the `cut()` function for this purpose. We will divide the NO2 variable into tertiles.\n\nFirst we need to calculate the tertiles with the `quantile()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncutpoints <- quantile(maacs$logno2_new, seq(0, 1, length = 4), na.rm = TRUE)\n```\n:::\n\n\nThen we need to divide the original `logno2_new` variable into the ranges defined by the cut points computed above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmaacs$no2tert <- cut(maacs$logno2_new, cutpoints)\n```\n:::\n\n\nThe `not2tert` variable is now a categorical factor variable containing 3 levels, indicating the ranges of NO2 (on the log scale).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## See the levels of the newly created factor variable\nlevels(maacs$no2tert)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"(0.342,1.23]\" \"(1.23,1.47]\" \"(1.47,2.17]\" \n```\n:::\n:::\n\n\nThe final plot shows the relationship between PM2.5 and nocturnal symptoms by BMI category and NO2 tertile.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Setup ggplot with data frame\ng <- maacs %>%\n ggplot(aes(logpm25, NocturnalSympt))\n\n## Add layers\ng + geom_point(alpha = 1/3) + \n facet_grid(bmicat ~ no2tert) + \n geom_smooth(method=\"lm\", se=FALSE, col=\"steelblue\") + \n theme_bw(base_family = \"Avenir\", base_size = 10) + \n labs(x = expression(\"log \" * PM[2.5])) + \n labs(y = \"Nocturnal Symptoms\") + \n labs(title = \"MAACS Cohort\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![PM2.5 and nocturnal symptoms by BMI category and NO2 tertile](index_files/figure-html/unnamed-chunk-22-1.png){width=864}\n:::\n:::\n\n\n
\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. What happens if you facet on a continuous variable?\n\n2. Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other options control the layout of the individual panels? Why doesn't `facet_grid()` have `nrow` and `ncol` arguments?\n\n3. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?\n\n4. What does `geom_col()` do? How is it different to `geom_bar()`?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)\n bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)\n lattice 0.21-8 2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n Matrix 1.6-1 2023-08-14 [1] CRAN (R 4.3.0)\n mgcv 1.9-0 2023-07-11 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n nlme 3.1-163 2023-08-09 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1 2022-08-15 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"13 - The ggplot2 plotting system: ggplot()\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"An overview of the ggplot2 plotting system in R with ggplot()\"\ncategories: [module 3, week 3, R, programming, ggplot2, data viz]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/13-ggplot2-plotting-system-part-2/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to build up layers of graphics using `ggplot()`\n- Be able to modify properties of a `ggplot()` including layers and labels\n:::\n\n# The ggplot2 Plotting System\n\nIn this lesson, we will get into a little more of the nitty gritty of **how `ggplot2` builds plots** and how you can customize various aspects of any plot.\n\nPreviously, we used the `qplot()` function to quickly put points on a page.\n\n- The `qplot()` function's syntax is very similar to that of the `plot()` function in base graphics so for those switching over, it makes for an easy transition.\n\nBut it is worth knowing the underlying details of how `ggplot2` works so that you can really exploit its power.\n\n## Basic components of a ggplot2 plot\n\n::: callout-tip\n### Key components\n\nA **`ggplot2` plot** consists of a number of **key components**.\n\n- A **data frame**: stores all of the data that will be displayed on the plot\n\n- **aesthetic mappings**: describe how data are mapped to color, size, shape, location\n\n- **geoms**: geometric objects like points, lines, shapes\n\n- **facets**: describes how conditional/panel plots should be constructed\n\n- **stats**: statistical transformations like binning, quantiles, smoothing\n\n- **scales**: what scale an aesthetic map uses (example: left-handed = red, right-handed = blue)\n\n- **coordinate system**: describes the system in which the locations of the geoms will be drawn\n:::\n\nIt is **essential to organize your data into a data frame** before you start with `ggplot2` (and all the **appropriate metadata** so that your data frame is self-describing and your plots will be self-documenting).\n\nWhen **building plots in `ggplot2`** (rather than using `qplot()`), the **\"artist's palette\" model may be the closest analogy**.\n\nEssentially, you start with some raw data, and then you **gradually add bits and pieces to it to create a plot**.\n\n::: callout-tip\n### Note\n\nPlots are built up in layers, with the typically ordering being\n\n1. Plot the data\n2. Overlay a summary\n3. Add metadata and annotation\n:::\n\nFor quick exploratory plots you may not get past step 1.\n\n## Example: BMI, PM2.5, Asthma\n\nTo demonstrate the various pieces of `ggplot2` we will use a running example from the **Mouse Allergen and Asthma Cohort Study (MAACS)**. Here, the question we are interested in is\n\n> \"Are overweight individuals, as measured by body mass index (BMI), more susceptible than normal weight individuals to the harmful effects of PM2.5 on asthma symptoms?\"\n\nThere is a suggestion that overweight individuals may be more susceptible to the negative effects of inhaling PM2.5.\n\nThis would suggest that increases in PM2.5 exposure in the home of an overweight child would be more deleterious to his/her asthma symptoms than they would be in the home of a normal weight child.\n\nWe want to see if we can see that difference in the data from MAACS.\n\n::: callout-tip\n### Note\n\nBecause the individual-level data for this study are protected by various U.S. privacy laws, we cannot make those data available.\n\nFor the purposes of this lesson, we have **simulated data** that share many of the same features of the original data, but do not contain any of the actual measurements or values contained in the original dataset.\n:::\n\n::: callout-tip\n### Example\n\nWe can look at the data quickly by reading it in as a tibble with `read_csv()` in the `tidyverse` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(here)\nmaacs <- read_csv(here(\"data\", \"bmi_pm25_no2_sim.csv\"),\n col_types = \"nnci\"\n)\nmaacs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 517 × 4\n logpm25 logno2_new bmicat NocturnalSympt\n \n 1 1.25 1.18 normal weight 1\n 2 1.12 1.55 overweight 0\n 3 1.93 1.43 normal weight 0\n 4 1.37 1.77 overweight 2\n 5 0.775 0.765 normal weight 0\n 6 1.49 1.11 normal weight 0\n 7 2.16 1.43 normal weight 0\n 8 1.65 1.40 normal weight 0\n 9 1.55 1.81 normal weight 0\n10 2.04 1.35 overweight 3\n# ℹ 507 more rows\n```\n:::\n:::\n\n:::\n\nThe outcome we will look at here (`NocturnalSymp`) is the number of days in the past 2 weeks where the child experienced asthma symptoms (e.g. coughing, wheezing) while sleeping.\n\nThe other key variables are:\n\n- `logpm25`: average level of PM2.5 over the course of 7 days (micrograms per cubic meter) on the log scale\n\n- `logno2_new`: exhaled nitric oxide on the log scale\n\n- `bmicat`: categorical variable with BMI status\n\n# Building up in layers\n\nFirst, we can **create a `ggplot` object** that stores the dataset and the basic aesthetics for mapping the x- and y-coordinates for the plot.\n\n::: callout-tip\n### Example\n\nHere, we will eventually be plotting the log of PM2.5 and `NocturnalSymp` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(maacs, aes(\n x = logpm25,\n y = NocturnalSympt\n))\nsummary(g)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ndata: logpm25, logno2_new, bmicat, NocturnalSympt [517x4]\nmapping: x = ~logpm25, y = ~NocturnalSympt\nfaceting: \n compute_layout: function\n draw_back: function\n draw_front: function\n draw_labels: function\n draw_panels: function\n finish_data: function\n init_scales: function\n map_data: function\n params: list\n setup_data: function\n setup_params: function\n shrink: TRUE\n train_scales: function\n vars: function\n super: \n```\n:::\n\n```{.r .cell-code}\nclass(g)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"gg\" \"ggplot\"\n```\n:::\n:::\n\n:::\n\nYou can see above that the object `g` contains the dataset `maacs` and the mappings.\n\nNow, normally if you were to `print()` a `ggplot` object a plot would appear on the plot device, however, our object `g` actually does not contain enough information to make a plot yet.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- maacs %>%\n ggplot(aes(logpm25, NocturnalSympt))\nprint(g)\n```\n\n::: {.cell-output-display}\n![Nothing to see here!](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\n## First plot with point layer\n\nTo make a scatter plot, we need add at least one **geom**, such as points.\n\nHere, we add the `geom_point()` function to create a traditional scatter plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- maacs %>%\n ggplot(aes(logpm25, NocturnalSympt))\ng + geom_point()\n```\n\n::: {.cell-output-display}\n![Scatterplot of PM2.5 and days with nocturnal symptoms](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nHow does ggplot know what points to plot? In this case, it can grab them from the data frame `maacs` that served as the input into the `ggplot()` function.\n\n## Adding more layers\n\n### smooth\n\nBecause the data appear rather noisy, it might be better if we added a smoother on top of the points to see if there is a trend in the data with PM2.5.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n geom_point() +\n geom_smooth()\n```\n\n::: {.cell-output-display}\n![Scatterplot with smoother](index_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nThe default smoother is a loess smoother, which is flexible and nonparametric but might be too flexible for our purposes. Perhaps we'd prefer a simple linear regression line to highlight any first order trends. We can do this by specifying `method = \"lm\"` to `geom_smooth()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n geom_point() +\n geom_smooth(method = \"lm\")\n```\n\n::: {.cell-output-display}\n![Scatterplot with linear regression line](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nHere, we can see there appears to be a slight increasing trend, suggesting that higher levels of PM2.5 are associated with increased days with nocturnal symptoms.\n\n::: callout-note\n### Question\n\nLet's use the `ggplot()` function with our `palmerpenguins` dataset example and make a scatter plot with `flipper_length_mm` on the x-axis, `bill_length_mm` on the y-axis, colored by `species`, and a smoother by adding a linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(palmerpenguins)\npenguins\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n \n 1 Adelie Torgersen 39.1 18.7 181 3750\n 2 Adelie Torgersen 39.5 17.4 186 3800\n 3 Adelie Torgersen 40.3 18 195 3250\n 4 Adelie Torgersen NA NA NA NA\n 5 Adelie Torgersen 36.7 19.3 193 3450\n 6 Adelie Torgersen 39.3 20.6 190 3650\n 7 Adelie Torgersen 38.9 17.8 181 3625\n 8 Adelie Torgersen 39.2 19.6 195 4675\n 9 Adelie Torgersen 34.1 18.1 193 3475\n10 Adelie Torgersen 42 20.2 190 4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex , year \n```\n:::\n:::\n\n:::\n\n### facets\n\nBecause our primary question involves comparing overweight individuals to normal weight individuals, we can **stratify the scatter plot** of PM2.5 and nocturnal symptoms by the BMI category (`bmicat`) variable, which indicates whether an individual is overweight or now.\n\nTo visualize this we can **add a `facet_grid()`**, which takes a formula argument.\n\n::: callout-tip\n### Example\n\nWe want one row and two columns, one column for each weight category. So we specify `bmicat` on the right hand side of the forumla passed to `facet_grid()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n geom_point() +\n geom_smooth(method = \"lm\") +\n facet_grid(. ~ bmicat)\n```\n\n::: {.cell-output-display}\n![Scatterplot of PM2.5 and nocturnal symptoms by BMI category](index_files/figure-html/unnamed-chunk-8-1.png){width=864}\n:::\n:::\n\n:::\n\nNow it seems clear that the relationship between PM2.5 and nocturnal symptoms is relatively flat among normal weight individuals, while the relationship is increasing among overweight individuals.\n\nThis plot suggests that overweight individuals may be more susceptible to the effects of PM2.5.\n\n# Modifying geom properties\n\nYou can **modify properties of geoms** by specifying options to their respective `geom_*()` functions.\n\n### map aesthetics to constants\n\n::: callout-tip\n### Example\n\nFor example, here we modify the points in the scatterplot to make the color \"steelblue\", the size larger, and the alpha transparency greater.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + geom_point(color = \"steelblue\", size = 4, alpha = 1 / 2)\n```\n\n::: {.cell-output-display}\n![Modifying point color with a constant](index_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n:::\n\n### map aesthetics to variables\n\nIn addition to setting specific geom attributes to constant values, we can **map aesthetics to variables** in our dataset.\n\nFor example, we can map the aesthetic `color` to the variable `bmicat`, so the points will be colored according to the levels of `bmicat`.\n\nWe use the `aes()` function to indicate this difference from the plot above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng + geom_point(aes(color = bmicat), size = 4, alpha = 1 / 2)\n```\n\n::: {.cell-output-display}\n![Mapping color to a variable](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Customizing the smooth\n\nWe can also **customize aspects of the geoms**.\n\nFor example, we can customize the smoother that we overlay on the points with `geom_smooth()`.\n\nHere we change the line type and increase the size from the default. We also remove the shaded standard error from the line.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n geom_point(aes(color = bmicat),\n size = 2,\n alpha = 1 / 2\n ) +\n geom_smooth(\n size = 4,\n linetype = 3,\n method = \"lm\",\n se = FALSE\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.\nℹ Please use `linewidth` instead.\n```\n:::\n\n::: {.cell-output-display}\n![Customizing a smoother](index_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\n# Other important stuff\n\n## Changing the theme\n\nThe **default theme for `ggplot2` uses the gray background** with white grid lines.\n\nIf you don't find this suitable, you can use the black and white theme by using the `theme_bw()` function.\n\nThe `theme_bw()` function also allows you to set the typeface for the plot, in case you don't want the default Helvetica. Here we change the typeface to Times.\n\n::: callout-tip\n### Note\n\nFor things that only make sense globally, use `theme()`, i.e. `theme(legend.position = \"none\")`. Two standard appearance themes are included\n\n- `theme_gray()`: The default theme (gray background)\n- `theme_bw()`: More stark/plain\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n geom_point(aes(color = bmicat)) +\n theme_bw(base_family = \"Times\")\n```\n\n::: {.cell-output-display}\n![Modifying the theme for a plot](index_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n::: callout-note\n### Question\n\nLet's take our `palmerpenguins` scatterplot from above and change out the theme to use `theme_dark()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(palmerpenguins)\npenguins\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n \n 1 Adelie Torgersen 39.1 18.7 181 3750\n 2 Adelie Torgersen 39.5 17.4 186 3800\n 3 Adelie Torgersen 40.3 18 195 3250\n 4 Adelie Torgersen NA NA NA NA\n 5 Adelie Torgersen 36.7 19.3 193 3450\n 6 Adelie Torgersen 39.3 20.6 190 3650\n 7 Adelie Torgersen 38.9 17.8 181 3625\n 8 Adelie Torgersen 39.2 19.6 195 4675\n 9 Adelie Torgersen 34.1 18.1 193 3475\n10 Adelie Torgersen 42 20.2 190 4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex , year \n```\n:::\n:::\n\n:::\n\n## Modifying labels\n\n::: callout-tip\n### Note\n\nThere are a variety of **annotations** you can add to a plot, including **different kinds of labels**.\n\n- `xlab()` for x-axis labels\n- `ylab()` for y-axis labels\n- `ggtitle()` for specifying plot titles\n\n`labs()` function is generic and can be used to modify multiple types of labels at once\n:::\n\nHere is an example of modifying the title and the `x` and `y` labels to make the plot a bit more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n geom_point(aes(color = bmicat)) +\n labs(title = \"MAACS Cohort\") +\n labs(\n x = expression(\"log \" * PM[2.5]),\n y = \"Nocturnal Symptoms\"\n )\n```\n\n::: {.cell-output-display}\n![Modifying plot labels](index_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\n## A quick aside about axis limits\n\nOne quick **quirk about `ggplot2`** that caught me up when I first started using the package can be displayed in the following example.\n\nIf you make a lot of time series plots, you often **want to restrict the range of the y-axis** while still plotting all the data.\n\nIn the base graphics system you can do that as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntestdat <- data.frame(\n x = 1:100,\n y = rnorm(100)\n)\ntestdat[50, 2] <- 100 ## Outlier!\nplot(testdat$x,\n testdat$y,\n type = \"l\",\n ylim = c(-3, 3)\n)\n```\n\n::: {.cell-output-display}\n![Time series plot with base graphics](index_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n\nHere, we have restricted the y-axis range to be between -3 and 3, even though there is a clear outlier in the data.\n\n::: callout-tip\n### Example\n\nWith `ggplot2` the default settings will give you this.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(testdat, aes(x = x, y = y))\ng + geom_line()\n```\n\n::: {.cell-output-display}\n![Time series plot with default settings](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\nOne might think that modifying the `ylim()` attribute would give you the same thing as the base plot, but it doesn't (?????)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n geom_line() +\n ylim(-3, 3)\n```\n\n::: {.cell-output-display}\n![Time series plot with modified ylim](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n:::\n\nEffectively, what this does is subset the data so that only observations between -3 and 3 are included, then plot the data.\n\nTo plot the data without subsetting it first and still get the restricted range, you have to do the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng +\n geom_line() +\n coord_cartesian(ylim = c(-3, 3))\n```\n\n::: {.cell-output-display}\n![Time series plot with restricted y-axis range](index_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\nAnd now you know!\n\n# Post-lecture materials\n\n### Resources\n\n- The *ggplot2* book by Hadley Wickham\n- The *R Graphics Cookbook* by Winston Chang (examples in base plots and in `ggplot2`)\n- [tidyverse web site](http://ggplot2.tidyverse.org)\n\n### More complex example with `ggplot2`\n\nNow you get the sense that plots in the `ggplot2` system are constructed by successively adding components to the plot, starting with the base dataset and maybe a scatterplot. In this section bleow, you can see a slightly more complicated example with an additional variable.\n\n
\n\nClick here for a slightly more complicated example with `ggplot()`.\n\nNow, we will ask the question\n\n> How does the relationship between PM2.5 and nocturnal symptoms vary by BMI category and nitrogen dioxide (NO2)?\n\nUnlike our previous BMI variable, NO2 is continuous, and so we need to make NO2 categorical so we can condition on it in the plotting. We can use the `cut()` function for this purpose. We will divide the NO2 variable into tertiles.\n\nFirst we need to calculate the tertiles with the `quantile()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncutpoints <- quantile(maacs$logno2_new, seq(0, 1, length = 4), na.rm = TRUE)\n```\n:::\n\n\nThen we need to divide the original `logno2_new` variable into the ranges defined by the cut points computed above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmaacs$no2tert <- cut(maacs$logno2_new, cutpoints)\n```\n:::\n\n\nThe `not2tert` variable is now a categorical factor variable containing 3 levels, indicating the ranges of NO2 (on the log scale).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## See the levels of the newly created factor variable\nlevels(maacs$no2tert)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"(0.342,1.23]\" \"(1.23,1.47]\" \"(1.47,2.17]\" \n```\n:::\n:::\n\n\nThe final plot shows the relationship between PM2.5 and nocturnal symptoms by BMI category and NO2 tertile.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Setup ggplot with data frame\ng <- maacs %>%\n ggplot(aes(logpm25, NocturnalSympt))\n\n## Add layers\ng + geom_point(alpha = 1 / 3) +\n facet_grid(bmicat ~ no2tert) +\n geom_smooth(method = \"lm\", se = FALSE, col = \"steelblue\") +\n theme_bw(base_family = \"Avenir\", base_size = 10) +\n labs(x = expression(\"log \" * PM[2.5])) +\n labs(y = \"Nocturnal Symptoms\") +\n labs(title = \"MAACS Cohort\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n`geom_smooth()` using formula = 'y ~ x'\n```\n:::\n\n::: {.cell-output-display}\n![PM2.5 and nocturnal symptoms by BMI category and NO2 tertile](index_files/figure-html/unnamed-chunk-22-1.png){width=864}\n:::\n:::\n\n\n
\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. What happens if you facet on a continuous variable?\n\n2. Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other options control the layout of the individual panels? Why doesn't `facet_grid()` have `nrow` and `ncol` arguments?\n\n3. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?\n\n4. What does `geom_col()` do? How is it different to `geom_bar()`?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)\n bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)\n lattice 0.21-8 2023-04-05 [1] CRAN (R 4.3.1)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n Matrix 1.6-1 2023-08-14 [1] CRAN (R 4.3.0)\n mgcv 1.9-0 2023-07-11 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n nlme 3.1-163 2023-08-09 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1 2022-08-15 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [ "index_files" ], diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-15-1.png b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-15-1.png index d15ae2e..807f126 100644 Binary files a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-15-1.png and b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-15-1.png differ diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-16-1.png b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-16-1.png index cd87fdf..ed79c8a 100644 Binary files a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-16-1.png and b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-16-1.png differ diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-17-1.png b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-17-1.png index 0c3c0d5..126a383 100644 Binary files a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-17-1.png and b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-17-1.png differ diff --git a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-18-1.png b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-18-1.png index a455f86..1f3afa1 100644 Binary files a/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-18-1.png and b/_freeze/posts/13-ggplot2-plotting-system-part-2/index/figure-html/unnamed-chunk-18-1.png differ diff --git a/_freeze/posts/15-control-structures/index/execute-results/html.json b/_freeze/posts/15-control-structures/index/execute-results/html.json index 326fd20..ff844d6 100644 --- a/_freeze/posts/15-control-structures/index/execute-results/html.json +++ b/_freeze/posts/15-control-structures/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "4468daea3a3229a9227b455c6194b86c", + "hash": "5879f7961c587861a8d64dc17e5f9fdf", "result": { - "markdown": "---\ntitle: \"15 - Control Structures\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to control the flow of execution of a series of R expressions\"\ncategories: [module 4, week 4, R, programming]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/15-control-structures/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to use commonly used control structures including `if`, `while`, `repeat`, and `for`\n- Be able to skip an iteration of a loop using `next`\n- Be able to exit a loop immediately using `break`\n:::\n\n# Control Structures\n\n**Control structures** in R allow you to **control the flow of execution of a series of R expressions**.\n\nBasically, control structures allow you to put some \"logic\" into your R code, rather than just always executing the same R code every time.\n\nControl structures **allow you to respond to inputs or to features of the data** and execute different R expressions accordingly.\n\nCommonly used control structures are\n\n- `if` and `else`: testing a condition and acting on it\n\n- `for`: execute a loop a fixed number of times\n\n- `while`: execute a loop *while* a condition is true\n\n- `repeat`: execute an infinite loop (must `break` out of it to stop)\n\n- `break`: break the execution of a loop\n\n- `next`: skip an interation of a loop\n\n::: callout-tip\n### Pro-tip\n\nMost control structures are not used in interactive sessions, but rather when writing functions or longer expressions.\n\nHowever, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions.\n:::\n\n## `if`-`else`\n\nThe `if`-`else` combination is probably the most commonly used control structure in R (or perhaps any language). This structure allows you to test a condition and act on it depending on whether it's true or false.\n\nFor starters, you can just use the `if` statement.\n\n``` r\nif() {\n ## do something\n} \n## Continue with rest of code\n```\n\nThe above code does nothing if the condition is false. If you have an action you want to execute when the condition is false, then you need an `else` clause.\n\n``` r\nif() {\n ## do something\n} \nelse {\n ## do something else\n}\n```\n\nYou can have a series of tests by following the initial `if` with any number of `else if`s.\n\n``` r\nif() {\n ## do something\n} else if() {\n ## do something different\n} else {\n ## do something different\n}\n```\n\nHere is an example of a valid if/else structure.\n\nLet's use the `runif(n, min=0, max=1)` function which draws a random value between a min and max value with the default being between 0 and 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(n=1, min=0, max=10) \nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4.495949\n```\n:::\n:::\n\n\nThen, we can write and `if`-`else` statement that tests whethere `x` is greater than 3 or not.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx > 3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\nIf `x` is greater than 3, then the first condition occurs. If `x` is not greater than 3, then the second condition occurs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif(x > 3) {\n y <- 10\n } else {\n y <- 0\n }\n```\n:::\n\n\nFinally, we can auto print `y` to see what the value is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10\n```\n:::\n:::\n\n\nThis expression can also be written a different (but equivalent!) way in R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- if(x > 3) {\n 10\n } else { \n 0\n }\n\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nNeither way of writing this expression is more correct than the other.\n\nWhich one you use will **depend on your preference** and perhaps those of the team you may be working with.\n:::\n\nOf course, the `else` clause is not necessary. You could have a series of if clauses that always get executed if their respective conditions are true.\n\n``` r\nif() {\n\n}\n\nif() {\n\n}\n```\n\n::: callout-note\n### Question\n\nLet's use the `palmerpenguins` dataset and write a if-else statement that\n\n1. Randomly samples a value from a standard normal distribution (**Hint**: check out the `rnorm(n, mean = 0, sd = 1)` function in base R).\n2. If the value is larger than 0, use `dplyr` functions to keep only the `Chinstrap` penguins.\n3. Otherwise, keep only the `Gentoo` penguins.\n4. Re-run the code 10 times and look at output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(tidyverse)\nlibrary(palmerpenguins)\npenguins \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n \n 1 Adelie Torgersen 39.1 18.7 181 3750\n 2 Adelie Torgersen 39.5 17.4 186 3800\n 3 Adelie Torgersen 40.3 18 195 3250\n 4 Adelie Torgersen NA NA NA NA\n 5 Adelie Torgersen 36.7 19.3 193 3450\n 6 Adelie Torgersen 39.3 20.6 190 3650\n 7 Adelie Torgersen 38.9 17.8 181 3625\n 8 Adelie Torgersen 39.2 19.6 195 4675\n 9 Adelie Torgersen 34.1 18.1 193 3475\n10 Adelie Torgersen 42 20.2 190 4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex , year \n```\n:::\n:::\n\n:::\n\n## `for` Loops\n\n**For loops** are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop was not sufficient.\n\nIn R, for loops take an iterator variable and assign it successive values from a sequence or vector.\n\nFor loops are most commonly used for **iterating over the elements of an object** (list, vector, etc.)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in 1:10) {\n print(i)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n[1] 6\n[1] 7\n[1] 8\n[1] 9\n[1] 10\n```\n:::\n:::\n\n\nThis **loop takes the `i` variable** and in **each iteration of the loop** gives it values 1, 2, 3, ..., 10, then **executes the code** within the curly braces, and then the loop exits.\n\nThe following three loops all have the same behavior.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## define the loop to iterate over\nx <- c(\"a\", \"b\", \"c\", \"d\")\n\n## create for loop\nfor(i in 1:4) {\n ## Print out each element of 'x'\n print(x[i]) \n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nWe can also print just the iteration value (`i`) itself\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## define the loop to iterate over\nx <- c(\"a\", \"b\", \"c\", \"d\")\n\n## create for loop\nfor(i in 1:4) {\n ## Print out just 'i'\n print(i)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n```\n:::\n:::\n\n\n### `seq_along()`\n\nThe `seq_along()` function is **commonly used in conjunction with `for` loops** in order to generate an integer sequence based on the length of an object (or `ncol()` of an R object) (in this case, the object `x`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\" \"b\" \"c\" \"d\"\n```\n:::\n\n```{.r .cell-code}\nseq_along(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2 3 4\n```\n:::\n:::\n\n\nThe `seq_along()` function takes in a vector and then **returns a sequence of integers** that is the same length as the input vector. It doesn't matter what class the vector is.\n\nLet's put `seq_along()` and `for` loops together.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Generate a sequence based on length of 'x'\nfor(i in seq_along(x)) { \n print(x[i])\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nIt is not necessary to use an index-type variable (i.e. `i`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(babyshark in x) {\n print(babyshark)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(candyisgreat in x) {\n print(candyisgreat)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(RememberToVote in x) {\n print(RememberToVote)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nYou can use any character index you want (but not with symbols or numbers).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(1999 in x) {\n print(1999)\n}\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: :1:5: unexpected numeric constant\n1: for(1999\n ^\n```\n:::\n:::\n\n\nFor one line loops, the curly braces are not strictly necessary.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in 1:4) print(x[i])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nHowever, I like to use curly braces even for one-line loops, because that way if you decide to expand the loop to multiple lines, you won't be burned because you forgot to add curly braces (and you **will** be burned by this).\n\n::: callout-note\n### Question\n\nLet's use the `palmerpenguins` dataset. Here are the tasks:\n\n1. Start a `for` loop\n2. Iterate over the columns of `penguins`\n3. For each column, extract the values of that column (**Hint**: check out the `pull()` function in `dplyr`).\n4. Using a `if`-`else` statement, test whether or not the values in the column are numeric or not (**Hint**: remember the `is.numeric()` function to test if a value is numeric).\n5. If they are numeric, compute the column mean. Otherwise, report a `NA`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n### Nested `for` loops\n\n`for` loops can be **nested** inside of each other.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(1:6, nrow = 2, ncol = 3)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3]\n[1,] 1 3 5\n[2,] 2 4 6\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in seq_len(nrow(x))) {\n for(j in seq_len(ncol(x))) {\n print(x[i, j])\n } \n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 3\n[1] 5\n[1] 2\n[1] 4\n[1] 6\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `j` index goes across the columns. That's why we values 1, 3, etc.\n:::\n\nNested loops are commonly needed for **multidimensional or hierarchical data structures** (e.g. matrices, lists). Be careful with nesting though.\n\nNesting beyond 2 to 3 levels often makes it **difficult to read/understand the code**.\n\nIf you find yourself in need of a large number of nested loops, you may want to **break up the loops by using functions** (discussed later).\n\n## `while` Loops\n\n**`while` loops** begin by **testing a condition**.\n\nIf it is true, then they execute the loop body.\n\nOnce the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncount <- 0\nwhile(count < 10) {\n print(count)\n count <- count + 1\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n[1] 6\n[1] 7\n[1] 8\n[1] 9\n```\n:::\n:::\n\n\n`while` loops can potentially result in infinite loops if not written properly. **Use with care!**\n\nSometimes there will be more than one condition in the test.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- 5\nset.seed(1)\n\nwhile(z >= 3 && z <= 10) {\n coin <- rbinom(1, 1, 0.5)\n \n if(coin == 1) { ## random walk\n z <- z + 1\n } else {\n z <- z - 1\n } \n}\nprint(z)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nWhat's the difference between using one `&` or two `&&` ?\n\nIf you use only one `&`, these are vectorized operations, meaning they can **return a vector**, like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n-2:2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -2 -1 0 1 2\n```\n:::\n\n```{.r .cell-code}\n((-2:2) >= 0) & ((-2:2) <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE FALSE TRUE FALSE FALSE\n```\n:::\n:::\n\n\nIf you use two `&&` (as above), then these **conditions are evaluated left to right**. For example, in the above code, if `z` were less than 3, the second test would not have been evaluated.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(2 >= 0) && (-2 <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n(-2 >= 0) && (-2 <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n:::\n\n## `repeat` Loops\n\n**`repeat` initiates an infinite loop** right from the start. These are **not commonly used** in statistical or data analysis applications, but they do have their uses.\n\n::: callout-tip\n### IMPORTANT (READ THIS AND DON'T FORGET... I'M SERIOUS... YOU WANT TO REMEMBER THIS.. FOR REALZ PLZ REMEMBER THIS)\n\nThe only way to exit a `repeat` loop is to call `break`.\n:::\n\nOne possible paradigm might be in an iterative algorithm where you may be searching for a solution and you do not want to stop until you are close enough to the solution.\n\nIn this kind of situation, you often don't know in advance how many iterations it's going to take to get \"close enough\" to the solution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx0 <- 1\ntol <- 1e-8\n\nrepeat {\n x1 <- computeEstimate()\n \n if(abs(x1 - x0) < tol) { ## Close enough?\n break\n } else {\n x0 <- x1\n } \n}\n```\n:::\n\n\n::: callout-tip\n### Note\n\nThe above code will not run if the `computeEstimate()` function is not defined (I just made it up for the purposes of this demonstration).\n:::\n\n::: callout-tip\n### Pro-tip\n\nThe loop above is a bit **dangerous** because there is no guarantee it will stop.\n\nYou could get in a situation where the values of `x0` and `x1` oscillate back and forth and never converge.\n\nBetter to set a hard limit on the number of iterations by using a `for` loop and then report whether convergence was achieved or not.\n:::\n\n## `next`, `break`\n\n`next` is used to skip an iteration of a loop.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in 1:100) {\n if(i <= 20) {\n ## Skip the first 20 iterations\n next \n }\n ## Do something here\n}\n```\n:::\n\n\n`break` is used to exit a loop immediately, regardless of what iteration the loop may be on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor(i in 1:100) {\n print(i)\n\n if(i > 20) {\n ## Stop loop after 20 iterations\n break \n }\t\t\n}\n```\n:::\n\n\n# Summary\n\n- Control structures like `if`, `while`, and `for` allow you to control the flow of an R program\n- Infinite loops should generally be avoided, even if (you believe) they are theoretically correct.\n- Control structures mentioned here are primarily useful for writing programs; for command-line interactive work, the \"apply\" functions are more useful.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Write for loops to compute the mean of every column in `mtcars`.\n\n2. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, `files <- dir(\"data/\", pattern = \"\\\\.csv$\", full.names = TRUE)`, and now want to read each one with `read_csv()`. Write the for loop that will load them into a single data frame.\n\n3. What happens if you use `for (nm in names(x))` and `x` has no names? What if only some of the elements are named? What if the names are not unique?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1 2022-08-15 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"15 - Control Structures\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to control the flow of execution of a series of R expressions\"\ncategories: [module 4, week 4, R, programming]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/15-control-structures/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Be able to use commonly used control structures including `if`, `while`, `repeat`, and `for`\n- Be able to skip an iteration of a loop using `next`\n- Be able to exit a loop immediately using `break`\n:::\n\n# Control Structures\n\n**Control structures** in R allow you to **control the flow of execution of a series of R expressions**.\n\nBasically, control structures allow you to put some \"logic\" into your R code, rather than just always executing the same R code every time.\n\nControl structures **allow you to respond to inputs or to features of the data** and execute different R expressions accordingly.\n\nCommonly used control structures are\n\n- `if` and `else`: testing a condition and acting on it\n\n- `for`: execute a loop a fixed number of times\n\n- `while`: execute a loop *while* a condition is true\n\n- `repeat`: execute an infinite loop (must `break` out of it to stop)\n\n- `break`: break the execution of a loop\n\n- `next`: skip an interation of a loop\n\n::: callout-tip\n### Pro-tip\n\nMost control structures are not used in interactive sessions, but rather when writing functions or longer expressions.\n\nHowever, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions.\n:::\n\n## `if`-`else`\n\nThe `if`-`else` combination is probably the most commonly used control structure in R (or perhaps any language). This structure allows you to test a condition and act on it depending on whether it's true or false.\n\nFor starters, you can just use the `if` statement.\n\n``` r\nif() {\n ## do something\n} \n## Continue with rest of code\n```\n\nThe above code does nothing if the condition is false. If you have an action you want to execute when the condition is false, then you need an `else` clause.\n\n``` r\nif() {\n ## do something\n} \nelse {\n ## do something else\n}\n```\n\nYou can have a series of tests by following the initial `if` with any number of `else if`s.\n\n``` r\nif() {\n ## do something\n} else if() {\n ## do something different\n} else {\n ## do something different\n}\n```\n\nHere is an example of a valid if/else structure.\n\nLet's use the `runif(n, min=0, max=1)` function which draws a random value between a min and max value with the default being between 0 and 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(n = 1, min = 0, max = 10)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.521267\n```\n:::\n:::\n\n\nThen, we can write and `if`-`else` statement that tests whethere `x` is greater than 3 or not.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx > 3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\nIf `x` is greater than 3, then the first condition occurs. If `x` is not greater than 3, then the second condition occurs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (x > 3) {\n y <- 10\n} else {\n y <- 0\n}\n```\n:::\n\n\nFinally, we can auto print `y` to see what the value is.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10\n```\n:::\n:::\n\n\nThis expression can also be written a different (but equivalent!) way in R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- if (x > 3) {\n 10\n} else {\n 0\n}\n\ny\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 10\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nNeither way of writing this expression is more correct than the other.\n\nWhich one you use will **depend on your preference** and perhaps those of the team you may be working with.\n:::\n\nOf course, the `else` clause is not necessary. You could have a series of if clauses that always get executed if their respective conditions are true.\n\n``` r\nif() {\n\n}\n\nif() {\n\n}\n```\n\n::: callout-note\n### Question\n\nLet's use the `palmerpenguins` dataset and write a if-else statement that\n\n1. Randomly samples a value from a standard normal distribution (**Hint**: check out the `rnorm(n, mean = 0, sd = 1)` function in base R).\n2. If the value is larger than 0, use `dplyr` functions to keep only the `Chinstrap` penguins.\n3. Otherwise, keep only the `Gentoo` penguins.\n4. Re-run the code 10 times and look at output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n\nlibrary(tidyverse)\nlibrary(palmerpenguins)\npenguins\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 344 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n \n 1 Adelie Torgersen 39.1 18.7 181 3750\n 2 Adelie Torgersen 39.5 17.4 186 3800\n 3 Adelie Torgersen 40.3 18 195 3250\n 4 Adelie Torgersen NA NA NA NA\n 5 Adelie Torgersen 36.7 19.3 193 3450\n 6 Adelie Torgersen 39.3 20.6 190 3650\n 7 Adelie Torgersen 38.9 17.8 181 3625\n 8 Adelie Torgersen 39.2 19.6 195 4675\n 9 Adelie Torgersen 34.1 18.1 193 3475\n10 Adelie Torgersen 42 20.2 190 4250\n# ℹ 334 more rows\n# ℹ 2 more variables: sex , year \n```\n:::\n:::\n\n:::\n\n## `for` Loops\n\n**For loops** are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop was not sufficient.\n\nIn R, for loops take an iterator variable and assign it successive values from a sequence or vector.\n\nFor loops are most commonly used for **iterating over the elements of an object** (list, vector, etc.)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:10) {\n print(i)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n[1] 6\n[1] 7\n[1] 8\n[1] 9\n[1] 10\n```\n:::\n:::\n\n\nThis **loop takes the `i` variable** and in **each iteration of the loop** gives it values 1, 2, 3, ..., 10, then **executes the code** within the curly braces, and then the loop exits.\n\nThe following three loops all have the same behavior.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## define the loop to iterate over\nx <- c(\"a\", \"b\", \"c\", \"d\")\n\n## create for loop\nfor (i in 1:4) {\n ## Print out each element of 'x'\n print(x[i])\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nWe can also print just the iteration value (`i`) itself\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## define the loop to iterate over\nx <- c(\"a\", \"b\", \"c\", \"d\")\n\n## create for loop\nfor (i in 1:4) {\n ## Print out just 'i'\n print(i)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n```\n:::\n:::\n\n\n### `seq_along()`\n\nThe `seq_along()` function is **commonly used in conjunction with `for` loops** in order to generate an integer sequence based on the length of an object (or `ncol()` of an R object) (in this case, the object `x`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\" \"b\" \"c\" \"d\"\n```\n:::\n\n```{.r .cell-code}\nseq_along(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 2 3 4\n```\n:::\n:::\n\n\nThe `seq_along()` function takes in a vector and then **returns a sequence of integers** that is the same length as the input vector. It doesn't matter what class the vector is.\n\nLet's put `seq_along()` and `for` loops together.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Generate a sequence based on length of 'x'\nfor (i in seq_along(x)) {\n print(x[i])\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nIt is not necessary to use an index-type variable (i.e. `i`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (babyshark in x) {\n print(babyshark)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (candyisgreat in x) {\n print(candyisgreat)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (RememberToVote in x) {\n print(RememberToVote)\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nYou can use any character index you want (but not with symbols or numbers).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (1999 in x) {\n print(1999)\n}\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: :1:6: unexpected numeric constant\n1: for (1999\n ^\n```\n:::\n:::\n\n\nFor one line loops, the curly braces are not strictly necessary.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:4) print(x[i])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\"\n[1] \"b\"\n[1] \"c\"\n[1] \"d\"\n```\n:::\n:::\n\n\nHowever, I like to use curly braces even for one-line loops, because that way if you decide to expand the loop to multiple lines, you won't be burned because you forgot to add curly braces (and you **will** be burned by this).\n\n::: callout-note\n### Question\n\nLet's use the `palmerpenguins` dataset. Here are the tasks:\n\n1. Start a `for` loop\n2. Iterate over the columns of `penguins`\n3. For each column, extract the values of that column (**Hint**: check out the `pull()` function in `dplyr`).\n4. Using a `if`-`else` statement, test whether or not the values in the column are numeric or not (**Hint**: remember the `is.numeric()` function to test if a value is numeric).\n5. If they are numeric, compute the column mean. Otherwise, report a `NA`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# try it yourself\n```\n:::\n\n:::\n\n### Nested `for` loops\n\n`for` loops can be **nested** inside of each other.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(1:6, nrow = 2, ncol = 3)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3]\n[1,] 1 3 5\n[2,] 2 4 6\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in seq_len(nrow(x))) {\n for (j in seq_len(ncol(x))) {\n print(x[i, j])\n }\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n[1] 3\n[1] 5\n[1] 2\n[1] 4\n[1] 6\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe `j` index goes across the columns. That's why we values 1, 3, etc.\n:::\n\nNested loops are commonly needed for **multidimensional or hierarchical data structures** (e.g. matrices, lists). Be careful with nesting though.\n\nNesting beyond 2 to 3 levels often makes it **difficult to read/understand the code**.\n\nIf you find yourself in need of a large number of nested loops, you may want to **break up the loops by using functions** (discussed later).\n\n## `while` Loops\n\n**`while` loops** begin by **testing a condition**.\n\nIf it is true, then they execute the loop body.\n\nOnce the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncount <- 0\nwhile (count < 10) {\n print(count)\n count <- count + 1\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0\n[1] 1\n[1] 2\n[1] 3\n[1] 4\n[1] 5\n[1] 6\n[1] 7\n[1] 8\n[1] 9\n```\n:::\n:::\n\n\n`while` loops can potentially result in infinite loops if not written properly. **Use with care!**\n\nSometimes there will be more than one condition in the test.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- 5\nset.seed(1)\n\nwhile (z >= 3 && z <= 10) {\n coin <- rbinom(1, 1, 0.5)\n\n if (coin == 1) { ## random walk\n z <- z + 1\n } else {\n z <- z - 1\n }\n}\nprint(z)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\nWhat's the difference between using one `&` or two `&&` ?\n\nIf you use only one `&`, these are vectorized operations, meaning they can **return a vector**, like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n-2:2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -2 -1 0 1 2\n```\n:::\n\n```{.r .cell-code}\n((-2:2) >= 0) & ((-2:2) <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE FALSE TRUE FALSE FALSE\n```\n:::\n:::\n\n\nIf you use two `&&` (as above), then these **conditions are evaluated left to right**. For example, in the above code, if `z` were less than 3, the second test would not have been evaluated.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(2 >= 0) && (-2 <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\n(-2 >= 0) && (-2 <= 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n:::\n\n## `repeat` Loops\n\n**`repeat` initiates an infinite loop** right from the start. These are **not commonly used** in statistical or data analysis applications, but they do have their uses.\n\n::: callout-tip\n### IMPORTANT (READ THIS AND DON'T FORGET... I'M SERIOUS... YOU WANT TO REMEMBER THIS.. FOR REALZ PLZ REMEMBER THIS)\n\nThe only way to exit a `repeat` loop is to call `break`.\n:::\n\nOne possible paradigm might be in an iterative algorithm where you may be searching for a solution and you do not want to stop until you are close enough to the solution.\n\nIn this kind of situation, you often don't know in advance how many iterations it's going to take to get \"close enough\" to the solution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx0 <- 1\ntol <- 1e-8\n\nrepeat {\n x1 <- computeEstimate()\n\n if (abs(x1 - x0) < tol) { ## Close enough?\n break\n } else {\n x0 <- x1\n }\n}\n```\n:::\n\n\n::: callout-tip\n### Note\n\nThe above code will not run if the `computeEstimate()` function is not defined (I just made it up for the purposes of this demonstration).\n:::\n\n::: callout-tip\n### Pro-tip\n\nThe loop above is a bit **dangerous** because there is no guarantee it will stop.\n\nYou could get in a situation where the values of `x0` and `x1` oscillate back and forth and never converge.\n\nBetter to set a hard limit on the number of iterations by using a `for` loop and then report whether convergence was achieved or not.\n:::\n\n## `next`, `break`\n\n`next` is used to skip an iteration of a loop.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:100) {\n if (i <= 20) {\n ## Skip the first 20 iterations\n next\n }\n ## Do something here\n}\n```\n:::\n\n\n`break` is used to exit a loop immediately, regardless of what iteration the loop may be on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:100) {\n print(i)\n\n if (i > 20) {\n ## Stop loop after 20 iterations\n break\n }\n}\n```\n:::\n\n\n# Summary\n\n- Control structures like `if`, `while`, and `for` allow you to control the flow of an R program\n- Infinite loops should generally be avoided, even if (you believe) they are theoretically correct.\n- Control structures mentioned here are primarily useful for writing programs; for command-line interactive work, the \"apply\" functions are more useful.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Write for loops to compute the mean of every column in `mtcars`.\n\n2. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, `files <- dir(\"data/\", pattern = \"\\\\.csv$\", full.names = TRUE)`, and now want to read each one with `read_csv()`. Write the for loop that will load them into a single data frame.\n\n3. What happens if you use `for (nm in names(x))` and `x` has no names? What if only some of the elements are named? What if the names are not unique?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n palmerpenguins * 0.1.1 2022-08-15 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/16-functions/index/execute-results/html.json b/_freeze/posts/16-functions/index/execute-results/html.json index 21be540..d99d8f2 100644 --- a/_freeze/posts/16-functions/index/execute-results/html.json +++ b/_freeze/posts/16-functions/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "2bcbe1b7adebc05021228f054529b187", + "hash": "4e89da80ee98fc8abb67c9e1b35b137b", "result": { - "markdown": "---\ntitle: \"16 - Functions\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to writing functions in R\"\ncategories: [module 4, week 4, R, programming, functions]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/16-functions/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n3. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Know how to create a **function** using `function()` in R\n- Know how to define **named arguments** inside a function with default values\n- Be able to use named matching or **positional matching** in the argument list\n- Understand what is **lazy evaluation**\n- Understand the the special `...` argument in a function definition\n:::\n\n# Introduction\n\nWriting functions is a **core activity** of an R programmer. It represents the key step of the transition from a mere \"user\" to a developer who creates new functionality for R.\n\n**Functions** are often used to **encapsulate a sequence of expressions that need to be executed numerous times**, perhaps under slightly different conditions.\n\nAlso, functions are also often written **when code must be shared with others or the public**.\n\nThe writing of a function allows a developer to create an interface to the code, that is explicitly specified with a set of **arguments** (or parameters).\n\nThis interface provides an **abstraction of the code** to potential users. This abstraction simplifies the users' lives because it relieves them from having to know every detail of how the code operates.\n\nIn addition, the creation of an interface allows the developer to **communicate to the user the aspects of the code that are important** or are most relevant.\n\n## Functions in R\n\nFunctions in R are \"first class objects\", which means that they can be treated much like any other R object.\n\n::: callout-tip\n### Important facts about R functions\n\n- Functions can be passed as arguments to other functions.\n - This is very handy for the various apply functions, like `lapply()` and `sapply()`.\n- Functions can be nested, so that you can define a function inside of another function.\n:::\n\nIf you are familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis.\n\n## Your First Function\n\nFunctions are defined using the `function()` directive and are **stored as R objects** just like anything else.\n\n::: callout-tip\n### Important\n\nIn particular, functions are R objects of class `function`.\n\nHere's a simple function that takes no arguments and does nothing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n ## This is an empty function\n}\n## Functions have their own class\nclass(f) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"function\"\n```\n:::\n\n```{.r .cell-code}\n## Execute this function\nf() \n```\n\n::: {.cell-output .cell-output-stdout}\n```\nNULL\n```\n:::\n:::\n\n:::\n\nNot very interesting, but it is a start!\n\nThe next thing we can do is **create a function** that actually has a non-trivial **function body**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n # this is the function body\n hello <- \"Hello, world!\\n\"\n cat(hello) \n}\nf()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\n`cat()` is useful and preferable to `print()` in several settings. One reason is that it doesn't output new lines (i.e. `\\n`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhello <- \"Hello, world!\\n\"\n\nprint(hello)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Hello, world!\\n\"\n```\n:::\n\n```{.r .cell-code}\ncat(hello)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n:::\n\n:::\n\nThe last aspect of a basic function is the **function arguments**.\n\nThese are **the options that you can specify to the user** that the user may explicitly set.\n\nFor this basic function, we can add an argument that determines how many times \"Hello, world!\" is printed to the console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num) {\n for(i in seq_len(num)) {\n hello <- \"Hello, world!\\n\"\n cat(hello) \n }\n}\nf(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\nHello, world!\n```\n:::\n:::\n\n\nObviously, we **could have just cut-and-pasted** the `cat(\"Hello, world!\\n\")` code three times to achieve the same effect, but then we wouldn't be programming, would we?\n\nAlso, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see \"Hello, world!\".\n\n::: callout-tip\n### Pro-tip\n\nIf you find yourself doing a lot of cutting and pasting, that's usually a good sign that you might need to write a function.\n:::\n\nFinally, the function above doesn't **return** anything.\n\nIt just prints \"Hello, world!\" to the console `num` number of times and then exits.\n\nBut often it is useful **if a function returns something** that perhaps can be fed into another section of code.\n\nThis next function returns the total number of characters printed to the console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num) {\n hello <- \"Hello, world!\\n\"\n for(i in seq_len(num)) {\n cat(hello)\n }\n chars <- nchar(hello) * num\n chars\n}\nmeaningoflife <- f(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\nHello, world!\n```\n:::\n\n```{.r .cell-code}\nprint(meaningoflife)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 42\n```\n:::\n:::\n\n\nIn the above function, we did not have to indicate anything special in order for the function to return the number of characters.\n\nIn R, the **return value of a function** is always the very **last expression that is evaluated**.\n\nBecause the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function.\n\n::: callout-tip\n### Note\n\nThere is a `return()` function that can be used to return an explicitly value from a function, but it is rarely used in R (we will discuss it a bit later in this lesson).\n:::\n\nFinally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in f(): argument \"num\" is missing, with no default\n```\n:::\n:::\n\n\nWe can modify this behavior by setting a **default value** for the argument `num`.\n\n**Any function argument can have a default value**, if you wish to specify it.\n\nSometimes, argument values are rarely modified (except in special cases) and it makes sense to set a default value for that argument. This relieves the user from having to specify the value of that argument every single time the function is called.\n\nHere, for example, we could set the default value for `num` to be 1, so that if the function is called without the `num` argument being explicitly specified, then it will print \"Hello, world!\" to the console once.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num = 1) {\n hello <- \"Hello, world!\\n\"\n for(i in seq_len(num)) {\n cat(hello)\n }\n chars <- nchar(hello) * num\n chars\n}\n\n\nf() ## Use default value for 'num'\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n\n```{.r .cell-code}\nf(2) ## Use user-specified value\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 28\n```\n:::\n:::\n\n\nRemember that the function still returns the number of characters printed to the console.\n\n::: callout-tip\n### Pro-tip\n\nThe `formals()` function returns a list of all the formal arguments of a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nformals(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$num\n[1] 1\n```\n:::\n:::\n\n:::\n\n## Summary\n\nWe have written a function that\n\n- has one *formal argument* named `num` with a *default value* of 1. The *formal arguments* are the arguments included in the function definition.\n\n- prints the message \"Hello, world!\" to the console a number of times indicated by the argument `num`\n\n- *returns* the number of characters printed to the console\n\n# Arguments\n\n## Named arguments\n\nAbove, we have learned that functions have **named arguments**, which can optionally have default values.\n\nBecause all function arguments have names, they can be specified using their name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf(num = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 28\n```\n:::\n:::\n\n\nSpecifying an argument by its name is sometimes useful **if a function has many arguments** and it may not always be clear which argument is being specified.\n\nHere, our function only has one argument so there's no confusion.\n\n## Argument matching\n\nCalling an **R function with multiple arguments** can be done in a variety of ways.\n\nThis may be confusing at first, but it's really handy when doing interactive work at the command line. R functions arguments can be matched **positionally** or **by name**.\n\n- **Positional matching** just means that R assigns the first value to the first argument, the second value to second argument, etc.\n\nSo, in the following call to `rnorm()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (n, mean = 0, sd = 1) \n```\n:::\n\n```{.r .cell-code}\nmydata <- rnorm(100, 2, 1) ## Generate some data\n```\n:::\n\n\n100 is assigned to the `n` argument, 2 is assigned to the `mean` argument, and 1 is assigned to the `sd` argument, all by positional matching.\n\nThe following calls to the `sd()` function (which computes the empirical standard deviation of a vector of numbers) are all equivalent.\n\n::: callout-tip\n### Note\n\n`sd(x, na.rm = FALSE)` has two arguments:\n\n- `x` indicates the vector of numbers\n- `na.rm` is a logical indicating whether missing values should be removed or not (default is `FALSE`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Positional match first argument, default for 'na.rm'\nsd(mydata) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n\n```{.r .cell-code}\n## Specify 'x' argument by name, default for 'na.rm'\nsd(x = mydata) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n\n```{.r .cell-code}\n## Specify both arguments by name\nsd(x = mydata, na.rm = FALSE) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n:::\n\n:::\n\nWhen **specifying the function arguments by name**, it **doesn't matter in what order** you specify them.\n\nIn the example below, we specify the `na.rm` argument first, followed by `x`, even though `x` is the first argument defined in the function definition.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Specify both arguments by name\nsd(na.rm = FALSE, x = mydata) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n:::\n\n\nYou **can mix positional matching with matching by name**.\n\nWhen an argument is matched by name, **it is \"taken out\" of the argument list** and the remaining unnamed arguments are matched in the order that they are listed in the function definition.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsd(na.rm = FALSE, mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.9974986\n```\n:::\n:::\n\n\nHere, the `mydata` object is assigned to the `x` argument, because it's the only argument not yet specified.\n\n::: callout-tip\n### Pro-tip\n\nThe `args()` function displays the argument names and corresponding default values of a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (num = 1) \nNULL\n```\n:::\n:::\n\n:::\n\nBelow is the argument list for the `lm()` function, which fits linear models to a dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (formula, data, subset, weights, na.action, method = \"qr\", \n model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, \n contrasts = NULL, offset, ...) \nNULL\n```\n:::\n:::\n\n\nThe following two calls are equivalent.\n\n``` r\nlm(data = mydata, y ~ x, model = FALSE, 1:100)\nlm(y ~ x, mydata, 1:100, model = FALSE)\n```\n\n::: callout-tip\n### Pro-tip\n\nEven though it's legal, I don't recommend messing around with the order of the arguments too much, since it can lead to some confusion.\n:::\n\nMost of the time, **named arguments are helpful**:\n\n- On the command line when you have a long argument list and you want to use the defaults for everything except for an argument near the end of the list\n- If you can remember the name of the argument and not its position on the argument list\n\nFor example, **plotting functions** often have a lot of options to allow for customization, but this makes it difficult to remember exactly the position of every argument on the argument list.\n\nFunction arguments can also be **partially matched**, which is useful for interactive work.\n\n::: callout-tip\n### Pro-tip\n\nThe order of operations when given an argument is\n\n1. Check for exact match for a named argument\n2. Check for a partial match\n3. Check for a positional match\n:::\n\n**Partial matching should be avoided when writing longer code or programs**, because it may lead to confusion if someone is reading the code. However, partial matching is very useful when calling functions interactively that have very long argument names.\n\n## Lazy Evaluation\n\nArguments to functions are **evaluated lazily**, so they are evaluated only as needed in the body of the function.\n\nIn this example, the function `f()` has two arguments: `a` and `b`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a, b) {\n a^2\n} \nf(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n:::\n\n\nThis **function never actually uses the argument `b`**, so calling `f(2)` will not produce an error because the 2 gets positionally matched to `a`.\n\nThis behavior can be good or bad. It's common to write a function that doesn't use an argument and not notice it simply because R never throws an error.\n\nThis example also shows lazy evaluation at work, but does eventually result in an error.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a, b) {\n print(a)\n print(b)\n}\nf(45)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 45\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nError in f(45): argument \"b\" is missing, with no default\n```\n:::\n:::\n\n\nNotice that \"45\" got printed first before the error was triggered! This is because `b` did not have to be evaluated until after `print(a)`.\n\nOnce the function tried to evaluate `print(b)` the function had to throw an error.\n\n## The `...` Argument\n\nThere is a **special argument in R known as the `...` argument**, which indicates **a variable number of arguments** that are usually passed on to other functions.\n\nThe `...` argument is **often used when extending another function** and you do not want to copy the entire argument list of the original function\n\nFor example, a custom plotting function may want to make use of the default `plot()` function along with its entire argument list. The function below changes the default for the `type` argument to the value `type = \"l\"` (the original default was `type = \"p\"`).\n\n``` r\nmyplot <- function(x, y, type = \"l\", ...) {\n plot(x, y, type = type, ...) ## Pass '...' to 'plot' function\n}\n```\n\nGeneric functions use `...` so that extra arguments can be passed to methods.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (x, ...) \nUseMethod(\"mean\")\n\n\n```\n:::\n:::\n\n\nThe `...` argument is necessary when the number of arguments passed to the function cannot be known in advance. This is clear in functions like `paste()` and `cat()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"one\", \"two\", \"three\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one two three\"\n```\n:::\n\n```{.r .cell-code}\npaste(\"one\", \"two\", \"three\", \"four\", \"five\", sep=\"_\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one_two_three_four_five\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(paste)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (..., sep = \" \", collapse = NULL, recycle0 = FALSE) \nNULL\n```\n:::\n:::\n\n\nBecause `paste()` prints out text to the console by combining multiple character vectors together, it is impossible for this function to know in advance how many character vectors will be passed to the function by the user.\n\nSo the first argument in the function is `...`.\n\n## Arguments Coming After the `...` Argument\n\nOne catch with `...` is that any **arguments that appear after** `...` on the argument list **must be named explicitly and cannot be partially matched or matched positionally**.\n\nTake a look at the arguments to the `paste()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(paste)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (..., sep = \" \", collapse = NULL, recycle0 = FALSE) \nNULL\n```\n:::\n:::\n\n\nWith the `paste()` function, the arguments `sep` and `collapse` must be named explicitly and in full if the default values are not going to be used.\n\nHere, I specify that I want \"a\" and \"b\" to be pasted together and separated by a colon.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"a\", \"b\", sep = \":\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a:b\"\n```\n:::\n:::\n\n\nIf I don't specify the `sep` argument in full and attempt to rely on partial matching, I don't get the expected result.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"a\", \"b\", se = \":\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a b :\"\n```\n:::\n:::\n\n\n# Functions are for humans and computers\n\nAs you start to write your own functions, it's important to keep in mind that functions are not just for the computer, but are also for humans. Technically, R does not care what your function is called, or what comments it contains, but these are important for **human readers**.\n\nThis section discusses some things that you should bear in mind when writing functions that humans can understand.\n\n## The name of a function is important\n\nIn an ideal world, you want the name of your function to be short but clearly describe what the function does. This is not always easy, but here are some tips.\n\nThe **function names** should be **verbs**, and **arguments** should be **nouns**.\n\nThere are some exceptions:\n\n- nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`).\n- A good sign that a noun might be a better choice is if you are using a very broad verb like \"get\", \"compute\", \"calculate\", or \"determine\". Use your best judgement and do not be afraid to rename a function if you figure out a better name later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Too short\nf()\n\n# Not a verb, or descriptive\nmy_awesome_function()\n\n# Long, but clear\nimpute_missing()\ncollapse_years()\n```\n:::\n\n\n## snake_case vs camelCase\n\nIf your function name is composed of multiple words, **use \"snake_case\"**, where each lowercase word is separated by an underscore.\n\n\"camelCase\" is a popular alternative. It does not really matter which one you pick, the important thing is to be consistent: **pick one or the other and stick with it**.\n\nR itself is not very consistent, but there is nothing you can do about that. Make sure you do not fall into the same trap by making your code as consistent as possible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Never do this!\ncol_mins <- function(x, y) {}\nrowMaxes <- function(x, y) {}\n```\n:::\n\n\n## Use a common prefix\n\nIf you have a family of functions that do similar things, make sure they have consistent names and arguments.\n\nIt's a good idea to indicate that they are connected. That is better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Good\ninput_select()\ninput_checkbox()\ninput_text()\n\n# Not so good\nselect_input()\ncheckbox_input()\ntext_input()\n```\n:::\n\n\n## Avoid overriding exisiting functions\n\nWhere possible, avoid overriding existing functions and variables.\n\nIt is impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Don't do this!\nT <- FALSE\nc <- 10\nmean <- function(x) sum(x)\n```\n:::\n\n\n## Use comments\n\nUse **comments** are lines starting with #. They can explain the \"why\" of your code.\n\nYou generally should avoid comments that explain the \"what\" or the \"how\". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear.\n\n- Do you need to add some intermediate variables with useful names?\n- Do you need to break out a subcomponent of a large function so you can name it?\n\nHowever, your code can never capture the reasoning behind your decisions:\n\n- Why did you choose this approach instead of an alternative?\n- What else did you try that didn't work?\n\nIt's a great idea to capture that sort of thinking in a comment.\n\n# Environment\n\nThe last component of a function is its **environment**.\n\nThis is not something you need to understand deeply when you first start writing functions. However, it's important to know a little bit about environments because they are crucial to how functions work.\n\nThe **environment of a function** controls how R finds the value associated with a name.\n\nFor example, take this function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(x) {\n x + y\n} \n```\n:::\n\n\nIn many programming languages, this would be an error, because `y` is not defined inside the function.\n\nIn R, this is valid code because R uses rules called **lexical scoping** to find the value associated with a name.\n\nSince `y` is not defined inside the function, R will look in the environment where the function was defined:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- 100\nf(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 110\n```\n:::\n\n```{.r .cell-code}\ny <- 1000\nf(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1010\n```\n:::\n:::\n\n\nThis behavior seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it does not cause too many problems (especially if you regularly restart R to get to a clean slate).\n\nThe **advantage of this behavior** is that from a language standpoint **it allows R to be very consistent**.\n\n- Every name is looked up using the same set of rules.\n\nFor `f()` that includes the behavior of two things that you might not expect: `{` and `+`. This allows you to do devious things like:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`+` <- function(x, y) {\n if (runif(1) < 0.1) {\n sum(x, y)\n } else {\n sum(x, y) * 1.1\n }\n}\ntable(replicate(1000, 1 + 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n 3 3.3 \n100 900 \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nrm(`+`)\n```\n:::\n\n\nThis is a common phenomenon in R. R places few limits on your power. You can do many things that you can't do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like `ggplot2` and `dplyr` possible.\n\n::: callout-tip\n### More resources\n\nIf you are interested in learning more about scoping, check out\n\n- \n- \n:::\n\n# Summary\n\n- Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object\n\n- Functions have can be defined with named arguments; these function arguments can have default values\n\n- Functions arguments can be specified by name or by position in the argument list\n\n- Functions always return the last expression evaluated in the function body\n\n- A variable number of arguments can be specified using the special `...` argument in a function definition.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(is.na(x))\n\nx / sum(x, na.rm = TRUE)\n```\n:::\n\n\n2. Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo) to \"Little Bunny Foo Foo\". There is a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.\n\n3. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.\n\n4. What does the `trim` argument to `mean()` do? When might you use it?\n\n5. The default value for the method argument to `cor()` is `c(\"pearson\", \"kendall\", \"spearman\")`. What does that mean? What value is used by default?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"16 - Functions\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to writing functions in R\"\ncategories: [module 4, week 4, R, programming, functions]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/16-functions/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n3. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Know how to create a **function** using `function()` in R\n- Know how to define **named arguments** inside a function with default values\n- Be able to use named matching or **positional matching** in the argument list\n- Understand what is **lazy evaluation**\n- Understand the the special `...` argument in a function definition\n:::\n\n# Introduction\n\nWriting functions is a **core activity** of an R programmer. It represents the key step of the transition from a mere \"user\" to a developer who creates new functionality for R.\n\n**Functions** are often used to **encapsulate a sequence of expressions that need to be executed numerous times**, perhaps under slightly different conditions.\n\nAlso, functions are also often written **when code must be shared with others or the public**.\n\nThe writing of a function allows a developer to create an interface to the code, that is explicitly specified with a set of **arguments** (or parameters).\n\nThis interface provides an **abstraction of the code** to potential users. This abstraction simplifies the users' lives because it relieves them from having to know every detail of how the code operates.\n\nIn addition, the creation of an interface allows the developer to **communicate to the user the aspects of the code that are important** or are most relevant.\n\n## Functions in R\n\nFunctions in R are \"first class objects\", which means that they can be treated much like any other R object.\n\n::: callout-tip\n### Important facts about R functions\n\n- Functions can be passed as arguments to other functions.\n - This is very handy for the various apply functions, like `lapply()` and `sapply()`.\n- Functions can be nested, so that you can define a function inside of another function.\n:::\n\nIf you are familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis.\n\n## Your First Function\n\nFunctions are defined using the `function()` directive and are **stored as R objects** just like anything else.\n\n::: callout-tip\n### Important\n\nIn particular, functions are R objects of class `function`.\n\nHere's a simple function that takes no arguments and does nothing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n ## This is an empty function\n}\n## Functions have their own class\nclass(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"function\"\n```\n:::\n\n```{.r .cell-code}\n## Execute this function\nf()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nNULL\n```\n:::\n:::\n\n:::\n\nNot very interesting, but it is a start!\n\nThe next thing we can do is **create a function** that actually has a non-trivial **function body**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n # this is the function body\n hello <- \"Hello, world!\\n\"\n cat(hello)\n}\nf()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n:::\n\n\n::: callout-tip\n### Pro-tip\n\n`cat()` is useful and preferable to `print()` in several settings. One reason is that it doesn't output new lines (i.e. `\\n`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhello <- \"Hello, world!\\n\"\n\nprint(hello)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Hello, world!\\n\"\n```\n:::\n\n```{.r .cell-code}\ncat(hello)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n:::\n\n:::\n\nThe last aspect of a basic function is the **function arguments**.\n\nThese are **the options that you can specify to the user** that the user may explicitly set.\n\nFor this basic function, we can add an argument that determines how many times \"Hello, world!\" is printed to the console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num) {\n for (i in seq_len(num)) {\n hello <- \"Hello, world!\\n\"\n cat(hello)\n }\n}\nf(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\nHello, world!\n```\n:::\n:::\n\n\nObviously, we **could have just cut-and-pasted** the `cat(\"Hello, world!\\n\")` code three times to achieve the same effect, but then we wouldn't be programming, would we?\n\nAlso, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see \"Hello, world!\".\n\n::: callout-tip\n### Pro-tip\n\nIf you find yourself doing a lot of cutting and pasting, that's usually a good sign that you might need to write a function.\n:::\n\nFinally, the function above doesn't **return** anything.\n\nIt just prints \"Hello, world!\" to the console `num` number of times and then exits.\n\nBut often it is useful **if a function returns something** that perhaps can be fed into another section of code.\n\nThis next function returns the total number of characters printed to the console.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num) {\n hello <- \"Hello, world!\\n\"\n for (i in seq_len(num)) {\n cat(hello)\n }\n chars <- nchar(hello) * num\n chars\n}\nmeaningoflife <- f(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\nHello, world!\n```\n:::\n\n```{.r .cell-code}\nprint(meaningoflife)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 42\n```\n:::\n:::\n\n\nIn the above function, we did not have to indicate anything special in order for the function to return the number of characters.\n\nIn R, the **return value of a function** is always the very **last expression that is evaluated**.\n\nBecause the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function.\n\n::: callout-tip\n### Note\n\nThere is a `return()` function that can be used to return an explicitly value from a function, but it is rarely used in R (we will discuss it a bit later in this lesson).\n:::\n\nFinally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in f(): argument \"num\" is missing, with no default\n```\n:::\n:::\n\n\nWe can modify this behavior by setting a **default value** for the argument `num`.\n\n**Any function argument can have a default value**, if you wish to specify it.\n\nSometimes, argument values are rarely modified (except in special cases) and it makes sense to set a default value for that argument. This relieves the user from having to specify the value of that argument every single time the function is called.\n\nHere, for example, we could set the default value for `num` to be 1, so that if the function is called without the `num` argument being explicitly specified, then it will print \"Hello, world!\" to the console once.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(num = 1) {\n hello <- \"Hello, world!\\n\"\n for (i in seq_len(num)) {\n cat(hello)\n }\n chars <- nchar(hello) * num\n chars\n}\n\n\nf() ## Use default value for 'num'\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n\n```{.r .cell-code}\nf(2) ## Use user-specified value\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 28\n```\n:::\n:::\n\n\nRemember that the function still returns the number of characters printed to the console.\n\n::: callout-tip\n### Pro-tip\n\nThe `formals()` function returns a list of all the formal arguments of a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nformals(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$num\n[1] 1\n```\n:::\n:::\n\n:::\n\n## Summary\n\nWe have written a function that\n\n- has one *formal argument* named `num` with a *default value* of 1. The *formal arguments* are the arguments included in the function definition.\n\n- prints the message \"Hello, world!\" to the console a number of times indicated by the argument `num`\n\n- *returns* the number of characters printed to the console\n\n# Arguments\n\n## Named arguments\n\nAbove, we have learned that functions have **named arguments**, which can optionally have default values.\n\nBecause all function arguments have names, they can be specified using their name.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf(num = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nHello, world!\nHello, world!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 28\n```\n:::\n:::\n\n\nSpecifying an argument by its name is sometimes useful **if a function has many arguments** and it may not always be clear which argument is being specified.\n\nHere, our function only has one argument so there's no confusion.\n\n## Argument matching\n\nCalling an **R function with multiple arguments** can be done in a variety of ways.\n\nThis may be confusing at first, but it's really handy when doing interactive work at the command line. R functions arguments can be matched **positionally** or **by name**.\n\n- **Positional matching** just means that R assigns the first value to the first argument, the second value to second argument, etc.\n\nSo, in the following call to `rnorm()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (n, mean = 0, sd = 1) \n```\n:::\n\n```{.r .cell-code}\nmydata <- rnorm(100, 2, 1) ## Generate some data\n```\n:::\n\n\n100 is assigned to the `n` argument, 2 is assigned to the `mean` argument, and 1 is assigned to the `sd` argument, all by positional matching.\n\nThe following calls to the `sd()` function (which computes the empirical standard deviation of a vector of numbers) are all equivalent.\n\n::: callout-tip\n### Note\n\n`sd(x, na.rm = FALSE)` has two arguments:\n\n- `x` indicates the vector of numbers\n- `na.rm` is a logical indicating whether missing values should be removed or not (default is `FALSE`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Positional match first argument, default for 'na.rm'\nsd(mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n\n```{.r .cell-code}\n## Specify 'x' argument by name, default for 'na.rm'\nsd(x = mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n\n```{.r .cell-code}\n## Specify both arguments by name\nsd(x = mydata, na.rm = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n:::\n\n:::\n\nWhen **specifying the function arguments by name**, it **doesn't matter in what order** you specify them.\n\nIn the example below, we specify the `na.rm` argument first, followed by `x`, even though `x` is the first argument defined in the function definition.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Specify both arguments by name\nsd(na.rm = FALSE, x = mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n:::\n\n\nYou **can mix positional matching with matching by name**.\n\nWhen an argument is matched by name, **it is \"taken out\" of the argument list** and the remaining unnamed arguments are matched in the order that they are listed in the function definition.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsd(na.rm = FALSE, mydata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.014286\n```\n:::\n:::\n\n\nHere, the `mydata` object is assigned to the `x` argument, because it's the only argument not yet specified.\n\n::: callout-tip\n### Pro-tip\n\nThe `args()` function displays the argument names and corresponding default values of a function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (num = 1) \nNULL\n```\n:::\n:::\n\n:::\n\nBelow is the argument list for the `lm()` function, which fits linear models to a dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (formula, data, subset, weights, na.action, method = \"qr\", \n model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, \n contrasts = NULL, offset, ...) \nNULL\n```\n:::\n:::\n\n\nThe following two calls are equivalent.\n\n``` r\nlm(data = mydata, y ~ x, model = FALSE, 1:100)\nlm(y ~ x, mydata, 1:100, model = FALSE)\n```\n\n::: callout-tip\n### Pro-tip\n\nEven though it's legal, I don't recommend messing around with the order of the arguments too much, since it can lead to some confusion.\n:::\n\nMost of the time, **named arguments are helpful**:\n\n- On the command line when you have a long argument list and you want to use the defaults for everything except for an argument near the end of the list\n- If you can remember the name of the argument and not its position on the argument list\n\nFor example, **plotting functions** often have a lot of options to allow for customization, but this makes it difficult to remember exactly the position of every argument on the argument list.\n\nFunction arguments can also be **partially matched**, which is useful for interactive work.\n\n::: callout-tip\n### Pro-tip\n\nThe order of operations when given an argument is\n\n1. Check for exact match for a named argument\n2. Check for a partial match\n3. Check for a positional match\n:::\n\n**Partial matching should be avoided when writing longer code or programs**, because it may lead to confusion if someone is reading the code. However, partial matching is very useful when calling functions interactively that have very long argument names.\n\n## Lazy Evaluation\n\nArguments to functions are **evaluated lazily**, so they are evaluated only as needed in the body of the function.\n\nIn this example, the function `f()` has two arguments: `a` and `b`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a, b) {\n a^2\n}\nf(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n:::\n\n\nThis **function never actually uses the argument `b`**, so calling `f(2)` will not produce an error because the 2 gets positionally matched to `a`.\n\nThis behavior can be good or bad. It's common to write a function that doesn't use an argument and not notice it simply because R never throws an error.\n\nThis example also shows lazy evaluation at work, but does eventually result in an error.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a, b) {\n print(a)\n print(b)\n}\nf(45)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 45\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nError in f(45): argument \"b\" is missing, with no default\n```\n:::\n:::\n\n\nNotice that \"45\" got printed first before the error was triggered! This is because `b` did not have to be evaluated until after `print(a)`.\n\nOnce the function tried to evaluate `print(b)` the function had to throw an error.\n\n## The `...` Argument\n\nThere is a **special argument in R known as the `...` argument**, which indicates **a variable number of arguments** that are usually passed on to other functions.\n\nThe `...` argument is **often used when extending another function** and you do not want to copy the entire argument list of the original function\n\nFor example, a custom plotting function may want to make use of the default `plot()` function along with its entire argument list. The function below changes the default for the `type` argument to the value `type = \"l\"` (the original default was `type = \"p\"`).\n\n``` r\nmyplot <- function(x, y, type = \"l\", ...) {\n plot(x, y, type = type, ...) ## Pass '...' to 'plot' function\n}\n```\n\nGeneric functions use `...` so that extra arguments can be passed to methods.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (x, ...) \nUseMethod(\"mean\")\n\n\n```\n:::\n:::\n\n\nThe `...` argument is necessary when the number of arguments passed to the function cannot be known in advance. This is clear in functions like `paste()` and `cat()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"one\", \"two\", \"three\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one two three\"\n```\n:::\n\n```{.r .cell-code}\npaste(\"one\", \"two\", \"three\", \"four\", \"five\", sep = \"_\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"one_two_three_four_five\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(paste)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (..., sep = \" \", collapse = NULL, recycle0 = FALSE) \nNULL\n```\n:::\n:::\n\n\nBecause `paste()` prints out text to the console by combining multiple character vectors together, it is impossible for this function to know in advance how many character vectors will be passed to the function by the user.\n\nSo the first argument in the function is `...`.\n\n## Arguments Coming After the `...` Argument\n\nOne catch with `...` is that any **arguments that appear after** `...` on the argument list **must be named explicitly and cannot be partially matched or matched positionally**.\n\nTake a look at the arguments to the `paste()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nargs(paste)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (..., sep = \" \", collapse = NULL, recycle0 = FALSE) \nNULL\n```\n:::\n:::\n\n\nWith the `paste()` function, the arguments `sep` and `collapse` must be named explicitly and in full if the default values are not going to be used.\n\nHere, I specify that I want \"a\" and \"b\" to be pasted together and separated by a colon.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"a\", \"b\", sep = \":\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a:b\"\n```\n:::\n:::\n\n\nIf I don't specify the `sep` argument in full and attempt to rely on partial matching, I don't get the expected result.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaste(\"a\", \"b\", se = \":\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a b :\"\n```\n:::\n:::\n\n\n# Functions are for humans and computers\n\nAs you start to write your own functions, it's important to keep in mind that functions are not just for the computer, but are also for humans. Technically, R does not care what your function is called, or what comments it contains, but these are important for **human readers**.\n\nThis section discusses some things that you should bear in mind when writing functions that humans can understand.\n\n## The name of a function is important\n\nIn an ideal world, you want the name of your function to be short but clearly describe what the function does. This is not always easy, but here are some tips.\n\nThe **function names** should be **verbs**, and **arguments** should be **nouns**.\n\nThere are some exceptions:\n\n- nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`).\n- A good sign that a noun might be a better choice is if you are using a very broad verb like \"get\", \"compute\", \"calculate\", or \"determine\". Use your best judgement and do not be afraid to rename a function if you figure out a better name later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Too short\nf()\n\n# Not a verb, or descriptive\nmy_awesome_function()\n\n# Long, but clear\nimpute_missing()\ncollapse_years()\n```\n:::\n\n\n## snake_case vs camelCase\n\nIf your function name is composed of multiple words, **use \"snake_case\"**, where each lowercase word is separated by an underscore.\n\n\"camelCase\" is a popular alternative. It does not really matter which one you pick, the important thing is to be consistent: **pick one or the other and stick with it**.\n\nR itself is not very consistent, but there is nothing you can do about that. Make sure you do not fall into the same trap by making your code as consistent as possible.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Never do this!\ncol_mins <- function(x, y) {}\nrowMaxes <- function(x, y) {}\n```\n:::\n\n\n## Use a common prefix\n\nIf you have a family of functions that do similar things, make sure they have consistent names and arguments.\n\nIt's a good idea to indicate that they are connected. That is better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Good\ninput_select()\ninput_checkbox()\ninput_text()\n\n# Not so good\nselect_input()\ncheckbox_input()\ntext_input()\n```\n:::\n\n\n## Avoid overriding exisiting functions\n\nWhere possible, avoid overriding existing functions and variables.\n\nIt is impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Don't do this!\nT <- FALSE\nc <- 10\nmean <- function(x) sum(x)\n```\n:::\n\n\n## Use comments\n\nUse **comments** are lines starting with #. They can explain the \"why\" of your code.\n\nYou generally should avoid comments that explain the \"what\" or the \"how\". If you can't understand what the code does from reading it, you should think about how to rewrite it to be more clear.\n\n- Do you need to add some intermediate variables with useful names?\n- Do you need to break out a subcomponent of a large function so you can name it?\n\nHowever, your code can never capture the reasoning behind your decisions:\n\n- Why did you choose this approach instead of an alternative?\n- What else did you try that didn't work?\n\nIt's a great idea to capture that sort of thinking in a comment.\n\n# Environment\n\nThe last component of a function is its **environment**.\n\nThis is not something you need to understand deeply when you first start writing functions. However, it's important to know a little bit about environments because they are crucial to how functions work.\n\nThe **environment of a function** controls how R finds the value associated with a name.\n\nFor example, take this function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(x) {\n x + y\n}\n```\n:::\n\n\nIn many programming languages, this would be an error, because `y` is not defined inside the function.\n\nIn R, this is valid code because R uses rules called **lexical scoping** to find the value associated with a name.\n\nSince `y` is not defined inside the function, R will look in the environment where the function was defined:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- 100\nf(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 110\n```\n:::\n\n```{.r .cell-code}\ny <- 1000\nf(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1010\n```\n:::\n:::\n\n\nThis behavior seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it does not cause too many problems (especially if you regularly restart R to get to a clean slate).\n\nThe **advantage of this behavior** is that from a language standpoint **it allows R to be very consistent**.\n\n- Every name is looked up using the same set of rules.\n\nFor `f()` that includes the behavior of two things that you might not expect: `{` and `+`. This allows you to do devious things like:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`+` <- function(x, y) {\n if (runif(1) < 0.1) {\n sum(x, y)\n } else {\n sum(x, y) * 1.1\n }\n}\ntable(replicate(1000, 1 + 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n 3 3.3 \n 82 918 \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nrm(`+`)\n```\n:::\n\n\nThis is a common phenomenon in R. R places few limits on your power. You can do many things that you can't do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like `ggplot2` and `dplyr` possible.\n\n::: callout-tip\n### More resources\n\nIf you are interested in learning more about scoping, check out\n\n- \n- \n:::\n\n# Summary\n\n- Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object\n\n- Functions have can be defined with named arguments; these function arguments can have default values\n\n- Functions arguments can be specified by name or by position in the argument list\n\n- Functions always return the last expression evaluated in the function body\n\n- A variable number of arguments can be specified using the special `...` argument in a function definition.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(is.na(x))\n\nx / sum(x, na.rm = TRUE)\n```\n:::\n\n\n2. Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo) to \"Little Bunny Foo Foo\". There is a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.\n\n3. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.\n\n4. What does the `trim` argument to `mean()` do? When might you use it?\n\n5. The default value for the method argument to `cor()` is `c(\"pearson\", \"kendall\", \"spearman\")`. What does that mean? What value is used by default?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/17-loop-functions/index/execute-results/html.json b/_freeze/posts/17-loop-functions/index/execute-results/html.json index 07a66d5..9941616 100644 --- a/_freeze/posts/17-loop-functions/index/execute-results/html.json +++ b/_freeze/posts/17-loop-functions/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "d6faa0c6c51529d5dc808757f7036a0a", + "hash": "d7a61d6652db2a9bb07ed8e7a7941d8f", "result": { - "markdown": "---\ntitle: \"17 - Vectorization and loop functionals\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to vectorization and loop functionals\"\ncategories: [module 4, week 5, R, programming, functions]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/17-loop-functions/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Understand how to perform vector arithmetics in R\n- Implement the 5 functional loops in R (vs e.g. for loops) in R\n:::\n\n# Vectorization\n\nWriting `for` and `while` loops are useful and easy to understand, but in R we rarely use them.\n\nAs you learn more R, you will realize that **vectorization** is preferred over for-loops since it results in shorter and clearer code.\n\n## Vector arithmetics\n\n### Rescaling a vector\n\nIn R, arithmetic operations on **vectors occur element-wise**. For a quick example, suppose we have height in inches:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)\n```\n:::\n\n\nand want to convert to centimeters.\n\nNotice what happens when we multiply inches by 2.54:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches * 2.54\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80\n```\n:::\n:::\n\n\nIn the line above, we **multiplied each element** by 2.54.\n\nSimilarly, if for each entry we want to compute how many inches taller or shorter than 69 inches (the average height for males), we can subtract it from every entry like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches - 69\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 0 -7 -3 1 1 4 -2 4 -2 1\n```\n:::\n:::\n\n\n### Two vectors\n\nIf we have **two vectors of the same length**, and we sum them in R, they will be **added entry by entry** as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\ny <- 1:10 \nx + y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 2 4 6 8 10 12 14 16 18 20\n```\n:::\n:::\n\n\nThe same holds for other mathematical operations, such as `-`, `*` and `/`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\nsqrt(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427\n [9] 3.000000 3.162278\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- 1:10\nx*y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 4 9 16 25 36 49 64 81 100\n```\n:::\n:::\n\n\n# Functional loops\n\nWhile `for` loops are perfectly valid, when you use vectorization in an element-wise fashion, there is no need for `for` loops because we can apply what are called functional loops.\n\n**Functional loops** are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here are a list of them:\n\n- `lapply()`: Loop over a list and evaluate a function on each element\n\n- `sapply()`: Same as `lapply` but try to simplify the result\n\n- `apply()`: Apply a function over the margins of an array\n\n- `tapply()`: Apply a function over subsets of a vector\n\n- `mapply()`: Multivariate version of `lapply` (won't cover)\n\nAn auxiliary function `split()` is also useful, particularly in conjunction with `lapply()`.\n\n## `lapply()`\n\nThe `lapply()` function does the following simple series of operations:\n\n1. it loops over a list, iterating over each element in that list\n2. it applies a *function* to each element of the list (a function that you specify)\n3. and returns a list (the `l` in `lapply()` is for \"list\").\n\nThis function takes three arguments: (1) a list `X`; (2) a function (or the name of a function) `FUN`; (3) other arguments via its `...` argument. If `X` is not a list, it will be coerced to a list using `as.list()`.\n\nThe body of the `lapply()` function can be seen here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, FUN, ...) \n{\n FUN <- match.fun(FUN)\n if (!is.vector(X) || is.object(X)) \n X <- as.list(X)\n .Internal(lapply(X, FUN))\n}\n\n\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe actual looping is done internally in C code for efficiency reasons.\n:::\n\nIt is important to remember that `lapply()` always returns a list, regardless of the class of the input.\n\n::: callout-tip\n### Example\n\nHere's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:5, b = rnorm(10))\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2 3 4 5\n\n$b\n [1] 1.2229485 0.1878172 -0.7560246 -0.8520380 1.7012165 -0.6487224\n [7] -0.6863177 0.2483112 -0.4892715 -1.3013705\n```\n:::\n\n```{.r .cell-code}\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 3\n\n$b\n[1] -0.1373451\n```\n:::\n:::\n\n\nNotice that here we are passing the `mean()` function as an argument to the `lapply()` function.\n:::\n\n**Functions in R can be** used this way and can be **passed back and forth as arguments** just like any other object inR.\n\nWhen you pass a function to another function, you do not need to include the open and closed parentheses `()` like you do when you are **calling** a function.\n\n::: callout-tip\n### Example\n\nHere is another example of using `lapply()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 2.5\n\n$b\n[1] 0.08347435\n\n$c\n[1] 0.7113253\n\n$d\n[1] 5.115736\n```\n:::\n:::\n\n:::\n\nYou can use `lapply()` to evaluate a function multiple times each with a different argument.\n\nNext is an example where I call the `runif()` function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nlapply(x, runif)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 0.5243483\n\n[[2]]\n[1] 0.6741892 0.2195450\n\n[[3]]\n[1] 0.5812777 0.9505693 0.7417778\n\n[[4]]\n[1] 0.7255782 0.5353819 0.2978647 0.7711454\n```\n:::\n:::\n\n\n::: callout-tip\n### What happened?\n\nWhen you pass a function to `lapply()`, `lapply()` takes elements of the list and passes them as the *first argument* of the function you are applying.\n\nIn the above example, the first argument of `runif()` is `n`, and so the elements of the sequence `1:4` all got passed to the `n` argument of `runif()`.\n:::\n\nFunctions that you pass to `lapply()` may have other arguments. For example, the `runif()` function has a `min` and `max` argument too.\n\n::: callout-note\n### Question\n\nIn the example above I used the default values for `min` and `max`.\n\n- How would you be able to specify different values for that in the context of `lapply()`?\n:::\n\nHere is where the `...` argument to `lapply()` comes into play. Any arguments that you place in the `...` argument will get passed down to the function being applied to the elements of the list.\n\nHere, the `min = 0` and `max = 10` arguments are passed down to `runif()` every time it gets called.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nlapply(x, runif, min = 0, max = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 3.788686\n\n[[2]]\n[1] 7.370226 2.895551\n\n[[3]]\n[1] 4.742728 5.781041 7.590877\n\n[[4]]\n[1] 9.5078106 4.9129254 0.9233773 1.9651523\n```\n:::\n:::\n\n\nSo now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10.\n\nThe `lapply()` function (and its friends) makes heavy use of *anonymous* functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These functions are generated \"on the fly\" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace.\n\n::: callout-tip\n### Example\n\nHere I am creating a list that contains two matrices.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2)) \nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n [,1] [,2]\n[1,] 1 3\n[2,] 2 4\n\n$b\n [,1] [,2]\n[1,] 1 4\n[2,] 2 5\n[3,] 3 6\n```\n:::\n:::\n\n\nSuppose I wanted to extract the first column of each matrix in the list. I could write an anonymous function for extracting the first column of each matrix.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(x, function(elt) { elt[,1] })\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2\n\n$b\n[1] 1 2 3\n```\n:::\n:::\n\n\nNotice that I put the `function()` definition right in the call to `lapply()`.\n:::\n\nThis is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside `lapply()`, but if it's going to be more complicated, it's probably a better idea to define the function separately.\n\nFor example, I could have done the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(elt) {\n elt[, 1]\n}\nlapply(x, f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2\n\n$b\n[1] 1 2 3\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nNow the function is no longer anonymous; its name is `f`.\n:::\n\nWhether you use an anonymous function or you define a function first depends on your context. If you think the function `f` is something you are going to need a lot in other parts of your code, you might want to define it separately. But if you are just going to use it for this call to `lapply()`, then it is probably simpler to use an anonymous function.\n\n## `sapply()`\n\nThe `sapply()` function behaves similarly to `lapply()`; the only real difference is in the return value. `sapply()` will try to simplify the result of `lapply()` if possible. Essentially, `sapply()` calls `lapply()` on its input and then applies the following algorithm:\n\n- If the result is a list where every element is length 1, then a vector is returned\n\n- If the result is a list where every element is a vector of the same length (\\> 1), a matrix is returned.\n\n- If it can't figure things out, a list is returned\n\nHere's the result of calling `lapply()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 2.5\n\n$b\n[1] 0.1887374\n\n$c\n[1] 1.304012\n\n$d\n[1] 5.012149\n```\n:::\n:::\n\n\nNotice that `lapply()` returns a list (as usual), but that each element of the list has length 1.\n\nHere's the result of calling `sapply()` on the same list.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(x, mean) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n a b c d \n2.5000000 0.1887374 1.3040116 5.0121492 \n```\n:::\n:::\n\n\nBecause the result of `lapply()` was a list where each element had length 1, `sapply()` collapsed the output into a numeric vector, which is often more useful than a list.\n\n## `split()`\n\nThe `split()` function takes a vector or other objects and splits it into groups determined by a factor or list of factors.\n\nThe arguments to `split()` are\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(split)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (x, f, drop = FALSE, ...) \n```\n:::\n:::\n\n\nwhere\n\n- `x` is a vector (or list) or data frame\n- `f` is a factor (or coerced to one) or a list of factors\n- `drop` indicates whether empty factors levels should be dropped\n\nThe combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying that function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as \"map-reduce\" in other contexts.\n\nHere we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to \"generate levels\" in a factor variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(rnorm(10), runif(10), rnorm(10, 1))\nf <- gl(3, 10) # generate factor levels\nf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3\nLevels: 1 2 3\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsplit(x, f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n [1] 0.3894649 -0.9253587 1.4289026 -0.4602414 -0.6726225 3.0199478\n [7] 1.2191391 0.3484649 0.8080023 0.3774965\n\n$`2`\n [1] 0.8782538 0.4346257 0.5970352 0.1635630 0.3839225 0.8525622 0.4988508\n [8] 0.8624590 0.2599047 0.1006897\n\n$`3`\n [1] 0.20936028 -0.17011167 0.34303862 1.04290024 0.80930785 3.20756177\n [7] 0.55197354 0.49465007 0.06888936 -0.41865555\n```\n:::\n:::\n\n\nA common idiom is `split` followed by an `lapply`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(split(x, f), mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n[1] 0.5533195\n\n$`2`\n[1] 0.5031867\n\n$`3`\n[1] 0.6138915\n```\n:::\n:::\n\n\n### Splitting a Data Frame\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(datasets)\nhead(airquality)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Ozone Solar.R Wind Temp Month Day\n1 41 190 7.4 67 5 1\n2 36 118 8.0 72 5 2\n3 12 149 12.6 74 5 3\n4 18 313 11.5 62 5 4\n5 NA NA 14.3 56 5 5\n6 28 NA 14.9 66 5 6\n```\n:::\n:::\n\n\nWe can split the `airquality` data frame by the `Month` variable so that we have separate sub-data frames for each month.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ns <- split(airquality, airquality$Month)\nstr(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 5\n $ 5:'data.frame':\t31 obs. of 6 variables:\n ..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...\n ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...\n ..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...\n ..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...\n ..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...\n ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 6:'data.frame':\t30 obs. of 6 variables:\n ..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...\n ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...\n ..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...\n ..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...\n ..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...\n ..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...\n $ 7:'data.frame':\t31 obs. of 6 variables:\n ..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...\n ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...\n ..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...\n ..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...\n ..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...\n ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 8:'data.frame':\t31 obs. of 6 variables:\n ..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...\n ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...\n ..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...\n ..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...\n ..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...\n ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 9:'data.frame':\t30 obs. of 6 variables:\n ..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...\n ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...\n ..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...\n ..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...\n ..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...\n ..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...\n```\n:::\n:::\n\n\nThen we can take the column means for `Ozone`, `Solar.R`, and `Wind` for each sub-data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(s, function(x) {\n colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")])\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`5`\n Ozone Solar.R Wind \n NA NA 11.62258 \n\n$`6`\n Ozone Solar.R Wind \n NA 190.16667 10.26667 \n\n$`7`\n Ozone Solar.R Wind \n NA 216.483871 8.941935 \n\n$`8`\n Ozone Solar.R Wind \n NA NA 8.793548 \n\n$`9`\n Ozone Solar.R Wind \n NA 167.4333 10.1800 \n```\n:::\n:::\n\n\nUsing `sapply()` might be better here for a more readable output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(s, function(x) {\n colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")])\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 5 6 7 8 9\nOzone NA NA NA NA NA\nSolar.R NA 190.16667 216.483871 NA 167.4333\nWind 11.62258 10.26667 8.941935 8.793548 10.1800\n```\n:::\n:::\n\n\nUnfortunately, there are `NA`s in the data so we cannot simply take the means of those variables. However, we can tell the `colMeans` function to remove the `NA`s before computing the mean.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(s, function(x) {\n colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")], \n na.rm = TRUE)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 5 6 7 8 9\nOzone 23.61538 29.44444 59.115385 59.961538 31.44828\nSolar.R 181.29630 190.16667 216.483871 171.857143 167.43333\nWind 11.62258 10.26667 8.941935 8.793548 10.18000\n```\n:::\n:::\n\n\n## tapply\n\n`tapply()` is used to apply a function over subsets of a vector. It can be thought of as a combination of `split()` and `sapply()` for vectors only. I've been told that the \"t\" in `tapply()` refers to \"table\", but that is unconfirmed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(tapply)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) \n```\n:::\n:::\n\n\nThe arguments to `tapply()` are as follows:\n\n- `X` is a vector\n- `INDEX` is a factor or a list of factors (or else they are coerced to factors)\n- `FUN` is a function to be applied\n- ... contains other arguments to be passed `FUN`\n- `simplify`, should we simplify the result?\n\n::: callout-tip\n### Example\n\nGiven a vector of numbers, one simple operation is to take group means.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Simulate some data\nx <- c(rnorm(10), runif(10), rnorm(10, 1))\n## Define some groups with a factor variable\nf <- gl(3, 10) \nf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3\nLevels: 1 2 3\n```\n:::\n\n```{.r .cell-code}\ntapply(x, f, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 1 2 3 \n0.1087117 0.5112950 1.2957052 \n```\n:::\n:::\n\n:::\n\nWe can also apply functions that return more than a single value. In this case, `tapply()` will not simplify the result and will return a list. Here's an example of finding the `range()` (min and max) of each sub-group.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntapply(x, f, range)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n[1] -0.8435817 1.0096577\n\n$`2`\n[1] 0.05537228 0.83863420\n\n$`3`\n[1] -0.8717398 2.0814386\n```\n:::\n:::\n\n\n## `apply()`\n\nThe `apply()` function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. Using `apply()` is not really faster than writing a loop, but it works in one line and is highly compact.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(apply)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, MARGIN, FUN, ..., simplify = TRUE) \n```\n:::\n:::\n\n\nThe arguments to `apply()` are\n\n- `X` is an array\n- `MARGIN` is an integer vector indicating which margins should be \"retained\".\n- `FUN` is a function to be applied\n- `...` is for other arguments to be passed to `FUN`\n\n::: callout-tip\n### Example\n\nHere I create a 20 by 10 matrix of Normal random numbers. I then compute the mean of each column.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(rnorm(200), 20, 10)\nhead(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3] [,4] [,5] [,6]\n[1,] 0.84236865 -0.1124551 -0.7392402 -0.1509353 0.9928648 0.6976390\n[2,] 0.02815865 1.6436612 -1.1889939 -0.4767309 -1.1571802 0.6857492\n[3,] 0.01051357 -0.2242306 0.6610240 0.4123025 -0.2424516 1.3800597\n[4,] 0.72109478 0.2949731 0.9722644 0.7911567 0.5616605 -0.6849934\n[5,] 1.15863939 -0.8584817 0.5789411 1.2627121 -0.5413249 -0.4740383\n[6,] 1.32952051 0.4994495 -0.2406128 -1.2990888 1.1405130 -0.4864257\n [,7] [,8] [,9] [,10]\n[1,] 0.9033576 -0.29090608 0.54385628 -1.5146458\n[2,] 0.6788826 0.92735898 -0.16479486 -2.7966950\n[3,] -2.0091145 -0.01351013 0.60310429 1.4103354\n[4,] 0.2323455 0.83083908 -1.08912322 0.6458769\n[5,] -1.5478440 0.05488507 0.03434319 0.1834132\n[6,] -0.1445403 -0.62788083 0.44295763 -0.2051684\n```\n:::\n\n```{.r .cell-code}\napply(x, 2, mean) ## Take the mean of each column\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 0.15369716 0.23398541 -0.18962509 -0.05729131 0.01771256 -0.16540884\n [7] -0.16674585 0.40803348 -0.13061682 -0.09736721\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nI can also compute the sum of each row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 1, sum) ## Take the mean of each row\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1.1719039 -1.8205843 1.9880325 3.2760942 -0.1487549 0.4087238\n [7] -4.0732499 2.1231502 0.4270388 -0.9871043 -0.9797849 -3.0278088\n[13] 0.6285069 2.3373183 1.6973524 -3.8844200 4.2352140 -0.3202212\n[19] -1.7755148 -1.1484223\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nIn both calls to `apply()`, the return value was a vector of numbers.\n:::\n\nYou've probably noticed that the second argument is either a 1 or a 2, depending on whether we want row statistics or column statistics. What exactly *is* the second argument to `apply()`?\n\nThe `MARGIN` argument essentially indicates to `apply()` which dimension of the array you want to preserve or retain.\n\nSo when taking the mean of each column, I specify\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 2, mean)\n```\n:::\n\n\nbecause I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 1, mean)\n```\n:::\n\n\nbecause I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension).\n\n### Col/Row Sums and Means\n\n::: callout-tip\n### Pro-tip\n\nFor the special case of column/row sums and column/row means of matrices, we have some useful shortcuts.\n\n- `rowSums` = `apply(x, 1, sum)`\n- `rowMeans` = `apply(x, 1, mean)`\n- `colSums` = `apply(x, 2, sum)`\n- `colMeans` = `apply(x, 2, mean)`\n:::\n\nThe shortcut functions are heavily optimized and hence are **much** faster, but you probably won't notice unless you're using a large matrix.\n\nAnother nice aspect of these functions is that they are a bit more descriptive. It's arguably more clear to write `colMeans(x)` in your code than `apply(x, 2, mean)`.\n\n### Other Ways to Apply\n\nYou can do more than take sums and means with the `apply()` function.\n\n::: callout-tip\n### Example\n\nFor example, you can compute quantiles of the rows of a matrix using the `quantile()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(rnorm(200), 20, 10)\nhead(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3] [,4] [,5] [,6]\n[1,] -0.8222286 -0.06264826 0.4138949 -1.4608268 0.67442318 -0.24230780\n[2,] -0.1836086 -0.35550379 1.0199434 -0.2441630 -0.46562697 0.09719075\n[3,] 1.6292807 -1.49980763 -2.3034693 -0.6154384 -0.04040846 0.21809278\n[4,] -0.5413514 0.58643377 0.8135796 1.7305934 0.69119103 0.33314754\n[5,] -0.9640885 0.38238569 0.2424066 -0.4919602 0.90386972 1.13194597\n[6,] -0.1440487 1.68226290 2.2330038 -0.4490778 0.08801544 -1.29481321\n [,7] [,8] [,9] [,10]\n[1,] -0.38935245 -0.77293702 0.07677024 0.12423212\n[2,] 0.06646333 0.28363323 -0.11256258 1.18545766\n[3,] 0.22432215 -0.09068388 0.71109748 -0.09385239\n[4,] 0.89742043 -0.70635739 1.58637686 -1.89408900\n[5,] 1.28360089 0.08283728 -0.43722835 -0.60018278\n[6,] -1.15045126 -2.19382879 -0.20375594 -0.32560297\n```\n:::\n\n```{.r .cell-code}\n## Get row quantiles\napply(x, 1, quantile, probs = c(0.25, 0.75)) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3] [,4] [,5] [,6]\n25% -0.6770409 -0.2290244 -0.4850419 -0.3227267 -0.4782773 -0.97510790\n75% 0.1123667 0.2370226 0.2227648 0.8764602 0.7734987 0.02999941\n [,7] [,8] [,9] [,10] [,11] [,12]\n25% -0.64819251 -0.05986007 -1.2978967 -1.11450694 -0.4172096 -1.2630159\n75% 0.04747862 0.38562818 0.2370868 -0.08617673 0.7178610 0.9225057\n [,13] [,14] [,15] [,16] [,17] [,18]\n25% -0.4066897 -0.7840147 -0.9955023 -0.4760713 -0.9869851 -1.36026776\n75% 0.7307004 0.3125165 0.9180706 0.8888653 0.8109465 0.04670028\n [,19] [,20]\n25% -0.5785061 -0.4817867\n75% 0.3695974 0.5743219\n```\n:::\n:::\n\n\nNotice that I had to pass the `probs = c(0.25, 0.75)` argument to `quantile()` via the `...` argument to `apply()`.\n:::\n\n## Vectorizing a Function\n\nLet's talk about how we can **\"vectorize\" a function**.\n\nWhat this means is that we can write function that typically only takes single arguments and create a new function that can take vector arguments.\n\nThis is often needed when you want to plot functions.\n\n::: callout-tip\n### Example\n\nHere's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is $\\sum_{i=1}^n(x_i-\\mu)^2/\\sigma^2$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumsq <- function(mu, sigma, x) {\n sum(((x - mu) / sigma)^2)\n}\n```\n:::\n\n\nThis function takes a mean `mu`, a standard deviation `sigma`, and some data in a vector `x`.\n\nIn many statistical applications, we want to minimize the sum of squares to find the optimal `mu` and `sigma`. Before we do that, we may want to evaluate or plot the function for many different values of `mu` or `sigma`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- rnorm(100) ## Generate some data\nsumsq(mu=1, sigma=1, x) ## This works (returns one value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 167.2459\n```\n:::\n:::\n\n\nHowever, passing a vector of `mu`s or `sigma`s won't work with this function because it's not vectorized.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumsq(1:10, 1:10, x) ## This is not what we want\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 107.6407\n```\n:::\n:::\n\n:::\n\nThere's even a function in R called `Vectorize()` that **automatically can create a vectorized version of your function**.\n\nSo we could create a `vsumsq()` function that is fully vectorized as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvsumsq <- Vectorize(sumsq, c(\"mu\", \"sigma\"))\nvsumsq(1:10, 1:10, x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 167.24586 113.22134 104.28054 101.51027 100.39215 99.87343 99.61394\n [8] 99.48004 99.41187 99.38001\n```\n:::\n:::\n\n\nPretty cool, right?\n\n# Summary\n\n- The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form\n\n- The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.\n\n- Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere\n\n- The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Write a function `compute_s_n()` that for any given `n` computes the sum\n\n$$\nS_n = 1^2 + 2^2 + 3^2 + \\ldots + n^2\n$$\n\nReport the value of the sum when $n$ = 10.\n\n2. Define an empty numerical vector `s_n` of size 25 using `s_n <- vector(\"numeric\", 25)` and store in the results of $S_1, S_2, \\ldots, S_n$ using a for-loop.\n\n3. Repeat Q3, but this time use `sapply()`.\n\n4. Plot `s_n` versus `n`. Use points defined by $n= 1, \\ldots, 25$\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"17 - Vectorization and loop functionals\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to vectorization and loop functionals\"\ncategories: [module 4, week 5, R, programming, functions]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/17-loop-functions/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Understand how to perform vector arithmetics in R\n- Implement the 5 functional loops in R (vs e.g. for loops) in R\n:::\n\n# Vectorization\n\nWriting `for` and `while` loops are useful and easy to understand, but in R we rarely use them.\n\nAs you learn more R, you will realize that **vectorization** is preferred over for-loops since it results in shorter and clearer code.\n\n## Vector arithmetics\n\n### Rescaling a vector\n\nIn R, arithmetic operations on **vectors occur element-wise**. For a quick example, suppose we have height in inches:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)\n```\n:::\n\n\nand want to convert to centimeters.\n\nNotice what happens when we multiply inches by 2.54:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches * 2.54\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80\n```\n:::\n:::\n\n\nIn the line above, we **multiplied each element** by 2.54.\n\nSimilarly, if for each entry we want to compute how many inches taller or shorter than 69 inches (the average height for males), we can subtract it from every entry like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninches - 69\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 0 -7 -3 1 1 4 -2 4 -2 1\n```\n:::\n:::\n\n\n### Two vectors\n\nIf we have **two vectors of the same length**, and we sum them in R, they will be **added entry by entry** as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\ny <- 1:10\nx + y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 2 4 6 8 10 12 14 16 18 20\n```\n:::\n:::\n\n\nThe same holds for other mathematical operations, such as `-`, `*` and `/`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\nsqrt(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427\n [9] 3.000000 3.162278\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- 1:10\nx * y\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 4 9 16 25 36 49 64 81 100\n```\n:::\n:::\n\n\n# Functional loops\n\nWhile `for` loops are perfectly valid, when you use vectorization in an element-wise fashion, there is no need for `for` loops because we can apply what are called functional loops.\n\n**Functional loops** are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here are a list of them:\n\n- `lapply()`: Loop over a list and evaluate a function on each element\n\n- `sapply()`: Same as `lapply` but try to simplify the result\n\n- `apply()`: Apply a function over the margins of an array\n\n- `tapply()`: Apply a function over subsets of a vector\n\n- `mapply()`: Multivariate version of `lapply` (won't cover)\n\nAn auxiliary function `split()` is also useful, particularly in conjunction with `lapply()`.\n\n## `lapply()`\n\nThe `lapply()` function does the following simple series of operations:\n\n1. it loops over a list, iterating over each element in that list\n2. it applies a *function* to each element of the list (a function that you specify)\n3. and returns a list (the `l` in `lapply()` is for \"list\").\n\nThis function takes three arguments: (1) a list `X`; (2) a function (or the name of a function) `FUN`; (3) other arguments via its `...` argument. If `X` is not a list, it will be coerced to a list using `as.list()`.\n\nThe body of the `lapply()` function can be seen here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, FUN, ...) \n{\n FUN <- match.fun(FUN)\n if (!is.vector(X) || is.object(X)) \n X <- as.list(X)\n .Internal(lapply(X, FUN))\n}\n\n\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nThe actual looping is done internally in C code for efficiency reasons.\n:::\n\nIt is important to remember that `lapply()` always returns a list, regardless of the class of the input.\n\n::: callout-tip\n### Example\n\nHere's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:5, b = rnorm(10))\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2 3 4 5\n\n$b\n [1] -0.6113707 0.5950531 0.6319343 0.5595441 0.3188799 -0.4400711\n [7] 1.6687028 0.4501791 1.4356856 -0.3858270\n```\n:::\n\n```{.r .cell-code}\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 3\n\n$b\n[1] 0.422271\n```\n:::\n:::\n\n\nNotice that here we are passing the `mean()` function as an argument to the `lapply()` function.\n:::\n\n**Functions in R can be** used this way and can be **passed back and forth as arguments** just like any other object inR.\n\nWhen you pass a function to another function, you do not need to include the open and closed parentheses `()` like you do when you are **calling** a function.\n\n::: callout-tip\n### Example\n\nHere is another example of using `lapply()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 2.5\n\n$b\n[1] 0.1655327\n\n$c\n[1] 0.9767504\n\n$d\n[1] 4.951283\n```\n:::\n:::\n\n:::\n\nYou can use `lapply()` to evaluate a function multiple times each with a different argument.\n\nNext is an example where I call the `runif()` function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nlapply(x, runif)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 0.5924944\n\n[[2]]\n[1] 0.8660588 0.3277243\n\n[[3]]\n[1] 0.5009080 0.2951163 0.6264905\n\n[[4]]\n[1] 0.04282267 0.14951908 0.82034538 0.64614463\n```\n:::\n:::\n\n\n::: callout-tip\n### What happened?\n\nWhen you pass a function to `lapply()`, `lapply()` takes elements of the list and passes them as the *first argument* of the function you are applying.\n\nIn the above example, the first argument of `runif()` is `n`, and so the elements of the sequence `1:4` all got passed to the `n` argument of `runif()`.\n:::\n\nFunctions that you pass to `lapply()` may have other arguments. For example, the `runif()` function has a `min` and `max` argument too.\n\n::: callout-note\n### Question\n\nIn the example above I used the default values for `min` and `max`.\n\n- How would you be able to specify different values for that in the context of `lapply()`?\n:::\n\nHere is where the `...` argument to `lapply()` comes into play. Any arguments that you place in the `...` argument will get passed down to the function being applied to the elements of the list.\n\nHere, the `min = 0` and `max = 10` arguments are passed down to `runif()` every time it gets called.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nlapply(x, runif, min = 0, max = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 5.653385\n\n[[2]]\n[1] 8.325503 7.234466\n\n[[3]]\n[1] 5.968981 9.174316 7.920678\n\n[[4]]\n[1] 9.491500 3.023649 2.990945 8.757496\n```\n:::\n:::\n\n\nSo now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10.\n\nThe `lapply()` function (and its friends) makes heavy use of *anonymous* functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These functions are generated \"on the fly\" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace.\n\n::: callout-tip\n### Example\n\nHere I am creating a list that contains two matrices.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n [,1] [,2]\n[1,] 1 3\n[2,] 2 4\n\n$b\n [,1] [,2]\n[1,] 1 4\n[2,] 2 5\n[3,] 3 6\n```\n:::\n:::\n\n\nSuppose I wanted to extract the first column of each matrix in the list. I could write an anonymous function for extracting the first column of each matrix.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(x, function(elt) {\n elt[, 1]\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2\n\n$b\n[1] 1 2 3\n```\n:::\n:::\n\n\nNotice that I put the `function()` definition right in the call to `lapply()`.\n:::\n\nThis is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside `lapply()`, but if it's going to be more complicated, it's probably a better idea to define the function separately.\n\nFor example, I could have done the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(elt) {\n elt[, 1]\n}\nlapply(x, f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 1 2\n\n$b\n[1] 1 2 3\n```\n:::\n:::\n\n\n::: callout-tip\n### Note\n\nNow the function is no longer anonymous; its name is `f`.\n:::\n\nWhether you use an anonymous function or you define a function first depends on your context. If you think the function `f` is something you are going to need a lot in other parts of your code, you might want to define it separately. But if you are just going to use it for this call to `lapply()`, then it is probably simpler to use an anonymous function.\n\n## `sapply()`\n\nThe `sapply()` function behaves similarly to `lapply()`; the only real difference is in the return value. `sapply()` will try to simplify the result of `lapply()` if possible. Essentially, `sapply()` calls `lapply()` on its input and then applies the following algorithm:\n\n- If the result is a list where every element is length 1, then a vector is returned\n\n- If the result is a list where every element is a vector of the same length (\\> 1), a matrix is returned.\n\n- If it can't figure things out, a list is returned\n\nHere's the result of calling `lapply()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))\nlapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] 2.5\n\n$b\n[1] -0.1478465\n\n$c\n[1] 0.819794\n\n$d\n[1] 4.954484\n```\n:::\n:::\n\n\nNotice that `lapply()` returns a list (as usual), but that each element of the list has length 1.\n\nHere's the result of calling `sapply()` on the same list.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(x, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n a b c d \n 2.5000000 -0.1478465 0.8197940 4.9544836 \n```\n:::\n:::\n\n\nBecause the result of `lapply()` was a list where each element had length 1, `sapply()` collapsed the output into a numeric vector, which is often more useful than a list.\n\n## `split()`\n\nThe `split()` function takes a vector or other objects and splits it into groups determined by a factor or list of factors.\n\nThe arguments to `split()` are\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(split)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (x, f, drop = FALSE, ...) \n```\n:::\n:::\n\n\nwhere\n\n- `x` is a vector (or list) or data frame\n- `f` is a factor (or coerced to one) or a list of factors\n- `drop` indicates whether empty factors levels should be dropped\n\nThe combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying that function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as \"map-reduce\" in other contexts.\n\nHere we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to \"generate levels\" in a factor variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(rnorm(10), runif(10), rnorm(10, 1))\nf <- gl(3, 10) # generate factor levels\nf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3\nLevels: 1 2 3\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsplit(x, f)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n [1] 0.78541247 -0.06267966 -0.89713180 0.11796725 0.66689447 -0.02523006\n [7] -0.19081948 0.44974528 -0.51005146 -0.08103298\n\n$`2`\n [1] 0.29977033 0.31873253 0.53182993 0.85507540 0.21585775 0.89867742\n [7] 0.78109747 0.06887742 0.79661568 0.60022565\n\n$`3`\n [1] -0.38262045 0.06294368 0.41768485 1.57972821 1.17555228 1.47374130\n [7] 1.79199913 2.25569283 1.55226509 -1.51811384\n```\n:::\n:::\n\n\nA common idiom is `split` followed by an `lapply`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(split(x, f), mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n[1] 0.0253074\n\n$`2`\n[1] 0.536676\n\n$`3`\n[1] 0.8408873\n```\n:::\n:::\n\n\n### Splitting a Data Frame\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(datasets)\nhead(airquality)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Ozone Solar.R Wind Temp Month Day\n1 41 190 7.4 67 5 1\n2 36 118 8.0 72 5 2\n3 12 149 12.6 74 5 3\n4 18 313 11.5 62 5 4\n5 NA NA 14.3 56 5 5\n6 28 NA 14.9 66 5 6\n```\n:::\n:::\n\n\nWe can split the `airquality` data frame by the `Month` variable so that we have separate sub-data frames for each month.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ns <- split(airquality, airquality$Month)\nstr(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 5\n $ 5:'data.frame':\t31 obs. of 6 variables:\n ..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...\n ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...\n ..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...\n ..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...\n ..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...\n ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 6:'data.frame':\t30 obs. of 6 variables:\n ..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...\n ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...\n ..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...\n ..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...\n ..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...\n ..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...\n $ 7:'data.frame':\t31 obs. of 6 variables:\n ..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...\n ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...\n ..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...\n ..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...\n ..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...\n ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 8:'data.frame':\t31 obs. of 6 variables:\n ..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...\n ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...\n ..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...\n ..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...\n ..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...\n ..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...\n $ 9:'data.frame':\t30 obs. of 6 variables:\n ..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...\n ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...\n ..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...\n ..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...\n ..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...\n ..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...\n```\n:::\n:::\n\n\nThen we can take the column means for `Ozone`, `Solar.R`, and `Wind` for each sub-data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlapply(s, function(x) {\n colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")])\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`5`\n Ozone Solar.R Wind \n NA NA 11.62258 \n\n$`6`\n Ozone Solar.R Wind \n NA 190.16667 10.26667 \n\n$`7`\n Ozone Solar.R Wind \n NA 216.483871 8.941935 \n\n$`8`\n Ozone Solar.R Wind \n NA NA 8.793548 \n\n$`9`\n Ozone Solar.R Wind \n NA 167.4333 10.1800 \n```\n:::\n:::\n\n\nUsing `sapply()` might be better here for a more readable output.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(s, function(x) {\n colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")])\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 5 6 7 8 9\nOzone NA NA NA NA NA\nSolar.R NA 190.16667 216.483871 NA 167.4333\nWind 11.62258 10.26667 8.941935 8.793548 10.1800\n```\n:::\n:::\n\n\nUnfortunately, there are `NA`s in the data so we cannot simply take the means of those variables. However, we can tell the `colMeans` function to remove the `NA`s before computing the mean.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsapply(s, function(x) {\n colMeans(x[, c(\"Ozone\", \"Solar.R\", \"Wind\")],\n na.rm = TRUE\n )\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 5 6 7 8 9\nOzone 23.61538 29.44444 59.115385 59.961538 31.44828\nSolar.R 181.29630 190.16667 216.483871 171.857143 167.43333\nWind 11.62258 10.26667 8.941935 8.793548 10.18000\n```\n:::\n:::\n\n\n## tapply\n\n`tapply()` is used to apply a function over subsets of a vector. It can be thought of as a combination of `split()` and `sapply()` for vectors only. I've been told that the \"t\" in `tapply()` refers to \"table\", but that is unconfirmed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(tapply)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) \n```\n:::\n:::\n\n\nThe arguments to `tapply()` are as follows:\n\n- `X` is a vector\n- `INDEX` is a factor or a list of factors (or else they are coerced to factors)\n- `FUN` is a function to be applied\n- ... contains other arguments to be passed `FUN`\n- `simplify`, should we simplify the result?\n\n::: callout-tip\n### Example\n\nGiven a vector of numbers, one simple operation is to take group means.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Simulate some data\nx <- c(rnorm(10), runif(10), rnorm(10, 1))\n## Define some groups with a factor variable\nf <- gl(3, 10)\nf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3\nLevels: 1 2 3\n```\n:::\n\n```{.r .cell-code}\ntapply(x, f, mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 1 2 3 \n0.3554738 0.5195466 0.6764006 \n```\n:::\n:::\n\n:::\n\nWe can also apply functions that return more than a single value. In this case, `tapply()` will not simplify the result and will return a list. Here's an example of finding the `range()` (min and max) of each sub-group.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntapply(x, f, range)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$`1`\n[1] -1.431912 2.695089\n\n$`2`\n[1] 0.1263379 0.8959040\n\n$`3`\n[1] -1.207741 1.696309\n```\n:::\n:::\n\n\n## `apply()`\n\nThe `apply()` function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. Using `apply()` is not really faster than writing a loop, but it works in one line and is highly compact.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(apply)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfunction (X, MARGIN, FUN, ..., simplify = TRUE) \n```\n:::\n:::\n\n\nThe arguments to `apply()` are\n\n- `X` is an array\n- `MARGIN` is an integer vector indicating which margins should be \"retained\".\n- `FUN` is a function to be applied\n- `...` is for other arguments to be passed to `FUN`\n\n::: callout-tip\n### Example\n\nHere I create a 20 by 10 matrix of Normal random numbers. I then compute the mean of each column.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(rnorm(200), 20, 10)\nhead(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3] [,4] [,5] [,6]\n[1,] 1.589728 0.7733454 -1.3311072 -0.77084025 -0.1947478 0.1748546\n[2,] 2.395088 0.3243910 -1.5133366 0.09199955 0.3850993 0.1851718\n[3,] 1.039643 -2.1721402 -0.9933217 -1.89261272 0.1748050 1.0563987\n[4,] -1.580978 -0.9884235 -1.4976744 -0.51011200 -2.7512079 0.5547477\n[5,] 1.264799 -2.0551874 0.4483417 -3.08561764 -0.1549359 -0.8384706\n[6,] 1.756973 0.9244522 0.2740854 -0.61441465 -1.0661350 1.4497808\n [,7] [,8] [,9] [,10]\n[1,] 0.7163086 -0.01817166 0.2193225 -0.3346788\n[2,] 0.7606851 0.42082416 0.1099027 0.2834439\n[3,] -1.1218204 -1.17000278 0.4302792 -0.5684986\n[4,] 0.6082452 0.46763465 -0.3481830 -0.1765517\n[5,] -0.7460224 -0.01123782 1.8116342 -0.1033175\n[6,] 1.0160202 -0.82361401 -0.1616471 -0.1628032\n```\n:::\n\n```{.r .cell-code}\napply(x, 2, mean) ## Take the mean of each column\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 0.083759441 -0.134507982 -0.246473461 -0.371270102 -0.078433882\n [6] -0.101665531 -0.007126106 -0.003193726 0.114767264 0.070612124\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Example\n\nI can also compute the sum of each row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 1, sum) ## Take the mean of each row\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 0.82401382 3.44326903 -5.21727094 -6.22250299 -3.47001414 2.59269751\n [7] -1.76049948 -0.54534465 1.26993157 -0.05660623 1.89101638 2.60154094\n[13] -0.80804188 1.96321614 -2.68869045 0.56525640 0.44214056 -4.25890694\n[19] -3.02509115 -1.01075274\n```\n:::\n:::\n\n:::\n\n::: callout-tip\n### Note\n\nIn both calls to `apply()`, the return value was a vector of numbers.\n:::\n\nYou've probably noticed that the second argument is either a 1 or a 2, depending on whether we want row statistics or column statistics. What exactly *is* the second argument to `apply()`?\n\nThe `MARGIN` argument essentially indicates to `apply()` which dimension of the array you want to preserve or retain.\n\nSo when taking the mean of each column, I specify\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 2, mean)\n```\n:::\n\n\nbecause I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run\n\n\n::: {.cell}\n\n```{.r .cell-code}\napply(x, 1, mean)\n```\n:::\n\n\nbecause I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension).\n\n### Col/Row Sums and Means\n\n::: callout-tip\n### Pro-tip\n\nFor the special case of column/row sums and column/row means of matrices, we have some useful shortcuts.\n\n- `rowSums` = `apply(x, 1, sum)`\n- `rowMeans` = `apply(x, 1, mean)`\n- `colSums` = `apply(x, 2, sum)`\n- `colMeans` = `apply(x, 2, mean)`\n:::\n\nThe shortcut functions are heavily optimized and hence are **much** faster, but you probably won't notice unless you're using a large matrix.\n\nAnother nice aspect of these functions is that they are a bit more descriptive. It's arguably more clear to write `colMeans(x)` in your code than `apply(x, 2, mean)`.\n\n### Other Ways to Apply\n\nYou can do more than take sums and means with the `apply()` function.\n\n::: callout-tip\n### Example\n\nFor example, you can compute quantiles of the rows of a matrix using the `quantile()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- matrix(rnorm(200), 20, 10)\nhead(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3] [,4] [,5] [,6]\n[1,] 0.58654399 -0.502546440 1.1493478 0.6257709 -0.02866237 1.490139530\n[2,] -0.14969248 0.327632870 0.0202589 0.2889600 -0.16552218 -0.829703298\n[3,] 1.12561766 0.707836011 0.6038607 -0.6722613 0.85092968 0.550785886\n[4,] -1.71719604 0.554424755 0.4229181 0.1484968 0.22134369 0.258853355\n[5,] 0.31827641 1.555568589 0.8971850 -0.7742244 0.45459793 -0.043814576\n[6,] -0.08429415 0.001737282 0.1906608 1.1145869 0.54156791 -0.004889302\n [,7] [,8] [,9] [,10]\n[1,] -0.7879713 1.02206400 -1.0420765 -1.2779945\n[2,] 1.7217146 0.06728039 0.6408182 -0.3551929\n[3,] -0.2439192 -0.71553120 -0.8273868 0.2559954\n[4,] -0.1085818 -0.28763268 1.9010457 1.7950971\n[5,] -1.4082747 -1.07621679 0.5428189 0.4538626\n[6,] -1.0644006 -0.04186614 -0.8150566 1.0490749\n```\n:::\n\n```{.r .cell-code}\n## Get row quantiles\napply(x, 1, quantile, probs = c(0.25, 0.75))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3] [,4] [,5] [,6]\n25% -0.7166151 -0.1615648 -0.5651758 -0.04431213 -0.5916219 -0.07368714\n75% 0.9229907 0.3179646 0.6818422 0.52154809 0.5207637 0.45384114\n [,7] [,8] [,9] [,10] [,11] [,12]\n25% -0.4355993 -0.1313015 -0.8149658 -0.9260982 0.02077709 -0.1343613\n75% 1.5985929 0.8889319 0.2213238 0.3661333 0.82424899 0.4156328\n [,13] [,14] [,15] [,16] [,17] [,18]\n25% -0.1281593 -0.6691927 -0.2824997 -0.6574923 0.06421797 -0.7905708\n75% 1.3073689 1.2450340 0.5072401 0.5023885 1.08294108 0.4653062\n [,19] [,20]\n25% -0.5826196 -0.6965163\n75% 0.1313324 0.6849689\n```\n:::\n:::\n\n\nNotice that I had to pass the `probs = c(0.25, 0.75)` argument to `quantile()` via the `...` argument to `apply()`.\n:::\n\n## Vectorizing a Function\n\nLet's talk about how we can **\"vectorize\" a function**.\n\nWhat this means is that we can write function that typically only takes single arguments and create a new function that can take vector arguments.\n\nThis is often needed when you want to plot functions.\n\n::: callout-tip\n### Example\n\nHere's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is $\\sum_{i=1}^n(x_i-\\mu)^2/\\sigma^2$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumsq <- function(mu, sigma, x) {\n sum(((x - mu) / sigma)^2)\n}\n```\n:::\n\n\nThis function takes a mean `mu`, a standard deviation `sigma`, and some data in a vector `x`.\n\nIn many statistical applications, we want to minimize the sum of squares to find the optimal `mu` and `sigma`. Before we do that, we may want to evaluate or plot the function for many different values of `mu` or `sigma`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- rnorm(100) ## Generate some data\nsumsq(mu = 1, sigma = 1, x) ## This works (returns one value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 248.8765\n```\n:::\n:::\n\n\nHowever, passing a vector of `mu`s or `sigma`s won't work with this function because it's not vectorized.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumsq(1:10, 1:10, x) ## This is not what we want\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 119.3071\n```\n:::\n:::\n\n:::\n\nThere's even a function in R called `Vectorize()` that **automatically can create a vectorized version of your function**.\n\nSo we could create a `vsumsq()` function that is fully vectorized as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvsumsq <- Vectorize(sumsq, c(\"mu\", \"sigma\"))\nvsumsq(1:10, 1:10, x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 248.8765 146.5055 124.7964 116.2695 111.8983 109.2945 107.5867 106.3890\n [9] 105.5067 104.8318\n```\n:::\n:::\n\n\nPretty cool, right?\n\n# Summary\n\n- The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form\n\n- The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.\n\n- Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere\n\n- The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Write a function `compute_s_n()` that for any given `n` computes the sum\n\n$$\nS_n = 1^2 + 2^2 + 3^2 + \\ldots + n^2\n$$\n\nReport the value of the sum when $n$ = 10.\n\n2. Define an empty numerical vector `s_n` of size 25 using `s_n <- vector(\"numeric\", 25)` and store in the results of $S_1, S_2, \\ldots, S_n$ using a for-loop.\n\n3. Repeat Q3, but this time use `sapply()`.\n\n4. Plot `s_n` versus `n`. Use points defined by $n= 1, \\ldots, 25$\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/18-debugging-r-code/index/execute-results/html.json b/_freeze/posts/18-debugging-r-code/index/execute-results/html.json index eee8e21..7537551 100644 --- a/_freeze/posts/18-debugging-r-code/index/execute-results/html.json +++ b/_freeze/posts/18-debugging-r-code/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "7466f680cc707a58b3cb6baa83e2368a", + "hash": "8ab23c4f0460ca2e98ce61454db02a6d", "result": { - "markdown": "---\ntitle: \"18 - Debugging R Code\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Help! What's wrong with my code???\"\ncategories: [module 4, week 5, programming, debugging]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/18-debugging-r-code/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Discuss an overall approach to debugging code in R\n- Recognize the three main indications of a problem/condition (`message`, `warning`, `error`) and a fatal problem (`error`)\n- Understand the importance of reproducing the problem when debugging a function or piece of code\n- Learn how to use interactive debugging tools `traceback`, `debug`, `recover`, `browser`, and `trace` can be used to find problematic code in functions\n:::\n\n# Debugging R Code\n\n## Overall approach\n\nFinding the **root cause of a problem is always challenging**.\n\nMost bugs are subtle and hard to find because if they were obvious, you would have avoided them in the first place.\n\nA good strategy helps. Below I outline a four step process that I have found useful:\n\n### 1. Google!\n\nWhenever you see an error message, **start by googling it**.\n\nIf you are lucky, you will discover that it's a common error with a known solution.\n\n::: callout-tip\n### Pro-tip\n\nWhen googling, improve your chances of a good match by removing any variable names or values that are specific to your problem.\n:::\n\n### 2. Make it repeatable\n\nTo find the root cause of an error, you are going to need to execute the code many times as you consider and reject hypotheses.\n\n**To make that iteration as quick possible**, it's worth some upfront investment to **make the problem both easy and fast to reproduce**.\n\nStart by creating a **rep**roducible **ex**ample (reprex).\n\n- This will help others help you, and **often leads to a solution without asking others**, because in the course of making the problem reproducible you often figure out the root cause.\n\nMake the **example minimal by removing code and simplifying data**.\n\n- As you do this, you may discover inputs that do not trigger the error. - Make note of them: they will be helpful when diagnosing the root cause.\n\n::: callout-tip\n### Example\n\nLet's try making a **reprex** [using the `reprex` package](https://www.tidyverse.org/help) (installed with the `tidyverse`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(reprex)\n```\n:::\n\n\nWrite a bit of code and copy it to the clipboard:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(y <- 1:4)\nmean(y)\n```\n:::\n\n\nEnter `reprex()` in the R Console. In RStudio, you'll see a preview of your rendered reprex.\n\nIt is now ready and waiting on your clipboard, so you can paste it into, say, a GitHub issue.\n\nOne last step. Let's go here and open up an issue on the course website:\n\n- \n\nWe will paste in the code from our reprex.\n:::\n\nIn RStudio, you can access reprex from the addins menu, which makes it even easier to point out your code and select the output format.\n\n### 3. Figure out where it is\n\nIt's a great idea to adopt the scientific method here.\n\n- Generate hypotheses\n- Design experiments to test them\n- Record your results\n\nThis may seem like a lot of work, but **a systematic approach** will end up saving you time.\n\nOften **a lot of time can be wasted relying on my intuition to solve a bug** (\"oh, it must be an off-by-one error, so I'll just subtract 1 here\"), when I would have been better off taking a systematic approach.\n\nIf this fails, you **might need to ask help from someone else**.\n\nIf you have followed the previous step, you will have a small example that is easy to share with others. That makes it much easier for other people to look at the problem, and more likely to help you find a solution.\n\n### 4. Fix it and test it\n\nOnce you have found the bug, you need to **figure out how to fix it** and to **check that the fix actually worked**.\n\nAgain, it is very useful to have automated tests in place.\n\n- Not only does this help to ensure that you **have actually fixed the bug**, it also **helps to ensure you have not introduced any new bugs** in the process.\n- In the absence of automated tests, make sure to **carefully record the correct output**, and check against the inputs that previously failed.\n\n## Something's Wrong!\n\nOnce you have made the error repeatable, the next step is to figure out where it comes from.\n\nR has a number of **ways to indicate to you that something is not right**.\n\nThere are **different levels of indication** that can be used, ranging from mere notification to fatal error. Executing any function in R may result in the following **conditions**.\n\n- `message`: A **generic notification/diagnostic message** produced by the `message()` function; execution of the function continues\n- `warning`: An indication that **something is wrong but not necessarily fatal**; execution of the function continues. Warnings are generated by the `warning()` function\n- `error`: An indication that **a fatal problem has occurred** and execution of the function stops. Errors are produced by the `stop()` function.\n- `condition`: A generic concept for indicating that **something unexpected has occurred**; programmers can create their own custom conditions if they want.\n\n::: callout-tip\n### Example\n\nHere is an example of a warning that you might receive in the course of using R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(-1)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in log(-1): NaNs produced\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NaN\n```\n:::\n:::\n\n\nThis warning lets you know that taking the log of a negative number results in a `NaN` value because you **can't take the log of negative numbers**.\n:::\n\nNevertheless, R doesn't give an error, because it has a useful value that it can return, the **`NaN` value**.\n\nThe **warning is just there** to let you know that **something unexpected happen**.\n\nDepending on what you are programming, you may have intentionally taken the log of a negative number in order to move on to another section of code.\n\n::: callout-tip\n### Example\n\nHere is another function that is designed to print a message to the console depending on the nature of its input.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message <- function(x) {\n if(x > 0) {\n print(\"x is greater than zero\")\n } else {\n print(\"x is less than or equal to zero\")\n } \n invisible(x) \n}\n```\n:::\n\n\nThis function is simple:\n\n- It **prints a message** telling you whether `x` is greater than zero or less than or equal to zero.\n- It also returns its input **invisibly**, which is a common practice with \"print\" functions.\n\n**Returning an object invisibly** means that the **return value does not get auto-printed** when the function is called.\n\nTake a hard look at the function above and see if you can identify any bugs or problems.\n\nWe can execute the function as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message(1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is greater than zero\"\n```\n:::\n:::\n\n\nThe function seems to work fine at this point. No errors, warnings, or messages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message(NA)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in if (x > 0) {: missing value where TRUE/FALSE needed\n```\n:::\n:::\n\n:::\n\nWhat happened?\n\n- Well, the first thing the function does is test if `x > 0`.\n- But you can't do that test if `x` is a `NA` or `NaN` value.\n- R **doesn't know what to do in this case** so it **stops with a fatal error**.\n\nWe can **fix this problem** by anticipating the possibility of `NA` values and checking to see if the input is `NA` with the `is.na()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message2 <- function(x) {\n if(is.na(x))\n print(\"x is a missing value!\")\n else if(x > 0)\n print(\"x is greater than zero\")\n else\n print(\"x is less than or equal to zero\")\n invisible(x)\n}\n```\n:::\n\n\nNow we can run the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message2(NA)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is a missing value!\"\n```\n:::\n:::\n\n\nAnd all is fine.\n\nNow what about the following situation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- log(c(-1, 2))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in log(c(-1, 2)): NaNs produced\n```\n:::\n\n```{.r .cell-code}\nprint_message2(x)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in if (is.na(x)) print(\"x is a missing value!\") else if (x > 0) print(\"x is greater than zero\") else print(\"x is less than or equal to zero\"): the condition has length > 1\n```\n:::\n:::\n\n\nNow what?? Why are we getting this warning?\n\nThe **warning** says \"the condition has length \\> 1 and only the first element will be used\".\n\nThe **problem here** is that I passed `print_message2()` a vector `x` that was of length 2 rather then length 1.\n\nInside the body of `print_message2()` the expression `is.na(x)` returns a vector that is tested in the `if` statement.\n\nHowever, `if` cannot take vector arguments, so you get a warning.\n\nThe fundamental problem here is that `print_message2()` is not **vectorized**.\n\nWe can **solve this problem** two ways.\n\n1. Simply **not allow vector arguments**.\n2. The other way is to **vectorize** the `print_message2()` function to allow it to take vector arguments.\n\nFor the **first way**, we simply need to check the length of the input.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message3 <- function(x) {\n if(length(x) > 1L)\n stop(\"'x' has length > 1\")\n if(is.na(x))\n print(\"x is a missing value!\")\n else if(x > 0)\n print(\"x is greater than zero\")\n else\n print(\"x is less than or equal to zero\")\n invisible(x)\n}\n```\n:::\n\n\nNow when we pass `print_message3()` a vector, we should get an **error**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message3(1:2)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in print_message3(1:2): 'x' has length > 1\n```\n:::\n:::\n\n\nVectorizing the function can be accomplished easily with the `Vectorize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message4 <- Vectorize(print_message2)\nout <- print_message4(c(-1, 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is less than or equal to zero\"\n[1] \"x is greater than zero\"\n```\n:::\n:::\n\n\nYou can see now that the **correct messages are printed without any warning or error**.\n\n::: callout-tip\n### Note\n\nI stored the return value of `print_message4()` in a separate R object called `out`.\n\nThis is because when I use the `Vectorize()` function it no longer preserves the invisibility of the return value.\n:::\n\n::: callout-tip\n### Helpful tips\n\nThe **primary task of debugging** any R code is **correctly diagnosing what the problem is**.\n\nWhen diagnosing a problem with your code (or somebody else's), it's important **first understand what you were expecting to occur**.\n\nThen you need to **idenfity what did occur** and **how did it deviate from your expectations**.\n\nSome basic questions you need to ask are\n\n- What was your input? How did you call the function?\n- What were you expecting? Output, messages, other results?\n- What did you get?\n- How does what you get differ from what you were expecting?\n- Were your expectations correct in the first place?\n- Can you reproduce the problem (exactly)?\n:::\n\nBeing able to answer these questions is important not just for your own sake, but in situations where you may need to ask someone else for help with debugging the problem.\n\nSeasoned programmers will be asking you these exact questions.\n\n# Debugging Tools in R\n\nR provides a number of tools to help you with debugging your code. The primary tools for debugging functions in R are\n\n- `traceback()`: **prints out the function call stack** after an error occurs; does nothing if there's no error\n- `debug()`: **flags a function for \"debug\" mode** which allows you to step through execution of a function one line at a time\n- `browser()`: **suspends the execution of a function** wherever it is called and puts the function in debug mode\n- `trace()`: allows you to **insert debugging code into a function** at specific places\n- `recover()`: allows you to **modify the error behavior** so that you can browse the function call stack\n\nThese functions are interactive tools specifically designed to allow you to pick through a function. There is also the more blunt technique of inserting `print()` or `cat()` statements in the function.\n\n## Using `traceback()`\n\nThe `traceback()` function **prints out the function call stack** after an error has occurred.\n\nThe **function call stack** is the **sequence of functions that was called before the error occurred**.\n\nFor example, you may have a function `a()` which subsequently calls function `b()` which calls `c()` and then `d()`.\n\nIf an error occurs, it may not be immediately clear in which function the error occurred.\n\nThe `traceback()` function **shows you how many levels deep** you were when the error occurred.\n\n::: callout-tip\n### Example\n\nLet's use the `mean()` function on a vector `z` that does not exist in our R environment\n\n``` r\n> mean(z)\nError in mean(z) : object 'z' not found\n> traceback()\n1: mean(z)\n```\n\nHere, it's **clear that the error occurred** inside the `mean()` function because the object `z` does not exist.\n:::\n\nThe `traceback()` function **must be called immediately after an error** occurs. Once another function is called, you lose the traceback.\n\n::: callout-tip\n### Example\n\nHere is a slightly more complicated example using the `lm()` function for linear modeling.\n\n``` r\n> lm(y ~ x)\nError in eval(expr, envir, enclos) : object ’y’ not found\n> traceback()\n7: eval(expr, envir, enclos)\n6: eval(predvars, data, env)\n5: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)\n4: model.frame(formula = y ~ x, drop.unused.levels = TRUE)\n3: eval(expr, envir, enclos)\n2: eval(mf, parent.frame())\n1: lm(y ~ x)\n```\n\nYou can see now that the **error did not get thrown until the 7th level of the function call stack**, in which case the `eval()` function tried to evaluate the formula `y ~ x` and **realized the object `y` did not exist**.\n:::\n\nLooking at the traceback is useful for figuring out roughly where an error occurred but it's not useful for more detailed debugging. For that you might turn to the `debug()` function.\n\n## Using `debug()`\n\n
\n\nClick here for how to use `debug()` with an interactive browser.\n\nThe `debug()` function initiates an interactive debugger (also known as the \"browser\" in R) for a function. With the debugger, you can step through an R function one expression at a time to pinpoint exactly where an error occurs.\n\nThe `debug()` function takes a function as its first argument. Here is an example of debugging the `lm()` function.\n\n``` r\n> debug(lm) ## Flag the 'lm()' function for interactive debugging\n> lm(y ~ x)\ndebugging in: lm(y ~ x)\ndebug: {\n ret.x <- x\n ret.y <- y\n cl <- match.call()\n ...\n if (!qr)\n z$qr <- NULL \n z\n} \nBrowse[2]>\n```\n\nNow, every time you call the `lm()` function it will launch the interactive debugger. To turn this behavior off you need to call the `undebug()` function.\n\nThe debugger calls the browser at the very top level of the function body. From there you can step through each expression in the body. There are a few special commands you can call in the browser:\n\n- `n` executes the current expression and moves to the next expression\n- `c` continues execution of the function and does not stop until either an error or the function exits\n- `Q` quits the browser\n\nHere's an example of a browser session with the `lm()` function.\n\n``` r\nBrowse[2]> n ## Evalute this expression and move to the next one\ndebug: ret.x <- x\nBrowse[2]> n\ndebug: ret.y <- y\nBrowse[2]> n\ndebug: cl <- match.call()\nBrowse[2]> n\ndebug: mf <- match.call(expand.dots = FALSE)\nBrowse[2]> n\ndebug: m <- match(c(\"formula\", \"data\", \"subset\", \"weights\", \"na.action\",\n \"offset\"), names(mf), 0L)\n```\n\nWhile you are in the browser you can execute any other R function that might be available to you in a regular session. In particular, you can use `ls()` to see what is in your current environment (the function environment) and `print()` to print out the values of R objects in the function environment.\n\nYou can turn off interactive debugging with the `undebug()` function.\n\n``` r\nundebug(lm) ## Unflag the 'lm()' function for debugging\n```\n\n
\n\n## Using `recover()`\n\n
\n\nClick here for how to use `recover()` with an interactive browser.\n\nThe `recover()` function can be used to modify the error behavior of R when an error occurs. Normally, when an error occurs in a function, R will print out an error message, exit out of the function, and return you to your workspace to await further commands.\n\nWith `recover()` you can tell R that when an error occurs, it should halt execution at the exact point at which the error occurred. That can give you the opportunity to poke around in the environment in which the error occurred. This can be useful to see if there are any R objects or data that have been corrupted or mistakenly modified.\n\n``` r\n> options(error = recover) ## Change default R error behavior\n> read.csv(\"nosuchfile\") ## This code doesn't work\nError in file(file, \"rt\") : cannot open the connection\nIn addition: Warning message:\nIn file(file, \"rt\") :\n cannot open file ’nosuchfile’: No such file or directory\n \nEnter a frame number, or 0 to exit\n\n1: read.csv(\"nosuchfile\")\n2: read.table(file = file, header = header, sep = sep, quote = quote, dec =\n3: file(file, \"rt\")\n\nSelection:\n```\n\nThe `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around.\n\n
\n\n# Summary\n\n- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal\n- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation\n- Interactive debugging tools `traceback`, `debug`, `recover`, `browser`, and `trace` can be used to find problematic code in functions\n- Debugging tools are not a substitute for thinking!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Try using `traceback()` to debug this piece of code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a) g(a)\ng <- function(b) h(b)\nh <- function(c) i(c)\ni <- function(d) {\n if (!is.numeric(d)) {\n stop(\"`d` must be numeric\", call. = FALSE)\n }\n d + 10\n}\nf(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: `d` must be numeric\n```\n:::\n:::\n\n\nDescribe in words what is happening above?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n reprex * 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"18 - Debugging R Code\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Help! What's wrong with my code???\"\ncategories: [module 4, week 5, programming, debugging]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/18-debugging-r-code/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n2. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Discuss an overall approach to debugging code in R\n- Recognize the three main indications of a problem/condition (`message`, `warning`, `error`) and a fatal problem (`error`)\n- Understand the importance of reproducing the problem when debugging a function or piece of code\n- Learn how to use interactive debugging tools `traceback`, `debug`, `recover`, `browser`, and `trace` can be used to find problematic code in functions\n:::\n\n# Debugging R Code\n\n## Overall approach\n\nFinding the **root cause of a problem is always challenging**.\n\nMost bugs are subtle and hard to find because if they were obvious, you would have avoided them in the first place.\n\nA good strategy helps. Below I outline a four step process that I have found useful:\n\n### 1. Google!\n\nWhenever you see an error message, **start by googling it**.\n\nIf you are lucky, you will discover that it's a common error with a known solution.\n\n::: callout-tip\n### Pro-tip\n\nWhen googling, improve your chances of a good match by removing any variable names or values that are specific to your problem.\n:::\n\n### 2. Make it repeatable\n\nTo find the root cause of an error, you are going to need to execute the code many times as you consider and reject hypotheses.\n\n**To make that iteration as quick possible**, it's worth some upfront investment to **make the problem both easy and fast to reproduce**.\n\nStart by creating a **rep**roducible **ex**ample (reprex).\n\n- This will help others help you, and **often leads to a solution without asking others**, because in the course of making the problem reproducible you often figure out the root cause.\n\nMake the **example minimal by removing code and simplifying data**.\n\n- As you do this, you may discover inputs that do not trigger the error. - Make note of them: they will be helpful when diagnosing the root cause.\n\n::: callout-tip\n### Example\n\nLet's try making a **reprex** [using the `reprex` package](https://www.tidyverse.org/help) (installed with the `tidyverse`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(reprex)\n```\n:::\n\n\nWrite a bit of code and copy it to the clipboard:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(y <- 1:4)\nmean(y)\n```\n:::\n\n\nEnter `reprex()` in the R Console. In RStudio, you'll see a preview of your rendered reprex.\n\nIt is now ready and waiting on your clipboard, so you can paste it into, say, a GitHub issue.\n\nOne last step. Let's go here and open up an issue on the course website:\n\n- \n\nWe will paste in the code from our reprex.\n:::\n\nIn RStudio, you can access reprex from the addins menu, which makes it even easier to point out your code and select the output format.\n\n### 3. Figure out where it is\n\nIt's a great idea to adopt the scientific method here.\n\n- Generate hypotheses\n- Design experiments to test them\n- Record your results\n\nThis may seem like a lot of work, but **a systematic approach** will end up saving you time.\n\nOften **a lot of time can be wasted relying on my intuition to solve a bug** (\"oh, it must be an off-by-one error, so I'll just subtract 1 here\"), when I would have been better off taking a systematic approach.\n\nIf this fails, you **might need to ask help from someone else**.\n\nIf you have followed the previous step, you will have a small example that is easy to share with others. That makes it much easier for other people to look at the problem, and more likely to help you find a solution.\n\n### 4. Fix it and test it\n\nOnce you have found the bug, you need to **figure out how to fix it** and to **check that the fix actually worked**.\n\nAgain, it is very useful to have automated tests in place.\n\n- Not only does this help to ensure that you **have actually fixed the bug**, it also **helps to ensure you have not introduced any new bugs** in the process.\n- In the absence of automated tests, make sure to **carefully record the correct output**, and check against the inputs that previously failed.\n\n## Something's Wrong!\n\nOnce you have made the error repeatable, the next step is to figure out where it comes from.\n\nR has a number of **ways to indicate to you that something is not right**.\n\nThere are **different levels of indication** that can be used, ranging from mere notification to fatal error. Executing any function in R may result in the following **conditions**.\n\n- `message`: A **generic notification/diagnostic message** produced by the `message()` function; execution of the function continues\n- `warning`: An indication that **something is wrong but not necessarily fatal**; execution of the function continues. Warnings are generated by the `warning()` function\n- `error`: An indication that **a fatal problem has occurred** and execution of the function stops. Errors are produced by the `stop()` function.\n- `condition`: A generic concept for indicating that **something unexpected has occurred**; programmers can create their own custom conditions if they want.\n\n::: callout-tip\n### Example\n\nHere is an example of a warning that you might receive in the course of using R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(-1)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in log(-1): NaNs produced\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NaN\n```\n:::\n:::\n\n\nThis warning lets you know that taking the log of a negative number results in a `NaN` value because you **can't take the log of negative numbers**.\n:::\n\nNevertheless, R doesn't give an error, because it has a useful value that it can return, the **`NaN` value**.\n\nThe **warning is just there** to let you know that **something unexpected happen**.\n\nDepending on what you are programming, you may have intentionally taken the log of a negative number in order to move on to another section of code.\n\n::: callout-tip\n### Example\n\nHere is another function that is designed to print a message to the console depending on the nature of its input.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message <- function(x) {\n if (x > 0) {\n print(\"x is greater than zero\")\n } else {\n print(\"x is less than or equal to zero\")\n }\n invisible(x)\n}\n```\n:::\n\n\nThis function is simple:\n\n- It **prints a message** telling you whether `x` is greater than zero or less than or equal to zero.\n- It also returns its input **invisibly**, which is a common practice with \"print\" functions.\n\n**Returning an object invisibly** means that the **return value does not get auto-printed** when the function is called.\n\nTake a hard look at the function above and see if you can identify any bugs or problems.\n\nWe can execute the function as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message(1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is greater than zero\"\n```\n:::\n:::\n\n\nThe function seems to work fine at this point. No errors, warnings, or messages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message(NA)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in if (x > 0) {: missing value where TRUE/FALSE needed\n```\n:::\n:::\n\n:::\n\nWhat happened?\n\n- Well, the first thing the function does is test if `x > 0`.\n- But you can't do that test if `x` is a `NA` or `NaN` value.\n- R **doesn't know what to do in this case** so it **stops with a fatal error**.\n\nWe can **fix this problem** by anticipating the possibility of `NA` values and checking to see if the input is `NA` with the `is.na()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message2 <- function(x) {\n if (is.na(x)) {\n print(\"x is a missing value!\")\n } else if (x > 0) {\n print(\"x is greater than zero\")\n } else {\n print(\"x is less than or equal to zero\")\n }\n invisible(x)\n}\n```\n:::\n\n\nNow we can run the following.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message2(NA)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is a missing value!\"\n```\n:::\n:::\n\n\nAnd all is fine.\n\nNow what about the following situation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- log(c(-1, 2))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in log(c(-1, 2)): NaNs produced\n```\n:::\n\n```{.r .cell-code}\nprint_message2(x)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in if (is.na(x)) {: the condition has length > 1\n```\n:::\n:::\n\n\nNow what?? Why are we getting this warning?\n\nThe **warning** says \"the condition has length \\> 1 and only the first element will be used\".\n\nThe **problem here** is that I passed `print_message2()` a vector `x` that was of length 2 rather then length 1.\n\nInside the body of `print_message2()` the expression `is.na(x)` returns a vector that is tested in the `if` statement.\n\nHowever, `if` cannot take vector arguments, so you get a warning.\n\nThe fundamental problem here is that `print_message2()` is not **vectorized**.\n\nWe can **solve this problem** two ways.\n\n1. Simply **not allow vector arguments**.\n2. The other way is to **vectorize** the `print_message2()` function to allow it to take vector arguments.\n\nFor the **first way**, we simply need to check the length of the input.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message3 <- function(x) {\n if (length(x) > 1L) {\n stop(\"'x' has length > 1\")\n }\n if (is.na(x)) {\n print(\"x is a missing value!\")\n } else if (x > 0) {\n print(\"x is greater than zero\")\n } else {\n print(\"x is less than or equal to zero\")\n }\n invisible(x)\n}\n```\n:::\n\n\nNow when we pass `print_message3()` a vector, we should get an **error**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message3(1:2)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in print_message3(1:2): 'x' has length > 1\n```\n:::\n:::\n\n\nVectorizing the function can be accomplished easily with the `Vectorize()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint_message4 <- Vectorize(print_message2)\nout <- print_message4(c(-1, 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"x is less than or equal to zero\"\n[1] \"x is greater than zero\"\n```\n:::\n:::\n\n\nYou can see now that the **correct messages are printed without any warning or error**.\n\n::: callout-tip\n### Note\n\nI stored the return value of `print_message4()` in a separate R object called `out`.\n\nThis is because when I use the `Vectorize()` function it no longer preserves the invisibility of the return value.\n:::\n\n::: callout-tip\n### Helpful tips\n\nThe **primary task of debugging** any R code is **correctly diagnosing what the problem is**.\n\nWhen diagnosing a problem with your code (or somebody else's), it's important **first understand what you were expecting to occur**.\n\nThen you need to **idenfity what did occur** and **how did it deviate from your expectations**.\n\nSome basic questions you need to ask are\n\n- What was your input? How did you call the function?\n- What were you expecting? Output, messages, other results?\n- What did you get?\n- How does what you get differ from what you were expecting?\n- Were your expectations correct in the first place?\n- Can you reproduce the problem (exactly)?\n:::\n\nBeing able to answer these questions is important not just for your own sake, but in situations where you may need to ask someone else for help with debugging the problem.\n\nSeasoned programmers will be asking you these exact questions.\n\n# Debugging Tools in R\n\nR provides a number of tools to help you with debugging your code. The primary tools for debugging functions in R are\n\n- `traceback()`: **prints out the function call stack** after an error occurs; does nothing if there's no error\n- `debug()`: **flags a function for \"debug\" mode** which allows you to step through execution of a function one line at a time\n- `browser()`: **suspends the execution of a function** wherever it is called and puts the function in debug mode\n- `trace()`: allows you to **insert debugging code into a function** at specific places\n- `recover()`: allows you to **modify the error behavior** so that you can browse the function call stack\n\nThese functions are interactive tools specifically designed to allow you to pick through a function. There is also the more blunt technique of inserting `print()` or `cat()` statements in the function.\n\n## Using `traceback()`\n\nThe `traceback()` function **prints out the function call stack** after an error has occurred.\n\nThe **function call stack** is the **sequence of functions that was called before the error occurred**.\n\nFor example, you may have a function `a()` which subsequently calls function `b()` which calls `c()` and then `d()`.\n\nIf an error occurs, it may not be immediately clear in which function the error occurred.\n\nThe `traceback()` function **shows you how many levels deep** you were when the error occurred.\n\n::: callout-tip\n### Example\n\nLet's use the `mean()` function on a vector `z` that does not exist in our R environment\n\n``` r\n> mean(z)\nError in mean(z) : object 'z' not found\n> traceback()\n1: mean(z)\n```\n\nHere, it's **clear that the error occurred** inside the `mean()` function because the object `z` does not exist.\n:::\n\nThe `traceback()` function **must be called immediately after an error** occurs. Once another function is called, you lose the traceback.\n\n::: callout-tip\n### Example\n\nHere is a slightly more complicated example using the `lm()` function for linear modeling.\n\n``` r\n> lm(y ~ x)\nError in eval(expr, envir, enclos) : object ’y’ not found\n> traceback()\n7: eval(expr, envir, enclos)\n6: eval(predvars, data, env)\n5: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)\n4: model.frame(formula = y ~ x, drop.unused.levels = TRUE)\n3: eval(expr, envir, enclos)\n2: eval(mf, parent.frame())\n1: lm(y ~ x)\n```\n\nYou can see now that the **error did not get thrown until the 7th level of the function call stack**, in which case the `eval()` function tried to evaluate the formula `y ~ x` and **realized the object `y` did not exist**.\n:::\n\nLooking at the traceback is useful for figuring out roughly where an error occurred but it's not useful for more detailed debugging. For that you might turn to the `debug()` function.\n\n## Using `debug()`\n\n
\n\nClick here for how to use `debug()` with an interactive browser.\n\nThe `debug()` function initiates an interactive debugger (also known as the \"browser\" in R) for a function. With the debugger, you can step through an R function one expression at a time to pinpoint exactly where an error occurs.\n\nThe `debug()` function takes a function as its first argument. Here is an example of debugging the `lm()` function.\n\n``` r\n> debug(lm) ## Flag the 'lm()' function for interactive debugging\n> lm(y ~ x)\ndebugging in: lm(y ~ x)\ndebug: {\n ret.x <- x\n ret.y <- y\n cl <- match.call()\n ...\n if (!qr)\n z$qr <- NULL \n z\n} \nBrowse[2]>\n```\n\nNow, every time you call the `lm()` function it will launch the interactive debugger. To turn this behavior off you need to call the `undebug()` function.\n\nThe debugger calls the browser at the very top level of the function body. From there you can step through each expression in the body. There are a few special commands you can call in the browser:\n\n- `n` executes the current expression and moves to the next expression\n- `c` continues execution of the function and does not stop until either an error or the function exits\n- `Q` quits the browser\n\nHere's an example of a browser session with the `lm()` function.\n\n``` r\nBrowse[2]> n ## Evalute this expression and move to the next one\ndebug: ret.x <- x\nBrowse[2]> n\ndebug: ret.y <- y\nBrowse[2]> n\ndebug: cl <- match.call()\nBrowse[2]> n\ndebug: mf <- match.call(expand.dots = FALSE)\nBrowse[2]> n\ndebug: m <- match(c(\"formula\", \"data\", \"subset\", \"weights\", \"na.action\",\n \"offset\"), names(mf), 0L)\n```\n\nWhile you are in the browser you can execute any other R function that might be available to you in a regular session. In particular, you can use `ls()` to see what is in your current environment (the function environment) and `print()` to print out the values of R objects in the function environment.\n\nYou can turn off interactive debugging with the `undebug()` function.\n\n``` r\nundebug(lm) ## Unflag the 'lm()' function for debugging\n```\n\n
\n\n## Using `recover()`\n\n
\n\nClick here for how to use `recover()` with an interactive browser.\n\nThe `recover()` function can be used to modify the error behavior of R when an error occurs. Normally, when an error occurs in a function, R will print out an error message, exit out of the function, and return you to your workspace to await further commands.\n\nWith `recover()` you can tell R that when an error occurs, it should halt execution at the exact point at which the error occurred. That can give you the opportunity to poke around in the environment in which the error occurred. This can be useful to see if there are any R objects or data that have been corrupted or mistakenly modified.\n\n``` r\n> options(error = recover) ## Change default R error behavior\n> read.csv(\"nosuchfile\") ## This code doesn't work\nError in file(file, \"rt\") : cannot open the connection\nIn addition: Warning message:\nIn file(file, \"rt\") :\n cannot open file ’nosuchfile’: No such file or directory\n \nEnter a frame number, or 0 to exit\n\n1: read.csv(\"nosuchfile\")\n2: read.table(file = file, header = header, sep = sep, quote = quote, dec =\n3: file(file, \"rt\")\n\nSelection:\n```\n\nThe `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around.\n\n
\n\n# Summary\n\n- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal\n- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation\n- Interactive debugging tools `traceback`, `debug`, `recover`, `browser`, and `trace` can be used to find problematic code in functions\n- Debugging tools are not a substitute for thinking!\n\n# Post-lecture materials\n\n### Final Questions\n\nHere are some post-lecture questions to help you think about the material discussed.\n\n::: callout-note\n### Questions\n\n1. Try using `traceback()` to debug this piece of code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a) g(a)\ng <- function(b) h(b)\nh <- function(c) i(c)\ni <- function(d) {\n if (!is.numeric(d)) {\n stop(\"`d` must be numeric\", call. = FALSE)\n }\n d + 10\n}\nf(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: `d` must be numeric\n```\n:::\n:::\n\n\nDescribe in words what is happening above?\n:::\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n reprex * 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/19-error-handling-and-generation/index/execute-results/html.json b/_freeze/posts/19-error-handling-and-generation/index/execute-results/html.json index bc1a96c..3bd7d55 100644 --- a/_freeze/posts/19-error-handling-and-generation/index/execute-results/html.json +++ b/_freeze/posts/19-error-handling-and-generation/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "0526af525b38b7e0548e2c7611c85749", + "hash": "09a27c1c17c10264f75e75c5203f76e9", "result": { - "markdown": "---\ntitle: \"19 - Error Handling and Generation\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Implement exception handling routines in R functions\"\ncategories: [module 4, week 5, programming, debugging]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/19-error-handling-and-generation/index.qmd).*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Create errors, warnings, and messages in R functions using the functions `stop`, `stopifnot`, `warning`, and `message`.\n- Understand the importance of providing useful error messaging to improve user experience with functions. However, these can also slow down code substantially.\n:::\n\n# Error Handling and Generation\n\n## What is an error?\n\n**Errors most often occur** when code is used in a way that **it is not intended to be used**.\n\n::: callout-tip\n### Example\n\nFor example adding two strings together produces the following error:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\"hello\" + \"world\"\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in \"hello\" + \"world\": non-numeric argument to binary operator\n```\n:::\n:::\n\n:::\n\nThe `+` operator is essentially a **function** that takes two numbers as arguments and finds their sum.\n\nSince neither `\"hello\"` nor `\"world\"` are numbers, the R interpreter produces an error.\n\n**Errors will stop the execution of your program**, and they will (hopefully) print an error message to the R console.\n\nIn R there are two other constructs which are related to errors:\n\n1. Warnings\n2. Messages\n\n**Warnings** are meant to indicate that **something seems to have gone wrong** in your program that should be inspected.\n\n::: callout-tip\n### Example\n\nHere's a simple example of a warning being generated:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"5\", \"6\", \"seven\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5 6 NA\n```\n:::\n:::\n\n\nThe `as.numeric()` function attempts to **convert each string** in `c(\"5\", \"6\", \"seven\")` into a number, however it is impossible to convert `\"seven\"`, so a warning is generated.\n\nExecution of the code is not halted, and an `NA` is produced for `\"seven\"` instead of a number.\n:::\n\n**Messages** simply **print to the R console**, though they are generated by an underlying mechanism that is similar to how errors and warning are generated.\n\n::: callout-tip\n### Example\n\nHere's a small function that will generate a message:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(){\n message(\"This is a message.\")\n}\n\nf()\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nThis is a message.\n```\n:::\n:::\n\n:::\n\n## Generating Errors\n\nThere are a few essential functions for **generating** errors, warnings, and messages in R.\n\nThe `stop()` function will generate an error.\n\n::: callout-tip\n### Example\n\nLet's generate an error:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstop(\"Something erroneous has occurred!\")\n```\n:::\n\n\n``` r\nError: Something erroneous has occurred!\n```\n:::\n\nIf an error occurs inside of a function, then the **name of that function will appear in the error message**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nname_of_function <- function(){\n stop(\"Something bad happened.\")\n}\n\nname_of_function()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in name_of_function(): Something bad happened.\n```\n:::\n:::\n\n\nThe `stopifnot()` function takes a series of logical expressions as arguments and if any of them are false an error is generated specifying which expression is false.\n\n::: callout-tip\n### Example\n\nLet's take a look at an example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nerror_if_n_is_greater_than_zero <- function(n){\n stopifnot(n <= 0)\n n\n}\n\nerror_if_n_is_greater_than_zero(5)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in error_if_n_is_greater_than_zero(5): n <= 0 is not TRUE\n```\n:::\n:::\n\n:::\n\nThe `warning()` function creates a warning, and the function itself is very similar to the `stop()` function. Remember that a warning does not stop the execution of a program (unlike an error.)\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwarning(\"Consider yourself warned!\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Consider yourself warned!\n```\n:::\n:::\n\n:::\n\nJust like errors, a warning generated inside of a function will include the name of the function in which it was generated:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmake_NA <- function(x){\n warning(\"Generating an NA.\")\n NA\n}\n\nmake_NA(\"Sodium\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in make_NA(\"Sodium\"): Generating an NA.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n:::\n\n\nMessages are simpler than errors or warnings; they just print strings to the R console.\n\nYou can issue a message with the `message()` function:\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmessage(\"In a bottle.\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nIn a bottle.\n```\n:::\n:::\n\n:::\n\n## When to generate errors or warnings\n\nStopping the execution of your program with `stop()` should only happen in the event of a catastrophe - meaning only if it is impossible for your program to continue.\n\n- If there are **conditions that you can anticipate** that would cause your program to create an error, then you **should document those conditions** so whoever uses your software is aware.\n\nAn example includes:\n\n- Providing invalid arguments to a function. You could check this at the beginning of your program using `stopifnot()` so that the user can quickly realize something has gone wrong.\n\nYou can think of a function as kind of contract between you and the user:\n\n- if the user provides specified arguments, your program will provide predictable results.\n\nOf course it's **impossible for you to anticipate** all of the potential uses of your program.\n\nIt's **appropriate to create a warning** when this contract between you and the user is violated.\n\nA perfect example of this situation is the result of\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"5\", \"6\", \"seven\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5 6 NA\n```\n:::\n:::\n\n\nThe user expects a vector of numbers to be returned as the result of `as.numeric()` but `\"seven\"` is coerced into being NA, which is not completely intuitive.\n\nR has largely been developed according to the [Unix Philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which generally **discourages printing text to the console unless something unexpected has occurred**.\n\nLanguages that commonly run on Unix systems like C and C++ are rarely used interactively, meaning that they usually underpin computer infrastructure (computers \"talking\" to other computers).\n\n**Messages printed to the console** are therefore not very useful since nobody will ever read them and it's not straightforward for other programs to capture and interpret them.\n\nIn contrast, R code is frequently executed by human beings in the R console, which serves as an interactive environment between the computer and person at the keyboard.\n\nIf you **think your program should produce a message**, make sure that the **output of the message is primarily meant for a human to read**.\n\nYou should avoid signaling a condition or the result of your program to another program by creating a message.\n\n## How should errors be handled?\n\nImagine writing a program that will take a long time to complete because of a complex calculation or because you're handling a large amount of data. If an error occurs during this computation then you're liable to lose all of the results that were calculated before the error, or your program may not finish a critical task that a program further down your pipeline is depending on. If you anticipate the possibility of errors occurring during the execution of your program, then you can design your program to handle them appropriately.\n\nThe `tryCatch()` function is the workhorse of handling errors and warnings in R. The first argument of this function is any R expression, followed by conditions which specify how to handle an error or a warning. The last argument, `finally`, specifies a function or expression that will be executed after the expression no matter what, even in the event of an error or a warning.\n\nLet's construct a simple function I'm going to call [`beera`](https://en.wikipedia.org/wiki/Yogi_Berra) that catches errors and warnings gracefully.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeera <- function(expr){\n tryCatch(expr,\n error = function(e){\n message(\"An error occurred:\\n\", e)\n },\n warning = function(w){\n message(\"A warning occured:\\n\", w)\n },\n finally = {\n message(\"Finally done!\")\n })\n}\n```\n:::\n\n\nThis function takes an expression as an argument and tries to evaluate it. If the expression can be evaluated without any errors or warnings then the result of the expression is returned and the message `Finally done!` is printed to the R console. If an error or warning is generated, then the functions that are provided to the `error` or `warning` arguments are printed. Let's try this function out with a few examples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeera({\n 2 + 2\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nFinally done!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n\n```{.r .cell-code}\nbeera({\n \"two\" + 2\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nAn error occurred:\nError in \"two\" + 2: non-numeric argument to binary operator\n\nFinally done!\n```\n:::\n\n```{.r .cell-code}\nbeera({\n as.numeric(c(1, \"two\", 3))\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nA warning occured:\nsimpleWarning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced by coercion\n\nFinally done!\n```\n:::\n:::\n\n\nNotice that we've effectively transformed errors and warnings into messages.\n\nNow that you know the basics of generating and catching errors you'll need to decide when your program should generate an error. My advice to you is to limit the number of errors your program generates as much as possible. Even if you design your program so that it's able to catch and handle errors, the error handling process slows down your program by orders of magnitude. Imagine you wanted to write a simple function that checks if an argument is an even number. You might write the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even <- function(n){\n n %% 2 == 0\n}\n\nis_even(768)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even(\"two\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in n%%2: non-numeric argument to binary operator\n```\n:::\n:::\n\n\nYou can see that providing a string causes this function to raise an error. You could imagine though that you want to use this function across a list of different data types, and you only want to know which elements of that list are even numbers. You might think to write the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even_error <- function(n){\n tryCatch(n %% 2 == 0,\n error = function(e){\n FALSE\n })\n}\n\nis_even_error(714)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even_error(\"eight\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nThis appears to be working the way you intended, however when applied to more data this function will be seriously slow compared to alternatives. For example I could check that `n` is numeric before treating `n` like a number:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even_check <- function(n){\n is.numeric(n) && n %% 2 == 0\n}\n\nis_even_check(1876)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even_check(\"twelve\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\n::: keyideas\nNotice that by using `is.numeric()` before the \"AND\" operator (`&&`), the expression `n %% 2 == 0` is never evaluated. This is a programming language design feature called \"short circuiting.\" The expression can never evaluate to `TRUE` if the left hand side of `&&` evaluates to `FALSE`, so the right hand side is ignored.\n:::\n\nTo demonstrate the difference in the speed of the code, we will use the `microbenchmark` package to measure how long it takes for each function to be applied to the same data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(microbenchmark)\nmicrobenchmark(sapply(letters, is_even_check))\n```\n:::\n\n\n``` \nUnit: microseconds\n expr min lq mean median uq max neval\n sapply(letters, is_even_check) 46.224 47.7975 61.43616 48.6445 58.4755 167.091 100\n```\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmicrobenchmark(sapply(letters, is_even_error))\n```\n:::\n\n\n``` \nUnit: microseconds\n expr min lq mean median uq max neval\n sapply(letters, is_even_error) 640.067 678.0285 906.3037 784.4315 1044.501 2308.931 100\n```\n\nThe error catching approach is nearly 15 times slower!\n\nProper error handling is an essential tool for any software developer so that you can design programs that are error tolerant. Creating clear and informative error messages is essential for building quality software.\n\n::: callout-tip\n### Pro-tip\n\nOne closing tip I recommend is to put documentation for your software online, including the meaning of the errors that your software can potentially throw. Often a user's first instinct when encountering an error is to search online for that error message, which should lead them to your documentation!\n:::\n\n# Summary\n\n- Errors, warnings, and messages can be generated within R code using the functions `stop`, `stopifnot`, `warning`, and `message`.\n\n- Catching errors, and providing useful error messaging, can improve user experience with functions but can also slow down code substantially.\n\n# Post-lecture materials\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"19 - Error Handling and Generation\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Implement exception handling routines in R functions\"\ncategories: [module 4, week 5, programming, debugging]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/19-error-handling-and-generation/index.qmd).*\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Create errors, warnings, and messages in R functions using the functions `stop`, `stopifnot`, `warning`, and `message`.\n- Understand the importance of providing useful error messaging to improve user experience with functions. However, these can also slow down code substantially.\n:::\n\n# Error Handling and Generation\n\n## What is an error?\n\n**Errors most often occur** when code is used in a way that **it is not intended to be used**.\n\n::: callout-tip\n### Example\n\nFor example adding two strings together produces the following error:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n\"hello\" + \"world\"\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in \"hello\" + \"world\": non-numeric argument to binary operator\n```\n:::\n:::\n\n:::\n\nThe `+` operator is essentially a **function** that takes two numbers as arguments and finds their sum.\n\nSince neither `\"hello\"` nor `\"world\"` are numbers, the R interpreter produces an error.\n\n**Errors will stop the execution of your program**, and they will (hopefully) print an error message to the R console.\n\nIn R there are two other constructs which are related to errors:\n\n1. Warnings\n2. Messages\n\n**Warnings** are meant to indicate that **something seems to have gone wrong** in your program that should be inspected.\n\n::: callout-tip\n### Example\n\nHere's a simple example of a warning being generated:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"5\", \"6\", \"seven\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5 6 NA\n```\n:::\n:::\n\n\nThe `as.numeric()` function attempts to **convert each string** in `c(\"5\", \"6\", \"seven\")` into a number, however it is impossible to convert `\"seven\"`, so a warning is generated.\n\nExecution of the code is not halted, and an `NA` is produced for `\"seven\"` instead of a number.\n:::\n\n**Messages** simply **print to the R console**, though they are generated by an underlying mechanism that is similar to how errors and warning are generated.\n\n::: callout-tip\n### Example\n\nHere's a small function that will generate a message:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function() {\n message(\"This is a message.\")\n}\n\nf()\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nThis is a message.\n```\n:::\n:::\n\n:::\n\n## Generating Errors\n\nThere are a few essential functions for **generating** errors, warnings, and messages in R.\n\nThe `stop()` function will generate an error.\n\n::: callout-tip\n### Example\n\nLet's generate an error:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstop(\"Something erroneous has occurred!\")\n```\n:::\n\n\n``` r\nError: Something erroneous has occurred!\n```\n:::\n\nIf an error occurs inside of a function, then the **name of that function will appear in the error message**:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nname_of_function <- function() {\n stop(\"Something bad happened.\")\n}\n\nname_of_function()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in name_of_function(): Something bad happened.\n```\n:::\n:::\n\n\nThe `stopifnot()` function takes a series of logical expressions as arguments and if any of them are false an error is generated specifying which expression is false.\n\n::: callout-tip\n### Example\n\nLet's take a look at an example:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nerror_if_n_is_greater_than_zero <- function(n) {\n stopifnot(n <= 0)\n n\n}\n\nerror_if_n_is_greater_than_zero(5)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in error_if_n_is_greater_than_zero(5): n <= 0 is not TRUE\n```\n:::\n:::\n\n:::\n\nThe `warning()` function creates a warning, and the function itself is very similar to the `stop()` function. Remember that a warning does not stop the execution of a program (unlike an error.)\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwarning(\"Consider yourself warned!\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: Consider yourself warned!\n```\n:::\n:::\n\n:::\n\nJust like errors, a warning generated inside of a function will include the name of the function in which it was generated:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmake_NA <- function(x) {\n warning(\"Generating an NA.\")\n NA\n}\n\nmake_NA(\"Sodium\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in make_NA(\"Sodium\"): Generating an NA.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n:::\n\n\nMessages are simpler than errors or warnings; they just print strings to the R console.\n\nYou can issue a message with the `message()` function:\n\n::: callout-tip\n### Example\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmessage(\"In a bottle.\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nIn a bottle.\n```\n:::\n:::\n\n:::\n\n## When to generate errors or warnings\n\nStopping the execution of your program with `stop()` should only happen in the event of a catastrophe - meaning only if it is impossible for your program to continue.\n\n- If there are **conditions that you can anticipate** that would cause your program to create an error, then you **should document those conditions** so whoever uses your software is aware.\n\nAn example includes:\n\n- Providing invalid arguments to a function. You could check this at the beginning of your program using `stopifnot()` so that the user can quickly realize something has gone wrong.\n\nYou can think of a function as kind of contract between you and the user:\n\n- if the user provides specified arguments, your program will provide predictable results.\n\nOf course it's **impossible for you to anticipate** all of the potential uses of your program.\n\nIt's **appropriate to create a warning** when this contract between you and the user is violated.\n\nA perfect example of this situation is the result of\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"5\", \"6\", \"seven\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5 6 NA\n```\n:::\n:::\n\n\nThe user expects a vector of numbers to be returned as the result of `as.numeric()` but `\"seven\"` is coerced into being NA, which is not completely intuitive.\n\nR has largely been developed according to the [Unix Philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which generally **discourages printing text to the console unless something unexpected has occurred**.\n\nLanguages that commonly run on Unix systems like C and C++ are rarely used interactively, meaning that they usually underpin computer infrastructure (computers \"talking\" to other computers).\n\n**Messages printed to the console** are therefore not very useful since nobody will ever read them and it's not straightforward for other programs to capture and interpret them.\n\nIn contrast, R code is frequently executed by human beings in the R console, which serves as an interactive environment between the computer and person at the keyboard.\n\nIf you **think your program should produce a message**, make sure that the **output of the message is primarily meant for a human to read**.\n\nYou should avoid signaling a condition or the result of your program to another program by creating a message.\n\n## How should errors be handled?\n\nImagine writing a program that will take a long time to complete because of a complex calculation or because you're handling a large amount of data. If an error occurs during this computation then you're liable to lose all of the results that were calculated before the error, or your program may not finish a critical task that a program further down your pipeline is depending on. If you anticipate the possibility of errors occurring during the execution of your program, then you can design your program to handle them appropriately.\n\nThe `tryCatch()` function is the workhorse of handling errors and warnings in R. The first argument of this function is any R expression, followed by conditions which specify how to handle an error or a warning. The last argument, `finally`, specifies a function or expression that will be executed after the expression no matter what, even in the event of an error or a warning.\n\nLet's construct a simple function I'm going to call [`beera`](https://en.wikipedia.org/wiki/Yogi_Berra) that catches errors and warnings gracefully.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeera <- function(expr) {\n tryCatch(expr,\n error = function(e) {\n message(\"An error occurred:\\n\", e)\n },\n warning = function(w) {\n message(\"A warning occured:\\n\", w)\n },\n finally = {\n message(\"Finally done!\")\n }\n )\n}\n```\n:::\n\n\nThis function takes an expression as an argument and tries to evaluate it. If the expression can be evaluated without any errors or warnings then the result of the expression is returned and the message `Finally done!` is printed to the R console. If an error or warning is generated, then the functions that are provided to the `error` or `warning` arguments are printed. Let's try this function out with a few examples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeera({\n 2 + 2\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nFinally done!\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n\n```{.r .cell-code}\nbeera({\n \"two\" + 2\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nAn error occurred:\nError in \"two\" + 2: non-numeric argument to binary operator\n\nFinally done!\n```\n:::\n\n```{.r .cell-code}\nbeera({\n as.numeric(c(1, \"two\", 3))\n})\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nA warning occured:\nsimpleWarning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced by coercion\n\nFinally done!\n```\n:::\n:::\n\n\nNotice that we've effectively transformed errors and warnings into messages.\n\nNow that you know the basics of generating and catching errors you'll need to decide when your program should generate an error. My advice to you is to limit the number of errors your program generates as much as possible. Even if you design your program so that it's able to catch and handle errors, the error handling process slows down your program by orders of magnitude. Imagine you wanted to write a simple function that checks if an argument is an even number. You might write the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even <- function(n) {\n n %% 2 == 0\n}\n\nis_even(768)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even(\"two\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in n%%2: non-numeric argument to binary operator\n```\n:::\n:::\n\n\nYou can see that providing a string causes this function to raise an error. You could imagine though that you want to use this function across a list of different data types, and you only want to know which elements of that list are even numbers. You might think to write the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even_error <- function(n) {\n tryCatch(n %% 2 == 0,\n error = function(e) {\n FALSE\n }\n )\n}\n\nis_even_error(714)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even_error(\"eight\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\nThis appears to be working the way you intended, however when applied to more data this function will be seriously slow compared to alternatives. For example I could check that `n` is numeric before treating `n` like a number:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis_even_check <- function(n) {\n is.numeric(n) && n %% 2 == 0\n}\n\nis_even_check(1876)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis_even_check(\"twelve\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\n::: keyideas\nNotice that by using `is.numeric()` before the \"AND\" operator (`&&`), the expression `n %% 2 == 0` is never evaluated. This is a programming language design feature called \"short circuiting.\" The expression can never evaluate to `TRUE` if the left hand side of `&&` evaluates to `FALSE`, so the right hand side is ignored.\n:::\n\nTo demonstrate the difference in the speed of the code, we will use the `microbenchmark` package to measure how long it takes for each function to be applied to the same data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(microbenchmark)\nmicrobenchmark(sapply(letters, is_even_check))\n```\n:::\n\n\n``` \nUnit: microseconds\n expr min lq mean median uq max neval\n sapply(letters, is_even_check) 46.224 47.7975 61.43616 48.6445 58.4755 167.091 100\n```\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmicrobenchmark(sapply(letters, is_even_error))\n```\n:::\n\n\n``` \nUnit: microseconds\n expr min lq mean median uq max neval\n sapply(letters, is_even_error) 640.067 678.0285 906.3037 784.4315 1044.501 2308.931 100\n```\n\nThe error catching approach is nearly 15 times slower!\n\nProper error handling is an essential tool for any software developer so that you can design programs that are error tolerant. Creating clear and informative error messages is essential for building quality software.\n\n::: callout-tip\n### Pro-tip\n\nOne closing tip I recommend is to put documentation for your software online, including the meaning of the errors that your software can potentially throw. Often a user's first instinct when encountering an error is to search online for that error message, which should lead them to your documentation!\n:::\n\n# Summary\n\n- Errors, warnings, and messages can be generated within R code using the functions `stop`, `stopifnot`, `warning`, and `message`.\n\n- Catching errors, and providing useful error messaging, can improve user experience with functions but can also slow down code substantially.\n\n# Post-lecture materials\n\n### Additional Resources\n\n::: callout-tip\n- \n- \n:::\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-08-17\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/20-working-with-dates-and-times/index/execute-results/html.json b/_freeze/posts/20-working-with-dates-and-times/index/execute-results/html.json index 95b68c9..b82baf4 100644 --- a/_freeze/posts/20-working-with-dates-and-times/index/execute-results/html.json +++ b/_freeze/posts/20-working-with-dates-and-times/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "f3b1703648e41b35e71a956912954ed5", + "hash": "9fcf3d6b99d63f4cc32e29c25e524a43", "result": { - "markdown": "---\ntitle: \"20 - Working with dates and times\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Introduction to lubridate for dates and times in R\"\ncategories: [module 5, week 6, tidyverse, R, programming, dates and times]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/posts/20-working-with-dates-and-times/index.qmd).*\n\n\n\n# Pre-lecture materials\n\n### Read ahead\n\n::: callout-note\n## Read ahead\n\n**Before class, you can prepare by reading the following materials:**\n\n1. \n:::\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\n- Recognize the `Date`, `POSIXct` and `POSIXlt` class types in R to represent dates and times\n- Learn how to create date and time objects in R using functions from the `lubridate` package\n- Learn how dealing with time zones can be frustrating 🙀 but hopefully less so after today's lecture 😺\n- Learn how to perform arithmetic operations on dates and times\n- Learn how plotting systems in R \"know\" about dates and times to appropriately handle axis labels\n:::\n\n# Introduction\n\nIn this lesson, we will **learn how to work with dates and times** in R. These may seem simple as you use them all of the time in your day-to-day life, but the more you work with them, the more complicated they seem to get.\n\n**Dates and times are hard to work with** because they have to reconcile **two physical phenomena**\n\n1. The rotation of the Earth and its orbit around the sun AND\n2. A whole raft of geopolitical phenomena including months, time zones, and daylight savings time (DST)\n\nThis lesson will not teach you every last detail about dates and times, but it will give you a solid grounding of **practical skills** that will help you with common data analysis challenges.\n\n::: callout-tip\n### Classes for dates and times in R\n\nR has developed a special representation of dates and times\n\n- Dates are represented by the `Date` class\n- Times are represented by the `POSIXct` or the `POSIXlt` class\n:::\n\n::: callout-tip\n### Important point in time\n\n- Dates are stored internally as the number of days since 1970-01-01\n- Times are stored internally as the number of seconds since 1970-01-01\n\nIn computing, **Unix time** (also known as Epoch time, Posix time, seconds since the Epoch, Unix timestamp, or UNIX Epoch time) is a system for **describing a point in time**.\n\nIt is the number of seconds that have elapsed since the Unix epoch, excluding leap seconds. The Unix epoch is 00:00:00 UTC on 1 January 1970.\n\nUnix time originally appeared as the system time of Unix, but is now used widely in computing, for example by filesystems; some Python language library functions handle Unix time.\\[4\\]\n\n\n:::\n\n## The `lubridate` package\n\nHere, we will focus on the `lubridate` R package, which makes it easier to work with dates and times in R.\n\n::: callout-tip\n### Pro-tip\n\n**Check out the `lubridate` cheat sheet** at \n:::\n\nA few things to note about it:\n\n- It largely **replaces the default date/time functions in base R**\n- It contains **methods for date/time arithmetic**\n- It **handles time zones**, leap year, leap seconds, etc.\n\n![Artwork by Allison Horst on the dplyr package](https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/lubridate_ymd.png){preview=\"TRUE\"} \\[**Source**: [Artwork by Allison Horst](https://github.com/allisonhorst/stats-illustrations)\\]\n\n`lubridate` is installed when you install `tidyverse`, but it is not loaded when you load `tidyverse`. Alternatively, you can install it separately.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"lubridate\") \n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(lubridate) \n```\n:::\n\n\n# Creating date/times\n\nThere are three types of date/time data that refer to an instant in time:\n\n- A **date**. Tibbles print this as ``.\n- A **time** within a day. Tibbles print this as `