-
Notifications
You must be signed in to change notification settings - Fork 0
/
technical_definitions.qmd
100 lines (64 loc) · 10.1 KB
/
technical_definitions.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
title: "Technical Definitions"
bibliography: references.bib
---
```{r, echo=FALSE, message=FALSE}
knitr::opts_chunk$set()
library(dplyr)
library(magrittr)
```
Prior to discussing the various package managers in R, it is important to understand some R terminology relevant to dependency management, namely the **concepts of and differences between packages, libraries, and repositories**.
# Packages
Packages are collections of functions, data, and documentation that are bundled together in a single unit. They are the primary way that R users can extend the functionality of the base R language. For example, the {dplyr} package is a popular package for data manipulation that extends the base R language with functions like `filter()` and `mutate()`. The code below demonstrates how one can filter a data set using *only* R's built-in capabilities and by using the popular data-cleaning package {dplyr}.
```{r, eval=FALSE}
#| code-fold: true
iris[iris$Species == "setosa" & iris$Sepal.Length > 4.5, ]
dplyr::filter(iris, Species == "setosa", Sepal.Length > 4.5)
```
Base R already has a large number of built-in functions and data structures that make it possible to work with data so what is the benefit in creating a whole new package like {dplyr} to accomplish the same tasks?
Subjectively, base R can be difficult to use for complex data manipulation tasks, the syntax is verbose, and it is overall difficult for beginners and intermediate users to gain mastery. Packages like {dplyr}, however, provide a more user-friendly interface for working with data, and they can make complex data manipulation tasks much easier to accomplish for users who are more concerned with accomplishing a task rather than learning a programming language in depth.
This example is not meant to highlight the benefits of {dplyr} specifically, but rather to illustrate the general utility of packages in R. Packages are, above all, a way to extend base R to make life easier for users, they provide alternatives to the base R packages, and they provide excellent extensions to the base R language for specific tasks.
## Pre-Installed Packages
Upon installing R, you will have access to a number of pre-installed packages. These packages are part of the "base" R distribution and are considered essential for working with R. A list of such packages can be identified by searching your installed packages for the "base" and "recommended" packages (also summarized as "high" priority packages). The code below generates this list of pre-installed packages in R, along with some information about each package.
```{r}
#| code-fold: true
installed.packages(priority = c("base", "recommended")) %>%
as_tibble() %>%
select(Package, Version, Priority, Depends, Imports) %>%
DT::datatable(options = list(paging=TRUE, scrollY = '300px', pageLength = 11))
```
You will likely recognize some of these packages, such as the {stats} package, which contains many of the commonly used statistical functions that are built into R like `glm()`. Other packages, like {utils} and {survival}, contain functions for other common activities like reading in files and for performing survival analysis, respectively.
Somewhat confusingly, the base R distribution also includes a package called {base} containing many functions and data structures that are considered essential for working with R. Some references to "base R" are therefore referring to the base package, while others refer to the base distribution of the R language. This distinction is not particularly important for the purposes of this tutorial, but it is worth keeping in mind when reading other R documentation.
## User-Installed Packages
In addition to the pre-installed packages, you can also install additional packages from online repositories like CRAN, Bioconductor, and GitHub. You have likely already done this in the past by using the `install.packages()` function to download from CRAN and/or Bioconductor or `devtools::install_github()` to get a package from GitHub. Beyond that, there are principally no differences between the pre-installed packages and user-installed packages; they are both just collections of functions, data, and documentation in a defined format.
Packages, of course, need to be stored somewhere on your computer, and this is where the concept of libraries comes in.
(Side note: you may hear people refer to R packages as libraries, likely stemming from the fact that packages are attached to your R session using the `library()` function. While this generally will not lead to much confusion, it is important to keep in mind the distinction between packages and libraries for the purposes of this tutorial.)
# Libraries {#libraries}
Libraries are the location(s) on your computer[^1] where R packages are stored. When you install a package, the contents are downloaded from a repository and stored in a library on your computer which is simply a directory/folder. Note that we said "a library," not "the library!" This is because it is possible to have multiple libraries on your computer.
[^1]: Strictly speaking, the libraries do not have to be on your computer, but they do need to be accessible to your R session. For example, you could store your libraries on a network drive or on a cloud storage service like Dropbox, but this is not at all recommended because it can lead to performance issues and other problems. A more likely scenario is that you use RStudio through Posit Cloud (i.e. RStudio on a web interface) in which case the available libraries are stored and managed either by Posit or by an IT group for your institution on a separate server.
R will look for packages in a set of default library locations, typically one in the user's home directory and one in the system directory. These paths vary based on operating system, but the `.libPaths()` function can be used to see the libraries that are currently detected by default on your computer.
``` r
# Example of running .libPaths() on a Mac. Results will vary across systems
.libPaths()
#> [1] "/Users/<USER>/Library/R/arm64/4.4/library"
#> [2] "/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library"
```
There are typically 1-2 default libraries on your system, and these **system libraries** are shared across all of your R sessions and projects. However, you can also create **project libraries** that are specific to a single project or session. This is useful if you want to ensure that a project uses a specific version of a package.
For example, different versions of the same package {pkg} might have important differences in how the function `pkg::foo()` works. By storing packages in separate libraries, R can be set up to use the version of `foo()` you actually want to use. These differences between versions happen more often than you'd expect, and should be logged in the package's `NEWS` file or in the package's documentation. Check out this example of "breaking changes" [from the {ggplot2} 3.5.0 package release](https://ggplot2.tidyverse.org/news/index.html#breaking-changes-3-5-0) to get a sense of how behavior of functions can change across versions.
Of course, managing multiple libraries would be incredibly cumbersome, and it requires one to actively be aware of which libraries hold the correct versions of packages. Doing this manually would likely be a worse solution than simply dealing with broken code, but this is where package managers come in; they can help you manage your libraries and package versions effectively, automatically, and without having to be an expert on navigating the file system of your computer. More specifically, package managers will create and manage project libraries in order to ensure that the correct version of a package is used in a given project.
# Repositories {#repositories}
```{r, echo=FALSE}
cran_avail_pkgs <- available.packages(
contriburl = contrib.url("https://cran.r-project.org/")
)
cran_package_count <- cran_avail_pkgs %>%
as_tibble() %>%
nrow()
```
::: center-larger
![Popular R repositories: Comprehensive R Archive Network (CRAN) and Bioconductor](assets/img/cran_and_bioconductor.png)
:::
A repository is a collection of packages that are available for download from a specific location. The most common and general-purpose repository for R packages is the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/), which hosted `r cran_package_count` freely available packages at the last time this page was generated on `r format(Sys.Date(), "%B %d %Y")`. Another popular repository, [Bioconductor](https://www.bioconductor.org/), also hosts several thousand packages, but these packages are focused on bioinformatics, genomics, and computational biology. Although not exactly a unified "repository," GitHub is another popular location for R packages, and many developers use it to host their R packages and to make the development process viewable by the public.
Repositories are important because they provide a centralized location for R users to find and download packages. When you use the `install.packages()` function in R, you are downloading a package from a specific repository. By default, R will download packages from CRAN, but you can specify a different repository if you want to download a package from a different location. This is useful if you want to download a package from [Bioconductor](https://www.bioconductor.org/), a single GitHub repository, or the growing [R-universe](https://r-universe.dev/search) repo for example.
Repositories are particularly relevant for package managers, as they provide a way for the package manager to know where to look for packages. When you use a package manager to install a package, the package manager will download the package from a specific repository and install it in a specific library on your computer. By specifying the repository and library, the package manager can ensure that the correct package is downloaded and installed in the correct location. Like with managing libraries, managing download repositories manually would be a tedious task at best, and package managers also largely automate this procedure.
[See also the Posit article on packages, libraries, and repositories.](https://solutions.posit.co/envs-pkgs/repos_and_libs/)