spooky-action.qmd

# Spooky action {#sec-spooky-action}

```{r}
#| include = FALSE
source("common.R")
```

## What's the problem?

There are no limits to what an function or script can do.
After you call `draw_plot()` or `source("analyse-data.R")`, you *could* discover that all the variables in your global environment have been deleted, or that 1000 new files have been created on your desktop.
But these actions would be surprising, because generally you expect the impact of a function (or script) to be as limited as possible.
Collectively, we call such side-effects "spooky actions" because the connection between action (calling a function or sourcing a script) and result (deleting objects or upgrading packages) is surprising.
It's like flipping a light-switch and discovering that the shower starts running, or having a poltergeist that rearranges the contents of your kitchen cupboards when you're not looking.

Deleting variables and creating files on your desktop are obviously surprising even if you've only just started using R.
But there are other actions that are less obviously destructive, and only start to become surprising as your mental model of R matures.
These include actions like:

-   **Attaching packages with `library()`**.
    For example, `ggplot2::geom_map()` used to call `library(maps)` in order to make map data available to the function.
    This seems harmless, but if you were using purrr, it would break `map()` `map()` would now refer to `maps::map()` rather than `purrr::map()`.
    Because of functions in different packages can have the same name, attaching a package can change the behaviour of existing code.

-   **Installing packages with `install.packages()`**.
    If a script needs dplyr to work, and it's not installed, it seems polite to install it on behalf of the user.
    But installing a new package can upgrade existing packages, which might break code in other projects.
    Install a package is a potentially destructive operation which should be done with care.

-   **Deleting objects in the global environment with `rm(list = ls())`**.
    This might seem like a good way to reset the environment so that your script can run cleanly.
    But if someone else `source()`s your script, it will delete objects that might be important to them.
    (Of course, you'd hope that all of those objects could easily be recreated from another script, but that is not always the case).

Because R doesn't constrain the potential scope of functions and scripts, *you* have to.
By avoiding these actions, you will create code that is less surprising to other R users.
At first, this might seem like tedious busywork.
You might find that spooky action is convenient in the moment, and you might convince yourself that it's necessary or a good idea.
But as you share your code with more people and run more code that has been shared with you[^spooky-action-1], you'll find spooky action to get more and more surprising and frustrating.

[^spooky-action-1]: Spooky actions tend to be particularly frustrating to those who teach R, because they have to run scripts from many students, and those scripts can end up doing wacky things to their computers.

## What precisely is a spooky action?

We can make the notion of spooky action precise by thinking about trees.
Code should only affect the tree beneath where it lives, so any action that reaches up, or across, the tree is a **spooky action**.

There are two important types of trees to consider:

-   **The tree formed by files and directories.** A script should only read from and write to directories beneath the directory where it lives.
    This explains why you shouldn't install packages (because the package library usually lives elsewhere), and also explains why you shouldn't create files on the desktop.

    This rule can be relaxed in two small ways.
    Firstly, if the script lives in a project, it's ok to read from and write to anywhere in the project (i.e. a file in `R/` can read from `data-raw/` and write to `data/`).
    Secondly, it's always ok to write to the session specific temporary directory, `tempdir()`.
    This directory is automatically deleted when R closes, so does not have any lasting effects.

-   **The tree of environments created by function calls.** A function should only create and modify variables in its own environment or environments that it creates (typically by calling other functions).
    This explains why you shouldn't attach packages (because that changes the [search path](https://adv-r.hadley.nz/environments.html#search-path)), why you shouldn't delete variables with `rm(list = ls())`, or assign to variables that you didn't create with `<<-`.

## How can I remediate spooky actions?

If you have read the above cautions, and still want to proceed, there are three ways you can make the spooky action as safe as possible:

-   Allow the user to control the scope.

-   Make the action less spooky by giving it a name that clearly describes what it will do.

-   Explicitly check with the user before proceeding with the action.

-   Advertise what's happening, so while the action might still be spooky, at least it isn't surprising.

### Parameterise the action

The first technique is to allow the user to control where the action will occur.
For example, instead of `save_output_desktop()`, you would write `save_output(path)`, and require that the user provide the path.

### Advertise the action with a clear name

If you can't parameterise the action, make it clear what's going to happen from the outside.
It is fine for function or scripts to have actions outside of their usual trees as long as it is implicit in the name:

-   It's ok for `<-` to modify the global environment, because that is its one job, and it's obvious from the name (once you've learned about `<-`, which happens very early).
    Similarly, it's ok for `save_output_to_desktop()` to create files in on the desktop, or `copy_to_clipboard()` to copy text to the clipboard, because the action is clear from the name.

-   It's fine for `install.packages()` to modify files outside of the current working directory because it's designed specifically to install packages.
    Similarly, it's ok for `source("class-setup.R")` to install packages because the intent of a setup script is to get your computer into the same state as someone else's.

Here, it's the name of the function or script that is really important.
As soon as you

Note that it's the name that's important - it's fine for `install.packages()` to install packages, but it's not ok as soon as it's hidden behind even a very simple wrapper:

```{r}
current_time <- function() {
  if (!requireNamespace("lubridate", quietly = TRUE)) {
    install.packages("lubridate")
  }
  lubridate::now()
}
current_time()
```

### Ask for confirmation

If you can't parameterise the operation, and need to perform it from somewhere deep within the cope, make sure to confirm with the user before performing the action.
The code below shows how you might do so when installing a package:

```{r}
install_if_needed <- function(package) {
  if (requireNamespace(package, quietly = TRUE)) {
    return(invisible(TRUE))
  }
  
  if (!interactive()) {
    stop(package, " is not installed", call. = FALSE)
  }
  
  title <- paste0(package, " is not installed. Do you wish to install now?")
  if (menu(c("Yes", "No"), title = title) != 1) {
    stop("Confirmation not received", call. = FALSE)
  }
  
  invisible(TRUE)
}
```

Note the use of `interactive()` here: if the user is not in an interactive setting (i.e. the code is being run with `Rscript`) and we can not get explicit confirmation, we shouldn't make any changes.
Also that all failures are errors: this ensures that the remainder of the function or script does not run if the user doesn't confirm.

Ideally this function would also clearly describe the consequences of your decision.
For example, it would be nice to know if it will download a significant amount of data (since you might want to wait until your at a fast connection if downloading a 1 Gb data package), or if it will upgrade existing packages (since that might break other code).

Writing code that checks with the user requires some care, and it's easy to get the details wrong.
That's why it's better to prefer one of the prior techniques.

### Advertise the side-effects

If you can't get explicit confirmation from the user, at the very minimum you should clearly advertise what is happening.
For example, when you call `install.packages()` it notifies you:

```{r}
#| eval = FALSE
install.packages("dplyr")
#> Installing package into ‘/Users/hadley/R’
#> (as ‘lib’ is unspecified)
#> Trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/dplyr_0.8.0.1.tgz'
#> Content type 'application/x-gzip' length 6587031 bytes (6.3 MB)
#> ==================================================
#> downloaded 6.3 MB
```

However, this message could do with some work:

-   It says installing "package", without specifying which package (so if this is called inside another function it won't be informative).

-   It doesn't notify me which dependencies it's also going to update.

-   It notifies me of the url it's downloading from, which I don't care about, and it only notifies me about the size when it's finished downloading, by which time it too late to stop it.

I would instead write something like this:

```{r}
#| eval = FALSE
install.packages("dplyr")
#> Installing package dplyr to `/Users/hadley/R`
#> Also installing 3 dependencies: glue, rlang, R6
```

We'll come back to the issue of informing the user in ...

## Case studies

### `save()` and `load()`

`load()` has a spooky action because it modifies variables in the current environment:

```{r}
#| include = FALSE
if (!file.exists("spooky-action.rds")) {
  x <- 10
  y <- 100
  save(x, y, file = "spooky-action.rds")
  rm(x, y)
}
```

```{r}
x <- 1
load("spooky-action.rds")
x
```

You can make it less spooky by supplying `verbose = TRUE`.
Here we learn that it also loaded a `y` object:

```{r}
load("spooky-action.rds", verbose = TRUE)
y
```

(In an ideal world `verbose = TRUE` would be default)

But generally, I'd avoid `save()` and `load()` altogether, and instead use `saveRDS()` and `readRDS()`, which read and write individual R objects to individual R files and work with `<-`.
This eliminates all spooky action:

```{r}
saveRDS(x, "x.rds")
x <- readRDS("x.rds")
```

```{r}
unlink("x.rds")
```

(readr provides `readr::read_rds()` and `readr::write_rds()` if the inconsistent naming conventions bother you like they bother me.)

### usethis

The usethis package is designed to support the process of developing a package of R code.
It automates many of tedious setup steps by providing function like `use_r()` or `use_test()`.
Many usethis functions modify the `DESCRIPTION` and create other files.
usethis makes these actions as pedestrian as possible by:

-   Making it clear that the entire package is designed for the purpose of creating and modifying files, and the purpose of each function is clearly encoded in its named.

-   For any potential risky operation, e.g. overwriting an existing file, usethis explicitly asks for confirmation from the user.
    To make it harder to "click" through prompts without reading them, usethis uses random prompts in a random ordering.

    ```{r}
    #| eval = FALSE
    usethis::ui_yeah("Do you want to proceed?")
    #> Do you want to proceed?
    #> 
    #> 1: Absolutely not
    #> 2: Not now
    #> 3: I agree
    ```

    usethis also works in concert with git to make sure that change are captured in a way that can easily be undone.

-   When you call it, every usethis function describes what it is doing as it it doing it:

    ```{r}
    #| eval = FALSE
    usethis::create_package("mypackage", open = FALSE)
    #> ✔ Creating 'mypackage'
    #> ✔ Setting active project to 'mypackage'
    #> ✔ Creating 'R/'
    #> ✔ Writing 'DESCRIPTION'
    #> ✔ Writing 'NAMESPACE'
    #> ✔ Writing 'mypackage.Rproj'
    #> ✔ Adding '.Rproj.user' to '.gitignore'
    #> ✔ Adding '^mypackage\\.Rproj$', '^\\.Rproj\\.user$' to '.Rbuildignore'
    ```

    This is important, but it's not clear how impactful it is because many functions produce enough output that reading through it all seems onerous and so generally most people don't read it.

### `<<-`

If you haven't heard of `<<-`, the super-assignment operator, before, feel free to skip this section as it's an advanced technique that has relatively limited applications.
They're most important in the context of functionals, which you can read more about in [Advanced R](https://adv-r.hadley.nz/function-factories.html#stateful-funs).

`<<-` is safe if you use it to modify a variable in an environment that you control.
For example, the following code creates a function that counts the number of times it is called.
The use of `<<-` is safe because it only affects the environment created by `make_counter()`, not an external environment.

```{r}
make_counter <- function() {
  i <- 0
  function() {
    i <<- i + 1
    i
  }
}
c1 <- make_counter()
c2 <- make_counter()
c1()
c1()
c2()
```

A more common use of `<<-` is to break one of the limitations of `map()`[^spooky-action-2] and use it like a for loop to iteratively modify input.
For example, imagine you want to compute a cumulative sum.
That's straightforward to write with a for loop:

[^spooky-action-2]: `purrr::map()` is basically interchangeable with `base::lapply()` so if you're more familiar with `lapply()`, you can mentally substitute it for `map()` in all the code here.

```{r}
x <- rpois(10, 10)

out <- numeric(length(x))
for (i in seq_along(x)) {
  if (i == 1) {
    out[[i]] <- x[[i]]
  } else {
    out[[i]] <- x[[i]] + out[[i - 1]]
  }
}

rbind(x, out)
```

A simple transformation from to use `map()` doesn't work:

```{r}
library(purrr)
out <- numeric(length(x))
map_dbl(seq_along(x), function(i) {
  if (i == 1) {
    out[[i]] <- x[[i]]
  } else {
    out[[i]] <- x[[i]] + out[[i - 1]]
  }
})
```

Because the modification of `out` is happening inside of a function, R creates a copy of `out` (this is called [copy-on-modify principle](https://adv-r.hadley.nz/names-values.html#copy-on-modify)).
Instead we need to use `<<-` to reach outside of the function to modify the outer `out`:

```{r}
map_dbl(seq_along(x), function(i) {
  if (i == 1) {
    out[[i]] <<- x[[i]]
  } else {
    out[[i]] <<- x[[i]] + out[[i - 1]]
  }
})
```

This use of `<<-` is a spooky action because we're reaching up the tree of environments to modify an object created outside of the function.
In this case, however, there's no point in using `map()`: the point of those functions is to restrict what you can do compared to a for loop so that your code is easier to understand.
[R for data science](https://r4ds.had.co.nz/iteration.html#for-loop-variations) has other examples of for loops that *could* be rewritten with `map()`, but shouldn't be.

Note that we could still wrap this code into a function to eliminate the spooky action:

```{r}
cumsum2 <- function(x) {
  out <- numeric(length(x))
  map_dbl(seq_along(x), function(i) {
    if (i == 1) {
      out[[i]] <<- x[[i]]
    } else {
      out[[i]] <<- x[[i]] + out[[i - 1]]
    }
  })
  out
}
```

This eliminates the spooky action because it's now modifying an object that the function "owns", but I still wouldn't recommend it, as the use of `map()` and `<<-` only increases complexity for no gain compared to the use of a for loop.

### `assign()` in a for loop

It's not uncommon for people to ask how to create multiple objects from for a loop.
For example, maybe you have a vector of file names, and want to read each file into an individual object.
With some effort you typically discover `assign()`:

```{r}
#| eval = FALSE
paths <- c("a.csv", "b.csv", "c.csv")
names(paths) <- c("a", "b", "c")

for (i in seq_along(paths)) {
  assign(names(paths)[[i]], read.csv(paths[[i]]))
}
```

The main problem with this approach is that it does not facilitate composition.
For example, imagine that you now wanted to figure out how many rows are in each of the data frames.
Now you need to learn how to loop over a series of objects, where the name of the object is stored in a character vector.
With some work, you might discover `get()` and write:

```{r}
#| eval = FALSE
lengths <- numeric(length(names))
for (i in seq_along(paths)) {
  lengths[[i]] <- nrow(get(names(paths)[[i]]))
}
```

This approach is not necessarily bad in and of itself[^spooky-action-3], but it tends to lead you down a high-friction path.
Instead, if you learn a little about lists and functional programming techniques (e.g. with [purrr](http://purrr.tidyverse.org/)), you'll be able to write code like this:

[^spooky-action-3]: However, if you take this approach in multiple places in your code, you'll need to make sure that you don't use the same name in multiple loops because `assign()` will silently overwrite an existing variable.
    This might not happen commonly but because it's silent, it will create a bug that is hard to detect.

```{r}
#| eval = FALSE
library(purrr)

files <- map(paths, read.csv)
lengths <- map_int(files, nrow)
```

This obviously requires that you learn some new tools - but learning about `map()` and `map_int()` will pay off in many more situations than learning about `assign()` and `get()`.
And because you can reuse `map()` and friends in many places, you'll find that they get easier and easier to use over time.

It would certainly be possible to build tools in purrr to avoid having to learn about `assign()` and `get()` and to provide a polished interface for working with character vectors containing object names.
But such functions would need to reach up the tree of environments, so would violate the "spooky action" principle, and thus I believe are best avoided.