-
Notifications
You must be signed in to change notification settings - Fork 54
/
Copy pathspooky-action.qmd
400 lines (301 loc) · 17.5 KB
/
spooky-action.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
# Spooky action {#sec-spooky-action}
```{r}
#| include = FALSE
source("common.R")
```
## What's the problem?
There are no limits to what an function or script can do.
After you call `draw_plot()` or `source("analyse-data.R")`, you *could* discover that all the variables in your global environment have been deleted, or that 1000 new files have been created on your desktop.
But these actions would be surprising, because generally you expect the impact of a function (or script) to be as limited as possible.
Collectively, we call such side-effects "spooky actions" because the connection between action (calling a function or sourcing a script) and result (deleting objects or upgrading packages) is surprising.
It's like flipping a light-switch and discovering that the shower starts running, or having a poltergeist that rearranges the contents of your kitchen cupboards when you're not looking.
Deleting variables and creating files on your desktop are obviously surprising even if you've only just started using R.
But there are other actions that are less obviously destructive, and only start to become surprising as your mental model of R matures.
These include actions like:
- **Attaching packages with `library()`**.
For example, `ggplot2::geom_map()` used to call `library(maps)` in order to make map data available to the function.
This seems harmless, but if you were using purrr, it would break `map()` `map()` would now refer to `maps::map()` rather than `purrr::map()`.
Because of functions in different packages can have the same name, attaching a package can change the behaviour of existing code.
- **Installing packages with `install.packages()`**.
If a script needs dplyr to work, and it's not installed, it seems polite to install it on behalf of the user.
But installing a new package can upgrade existing packages, which might break code in other projects.
Install a package is a potentially destructive operation which should be done with care.
- **Deleting objects in the global environment with `rm(list = ls())`**.
This might seem like a good way to reset the environment so that your script can run cleanly.
But if someone else `source()`s your script, it will delete objects that might be important to them.
(Of course, you'd hope that all of those objects could easily be recreated from another script, but that is not always the case).
Because R doesn't constrain the potential scope of functions and scripts, *you* have to.
By avoiding these actions, you will create code that is less surprising to other R users.
At first, this might seem like tedious busywork.
You might find that spooky action is convenient in the moment, and you might convince yourself that it's necessary or a good idea.
But as you share your code with more people and run more code that has been shared with you[^spooky-action-1], you'll find spooky action to get more and more surprising and frustrating.
[^spooky-action-1]: Spooky actions tend to be particularly frustrating to those who teach R, because they have to run scripts from many students, and those scripts can end up doing wacky things to their computers.
## What precisely is a spooky action?
We can make the notion of spooky action precise by thinking about trees.
Code should only affect the tree beneath where it lives, so any action that reaches up, or across, the tree is a **spooky action**.
There are two important types of trees to consider:
- **The tree formed by files and directories.** A script should only read from and write to directories beneath the directory where it lives.
This explains why you shouldn't install packages (because the package library usually lives elsewhere), and also explains why you shouldn't create files on the desktop.
This rule can be relaxed in two small ways.
Firstly, if the script lives in a project, it's ok to read from and write to anywhere in the project (i.e. a file in `R/` can read from `data-raw/` and write to `data/`).
Secondly, it's always ok to write to the session specific temporary directory, `tempdir()`.
This directory is automatically deleted when R closes, so does not have any lasting effects.
- **The tree of environments created by function calls.** A function should only create and modify variables in its own environment or environments that it creates (typically by calling other functions).
This explains why you shouldn't attach packages (because that changes the [search path](https://adv-r.hadley.nz/environments.html#search-path)), why you shouldn't delete variables with `rm(list = ls())`, or assign to variables that you didn't create with `<<-`.
## How can I remediate spooky actions?
If you have read the above cautions, and still want to proceed, there are three ways you can make the spooky action as safe as possible:
- Allow the user to control the scope.
- Make the action less spooky by giving it a name that clearly describes what it will do.
- Explicitly check with the user before proceeding with the action.
- Advertise what's happening, so while the action might still be spooky, at least it isn't surprising.
### Parameterise the action
The first technique is to allow the user to control where the action will occur.
For example, instead of `save_output_desktop()`, you would write `save_output(path)`, and require that the user provide the path.
### Advertise the action with a clear name
If you can't parameterise the action, make it clear what's going to happen from the outside.
It is fine for function or scripts to have actions outside of their usual trees as long as it is implicit in the name:
- It's ok for `<-` to modify the global environment, because that is its one job, and it's obvious from the name (once you've learned about `<-`, which happens very early).
Similarly, it's ok for `save_output_to_desktop()` to create files in on the desktop, or `copy_to_clipboard()` to copy text to the clipboard, because the action is clear from the name.
- It's fine for `install.packages()` to modify files outside of the current working directory because it's designed specifically to install packages.
Similarly, it's ok for `source("class-setup.R")` to install packages because the intent of a setup script is to get your computer into the same state as someone else's.
Here, it's the name of the function or script that is really important.
As soon as you
Note that it's the name that's important - it's fine for `install.packages()` to install packages, but it's not ok as soon as it's hidden behind even a very simple wrapper:
```{r}
current_time <- function() {
if (!requireNamespace("lubridate", quietly = TRUE)) {
install.packages("lubridate")
}
lubridate::now()
}
current_time()
```
### Ask for confirmation
If you can't parameterise the operation, and need to perform it from somewhere deep within the cope, make sure to confirm with the user before performing the action.
The code below shows how you might do so when installing a package:
```{r}
install_if_needed <- function(package) {
if (requireNamespace(package, quietly = TRUE)) {
return(invisible(TRUE))
}
if (!interactive()) {
stop(package, " is not installed", call. = FALSE)
}
title <- paste0(package, " is not installed. Do you wish to install now?")
if (menu(c("Yes", "No"), title = title) != 1) {
stop("Confirmation not received", call. = FALSE)
}
invisible(TRUE)
}
```
Note the use of `interactive()` here: if the user is not in an interactive setting (i.e. the code is being run with `Rscript`) and we can not get explicit confirmation, we shouldn't make any changes.
Also that all failures are errors: this ensures that the remainder of the function or script does not run if the user doesn't confirm.
Ideally this function would also clearly describe the consequences of your decision.
For example, it would be nice to know if it will download a significant amount of data (since you might want to wait until your at a fast connection if downloading a 1 Gb data package), or if it will upgrade existing packages (since that might break other code).
Writing code that checks with the user requires some care, and it's easy to get the details wrong.
That's why it's better to prefer one of the prior techniques.
### Advertise the side-effects
If you can't get explicit confirmation from the user, at the very minimum you should clearly advertise what is happening.
For example, when you call `install.packages()` it notifies you:
```{r}
#| eval = FALSE
install.packages("dplyr")
#> Installing package into ‘/Users/hadley/R’
#> (as ‘lib’ is unspecified)
#> Trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/dplyr_0.8.0.1.tgz'
#> Content type 'application/x-gzip' length 6587031 bytes (6.3 MB)
#> ==================================================
#> downloaded 6.3 MB
```
However, this message could do with some work:
- It says installing "package", without specifying which package (so if this is called inside another function it won't be informative).
- It doesn't notify me which dependencies it's also going to update.
- It notifies me of the url it's downloading from, which I don't care about, and it only notifies me about the size when it's finished downloading, by which time it too late to stop it.
I would instead write something like this:
```{r}
#| eval = FALSE
install.packages("dplyr")
#> Installing package dplyr to `/Users/hadley/R`
#> Also installing 3 dependencies: glue, rlang, R6
```
We'll come back to the issue of informing the user in ...
## Case studies
### `save()` and `load()`
`load()` has a spooky action because it modifies variables in the current environment:
```{r}
#| include = FALSE
if (!file.exists("spooky-action.rds")) {
x <- 10
y <- 100
save(x, y, file = "spooky-action.rds")
rm(x, y)
}
```
```{r}
x <- 1
load("spooky-action.rds")
x
```
You can make it less spooky by supplying `verbose = TRUE`.
Here we learn that it also loaded a `y` object:
```{r}
load("spooky-action.rds", verbose = TRUE)
y
```
(In an ideal world `verbose = TRUE` would be default)
But generally, I'd avoid `save()` and `load()` altogether, and instead use `saveRDS()` and `readRDS()`, which read and write individual R objects to individual R files and work with `<-`.
This eliminates all spooky action:
```{r}
saveRDS(x, "x.rds")
x <- readRDS("x.rds")
```
```{r}
unlink("x.rds")
```
(readr provides `readr::read_rds()` and `readr::write_rds()` if the inconsistent naming conventions bother you like they bother me.)
### usethis
The usethis package is designed to support the process of developing a package of R code.
It automates many of tedious setup steps by providing function like `use_r()` or `use_test()`.
Many usethis functions modify the `DESCRIPTION` and create other files.
usethis makes these actions as pedestrian as possible by:
- Making it clear that the entire package is designed for the purpose of creating and modifying files, and the purpose of each function is clearly encoded in its named.
- For any potential risky operation, e.g. overwriting an existing file, usethis explicitly asks for confirmation from the user.
To make it harder to "click" through prompts without reading them, usethis uses random prompts in a random ordering.
```{r}
#| eval = FALSE
usethis::ui_yeah("Do you want to proceed?")
#> Do you want to proceed?
#>
#> 1: Absolutely not
#> 2: Not now
#> 3: I agree
```
usethis also works in concert with git to make sure that change are captured in a way that can easily be undone.
- When you call it, every usethis function describes what it is doing as it it doing it:
```{r}
#| eval = FALSE
usethis::create_package("mypackage", open = FALSE)
#> ✔ Creating 'mypackage'
#> ✔ Setting active project to 'mypackage'
#> ✔ Creating 'R/'
#> ✔ Writing 'DESCRIPTION'
#> ✔ Writing 'NAMESPACE'
#> ✔ Writing 'mypackage.Rproj'
#> ✔ Adding '.Rproj.user' to '.gitignore'
#> ✔ Adding '^mypackage\\.Rproj$', '^\\.Rproj\\.user$' to '.Rbuildignore'
```
This is important, but it's not clear how impactful it is because many functions produce enough output that reading through it all seems onerous and so generally most people don't read it.
### `<<-`
If you haven't heard of `<<-`, the super-assignment operator, before, feel free to skip this section as it's an advanced technique that has relatively limited applications.
They're most important in the context of functionals, which you can read more about in [Advanced R](https://adv-r.hadley.nz/function-factories.html#stateful-funs).
`<<-` is safe if you use it to modify a variable in an environment that you control.
For example, the following code creates a function that counts the number of times it is called.
The use of `<<-` is safe because it only affects the environment created by `make_counter()`, not an external environment.
```{r}
make_counter <- function() {
i <- 0
function() {
i <<- i + 1
i
}
}
c1 <- make_counter()
c2 <- make_counter()
c1()
c1()
c2()
```
A more common use of `<<-` is to break one of the limitations of `map()`[^spooky-action-2] and use it like a for loop to iteratively modify input.
For example, imagine you want to compute a cumulative sum.
That's straightforward to write with a for loop:
[^spooky-action-2]: `purrr::map()` is basically interchangeable with `base::lapply()` so if you're more familiar with `lapply()`, you can mentally substitute it for `map()` in all the code here.
```{r}
x <- rpois(10, 10)
out <- numeric(length(x))
for (i in seq_along(x)) {
if (i == 1) {
out[[i]] <- x[[i]]
} else {
out[[i]] <- x[[i]] + out[[i - 1]]
}
}
rbind(x, out)
```
A simple transformation from to use `map()` doesn't work:
```{r}
library(purrr)
out <- numeric(length(x))
map_dbl(seq_along(x), function(i) {
if (i == 1) {
out[[i]] <- x[[i]]
} else {
out[[i]] <- x[[i]] + out[[i - 1]]
}
})
```
Because the modification of `out` is happening inside of a function, R creates a copy of `out` (this is called [copy-on-modify principle](https://adv-r.hadley.nz/names-values.html#copy-on-modify)).
Instead we need to use `<<-` to reach outside of the function to modify the outer `out`:
```{r}
map_dbl(seq_along(x), function(i) {
if (i == 1) {
out[[i]] <<- x[[i]]
} else {
out[[i]] <<- x[[i]] + out[[i - 1]]
}
})
```
This use of `<<-` is a spooky action because we're reaching up the tree of environments to modify an object created outside of the function.
In this case, however, there's no point in using `map()`: the point of those functions is to restrict what you can do compared to a for loop so that your code is easier to understand.
[R for data science](https://r4ds.had.co.nz/iteration.html#for-loop-variations) has other examples of for loops that *could* be rewritten with `map()`, but shouldn't be.
Note that we could still wrap this code into a function to eliminate the spooky action:
```{r}
cumsum2 <- function(x) {
out <- numeric(length(x))
map_dbl(seq_along(x), function(i) {
if (i == 1) {
out[[i]] <<- x[[i]]
} else {
out[[i]] <<- x[[i]] + out[[i - 1]]
}
})
out
}
```
This eliminates the spooky action because it's now modifying an object that the function "owns", but I still wouldn't recommend it, as the use of `map()` and `<<-` only increases complexity for no gain compared to the use of a for loop.
### `assign()` in a for loop
It's not uncommon for people to ask how to create multiple objects from for a loop.
For example, maybe you have a vector of file names, and want to read each file into an individual object.
With some effort you typically discover `assign()`:
```{r}
#| eval = FALSE
paths <- c("a.csv", "b.csv", "c.csv")
names(paths) <- c("a", "b", "c")
for (i in seq_along(paths)) {
assign(names(paths)[[i]], read.csv(paths[[i]]))
}
```
The main problem with this approach is that it does not facilitate composition.
For example, imagine that you now wanted to figure out how many rows are in each of the data frames.
Now you need to learn how to loop over a series of objects, where the name of the object is stored in a character vector.
With some work, you might discover `get()` and write:
```{r}
#| eval = FALSE
lengths <- numeric(length(names))
for (i in seq_along(paths)) {
lengths[[i]] <- nrow(get(names(paths)[[i]]))
}
```
This approach is not necessarily bad in and of itself[^spooky-action-3], but it tends to lead you down a high-friction path.
Instead, if you learn a little about lists and functional programming techniques (e.g. with [purrr](http://purrr.tidyverse.org/)), you'll be able to write code like this:
[^spooky-action-3]: However, if you take this approach in multiple places in your code, you'll need to make sure that you don't use the same name in multiple loops because `assign()` will silently overwrite an existing variable.
This might not happen commonly but because it's silent, it will create a bug that is hard to detect.
```{r}
#| eval = FALSE
library(purrr)
files <- map(paths, read.csv)
lengths <- map_int(files, nrow)
```
This obviously requires that you learn some new tools - but learning about `map()` and `map_int()` will pay off in many more situations than learning about `assign()` and `get()`.
And because you can reuse `map()` and friends in many places, you'll find that they get easier and easier to use over time.
It would certainly be possible to build tools in purrr to avoid having to learn about `assign()` and `get()` and to provide a polished interface for working with character vectors containing object names.
But such functions would need to reach up the tree of environments, so would violate the "spooky action" principle, and thus I believe are best avoided.