forked from rdpeng/rprogdatascience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathapply.Rmd
492 lines (314 loc) · 17.7 KB
/
apply.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
# Loop Functions {#loop-functions}
```{r,echo=FALSE}
knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE, error = TRUE)
set.seed(1)
```
## Looping on the Command Line
Writing `for` and `while` loops is useful when programming but not
particularly easy when working interactively on the command
line. Multi-line expressions with curly braces are just not that easy to sort through when working on the command line. R has some functions which implement looping in a compact form to make your life
easier.
- `lapply()`: Loop over a list and evaluate a function on each element
- `sapply()`: Same as `lapply` but try to simplify the result
- `apply()`: Apply a function over the margins of an array
- `tapply()`: Apply a function over subsets of a vector
- `mapply()`: Multivariate version of `lapply`
An auxiliary function `split` is also useful, particularly in conjunction with `lapply`.
## `lapply()`
[Watch a video of this section](https://youtu.be/E1_NlFb0E4g)
The `lapply()` function does the following simple series of operations:
1. it loops over a list, iterating over each element in that list
2. it applies a *function* to each element of the list (a function that you specify)
3. and returns a list (the `l` is for "list").
This function takes three arguments: (1) a list `X`; (2) a function (or the name of a function) `FUN`; (3) other arguments via its `...` argument. If `X` is not a list, it will be coerced to a list using `as.list()`.
The body of the `lapply()` function can be seen here.
```{r}
lapply
```
Note that the actual looping is done internally in C code for efficiency reasons.
It's important to remember that `lapply()` always returns a list, regardless of the class of the input.
Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output.
```{r}
x <- list(a = 1:5, b = rnorm(10))
lapply(x, mean)
```
Notice that here we are passing the `mean()` function as an argument to the `lapply()` function. Functions in R can be used this way and can be passed back and forth as arguments just like any other object. When you pass a function to another function, you do not need to include the open and closed parentheses `()` like you do when you are *calling* a function.
Here is another example of using `lapply()`.
```{r}
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
```
You can use `lapply()` to evaluate a function multiple times each with a different argument. Below, is an example where I call the `runif()` function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.
```{r}
x <- 1:4
lapply(x, runif)
```
When you pass a function to `lapply()`, `lapply()` takes elements of the list and passes them as the *first argument* of the function you are applying. In the above example, the first argument of `runif()` is `n`, and so the elements of the sequence `1:4` all got passed to the `n` argument of `runif()`.
Functions that you pass to `lapply()` may have other arguments. For example, the `runif()` function has a `min` and `max` argument too. In the example above I used the default values for `min` and `max`. How would you be able to specify different values for that in the context of `lapply()`?
Here is where the `...` argument to `lapply()` comes into play. Any arguments that you place in the `...` argument will get passed down to the function being applied to the elements of the list.
Here, the `min = 0` and `max = 10` arguments are passed down to `runif()` every time it gets called.
```{r}
x <- 1:4
lapply(x, runif, min = 0, max = 10)
```
So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10.
The `lapply()` function and its friends make heavy use of _anonymous_ functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These are functions are generated "on the fly" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace.
Here I am creating a list that contains two matrices.
```{r}
x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
x
```
Suppose I wanted to extract the first column of each matrix in the list. I could write
an anonymous function for extracting the first column of each matrix.
```{r}
lapply(x, function(elt) { elt[,1] })
```
Notice that I put the `function()` definition right in the call to `lapply()`. This is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside `lapply()`, but if it's going to be more complicated, it's probably a better idea to define the function separately.
For example, I could have done the following.
```{r}
f <- function(elt) {
elt[, 1]
}
lapply(x, f)
```
Now the function is no longer anonymous; it's name is `f`. Whether you use an anonymous function or you define a function first depends on your context. If you think the function `f` is something you're going to need a lot in other parts of your code, you might want to define it separately. But if you're just going to use it for this call to `lapply()`, then it's probably simpler to use an anonymous function.
## `sapply()`
The `sapply()` function behaves similarly to `lapply()`; the only real difference is in the return value. `sapply()` will try to simplify the result of `lapply()` if possible. Essentially, `sapply()` calls `lapply()` on its input and then applies the following algorithm:
- If the result is a list where every element is length 1, then a vector is returned
- If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
- If it can’t figure things out, a list is returned
Here's the result of calling `lapply()`.
```{r}
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
```
Notice that `lapply()` returns a list (as usual), but that each element of the list has length 1.
Here's the result of calling `sapply()` on the same list.
```{r}
sapply(x, mean)
```
Because the result of `lapply()` was a list where each element had length 1, `sapply()` collapsed the output into a numeric vector, which is often more useful than a list.
## `split()`
[Watch a video of this section](https://youtu.be/TjwE5b0fOcs)
The `split()` function takes a vector or other objects and splits it into groups determined by a factor or list of factors.
The arguments to `split()` are
```{r}
str(split)
```
where
- `x` is a vector (or list) or data frame
- `f` is a factor (or coerced to one) or a list of factors
- `drop` indicates whether empty factors levels should be dropped
The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying tha function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts.
Here we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to "generate levels" in a factor variable.
```{r}
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10)
split(x, f)
```
A common idiom is `split` followed by an `lapply`.
```{r}
lapply(split(x, f), mean)
```
## Splitting a Data Frame
```{r}
library(datasets)
head(airquality)
```
We can split the `airquality` data frame by the `Month` variable so that we have separate sub-data frames for each month.
```{r}
s <- split(airquality, airquality$Month)
str(s)
```
Then we can take the column means for `Ozone`, `Solar.R`, and `Wind` for each sub-data frame.
```{r}
lapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")])
})
```
Using `sapply()` might be better here for a more readable output.
```{r}
sapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")])
})
```
Unfortunately, there are `NA`s in the data so we cannot simply take the means of those variables. However, we can tell the `colMeans` function to remove the `NA`s before computing the mean.
```{r}
sapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")],
na.rm = TRUE)
})
```
Occasionally, we may want to split an R object according to levels defined in more than one variable. We can do this by creating an interaction of the variables with the `interaction()` function.
```{r}
x <- rnorm(10)
f1 <- gl(2, 5)
f2 <- gl(5, 2)
f1
f2
## Create interaction of two factors
interaction(f1, f2)
```
With multiple factors and many levels, creating an interaction can result in many levels that are empty.
```{r}
str(split(x, list(f1, f2)))
```
Notice that there are 4 categories with no data. But we can drop empty levels when we call the `split()` function.
```{r}
str(split(x, list(f1, f2), drop = TRUE))
```
## tapply
[Watch a video of this section](https://youtu.be/6YEPWjbk3GA)
`tapply()` is used to apply a function over subsets of a vector. It can be thought of as a combination of `split()` and `sapply()` for vectors only. I've been told that the "t" in `tapply()` refers to "table", but that is unconfirmed.
```{r}
str(tapply)
```
The arguments to `tapply()` are as follows:
- `X` is a vector
- `INDEX` is a factor or a list of factors (or else they are coerced to factors)
- `FUN` is a function to be applied
- ... contains other arguments to be passed `FUN`
- `simplify`, should we simplify the result?
Given a vector of numbers, one simple operation is to take group means.
```{r}
## Simulate some data
x <- c(rnorm(10), runif(10), rnorm(10, 1))
## Define some groups with a factor variable
f <- gl(3, 10)
f
tapply(x, f, mean)
```
We can also take the group means without simplifying the result, which will give us a list. For functions that return a single value, usually, this is not what we want, but it can be done.
```{r}
tapply(x, f, mean, simplify = FALSE)
```
We can also apply functions that return more than a single value. In this case, `tapply()` will not simplify the result and will return a list. Here's an example of finding the range of each sub-group.
```{r}
tapply(x, f, range)
```
## `apply()`
[Watch a video of this section](https://youtu.be/F54ixFPq_xQ)
The `apply()` function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. Using `apply()` is not really faster than writing a loop, but it works in one line and is highly compact.
```{r}
str(apply)
```
The arguments to `apply()` are
- `X` is an array
- `MARGIN` is an integer vector indicating which margins should be “retained”.
- `FUN` is a function to be applied
- `...` is for other arguments to be passed to `FUN`
Here I create a 20 by 10 matrix of Normal random numbers. I then compute the mean of each column.
```{r}
x <- matrix(rnorm(200), 20, 10)
apply(x, 2, mean) ## Take the mean of each column
```
I can also compute the sum of each row.
```{r}
apply(x, 1, sum) ## Take the mean of each row
```
Note that in both calls to `apply()`, the return value was a vector of numbers.
You've probably noticed that the second argument is either a 1 or a 2, depending on whether we want row statistics or column statistics. What exactly *is* the second argument to `apply()`?
The `MARGIN` argument essentially indicates to `apply()` which dimension of the array you want to preserve or retain. So when taking the mean of each column, I specify
```{r,eval=FALSE}
apply(x, 2, mean)
```
because I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run
```{r,eval=FALSE}
apply(x, 1, mean)
```
because I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension).
## Col/Row Sums and Means
For the special case of column/row sums and column/row means of matrices, we have some useful shortcuts.
- `rowSums` = `apply(x, 1, sum)`
- `rowMeans` = `apply(x, 1, mean)`
- `colSums` = `apply(x, 2, sum)`
- `colMeans` = `apply(x, 2, mean)`
The shortcut functions are heavily optimized and hence are _much_ faster, but you probably won’t notice unless you’re using a large matrix. Another nice aspect of these functions is that they are a bit more descriptive. It's arguably more clear to write `colMeans(x)` in your code than `apply(x, 2, mean)`.
## Other Ways to Apply
You can do more than take sums and means with the `apply()` function. For example, you can compute quantiles of the rows of a matrix using the `quantile()` function.
```{r}
x <- matrix(rnorm(200), 20, 10)
## Get row quantiles
apply(x, 1, quantile, probs = c(0.25, 0.75))
```
Notice that I had to pass the `probs = c(0.25, 0.75)` argument to `quantile()` via the `...` argument to `apply()`.
For a higher dimensional example, I can create an array of $2\times2$ matrices and the compute the average of the matrices in the array.
```{r}
a <- array(rnorm(2 * 2 * 10), c(2, 2, 10))
apply(a, c(1, 2), mean)
```
In the call to `apply()` here, I indicated via the `MARGIN` argument that I wanted to preserve the first and second dimensions and to collapse the third dimension by taking the mean.
There is a faster way to do this specific operation via the `colMeans()` function.
```{r}
rowMeans(a, dims = 2) ## Faster
```
In this situation, I might argue that the use of `rowMeans()` is less readable, but it is substantially faster with large arrays.
## `mapply()`
[Watch a video of this section](https://youtu.be/z8jC_h7S0VE)
The `mapply()` function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that `lapply()` and friends only iterate over a single R object. What if you want to iterate over multiple R objects in parallel? This is what `mapply()` is for.
```{r}
str(mapply)
```
The arguments to `mapply()` are
- `FUN` is a function to apply
- `...` contains R objects to apply over
- `MoreArgs` is a list of other arguments to `FUN`.
- `SIMPLIFY` indicates whether the result should be simplified
The `mapply()` function has a different argument order from `lapply()` because the function to apply comes first rather than the object to iterate over. The R objects over which we apply the function are given in the `...` argument because we can apply over an arbitrary number of R objects.
For example, the following is tedious to type
`list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))`
With `mapply()`, instead we can do
```{r}
mapply(rep, 1:4, 4:1)
```
This passes the sequence `1:4` to the first argument of `rep()` and the sequence `4:1` to the second argument.
Here's another example for simulating randon Normal variables.
```{r}
noise <- function(n, mean, sd) {
rnorm(n, mean, sd)
}
## Simulate 5 randon numbers
noise(5, 1, 2)
## This only simulates 1 set of numbers, not 5
noise(1:5, 1:5, 2)
```
Here we can use `mapply()` to pass the sequence `1:5` separately to the `noise()` function so that we can get 5 sets of random numbers, each with a different length and mean.
```{r}
mapply(noise, 1:5, 1:5, 2)
```
The above call to `mapply()` is the same as
```{r}
list(noise(1, 1, 2), noise(2, 2, 2),
noise(3, 3, 2), noise(4, 4, 2),
noise(5, 5, 2))
```
## Vectorizing a Function
The `mapply()` function can be use to automatically "vectorize" a function. What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments. This is often needed when you want to plot functions.
Here's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is $\sum_{i=1}^n(x_i-\mu)^2/\sigma^2$.
```{r}
sumsq <- function(mu, sigma, x) {
sum(((x - mu) / sigma)^2)
}
```
This function takes a mean `mu`, a standard deviation `sigma`, and some data in a vector `x`.
In many statistical applications, we want to minimize the sum of squares to find the optimal `mu` and `sigma`. Before we do that, we may want to evaluate or plot the function for many different values of `mu` or `sigma`. However, passing a vector of `mu`s or `sigma`s won't work with this function because it's not vectorized.
```{r}
x <- rnorm(100) ## Generate some data
sumsq(1:10, 1:10, x) ## This is not what we want
```
Note that the call to `sumsq()` only produced one value instead of 10 values.
However, we can do what we want to do by using `mapply()`.
```{r}
mapply(sumsq, 1:10, 1:10, MoreArgs = list(x = x))
```
There's even a function in R called `Vectorize()` that automatically can create a vectorized version of your function. So we could create a `vsumsq()` function that is fully vectorized as follows.
```{r}
vsumsq <- Vectorize(sumsq, c("mu", "sigma"))
vsumsq(1:10, 1:10, x)
```
Pretty cool, right?
## Summary
* The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form
* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.
* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere
* The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.