This repository has been archived by the owner on Aug 20, 2022. It is now read-only.
forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
code-generation.Rmd
562 lines (442 loc) · 19.4 KB
/
code-generation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
# Code generation
```{r setup, include = FALSE}
source("common.R")
library(pryr)
```
## Calls {#calls}
### Modifying a call
You can add, modify, and delete elements of the call with the standard replacement operators, `$<-` and `[[<-`: \index{calls|modifying}
```{r}
y <- quote(read.csv("important.csv", row.names = FALSE))
y$row.names <- TRUE
y$col.names <- FALSE
y
y[[2]] <- quote(paste0(filename, ".csv"))
y[[4]] <- NULL
y
y$sep <- ","
y
```
Calls also support the `[` method. But use it with care. Removing the first element is unlikely to create a useful call.
```{r}
x <- quote(read.csv("important.csv", row.names = FALSE))
x[-3] # remove the second argument
x[-1] # remove the function name - but it's still a call!
x
```
If you want a list of the unevaluated arguments (expressions), use explicit coercion:
```{r}
# A list of the unevaluated arguments
as.list(x[-1])
```
Generally speaking, because R's function calling semantics are so flexible, getting or setting arguments by position is dangerous. For example, even though the values at each position are different, the following three calls all have the same effect:
```{r}
m1 <- quote(read.delim("data.txt", sep = "|"))
m2 <- quote(read.delim(s = "|", "data.txt"))
m3 <- quote(read.delim(file = "data.txt", , "|"))
```
To work around this problem, pryr provides `standardise_call()`. It uses the base `match.call()` function to convert all positional arguments to named arguments: \indexc{standardise\_call()} \indexc{match.call()}
```{r}
standardise_call(m1)
standardise_call(m2)
standardise_call(m3)
```
### Creating a call from its components
To create a new call from its components, you can use `call()` or `as.call()`. The first argument to `call()` is a string which gives a function name. The other arguments are expressions that represent the arguments of the call. \indexc{call()} \indexc{as.call()}
```{r}
call(":", 1, 10)
call("mean", quote(1:10), na.rm = TRUE)
```
`as.call()` is a minor variant of `call()` that takes a single list as input. The first element is a name or call. The subsequent elements are the arguments.
```{r}
as.call(list(quote(mean), quote(1:10)))
as.call(list(quote(adder(10)), 20))
```
### Exercises
1. The following two calls look the same, but are actually different:
```{r}
(a <- call("mean", 1:10))
(b <- call("mean", quote(1:10)))
identical(a, b)
```
What's the difference? Which one should you prefer?
1. Implement a pure R version of `do.call()`.
1. Concatenating a call and an expression with `c()` creates a list. Implement
`concat()` so that the following code works to combine a call and
an additional argument.
```{r, eval = FALSE}
concat(quote(f), a = 1, b = quote(mean(a)))
#> f(a = 1, b = mean(a))
```
1. Since `list()`s don't belong in expressions, we could create a more
convenient call constructor that automatically combines lists into the
arguments. Implement `make_call()` so that the following code works.
```{r, eval = FALSE}
make_call(quote(mean), list(quote(x), na.rm = TRUE))
#> mean(x, na.rm = TRUE)
make_call(quote(mean), quote(x), na.rm = TRUE)
#> mean(x, na.rm = TRUE)
```
1. How does `mode<-` work? How does it use `call()`?
1. Read the source for `pryr::standardise_call()`. How does it work?
Why is `is.primitive()` needed?
1. `standardise_call()` doesn't work so well for the following calls.
Why?
```{r}
standardise_call(quote(mean(1:10, na.rm = TRUE)))
standardise_call(quote(mean(n = T, 1:10)))
standardise_call(quote(mean(x = 1:10, , TRUE)))
```
1. Read the documentation for `pryr::modify_call()`. How do you think
it works? Read the source code.
1. Use `ast()` and experimentation to figure out the three arguments in an
`if()` call. Which components are required? What are the arguments to
the `for()` and `while()` calls?
## Capturing the current call {#capturing-call}
```{r, eval = FALSE, echo = FALSE}
std <- c("package:base", "package:utils", "package:stats")
names(find_uses(std, "sys.call"))
names(find_uses(std, "match.call"))
```
Many base R functions use the current call: the expression that caused the current function to be run. There are two ways to capture a current call: \indexc{calls|capturing current}
* `sys.call()` captures exactly what the user typed. \indexc{sys.call()}
* `match.call()` makes a call that only uses named arguments. It's like
automatically calling `pryr::standardise_call()` on the result of
`sys.call()` \indexc{match.call()}
The following example illustrates the difference between the two:
```{r}
f <- function(abc = 1, def = 2, ghi = 3) {
list(sys = sys.call(), match = match.call())
}
f(d = 2, 2)
```
Modelling functions often use `match.call()` to capture the call used to create the model. This makes it possible to `update()` a model, re-fitting the model after modifying some of original arguments. Here's an example of `update()` in action: \indexc{update()}
```{r}
mod <- lm(mpg ~ wt, data = mtcars)
update(mod, formula = . ~ . + cyl)
```
How does `update()` work? We can rewrite it using some tools from pryr to focus on the essence of the algorithm.
```{r}
update_call <- function (object, formula., ...) {
call <- object$call
# Use update.formula to deal with formulas like . ~ .
if (!missing(formula.)) {
call$formula <- update.formula(formula(object), formula.)
}
modify_call(call, dots(...))
}
update_model <- function(object, formula., ...) {
call <- update_call(object, formula., ...)
eval(call, parent.frame())
}
update_model(mod, formula = . ~ . + cyl)
```
The original `update()` has an `evaluate` argument that controls whether the function returns the call or the result. But I think it's better, on principle, that a function returns only one type of object, rather than different types depending on the function's arguments.
This rewrite also allows us to fix a small bug in `update()`: it re-evaluates the call in the global environment, when what we really want is to re-evaluate it in the environment where the model was originally fit --- in the formula.
```{r, error = TRUE}
f <- function() {
n <- 3
lm(mpg ~ poly(wt, n), data = mtcars)
}
mod <- f()
update(mod, data = mtcars)
update_model <- function(object, formula., ...) {
call <- update_call(object, formula., ...)
eval(call, environment(formula(object)))
}
update_model(mod, data = mtcars)
```
This is an important principle to remember: if you want to re-run code captured with `match.call()`, you also need to capture the environment in which it was evaluated, usually the `parent.frame()`. The downside to this is that capturing the environment also means capturing any large objects which happen to be in that environment, which prevents their memory from being released. This topic is explored in more detail in [garbage collection](#gc). \index{environments|capturing}
Some base R functions use `match.call()` where it's not necessary. For example, `write.csv()` captures the call to `write.csv()` and mangles it to call `write.table()` instead:
```{r}
write.csv <- function(...) {
Call <- match.call(expand.dots = TRUE)
for (arg in c("append", "col.names", "sep", "dec", "qmethod")) {
if (!is.null(Call[[arg]])) {
warning(gettextf("attempt to set '%s' ignored", arg))
}
}
rn <- eval.parent(Call$row.names)
Call$append <- NULL
Call$col.names <- if (is.logical(rn) && !rn) TRUE else NA
Call$sep <- ","
Call$dec <- "."
Call$qmethod <- "double"
Call[[1L]] <- as.name("write.table")
eval.parent(Call)
}
```
To fix this, we could implement `write.csv()` using regular function call semantics:
```{r}
write.csv <- function(x, file = "", sep = ",", qmethod = "double",
...) {
write.table(x = x, file = file, sep = sep, qmethod = qmethod,
...)
}
```
This is much easier to understand: it's just calling `write.table()` with different defaults. This also fixes a subtle bug in the original `write.csv()`: `write.csv(mtcars, row = FALSE)` raises an error, but `write.csv(mtcars, row.names = FALSE)` does not. The lesson here is that it's always better to solve a problem with the simplest tool possible.
### Exercises
1. Compare and contrast `update_model()` with `update.default()`.
1. Why doesn't `write.csv(mtcars, "mtcars.csv", row = FALSE)` work?
What property of argument matching has the original author forgotten?
1. Rewrite `update.formula()` to use R code instead of C code.
1. Sometimes it's necessary to uncover the function that called the
function that called the current function (i.e., the grandparent, not
the parent). How can you use `sys.call()` or `match.call()` to find
this function?
## Walking the AST with recursive functions {#ast-funs}
It's easy to modify a single call with `substitute()` or `pryr::modify_call()`. For more complicated tasks we need to work directly with the AST. The base `codetools` package provides some useful motivating examples of how we can do this: \index{recursion!over ASTs}
* `findGlobals()` locates all global variables used by a function. This
can be useful if you want to check that your function doesn't inadvertently
rely on variables defined in their parent environment.
* `checkUsage()` checks for a range of common problems including
unused local variables, unused parameters, and the use of partial
argument matching.
To write functions like `findGlobals()` and `checkUsage()`, we'll need a new tool. Because expressions have a tree structure, using a recursive function would be the natural choice. The key to doing that is getting the recursion right. This means making sure that you know what the base case is and figuring out how to combine the results from the recursive case. For calls, there are two base cases (atomic vectors and names) and two recursive cases (calls and pairlists). This means that a function for working with expressions will look like:
```{r, eval = FALSE}
recurse_call <- function(x) {
if (is.atomic(x)) {
# Return a value
} else if (is.name(x)) {
# Return a value
} else if (is.call(x)) {
# Call recurse_call recursively
} else if (is.pairlist(x)) {
# Call recurse_call recursively
} else {
# User supplied incorrect input
stop("Don't know how to handle type ", typeof(x),
call. = FALSE)
}
}
```
### Finding F and T
We'll start simple with a function that determines whether a function uses the logical abbreviations `T` and `F`. Using `T` and `F` is generally considered to be poor coding practice, and is something that `R CMD check` will warn about. Let's first compare the AST for `T` vs. `TRUE`:
```{r}
ast(TRUE)
ast(T)
```
`TRUE` is parsed as a logical vector of length one, while `T` is parsed as a name. This tells us how to write our base cases for the recursive function: while an atomic vector will never be a logical abbreviation, a name might, so we'll need to test for both `T` and `F`. The recursive cases can be combined because they do the same thing in both cases: they recursively call `logical_abbr()` on each element of the object. \indexc{logical\_abbr}
```{r}
logical_abbr <- function(x) {
if (is.atomic(x)) {
FALSE
} else if (is.name(x)) {
identical(x, quote(T)) || identical(x, quote(F))
} else if (is.call(x) || is.pairlist(x)) {
for (i in seq_along(x)) {
if (logical_abbr(x[[i]])) return(TRUE)
}
FALSE
} else {
stop("Don't know how to handle type ", typeof(x),
call. = FALSE)
}
}
logical_abbr(quote(TRUE))
logical_abbr(quote(T))
logical_abbr(quote(mean(x, na.rm = T)))
logical_abbr(quote(function(x, na.rm = T) FALSE))
```
### Finding all variables created by assignment
`logical_abbr()` is very simple: it only returns a single `TRUE` or `FALSE`. The next task, listing all variables created by assignment, is a little more complicated. We'll start simply, and then make the function progressively more rigorous. \indexc{find\_assign()}
Again, we start by looking at the AST for assignment:
```{r}
ast(x <- 10)
```
Assignment is a call where the first element is the name `<-`, the second is the object the name is assigned to, and the third is the value to be assigned. This makes the base cases simple: constants and names don't create assignments, so they return `NULL`. The recursive cases aren't too hard either. We `lapply()` over pairlists and over calls to functions other than `<-`.
```{r}
find_assign <- function(x) {
if (is.atomic(x) || is.name(x)) {
NULL
} else if (is.call(x)) {
if (identical(x[[1]], quote(`<-`))) {
x[[2]]
} else {
lapply(x, find_assign)
}
} else if (is.pairlist(x)) {
lapply(x, find_assign)
} else {
stop("Don't know how to handle type ", typeof(x),
call. = FALSE)
}
}
find_assign(quote(a <- 1))
find_assign(quote({
a <- 1
b <- 2
}))
```
This function works for these simple cases, but the output is rather verbose and includes some extraneous `NULL`s. Instead of returning a list, let's keep it simple and use a character vector. We'll also test it with two slightly more complicated examples:
```{r}
find_assign2 <- function(x) {
if (is.atomic(x) || is.name(x)) {
character()
} else if (is.call(x)) {
if (identical(x[[1]], quote(`<-`))) {
as.character(x[[2]])
} else {
unlist(lapply(x, find_assign2))
}
} else if (is.pairlist(x)) {
unlist(lapply(x, find_assign2))
} else {
stop("Don't know how to handle type ", typeof(x),
call. = FALSE)
}
}
find_assign2(quote({
a <- 1
b <- 2
a <- 3
}))
find_assign2(quote({
system.time(x <- print(y <- 5))
}))
```
This is better, but we have two problems: dealing with repeated names and neglecting assignments inside other assignments. The fix for the first problem is easy. We need to wrap `unique()` around the recursive case to remove duplicate assignments. The fix for the second problem is a bit more tricky. We also need to recurse when the call is to `<-`. `find_assign3()` implements both strategies:
```{r}
find_assign3 <- function(x) {
if (is.atomic(x) || is.name(x)) {
character()
} else if (is.call(x)) {
if (identical(x[[1]], quote(`<-`))) {
lhs <- as.character(x[[2]])
} else {
lhs <- character()
}
unique(c(lhs, unlist(lapply(x, find_assign3))))
} else if (is.pairlist(x)) {
unique(unlist(lapply(x, find_assign3)))
} else {
stop("Don't know how to handle type ", typeof(x),
call. = FALSE)
}
}
find_assign3(quote({
a <- 1
b <- 2
a <- 3
}))
find_assign3(quote({
system.time(x <- print(y <- 5))
}))
```
We also need to test subassignment:
```{r}
find_assign3(quote({
l <- list()
l$a <- 5
names(l) <- "b"
}))
```
We only want assignment of the object itself, not assignment that modifies a property of the object. Drawing the tree for the quoted object will help us see what condition to test for. The second element of the call to `<-` should be a name, not another call.
```{r}
ast(l$a <- 5)
ast(names(l) <- "b")
```
Now we have a complete version:
```{r}
find_assign4 <- function(x) {
if (is.atomic(x) || is.name(x)) {
character()
} else if (is.call(x)) {
if (identical(x[[1]], quote(`<-`)) && is.name(x[[2]])) {
lhs <- as.character(x[[2]])
} else {
lhs <- character()
}
unique(c(lhs, unlist(lapply(x, find_assign4))))
} else if (is.pairlist(x)) {
unique(unlist(lapply(x, find_assign4)))
} else {
stop("Don't know how to handle type ", typeof(x),
call. = FALSE)
}
}
find_assign4(quote({
l <- list()
l$a <- 5
names(l) <- "b"
}))
```
While the complete version of this function is quite complicated, it's important to remember we wrote it by working our way up by writing simple component parts.
### Modifying the call tree {#modifying-code}
The next step up in complexity is returning a modified call tree, like what you get with `bquote()`. `bquote()` is a slightly more flexible form of quote: it allows you to optionally quote and unquote some parts of an expression (it's similar to the backtick operator in Lisp). Everything is quoted, _unless_ it's encapsulated in `.()` in which case it's evaluated and the result is inserted: \index{bquote()}
```{r}
a <- 1
b <- 3
bquote(a + b)
bquote(a + .(b))
bquote(.(a) + .(b))
bquote(.(a + b))
```
This provides a fairly easy way to control what gets evaluated and when. How does `bquote()` work? Below, I've rewritten `bquote()` to use the same style as our other functions: it expects input to be quoted already, and makes the base and recursive cases more explicit:
```{r}
bquote2 <- function (x, where = parent.frame()) {
if (is.atomic(x) || is.name(x)) {
# Leave unchanged
x
} else if (is.call(x)) {
if (identical(x[[1]], quote(.))) {
# Call to .(), so evaluate
eval(x[[2]], where)
} else {
# Otherwise apply recursively, turning result back into call
as.call(lapply(x, bquote2, where = where))
}
} else if (is.pairlist(x)) {
as.pairlist(lapply(x, bquote2, where = where))
} else {
# User supplied incorrect input
stop("Don't know how to handle type ", typeof(x),
call. = FALSE)
}
}
x <- 1
y <- 2
bquote2(quote(x == .(x)))
bquote2(quote(function(x = .(x)) {
x + .(y)
}))
```
The main difference between this and the previous recursive functions is that after we process each element of calls and pairlists, we need to coerce them back to their original types.
Note that functions that modify the source tree are most useful for creating expressions that are used at run-time, rather than those that are saved back to the original source file. This is because all non-code information is lost:
```{r}
bquote2(quote(function(x = .(x)) {
# This is a comment
x + # funky spacing
.(y)
}))
```
These tools are somewhat similar to Lisp macros, as discussed in [Programmer's Niche: Macros in R](http://www.r-project.org/doc/Rnews/Rnews_2001-3.pdf#page=10) by Thomas Lumley. However, macros are run at compile-time, which doesn't have any meaning in R, and always return expressions. They're also somewhat like Lisp [fexprs](http://en.wikipedia.org/wiki/Fexpr). A fexpr is a function where the arguments are not evaluated by default. The terms macro and fexpr are useful to know when looking for useful techniques from other languages. \index{macros} \index{fexprs}
### Exercises
1. Why does `logical_abbr()` use a for loop instead of a functional
like `lapply()`?
1. `logical_abbr()` works when given quoted objects, but doesn't work when
given an existing function, as in the example below. Why not? How could
you modify `logical_abbr()` to work with functions? Think about what
components make up a function.
```{r, eval = FALSE}
f <- function(x = TRUE) {
g(x + T)
}
logical_abbr(f)
```
1. Write a function called `ast_type()` that returns either "constant",
"name", "call", or "pairlist". Rewrite `logical_abbr()`, `find_assign()`,
and `bquote2()` to use this function with `switch()` instead of nested if
statements.
1. Write a function that extracts all calls to a function. Compare your
function to `pryr::fun_calls()`.
1. Write a wrapper around `bquote2()` that does non-standard evaluation
so that you don't need to explicitly `quote()` the input.
1. Compare `bquote2()` to `bquote()`. There is a subtle bug in `bquote()`:
it won't replace calls to functions with no arguments. Why?
```{r}
bquote(.(x)(), list(x = quote(f)))
bquote(.(x)(1), list(x = quote(f)))
```
1. Improve the base `recurse_call()` template to also work with lists of
functions and expressions (e.g., as from `parse(path_to_file))`.