-
Notifications
You must be signed in to change notification settings - Fork 2
/
22-nonpar-splines-poly.qmd
executable file
·423 lines (355 loc) · 19.7 KB
/
22-nonpar-splines-poly.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
# Non-parametric regression
```{r setup}
#| include: false
source("_common.R")
```
We now turn our attention to the situation where the
regression function E(Y|X) is not necessarily linear.
Furthermore, we will assume that its *form* is
**unknown**. If we knew that the regression
function was a linear combination
of a sine and a cosine function, "E(Y|X=x) = a + b sin(x)
+ c cos(x)", where *a*, *b* and *c* are uknown,
for example, then the problem would in fact be a linear
regression problem. More in general,
when the true regression function is known (or
assumed) to belong to a family of functions that we can parametrize, then
the estimation can be done via standard least squares.
Instead here we focus on the case where the
regression function is **completely unknown**.
In this note and the next one will discuss two ways to estimating
$E(Y|X)$:
(a) one using bases (e.g. a polynomial basis, or a spline basis); and
(b) one using kernels (aka local regression).
To simplify the presentation (but also because of
an intrinsic limitation of these methods, which will
be discussed in more detail later in the course), we will initially only consider
the case where there is a single explanatory variable
(i.e. X above is a scalar, not a vector).
## Polynomial regression
To illustrate these basis methods, we will consider the
`lidar` data, available in the package `SemiPar`. More
information is available from the
corresponding help page: `help(lidar, package='SemiPar')`.
We now load the data and plot it, the response
variable is `logratio` and the explanatory one is
`range`:
```{r nonparam, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
data(lidar, package = "SemiPar")
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
```
In class we discussed the formal motivation to look into
a polynomial approximation of the regression function.
This argument, however, does not specify which degree
of the approximating polynomial to use. Here we
first try a 4th degree polynomial and the problem reduces to a linear
regression one (see the lecture slides). We can use a
command like `lm(logratio ~ range + range^2 + range^3 + range^4)`.
However, this call to `lm` will not work as we intend it
(I recommend that you check this and find out the reason why).
Instead, we would need to use something like
`lm(logratio ~ range + I(range^2) + I(range^3) + I(range^4))`.
To avoid having to type a long formula, we can instead use
the function `poly()` in `R` to generate the design matrix
containing the desired powers
of `range`, and plug that into the call to `lm()`.
The code below fits two such approximations (a 3rd degree and a
4th degree polynomial), plots the data
and overlays the estimated regression function:
```{r poly4, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
# Degree 4 polynomials
pm <- lm(logratio ~ poly(range, 4), data = lidar)
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
lines(predict(pm)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "blue")
pm3 <- lm(logratio ~ poly(range, 3), data = lidar)
lines(predict(pm3)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "hotpink")
legend("topright", legend = c("3rd degree", "4th degree"), lwd = 6, col = c("hotpink", "blue"))
```
Note that this fit is reasonable, although there is probably
room for improvement. Based on the
formal motivation discussed in class to use polynomials in the first
place, it may seem natural to increase the order
of the approximating polynomial in order to improve the quality
of the approximation. However, this is
easily seen not to be a good idea. Below we compare the
4th degree approximation used above (in blue) with
a 10th degree one (in red):
```{r poly10, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
# Degree 10 polynomials
pm2 <- lm(logratio ~ poly(range, 10), data = lidar)
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
lines(predict(pm)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "blue")
lines(predict(pm2)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "red")
```
Note that the 10th order fit follows the data much more closely, but
it starts to become "too adaptive" and departing quite often from the
main (larger scale) trend we associate with the regression (conditional
mean) function.
(Can you explain the discrepancy between what we observe above and
the motivation we used in class, that suggests that higher order
polynomials provide better approximations?)
## A more stable basis: splines
Part of the problem with global polynomial bases as the ones used
above is that they necessarily become more wiggly within the range of the data, and also quickly
increase or decrease near the edge of the observations. A more stable
but also remarkably flexible basis is given by spline functions,
as discussed in class.
We first here show how to build a naive linear spline basis with
5 knots (placed at the `(1:5)/6` quantiles
(i.e. the `r round((1:5)/6, 2)` percentiles) of the
observed values of the explanatory variable), and use
it to estimate the regression function.
Remember that a linear spline function with knot *w* is
given by `f_w(x) = max( x - w, 0 )`. Given a fixed
set of pre-selected knots *w_1*, *w_2*, ..., *w_k*,
we consider regression functions that are linear combinations
of the corresponding k linear spline functions.
Note that for higher-order splines (e.g. cubic splines
discussed below), the naive spline basis used above is numerically
very unstable, and usually works poorly in practice.
I include it here simply as an illustration of the
methodology and to stress the point that this type of approach
(that estimates the regression function as a linear combination
of an explicit basis) is in fact nothing
more than slightly more complex linear models.
First we find the 5 knots mentioned above that will be used to
construct the spline basis:
```{r knots, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
# select the knots at 5 specific quantiles
(kn <- as.numeric(quantile(lidar$range, (1:5) / 6)))
```
Now we compute the matrix of "explanatory variables", that
is: the matrix that in its columns has each of the 5 basis
functions *f_1*, *f_2*, ..., *f_5* evaluated
at the n observed values of the (single) explanatory variable *x_1*, ..., *x_n*. In other words, the
matrix **X** has in its *(i, j)* cell the value *f_j(x_i)*,
for *j=1*, ..., *k*, and *i=1*, ..., *n*. In the code below we use (abuse?) `R`'s *recycling* rules
when operating over vectors and arrays (can you spot it?)
```{r linearbasis, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
# prepare the matrix of covariates / explanatory variables
x <- matrix(0, dim(lidar)[1], length(kn) + 1)
for (j in 1:length(kn)) {
x[, j] <- pmax(lidar$range - kn[j], 0)
}
x[, length(kn) + 1] <- lidar$range
```
Now that we have the matrix of our "explanatory variables", we can simply use `lm` to estimate
the coefficients of the linear combination of the functions in the spline basis
that will provide our regression function estimator. We then plot the data and overlay the
fitted / estimated regression function:
```{r splines1, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
# Fit the regression model
ppm <- lm(lidar$logratio ~ x)
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
lines(predict(ppm)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "hotpink")
```
There are better (numerically more stable) bases for the same linear
space spanned by these spline functions. These bases have different
numerical properties and can become cumbersome to describe. Here we
use the function `bs` (in package `splines`) to build a B-spline basis.
For an accessible discussion, see for example [@wood2017generalized, Section 4.1].
Given the chosen knots and the degree of the splines
(linear, quadratic, cubic, etc.) the set (linear space)
of functions we are using to construct our regression estimate
does not depend on the specific basis we use
(in other words: these are different bases that span
the same linear space of functions). As a consequence,
the estimated regression function should be the same regardless of the basis we use
(provided we do not run into serious numerical issues
with our naive basis). To illustrate this fact,
we will use a B-spline basis with the same 5 knots as above,
and compare the estimated regression
function with the one we obtained above
using our **poor people naive basis**. The plot below overlays both
fits (the naive one with a thick pink line as above, and the one
using b-splines with a thinner blue line):
```{r bsplines1, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
library(splines)
ppm2 <- lm(logratio ~ bs(range, degree = 1, knots = kn), data = lidar)
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
lines(predict(ppm)[order(range)] ~ sort(range), data = lidar, lwd = 8, col = "hotpink")
lines(predict(ppm2)[order(range)] ~ sort(range), data = lidar, lwd = 3, col = "darkblue")
```
As expected, both fits provide the same estimated regression function, although
its coefficients are naturally different (**but their lengths are the same**,
is this a coincidence, or will it always happen?)
```{r bsplines.coef}
as.vector(coef(ppm))
as.vector(coef(ppm2))
```
Note that, because we are
using a set of linear splines, our estimated regression functions will always be piecewise
linear (i.e. linear functions between each pair of knots). To obtain smoother (e.g. differentiable,
continuously differentiable, or even twice continously differentiable)
regression estimators below we will use higher-order splines.
## Higher order splines (quadratic, cubic, etc.)
In what follows (as above) we will use the function `bs` to evaluate the desired
spline basis on the observed values of the explanatory variable (in this case `range`).
We use the arguments `degree = 2` and `knots = kn` to indicate we want a
quadratic spline basis with knots located at the elements of the vector `kn`.
As before, we then simply use `lm`
to estimate the coefficients, and overlay the estimated regression
function over the data:
```{r bsplines2, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
ppmq <- lm(logratio ~ bs(range, degree = 2, knots = kn), data = lidar)
lines(predict(ppmq)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "steelblue")
```
A useful consequence of the fact that these regression estimators are in fact
just linear regression estimators
(but using a richer / more flexible basis than just the straight
predictors) is that we can easily compute (pointwise) standard errors
for the fitted regression curve, as follows. We first fit and
plot a quadratic
spline using the same 5 knots as before:
```{r bsplines.se1, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, eval=FALSE}
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
ppmc <- lm(logratio ~ bs(range, degree = 2, knots = kn), data = lidar)
lines(predict(ppmc)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "gray30")
```
To compute the estimated standard error of the predicted regression curve on a grid
of values of the explanatory variable `range`, we first build a grid of 200 equally spaced points
within the observed scope of the variable `range`:
```{r bsplines.se2, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
xx <- seq(min(lidar$range), max(lidar$range), length = 200)
```
The `predict` method for objects of class `lm` returns estimated standard
errors for each fitted value if we set the argument
`se.fit = TRUE`:
```{r bsplines.se3, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
ppmc <- lm(logratio ~ bs(range, degree = 2, knots = kn), data = lidar)
ps <- predict(ppmc, newdata = list(range = xx), se.fit = TRUE)
```
We now compute upper and lower confidence bands (I used 2 standard errors) around the fitted regression line:
```{r bsplines.se4, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
up <- (ps$fit + 2 * ps$se.fit)
lo <- (ps$fit - 2 * ps$se.fit)
```
Finally, we display the *confidence bands* we just constructed (using **base R** graphics,
but also consider using `ggplot2`):
```{r bsplines.se5, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
lines(predict(ppmc)[order(range)] ~ sort(range), data = lidar, lwd = 4, col = "gray30")
myrgb <- col2rgb("red") / 256 # , alpha=TRUE)
myrgb <- rgb(red = myrgb[1], green = myrgb[2], blue = myrgb[3], alpha = .3)
polygon(c(xx, rev(xx)), c(up, rev(lo)), density = NA, col = myrgb) #' lightblue')
lines(ps$fit ~ xx, data = lidar, lwd = 4, col = "blue")
```
**It is important to note that the above confidence bands were constructed assuming that the knots were fixed (not random), and similarly for the degree of the spline basis.**
Increasing the degree of the cubic basis yields smoother fits (having higher order continuous derivatives). For example, using cubic splines yields an even smoother fit:
```{r bsplines3, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
# cubic splines
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
ppmc <- lm(logratio ~ bs(range, degree = 3, knots = kn), data = lidar)
lines(predict(ppmc)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "tomato3")
```
Note that the estimated regression function seems to have started to "twitch"
and wiggle, particularly at the upper end of our observations.
## How many knots should we use?
So far we have used 5 knots, but we could have used any other number of knots. If we consider a quadratic
spline basis with 10 knots, the fit appears a bit better (at least aesthetically):
```{r bsplines.10knots, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
k <- 10
kn <- as.numeric(quantile(lidar$range, (1:k) / (k + 1)))
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
ppmc <- lm(logratio ~ bs(range, degree = 2, knots = kn), data = lidar)
lines(predict(ppmc)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "tomato3")
```
What about using more knots? The following plot used 50 knots:
```{r bsplines.50knots, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
k <- 50
kn <- as.numeric(quantile(lidar$range, (1:k) / (k + 1)))
ppmc <- lm(logratio ~ bs(range, degree = 2, knots = kn), data = lidar)
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
lines(predict(ppmc)[order(range)] ~ sort(range), data = lidar, lwd = 6, col = "hotpink")
```
Clearly not a good idea!
## Smoothing splines
If we were to follow the approach discussed so far we would need to
find an "optimal" of selecting the number of knots and their
positions, **plus** the order of the spline basis. Although one
could consider using cross-validation for this, we note that
this would require considerable computational effort (we would need
to perform an exhaustive search on a 3-dimensional grid).
We saw in class that *natural cubic splines* provide a natural
*optimal*
space to look for a good regression estimator. For a formal
but surprisingly simple proof of this optimality result,
see again Section 4.1 of
> Wood, S. (2006). *Generalized additive models : an introduction with R*.
> Chapman & Hall/CRC,
> Boca Raton, FL. [Library link](http://resolve.library.ubc.ca/cgi-bin/catsearch?bid=8140311).
This result not only justifies using natural cubic splines,
but also eliminates many of the unknown "tuning parameters" (the degree
of the spline basis, the number of knots, and their locations).
In fact, we only need to select one tuning parameter--the penalty term, which
can be done using any cross-validation "flavour" (although
in this setting leave-one-out CV is particularly appealing, as
we discussed in class).
The function `smooth.spline` in `R` computes a cubic smoothing spline
(natural cubic spline). Details on its arguments and different options
are available from its help page.
When applied to the `lidar` data with penalization
parameter equal to 0.2 (setting the argument `spar = 0.2`)
we obtain the following estimated regression function:
```{r smoothing1, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
tmp <- smooth.spline(
x = lidar$range, y = lidar$logratio, spar = .2, cv = FALSE,
all.knots = TRUE
)
lines(tmp$y ~ tmp$x, lwd = 6, col = "magenta")
```
This fit is clearly too wiggly and unsatisfactory. To obtain a smoother fit we increase the penalty term to 0.5:
```{r smoothing1.5, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
tmp <- smooth.spline(x = lidar$range, y = lidar$logratio, spar = .5, cv = FALSE, all.knots = TRUE)
lines(tmp$y ~ tmp$x, lwd = 6, col = "red")
```
The larger the penalty parameter, the smoother the fit. Setting it to 0.8 yields:
```{r smoothing2, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
tmp <- smooth.spline(x = lidar$range, y = lidar$logratio, spar = .8, cv = FALSE, all.knots = TRUE)
lines(tmp$y ~ tmp$x, lwd = 6, col = "blue")
```
It is easy to see that the larger the penalty coefficient the closer
the resulting natural cubic spline becomes to a linear function (why?).
For example, if we use `smooth.spline(spar=2)`:
```{r smoothing3, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
tmp <- smooth.spline(x = lidar$range, y = lidar$logratio, spar = 2, cv = FALSE, all.knots = TRUE)
lines(tmp$y ~ tmp$x, lwd = 6, col = "tomato3")
```
## Selecting an "optimal" penalty parameter
As discussed in class, an "optimal" natural cubic spline can be found using
cross-validation, and for these linear predictors, leave-one-out cross-validation
is particularly attractive (in terms of computational cost).
The function `smooth.spline` in `R` will compute (and use) an optimal value for the penalty term using
leave-one-out cross-validation when we set the argument `cv = TRUE`:
```{r smoothing.cv1, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
tmp.cv <- smooth.spline(x = lidar$range, y = lidar$logratio, cv = TRUE, all.knots = TRUE)
# tmp.cv$spar = 0.974
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
lines(tmp.cv$y ~ tmp.cv$x, lwd = 6, col = "blue")
```
### Sanity check (always a good idea)
Note that the optimal value found for the regularization parameter (`spar`) is
also returned in the element
`$spar` of the object returned by `smooth.spline`. Just as a **sanity check**
we can now call `smooth.spline` with `cv = FALSE` and manually set
`spar` to this optimal value, and verify that we obtain the same fit:
```{r smoothing.cv2, fig.width=5, fig.height=5, message=FALSE, warning=FALSE, tidy=TRUE}
plot(logratio ~ range, data = lidar, pch = 19, col = "gray", cex = 1.5)
lines(tmp.cv$y ~ tmp.cv$x, lwd = 8, col = "blue")
tmp <- smooth.spline(x = lidar$range, y = lidar$logratio, spar = tmp.cv$spar, cv = FALSE, all.knots = TRUE)
lines(tmp$y ~ tmp$x, lwd = 3, col = "red")
```
## The problem of outliers and other model departures
When the data may contain outliers and/or other atypical observations,
the estimation methods discussed above may be seriously affected, even
if there are only a few such aberrant data points in the training set
(possible outliers in the test / validation set are also a concern, but
we don't have time to discuss it here). Some robust estimation
methods based on splines exist. See for example [@TharmaratnamClaeskens2010]
and references therein. Software in `R` implementing this method is available
[here](https://github.com/msalibian/PenalizedS).