-
Notifications
You must be signed in to change notification settings - Fork 2
/
21-lasso-elnet.qmd
executable file
·453 lines (384 loc) · 16.1 KB
/
21-lasso-elnet.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
# LASSO
```{r setup}
#| include: false
source("_common.R")
```
A different approach to perform *some kind* of variable selection that may be
more stable than stepwise methods is to use an L1 regularization term
(instead of the L2 one used in ridge regression). Notwidthstanding the
geometric "interpretation" of the effect of using an L1 penalty,
it can also be argued that the L1 norm is, in some cases, a convex relaxation
(envelope) of the "L0" norm (the number of non-zero elements). As a result,
estimators based on the LASSO (L1-regularized regression) will typically have some
of their entries equal to zero.
Just as it was the case for Ridge Regression,
as the value of the penalty parameter increases, the solutions
to the L1 regularized problem change
from the usual least squares estimator (when the regularization parameter equals
to zero) to a vector of all zeroes (when the penalty constant is sufficiently
large). One difference between using an L1 or an L2 penalty is that
for an L1-regularized problem, there usually is a finite value of the penalty term
that produces a solution of all zeroes, whereas for the L2 penalizations
this is not generally true.
The sequence of solutions changing by value of the penalty parameter
is often used as a way to rank (or "sequence") the explanatory variables, listing them
in the order in which they "enter" (their estimated coefficient changes from
zero to a non-zero value). We can
also estimate the MSPE of each solution (on a finite
grid of values of the penalty parameter) to select one with
good prediction properties. If any of the
estimated regression coefficients in the selected solution are exactly zero it
is commonly said that those explanatory variables are not included
in the chosen model.
There are two main implementation of the LASSO in `R`, one is
via the `glmnet` function (in package `glmnet`), and the other
is with the function `lars` in package `lars`. Both, of course,
compute the same estimators, but they do so in different ways.
We first compute the path of LASSO solutions for the `credit` data
used in previous lectures:
```{r creditlasso, warning=FALSE, message=FALSE}
x <- read.table("data/Credit.csv", sep = ",", header = TRUE, row.names = 1)
# use non-factor variables
x <- x[, c(1:6, 11)]
y <- as.vector(x$Balance)
xm <- as.matrix(x[, -7])
library(glmnet)
# alpha = 1 - LASSO
lambdas <- exp(seq(-3, 10, length = 50))
a <- glmnet(
x = xm, y = y, lambda = rev(lambdas),
family = "gaussian", alpha = 1, intercept = TRUE
)
```
The `plot` method can be used to show the path of solutions, just as
we did for ridge regression:
```{r creditlasso3, fig.width=5, fig.height=5}
plot(a, xvar = "lambda", label = TRUE, lwd = 6, cex.axis = 1.5, cex.lab = 1.2)
```
Using `lars::lars()` we obtain:
```{r creditlars1, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
library(lars)
b <- lars(x = xm, y = y, type = "lasso", intercept = TRUE)
plot(b, lwd = 4)
```
With `lars` the returned object is a matrix of regression estimators, one
for each value of the penalty constant where a new coefficient "enters" the
model:
```{r creditlars2}
# see the variables
coef(b)
b
```
The presentation below exploits the fact that the LASSO regression estimators
are piecewise linear between values of the regularization parameter where
a variable enters or drops the model.
In order to select one LASSO estimator (among the infinitely many that
are possible) we can use K-fold CV to estimate the MSPE of a few of them
(for a grid of values of the penalty parameter, for example), and
choose the one with smallest estimated MSPE:
```{r creditlars3, fig.width=5, fig.height=5}
# select one solution
set.seed(123)
tmp.la <- cv.lars(
x = xm, y = y, intercept = TRUE, type = "lasso", K = 5,
index = seq(0, 1, length = 20)
)
```
Given their random nature, it is always a good idea to run K-fold CV experiments
more than once:
```{r creditlars4, fig.width=5, fig.height=5}
set.seed(23)
tmp.la <- cv.lars(
x = xm, y = y, intercept = TRUE, type = "lasso", K = 5,
index = seq(0, 1, length = 20)
)
```
We now repeat the same steps as above but using the implementation
in `glmnet`:
```{r creditcv, fig.width=5, fig.height=5}
# run 5-fold CV with glmnet()
set.seed(123)
tmp <- cv.glmnet(
x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 1,
family = "gaussian", intercept = TRUE
)
plot(tmp, lwd = 6, cex.axis = 1.5, cex.lab = 1.2)
```
We ran CV again:
```{r creditcv2, fig.width=5, fig.height=5}
set.seed(23)
tmp <- cv.glmnet(
x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 1,
family = "gaussian", intercept = TRUE
)
plot(tmp, lwd = 6, cex.axis = 1.5, cex.lab = 1.2)
```
Zoom in the CV plot to check the 1-SE rule:
```{r creditcv4, fig.width=5, fig.height=5}
plot(tmp, lwd = 6, cex.axis = 1.5, cex.lab = 1.2, ylim = c(22000, 33000))
```
The returned object includes the "optimal" value of the
penalization parameter, which can be used to
find the corresponding estimates for the regression
coefficients, using the method `coef`:
```{r creditcv3}
# optimal lambda
tmp$lambda.min
# coefficients for the optimal lambda
coef(tmp, s = tmp$lambda.min)
```
We can also use `coef` to compute the coefficients at
any value of the penalty parameter. For example we
show below the coefficients corresponding
to penalty values of exp(4) and exp(4.5):
```{r creditcoeffs}
# coefficients for other values of lambda
coef(tmp, s = exp(4))
coef(tmp, s = exp(4.5)) # note no. of zeroes...
```
## Compare MSPEs of Ridge & LASSO on the credit data
We now use 50 runs of 5-fold cross-validation to
estimate (and compare) the MSPEs of the different
estimators / predictors:
```{r mspecredit, warning=FALSE, message=FALSE, fig.width=5, fig.height=5, tidy=TRUE, cache=TRUE}
library(MASS)
n <- nrow(xm)
k <- 5
ii <- (1:n) %% k + 1
set.seed(123)
N <- 50
mspe.la <- mspe.st <- mspe.ri <- mspe.f <- rep(0, N)
for (i in 1:N) {
ii <- sample(ii)
pr.la <- pr.f <- pr.ri <- pr.st <- rep(0, n)
for (j in 1:k) {
tmp.ri <- cv.glmnet(
x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
nfolds = 5, alpha = 0, family = "gaussian"
)
tmp.la <- cv.glmnet(
x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
nfolds = 5, alpha = 1, family = "gaussian"
)
null <- lm(Balance ~ 1, data = x[ii != j, ])
full <- lm(Balance ~ ., data = x[ii != j, ])
tmp.st <- stepAIC(null, scope = list(lower = null, upper = full), trace = 0)
pr.ri[ii == j] <- predict(tmp.ri, s = "lambda.min", newx = xm[ii == j, ])
pr.la[ii == j] <- predict(tmp.la, s = "lambda.min", newx = xm[ii == j, ])
pr.st[ii == j] <- predict(tmp.st, newdata = x[ii == j, ])
pr.f[ii == j] <- predict(full, newdata = x[ii == j, ])
}
mspe.ri[i] <- mean((x$Balance - pr.ri)^2)
mspe.la[i] <- mean((x$Balance - pr.la)^2)
mspe.st[i] <- mean((x$Balance - pr.st)^2)
mspe.f[i] <- mean((x$Balance - pr.f)^2)
}
boxplot(mspe.la, mspe.ri, mspe.st, mspe.f, names = c("LASSO", "Ridge", "Stepwise", "Full"), col = c("steelblue", "gray80", "tomato", "springgreen"), cex.axis = 1, cex.lab = 1, cex.main = 2)
mtext(expression(hat(MSPE)), side = 2, line = 2.5)
```
We see that in this example LASSO does not seem to provide better
predictions than Ridge Regression. However, it does yield a
sequence of explanatory variables that can be interpreted as
based on "importance" for the linear regression model (see
above).
## Comparing LASSO with Ridge Regression on the air pollution data
Let us compare the Ridge Regression and LASSO fits to the
air pollution data. Of course, by *the Ridge Regression fit*
and *the LASSO fit* we mean the fit obtained with the
optimal value of the penalty constant chosen in terms
of the corresponding estimated MSPE (which is in
general estimated using K-fold cross validation).
We first load the data and use `cv.glmnet()` with
`alpha = 0` to select an **approximately optimal**
Ridge Regression fit (what makes the calculation
below **only approximately** optimal?).
```{r comparing.airp, fig.width=5, fig.height=5, warning=FALSE, message=FALSE}
airp <- read.table("data/rutgers-lib-30861_CSV-1.csv", header = TRUE, sep = ",")
y <- as.vector(airp$MORT)
xm <- as.matrix(airp[, names(airp) != "MORT"])
lambdas <- exp(seq(-3, 10, length = 50))
# Ridge Regression
set.seed(23)
air.l2 <- cv.glmnet(
x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0,
family = "gaussian", intercept = TRUE
)
plot(air.l2)
```
The plot above is included for illustration purposes only.
Similarly, we now compute an approximately optimal LASSO fit,
and look at the curve of estimated MSPEs:
```{r airp.lasso, fig.width=5, fig.height=5, warning=FALSE, message=FALSE}
# LASSO
set.seed(23)
air.l1 <- cv.glmnet(
x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 1,
family = "gaussian", intercept = TRUE
)
plot(air.l1)
```
It is interesting to compare the corresponding estimated regression coefficients,
so we put them side by side in two columns:
```{r airp.corr.groups}
cbind(
round(coef(air.l2, s = "lambda.min"), 3),
round(coef(air.l1, s = "lambda.min"), 3)
)
```
Note how several of them are relatively similar, but LASSO includes fewer of them.
A possible explanation for this is the particular correlation structure among the
explanatory variables. More specifically, when groups of
correlated covariates are present,
LASSO tends to choose only one of them, whereas Ridge Regression will tend
to keep all of them. For a formal statement see [@ZouHastie2005, Lemma 2].
It is important to note here that the above observations regarding the Ridge Regression
and LASSO fits trained on the air pollution data should be made on a more
reliable (more stable, less variable) choice of penalty parameter. For example,
we may want to run the above 5-fold CV experiments several times and take the
average of the estimated optimal penalty parameters. To simplify the presentation
we do not purse this here, but it may be a very good exercise for the reader to
do so.
The following heatmap of the pairwise correlations among explanatory variables
reveals certain patterns that may be used to explain the difference
mentioned above. Note that in this visualization method variables were
grouped ("clustered") according to their pairwise correlations in order to
improve the interpretability of the plot. We will see later in this course
the particular clustering method used here (hierarchical clustering).
```{r airp.correlations, fig.width=5, fig.height=5, warning=FALSE, message=FALSE}
library(ggcorrplot)
ggcorrplot(cor(xm), hc.order = TRUE, outline.col = "white")
```
## Compare MSPE of Ridge and LASSO on air pollution data
Since our focus was on the properties of the resulting predictions, it may be
interesting to compare the estimated MSPE of the different models / predictors
we have considered so far: a full linear model, a model selected via stepwise + AIC,
ridge regression and LASSO. As usual, we use 50 runs of 5-fold CV, and obtain
the following boxplots:
```{r bigcompare, fig.width=5, fig.height=5, warning=FALSE, message=FALSE, echo=TRUE, cache=TRUE}
library(MASS)
n <- nrow(xm)
k <- 5
ii <- (1:n) %% k + 1
set.seed(123)
N <- 50
mspe.la <- mspe.st <- mspe.ri <- mspe.f <- rep(0, N)
for (i in 1:N) {
ii <- sample(ii)
pr.la <- pr.f <- pr.ri <- pr.st <- rep(0, n)
for (j in 1:k) {
tmp.ri <- cv.glmnet(
x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
nfolds = 5, alpha = 0, family = "gaussian"
)
tmp.la <- cv.glmnet(
x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
nfolds = 5, alpha = 1, family = "gaussian"
)
null <- lm(MORT ~ 1, data = airp[ii != j, ])
full <- lm(MORT ~ ., data = airp[ii != j, ])
tmp.st <- stepAIC(null, scope = list(lower = null, upper = full), trace = FALSE)
pr.ri[ii == j] <- predict(tmp.ri, s = "lambda.min", newx = xm[ii == j, ])
pr.la[ii == j] <- predict(tmp.la, s = "lambda.min", newx = xm[ii == j, ])
pr.st[ii == j] <- predict(tmp.st, newdata = airp[ii == j, ])
pr.f[ii == j] <- predict(full, newdata = airp[ii == j, ])
}
mspe.ri[i] <- mean((airp$MORT - pr.ri)^2)
mspe.la[i] <- mean((airp$MORT - pr.la)^2)
mspe.st[i] <- mean((airp$MORT - pr.st)^2)
mspe.f[i] <- mean((airp$MORT - pr.f)^2)
}
boxplot(mspe.la, mspe.ri, mspe.st, mspe.f, names = c("LASSO", "Ridge", "Stepwise", "Full"), col = c("steelblue", "gray80", "tomato", "springgreen"), cex.axis = 1, cex.lab = 1, cex.main = 2)
mtext(expression(hat(MSPE)), side = 2, line = 2.5)
```
We see that there is a marginal advantage of LASSO, but it is rather minor, and
the three methods we have seen so far improve by similar margins
on the predictions obtained by using a full linear regression model.
## Less desirable properties of LASSO
As important as the LASSO estimator has been, its properties may sometimes
not be fully satisfactory. In particular:
* The LASSO selects the right variables only under very restrictive conditions (in other words, it is generally not "variable selection"-consistent).
* The LASSO sampling distribution is not the same as the one we would obtain with the standard least squares estimator if we knew which features to include and which ones to exclude from the model (in orther words, the LASSO does not have an "oracle" property).
* When groups of correlated explanatory variables are present the LASSO tends to include only one variable (randomly) from the group, relegate the others to the end of the sequence.
For precise statements and theoretical results regarding the three points above, see [@ZouHastie2005; @Zou2006].
## Elastic net
Elastic Net estimators were introduced to find an
informative compromise between LASSO and Ridge Regression.
Note that `cv.glmnet` only considers fits with variying
values of one of the penalty constants, while the other
one (`alpha`) is kept fixed. To compare different
Elastic Net fits we run `cv.glmnet` with 4 values of
`alpha`: 0.05, 0.1, 0.5 and 0.75.
```{r airp.en, fig.width=5, fig.height=5, warning=FALSE, message=FALSE}
# EN
set.seed(23)
air.en.75 <- cv.glmnet(
x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0.75,
family = "gaussian", intercept = TRUE
)
set.seed(23)
air.en.05 <- cv.glmnet(
x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0.05,
family = "gaussian", intercept = TRUE
)
set.seed(23)
air.en.1 <- cv.glmnet(
x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0.1,
family = "gaussian", intercept = TRUE
)
set.seed(23)
air.en.5 <- cv.glmnet(
x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0.5,
family = "gaussian", intercept = TRUE
)
plot(air.en.05)
plot(air.en.5)
plot(air.en.75)
```
### Run EN on airpollution data, compare fits
We now compare the estimates of the regression coefficients
obtained with the different methods discussed so far to
alleviate potential problems caused by correlated covariates.
```{r airp.en.compare}
a <- cbind(
round(coef(air.l2, s = "lambda.min"), 3),
round(coef(air.l1, s = "lambda.min"), 3),
round(coef(air.en.05, s = "lambda.min"), 3),
round(coef(air.en.1, s = "lambda.min"), 3),
round(coef(air.en.5, s = "lambda.min"), 3),
round(coef(air.en.75, s = "lambda.min"), 3)
)
colnames(a) <- c("Ridge", "LASSO", "EN-05", "EN-10", "EN-50", "EN-75")
a
```
The same comment made above regarding the need of a
more stable choice of "optimal" fits (for each of these
methods) applies here. Again, here we limit ourselves to one
run of 5-fold CV purely based on simplifying the
presentation.
### Compare MSPE's of Full, LASSO, Ridge, EN and stepwise
```{r bigcompare2, fig.width=5, fig.height=5, warning=FALSE, message=FALSE, cache=TRUE, echo=TRUE}
ii <- (1:n) %% k + 1
set.seed(123)
N <- 50
mspe.en <- rep(0, N)
for (i in 1:N) {
ii <- sample(ii)
pr.en <- rep(0, n)
for (j in 1:k) {
tmp.en <- cv.glmnet(
x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
nfolds = 5, alpha = 0.75, family = "gaussian"
)
pr.en[ii == j] <- predict(tmp.en, s = "lambda.min", newx = xm[ii == j, ])
}
mspe.en[i] <- mean((airp$MORT - pr.en)^2)
}
boxplot(mspe.en, mspe.la, mspe.ri, mspe.st, mspe.f,
names = c("EN", "LASSO", "Ridge", "Stepwise", "Full"),
col = c("hotpink", "steelblue", "gray80", "tomato", "springgreen"),
cex.axis = 1, cex.lab = 1, cex.main = 2
)
mtext(expression(hat(MSPE)), side = 2, line = 2.5)
```
We see that in this example Elastic Net with `alpha = 0.75` (which is not far from
the LASSO) provides slightly better estimated MSPEs.