21-lasso-elnet.qmd

# LASSO 

```{r setup}
#| include: false
source("_common.R")
```


A different approach to perform *some kind* of variable selection that may be
more stable than stepwise methods is to use an L1 regularization term
(instead of the L2 one used in ridge regression). Notwidthstanding the
geometric "interpretation" of the effect of using an L1 penalty, 
it can also be argued that the L1 norm is, in some cases, a convex relaxation
(envelope) of the "L0" norm (the number of non-zero elements). As a result,
estimators based on the LASSO (L1-regularized regression) will typically have some 
of their entries equal to zero. 

Just as it was the case for Ridge Regression, 
as the value of the penalty parameter increases, the solutions 
to the L1 regularized problem change
from the usual least squares estimator (when the regularization parameter equals
to zero) to a vector of all zeroes (when the penalty constant is sufficiently 
large). One difference between using an L1 or an L2 penalty is that
for an L1-regularized problem, there usually is a finite value of the penalty term
that produces a solution of all zeroes, whereas for the L2 penalizations
this is not generally true. 

The sequence of solutions changing by value of the penalty parameter
is often used as a way to rank (or "sequence") the explanatory variables, listing them
in the order in which they "enter" (their estimated coefficient changes from 
zero to a non-zero value). We can 
also estimate the MSPE of each solution (on a finite
grid of values of the penalty parameter) to select one with
good prediction properties. If any of the 
estimated regression coefficients in the selected solution are exactly zero it
is commonly said that those explanatory variables are not included 
in the chosen model. 

There are two main implementation of the LASSO in `R`, one is
via the `glmnet` function (in package `glmnet`), and the other
is with the function `lars` in package `lars`. Both, of course,
compute the same estimators, but they do so in different ways. 

We first compute the path of LASSO solutions for the `credit` data
used in previous lectures:
```{r creditlasso, warning=FALSE, message=FALSE}
x <- read.table("data/Credit.csv", sep = ",", header = TRUE, row.names = 1)
# use non-factor variables
x <- x[, c(1:6, 11)]
y <- as.vector(x$Balance)
xm <- as.matrix(x[, -7])
library(glmnet)
# alpha = 1 - LASSO
lambdas <- exp(seq(-3, 10, length = 50))
a <- glmnet(
  x = xm, y = y, lambda = rev(lambdas),
  family = "gaussian", alpha = 1, intercept = TRUE
)
```

The `plot` method can be used to show the path of solutions, just as
we did for ridge regression:

```{r creditlasso3, fig.width=5, fig.height=5}
plot(a, xvar = "lambda", label = TRUE, lwd = 6, cex.axis = 1.5, cex.lab = 1.2)
```

Using `lars::lars()` we obtain:

```{r creditlars1, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
library(lars)
b <- lars(x = xm, y = y, type = "lasso", intercept = TRUE)
plot(b, lwd = 4)
```

With `lars` the returned object is a matrix of regression estimators, one
for each value of the penalty constant where a new coefficient "enters" the
model:

```{r creditlars2}
# see the variables
coef(b)
b
```

The presentation below exploits the fact that the LASSO regression estimators
are piecewise linear between values of the regularization parameter where
a variable enters or drops the model.

In order to select one LASSO estimator (among the infinitely many that
are possible) we can use K-fold CV to estimate the MSPE of a few of them 
(for a grid of values of the penalty parameter, for example), and 
choose the one with smallest estimated MSPE:

```{r creditlars3, fig.width=5, fig.height=5}
# select one solution
set.seed(123)
tmp.la <- cv.lars(
  x = xm, y = y, intercept = TRUE, type = "lasso", K = 5,
  index = seq(0, 1, length = 20)
)
```

Given their random nature, it is always a good idea to run K-fold CV experiments 
more than once:

```{r creditlars4, fig.width=5, fig.height=5}
set.seed(23)
tmp.la <- cv.lars(
  x = xm, y = y, intercept = TRUE, type = "lasso", K = 5,
  index = seq(0, 1, length = 20)
)
```

We now repeat the same steps as above but using the implementation
in `glmnet`:

```{r creditcv, fig.width=5, fig.height=5}
# run 5-fold CV with glmnet()
set.seed(123)
tmp <- cv.glmnet(
  x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 1,
  family = "gaussian", intercept = TRUE
)
plot(tmp, lwd = 6, cex.axis = 1.5, cex.lab = 1.2)
```

We ran CV again:

```{r creditcv2, fig.width=5, fig.height=5}
set.seed(23)
tmp <- cv.glmnet(
  x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 1,
  family = "gaussian", intercept = TRUE
)
plot(tmp, lwd = 6, cex.axis = 1.5, cex.lab = 1.2)
```

Zoom in the CV plot to check the 1-SE rule:

```{r creditcv4, fig.width=5, fig.height=5}
plot(tmp, lwd = 6, cex.axis = 1.5, cex.lab = 1.2, ylim = c(22000, 33000))
```

The returned object includes the "optimal" value of the 
penalization parameter, which can be used to 
find the corresponding estimates for the regression
coefficients, using the method `coef`:
```{r creditcv3}
# optimal lambda
tmp$lambda.min
# coefficients for the optimal lambda
coef(tmp, s = tmp$lambda.min)
```
We can also use `coef` to compute the coefficients at
any value of the penalty parameter. For example we
show below the coefficients corresponding 
to penalty values of exp(4) and exp(4.5):
```{r creditcoeffs}
# coefficients for other values of lambda
coef(tmp, s = exp(4))
coef(tmp, s = exp(4.5)) # note no. of zeroes...
```

## Compare MSPEs of Ridge & LASSO on the credit data

We now use 50 runs of 5-fold cross-validation to
estimate (and compare) the MSPEs of the different 
estimators / predictors:

```{r mspecredit, warning=FALSE, message=FALSE, fig.width=5, fig.height=5, tidy=TRUE, cache=TRUE}
library(MASS)
n <- nrow(xm)
k <- 5
ii <- (1:n) %% k + 1
set.seed(123)
N <- 50
mspe.la <- mspe.st <- mspe.ri <- mspe.f <- rep(0, N)
for (i in 1:N) {
  ii <- sample(ii)
  pr.la <- pr.f <- pr.ri <- pr.st <- rep(0, n)
  for (j in 1:k) {
    tmp.ri <- cv.glmnet(
      x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
      nfolds = 5, alpha = 0, family = "gaussian"
    )
    tmp.la <- cv.glmnet(
      x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
      nfolds = 5, alpha = 1, family = "gaussian"
    )
    null <- lm(Balance ~ 1, data = x[ii != j, ])
    full <- lm(Balance ~ ., data = x[ii != j, ])
    tmp.st <- stepAIC(null, scope = list(lower = null, upper = full), trace = 0)
    pr.ri[ii == j] <- predict(tmp.ri, s = "lambda.min", newx = xm[ii == j, ])
    pr.la[ii == j] <- predict(tmp.la, s = "lambda.min", newx = xm[ii == j, ])
    pr.st[ii == j] <- predict(tmp.st, newdata = x[ii == j, ])
    pr.f[ii == j] <- predict(full, newdata = x[ii == j, ])
  }
  mspe.ri[i] <- mean((x$Balance - pr.ri)^2)
  mspe.la[i] <- mean((x$Balance - pr.la)^2)
  mspe.st[i] <- mean((x$Balance - pr.st)^2)
  mspe.f[i] <- mean((x$Balance - pr.f)^2)
}
boxplot(mspe.la, mspe.ri, mspe.st, mspe.f, names = c("LASSO", "Ridge", "Stepwise", "Full"), col = c("steelblue", "gray80", "tomato", "springgreen"), cex.axis = 1, cex.lab = 1, cex.main = 2)
mtext(expression(hat(MSPE)), side = 2, line = 2.5)
```

We see that in this example LASSO does not seem to provide better
predictions than Ridge Regression. However, it does yield a 
sequence of explanatory variables that can be interpreted as
based on "importance" for the linear regression model (see
above).


## Comparing LASSO with Ridge Regression on the air pollution data

Let us compare the Ridge Regression and LASSO fits to the
air pollution data. Of course, by *the Ridge Regression fit*
and *the LASSO fit* we mean the fit obtained with the
optimal value of the penalty constant chosen in terms
of the corresponding estimated MSPE (which is in 
general estimated using K-fold cross validation). 

We first load the data and use `cv.glmnet()` with 
`alpha = 0` to select an **approximately optimal** 
Ridge Regression fit (what makes the calculation 
below **only approximately** optimal?).

```{r comparing.airp, fig.width=5, fig.height=5, warning=FALSE, message=FALSE}
airp <- read.table("data/rutgers-lib-30861_CSV-1.csv", header = TRUE, sep = ",")
y <- as.vector(airp$MORT)
xm <- as.matrix(airp[, names(airp) != "MORT"])
lambdas <- exp(seq(-3, 10, length = 50))
# Ridge Regression
set.seed(23)
air.l2 <- cv.glmnet(
  x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0,
  family = "gaussian", intercept = TRUE
)
plot(air.l2)
```

The plot above is included for illustration purposes only. 
Similarly, we now compute an approximately optimal LASSO fit, 
and look at the curve of estimated MSPEs:

```{r airp.lasso, fig.width=5, fig.height=5, warning=FALSE, message=FALSE}
# LASSO
set.seed(23)
air.l1 <- cv.glmnet(
  x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 1,
  family = "gaussian", intercept = TRUE
)
plot(air.l1)
```

It is interesting to compare the corresponding estimated regression coefficients, 
so we put them side by side in two columns:
```{r airp.corr.groups} 
cbind(
  round(coef(air.l2, s = "lambda.min"), 3),
  round(coef(air.l1, s = "lambda.min"), 3)
)
```

Note how several of them are relatively similar, but LASSO includes fewer of them.
A possible explanation for this is the particular correlation structure among the
explanatory variables. More specifically, when groups of 
correlated covariates are present, 
LASSO tends to choose only one of them, whereas Ridge Regression will tend 
to keep all of them. For a formal statement see [@ZouHastie2005, Lemma 2].

It is important to note here that the above observations regarding the Ridge Regression
and LASSO fits trained on the air pollution data should be made on a more 
reliable (more stable, less variable) choice of penalty parameter. For example,
we may want to run the above 5-fold CV experiments several times and take the
average of the estimated optimal penalty parameters. To simplify the presentation
we do not purse this here, but it may be a very good exercise for the reader to
do so.

The following heatmap of the pairwise correlations among explanatory variables
reveals certain patterns that may be used to explain the difference
mentioned above. Note that in this visualization method variables were
grouped ("clustered") according to their pairwise correlations in order to
improve the interpretability of the plot. We will see later in this course
the particular clustering method used here (hierarchical clustering).

```{r airp.correlations, fig.width=5, fig.height=5, warning=FALSE, message=FALSE}
library(ggcorrplot)
ggcorrplot(cor(xm), hc.order = TRUE, outline.col = "white")
```


## Compare MSPE of Ridge and LASSO on air pollution data

Since our focus was on the properties of the resulting predictions, it may be
interesting to compare the estimated MSPE of the different models / predictors
we have considered so far: a full linear model, a model selected via stepwise + AIC, 
ridge regression and LASSO. As usual, we use 50 runs of 5-fold CV, and obtain
the following boxplots:

```{r bigcompare, fig.width=5, fig.height=5, warning=FALSE, message=FALSE, echo=TRUE, cache=TRUE}
library(MASS)
n <- nrow(xm)
k <- 5
ii <- (1:n) %% k + 1
set.seed(123)
N <- 50
mspe.la <- mspe.st <- mspe.ri <- mspe.f <- rep(0, N)
for (i in 1:N) {
  ii <- sample(ii)
  pr.la <- pr.f <- pr.ri <- pr.st <- rep(0, n)
  for (j in 1:k) {
    tmp.ri <- cv.glmnet(
      x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
      nfolds = 5, alpha = 0, family = "gaussian"
    )
    tmp.la <- cv.glmnet(
      x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
      nfolds = 5, alpha = 1, family = "gaussian"
    )
    null <- lm(MORT ~ 1, data = airp[ii != j, ])
    full <- lm(MORT ~ ., data = airp[ii != j, ])
    tmp.st <- stepAIC(null, scope = list(lower = null, upper = full), trace = FALSE)
    pr.ri[ii == j] <- predict(tmp.ri, s = "lambda.min", newx = xm[ii == j, ])
    pr.la[ii == j] <- predict(tmp.la, s = "lambda.min", newx = xm[ii == j, ])
    pr.st[ii == j] <- predict(tmp.st, newdata = airp[ii == j, ])
    pr.f[ii == j] <- predict(full, newdata = airp[ii == j, ])
  }
  mspe.ri[i] <- mean((airp$MORT - pr.ri)^2)
  mspe.la[i] <- mean((airp$MORT - pr.la)^2)
  mspe.st[i] <- mean((airp$MORT - pr.st)^2)
  mspe.f[i] <- mean((airp$MORT - pr.f)^2)
}
boxplot(mspe.la, mspe.ri, mspe.st, mspe.f, names = c("LASSO", "Ridge", "Stepwise", "Full"), col = c("steelblue", "gray80", "tomato", "springgreen"), cex.axis = 1, cex.lab = 1, cex.main = 2)
mtext(expression(hat(MSPE)), side = 2, line = 2.5)
```

We see that there is a marginal advantage of LASSO, but it is rather minor, and 
the three methods we have seen so far improve by similar margins 
on the predictions obtained by using a full linear regression model. 

## Less desirable properties of LASSO

As important as the LASSO estimator has been, its properties may sometimes 
not be fully satisfactory. In particular:

* The LASSO selects the right variables only under very restrictive conditions (in other words, it is generally not "variable selection"-consistent).
* The LASSO sampling distribution is not the same as the one we would obtain with the standard least squares estimator if we knew which features to include and which ones to exclude from the model (in orther words, the LASSO does not have an "oracle" property). 
* When groups of correlated explanatory variables are present the LASSO tends to include only one variable (randomly) from the group, relegate the others to the end of the sequence. 

For precise statements and theoretical results regarding the three points above, see [@ZouHastie2005; @Zou2006].


## Elastic net

Elastic Net estimators were introduced to find an 
informative compromise between LASSO and Ridge Regression. 

Note that `cv.glmnet` only considers fits with variying 
values of one of the penalty constants, while the other
one (`alpha`) is kept fixed. To compare different 
Elastic Net fits we run `cv.glmnet` with 4 values of
`alpha`: 0.05, 0.1, 0.5 and 0.75. 

```{r airp.en, fig.width=5, fig.height=5, warning=FALSE, message=FALSE}
# EN
set.seed(23)
air.en.75 <- cv.glmnet(
  x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0.75,
  family = "gaussian", intercept = TRUE
)
set.seed(23)
air.en.05 <- cv.glmnet(
  x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0.05,
  family = "gaussian", intercept = TRUE
)
set.seed(23)
air.en.1 <- cv.glmnet(
  x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0.1,
  family = "gaussian", intercept = TRUE
)
set.seed(23)
air.en.5 <- cv.glmnet(
  x = xm, y = y, lambda = lambdas, nfolds = 5, alpha = 0.5,
  family = "gaussian", intercept = TRUE
)
plot(air.en.05)
plot(air.en.5)
plot(air.en.75)
```


### Run EN on airpollution data, compare fits

We now compare the estimates of the regression coefficients
obtained with the different methods discussed so far to 
alleviate potential problems caused by correlated covariates. 

```{r airp.en.compare}
a <- cbind(
  round(coef(air.l2, s = "lambda.min"), 3),
  round(coef(air.l1, s = "lambda.min"), 3),
  round(coef(air.en.05, s = "lambda.min"), 3),
  round(coef(air.en.1, s = "lambda.min"), 3),
  round(coef(air.en.5, s = "lambda.min"), 3),
  round(coef(air.en.75, s = "lambda.min"), 3)
)
colnames(a) <- c("Ridge", "LASSO", "EN-05", "EN-10", "EN-50", "EN-75")
a
```

The same comment made above regarding the need of a 
more stable choice of "optimal" fits (for each of these
methods) applies here. Again, here we limit ourselves to one
run of 5-fold CV purely based on simplifying the
presentation. 


### Compare MSPE's of Full, LASSO, Ridge, EN and stepwise


```{r bigcompare2, fig.width=5, fig.height=5, warning=FALSE, message=FALSE, cache=TRUE, echo=TRUE}
ii <- (1:n) %% k + 1
set.seed(123)
N <- 50
mspe.en <- rep(0, N)
for (i in 1:N) {
  ii <- sample(ii)
  pr.en <- rep(0, n)
  for (j in 1:k) {
    tmp.en <- cv.glmnet(
      x = xm[ii != j, ], y = y[ii != j], lambda = lambdas,
      nfolds = 5, alpha = 0.75, family = "gaussian"
    )
    pr.en[ii == j] <- predict(tmp.en, s = "lambda.min", newx = xm[ii == j, ])
  }
  mspe.en[i] <- mean((airp$MORT - pr.en)^2)
}
boxplot(mspe.en, mspe.la, mspe.ri, mspe.st, mspe.f,
  names = c("EN", "LASSO", "Ridge", "Stepwise", "Full"),
  col = c("hotpink", "steelblue", "gray80", "tomato", "springgreen"),
  cex.axis = 1, cex.lab = 1, cex.main = 2
)
mtext(expression(hat(MSPE)), side = 2, line = 2.5)
```

We see that in this example Elastic Net with `alpha = 0.75` (which is not far from 
the LASSO) provides slightly better estimated MSPEs.