24-trees.qmd

# Regression trees 

```{r setup}
#| include: false
source("_common.R")
```


Trees provide a non-parametric regression estimator that is
able to overcome a serious limitation of "classical non-parametric"
estimators (like those based on splines, or kernels) 
when several (more than 2 or 3) explanatory variables are
available. 

Below we first describe the problem afflicting classical 
non-parametric methods (this is also known as the "curse of dimensionality")
and then describe how to compute regression trees in `R` using the
`rpart` package (although other implementations exist). 
Details were discussed in class. 

## Curse of dimensionality

Suppose  you have a random sample of *n = 100* observations,
uniformly distributed on the [0, 1] interval. How many do you 
expect to find within 0.25 of the middle point of the
interval (i.e. how many will be between 0.25 and 0.75)?
A trivial calculation shows that the expected number of
observations falling between 0.25 and 0.75 will be *n/2*,
in this case *50*. This is easy verified with a simple
numerical experiment:
```{r curse, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
# X ~ U(0,1)
# how many points do you expect within 0.25 of 1/2?
set.seed(1234)
n <- 100
x <- runif(n)
(sum(abs(x - 1 / 2) < 0.25)) # half the width of the dist'n
```
(wow! what are the chances?)

Consider now a sample of 100 observations, each with 
5 variables (5-dimensional observations), 
uniformly distributed in the 5-dimensional unit cube 
(*[0,1]^5*). How many do you expect to see in the
*central hypercube* with sides [0.25, 0.75] x [0.25, 0.75] ... 
x [0.25, 0.75] = [0.25, 0.75]^5? A simple experiment shows
that this number is probably rather small:
```{r curse.p5, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
p <- 5
x <- matrix(runif(n * p), n, p)
# how many points in the hypercube (0.25, 0.75)^p ?
tmp <- apply(x, 1, function(a) all(abs(a - 1 / 2) < 0.25))
(sum(tmp))
```
In fact, the expected number of observations 
in that central hypercube is exactly *n / 2^5*, 
which is approximately *3* when *n = 100*. 

A relevant question for our local regression estimation
problem is: "how large should our sample be if we want
to still have about 50 observations in our central hypercube?". 
Easy calculations show that this number is *50 / (1/2)^p*,
which, for *p = 5* is *1600*. Again, we can verify
this with a simple experiment:
```{r curse.2, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
# how many obs do we need to have 50 in the hypercube?
n <- 50 / (0.5^p)
x <- matrix(runif(n * p), n, p)
# how many points in the hypercube (0.25, 0.75)^p ?
tmp <- apply(x, 1, function(a) all(abs(a - 1 / 2) < 0.25))
(sum(tmp))
```
So we see that if the dimension of our problem increases 
from *p = 1* to *p = 5*, the number of observations we 
need to maintain an expectation of having about 50 points
in our central hypercube increases by a factor of 16 (not 5).
However, if we  double the dimension of the problem (to *p = 10*), in order to expect
50 observations in the central [0.25, 0.75] hypercube
we need a sample of size *n = 51,200*. In other words, we
doubled the dimension, but need 32 times more data (!)
to *fill* the central hypercube with the same number of points. 
Moreover, if we doubled the dimension again (to *p = 20*) we would need over 
52 million observations to have (just!) 50 in the central hypercube! 
Note that now we doubled the
dimension again but need 1024 times more data! The number of
observations needed to maintain a fixed number of observations
in a region of the space grows exponentially with the 
dimension of the space. 

Another way to think about this problem is to 
ask: "given a sample size of *n = 1000*, say, how wide / large
should the central hypercube be to expect 
about *50* observations in it?". The answer is
easily found to be *1 / (2 (n/50)^(1/p))*, which for 
*n = 1000* and *p = 5* equals 0.27, with 
*p = 10* is 0.37 and with *p = 20* is 
0.43, almost the full unit hypercube!

In this sense it is fair to say that in moderate to high dimensions
*local neighbourhoods* are either empty or not really *local*.


## Regression trees as constrained non-parametric regression

Regression trees provide an alternative non-regression
estimator that works well, even with many available features. 
As discussed in class, the basic idea is to approximate the
regression function by a linear combination of "simple" 
functions (i.e. functions $h(x) = I( x \in A )$ which equal
1 if the argument *x* belongs to the set *A*, and 0 otherwise. 
Each function has its own support set *A*. Furthermore, 
this linear combination is not estimated at once, but
iteratively, and only considering a specific class of
sets *A* (which ones?) As a result, the regression tree
is not the *global* optimal approximation by simple functions, 
but a good *suboptimal* one, that can be computed very rapidly.
Details were discussed in class, refer to your notes and
the corresponding slides.

There are several packages in `R` implementing trees,
in this course we will use `rpart`. To illustrate their
use we will consider the `Boston` data set, that contains
information on housing in the US city of Boston. The 
corresponding help page contains more information. 

**Here, to simplify the comparison** of the predictions obtained
by trees and other regression estimators, instead of using
K-fold CV, we start by randomly splitting the 
available data into a training and a test set:
```{r tree, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
library(rpart)
data(Boston, package = "MASS")
# split data into a training and
# a test set
set.seed(123456)
n <- nrow(Boston)
ii <- sample(n, floor(n / 4))
dat.te <- Boston[ii, ]
dat.tr <- Boston[-ii, ]
```

We now build a regression tree using the function `rpart` and leave most of 
its arguments to their
default values. We specify the response and explanatory variables using
a `formula`, as usual, and set  `method='anova'` to indicate we
want to train a regression tree (as opposed to a classification one, for example).
Finally, we use the corresponding `plot` method to display the tree
structure:


```{r tree3, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
set.seed(123)
bos.t <- rpart(medv ~ ., data = dat.tr, method = "anova")
plot(bos.t, uniform = FALSE, margin = 0.05)
text(bos.t, pretty = TRUE)
```

A few questions for you:

* Why did we set the pseudo-random generation seed (`set.seed(123)`) 
before calling `rpart`? Is there anything random about building these trees?
* What does the `uniform` argument for `plot.rpart` do? What does `text` do here? 

## Compare predictions 

We now compare the predictions we obtain on the test with the 
above regression tree, the usual linear model using all
explanatory variables, another one constructed using stepwise
variable selections methods, and the "optimal" LASSO. 

First, we estimate the MSPE of the regression tree using the test set:
```{r tree4, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
# predictions on the test set
pr.t <- predict(bos.t, newdata = dat.te, type = "vector")
with(dat.te, mean((medv - pr.t)^2))
```
For a full linear model, the estimated MSPE using the test set is:
```{r tree5, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
# full linear model
bos.lm <- lm(medv ~ ., data = dat.tr)
pr.lm <- predict(bos.lm, newdata = dat.te)
with(dat.te, mean((medv - pr.lm)^2))
```
The estimated MSPE of a linear model constructed via stepwise is:
```{r tree6, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
library(MASS)
null <- lm(medv ~ 1, data = dat.tr)
full <- lm(medv ~ ., data = dat.tr)
bos.aic <- stepAIC(null, scope = list(lower = null, upper = full), trace = FALSE)
pr.aic <- predict(bos.aic, newdata = dat.te)
with(dat.te, mean((medv - pr.aic)^2))
```
Finally, the estimated MSPE of the "optimal" LASSO fit is:
```{r tree7, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
# LASSO?
library(glmnet)
x.tr <- as.matrix(dat.tr[, -14])
y.tr <- as.vector(dat.tr$medv)
set.seed(123)
bos.la <- cv.glmnet(x = x.tr, y = y.tr, alpha = 1)
x.te <- as.matrix(dat.te[, -14])
pr.la <- predict(bos.la, s = "lambda.1se", newx = x.te)
with(dat.te, mean((medv - pr.la)^2))
```

Note that the regression tree appears to have the best MSPE, although
we cannot really assess whether the observed differences are beyond
the uncertainty associated with our MSPE estimators. In other words,
would these differences still be so if we used a different 
training / test data split? In fact, a very good exercise for you 
would be to repeat the above comparison using **many**
different training/test splits, or even better: using all the data for
training and K-fold CV to estimate the different MSPEs.