-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path24-trees.qmd
executable file
·218 lines (189 loc) · 8.58 KB
/
24-trees.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# Regression trees
```{r setup}
#| include: false
source("_common.R")
```
Trees provide a non-parametric regression estimator that is
able to overcome a serious limitation of "classical non-parametric"
estimators (like those based on splines, or kernels)
when several (more than 2 or 3) explanatory variables are
available.
Below we first describe the problem afflicting classical
non-parametric methods (this is also known as the "curse of dimensionality")
and then describe how to compute regression trees in `R` using the
`rpart` package (although other implementations exist).
Details were discussed in class.
## Curse of dimensionality
Suppose you have a random sample of *n = 100* observations,
uniformly distributed on the [0, 1] interval. How many do you
expect to find within 0.25 of the middle point of the
interval (i.e. how many will be between 0.25 and 0.75)?
A trivial calculation shows that the expected number of
observations falling between 0.25 and 0.75 will be *n/2*,
in this case *50*. This is easy verified with a simple
numerical experiment:
```{r curse, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
# X ~ U(0,1)
# how many points do you expect within 0.25 of 1/2?
set.seed(1234)
n <- 100
x <- runif(n)
(sum(abs(x - 1 / 2) < 0.25)) # half the width of the dist'n
```
(wow! what are the chances?)
Consider now a sample of 100 observations, each with
5 variables (5-dimensional observations),
uniformly distributed in the 5-dimensional unit cube
(*[0,1]^5*). How many do you expect to see in the
*central hypercube* with sides [0.25, 0.75] x [0.25, 0.75] ...
x [0.25, 0.75] = [0.25, 0.75]^5? A simple experiment shows
that this number is probably rather small:
```{r curse.p5, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
p <- 5
x <- matrix(runif(n * p), n, p)
# how many points in the hypercube (0.25, 0.75)^p ?
tmp <- apply(x, 1, function(a) all(abs(a - 1 / 2) < 0.25))
(sum(tmp))
```
In fact, the expected number of observations
in that central hypercube is exactly *n / 2^5*,
which is approximately *3* when *n = 100*.
A relevant question for our local regression estimation
problem is: "how large should our sample be if we want
to still have about 50 observations in our central hypercube?".
Easy calculations show that this number is *50 / (1/2)^p*,
which, for *p = 5* is *1600*. Again, we can verify
this with a simple experiment:
```{r curse.2, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
# how many obs do we need to have 50 in the hypercube?
n <- 50 / (0.5^p)
x <- matrix(runif(n * p), n, p)
# how many points in the hypercube (0.25, 0.75)^p ?
tmp <- apply(x, 1, function(a) all(abs(a - 1 / 2) < 0.25))
(sum(tmp))
```
So we see that if the dimension of our problem increases
from *p = 1* to *p = 5*, the number of observations we
need to maintain an expectation of having about 50 points
in our central hypercube increases by a factor of 16 (not 5).
However, if we double the dimension of the problem (to *p = 10*), in order to expect
50 observations in the central [0.25, 0.75] hypercube
we need a sample of size *n = 51,200*. In other words, we
doubled the dimension, but need 32 times more data (!)
to *fill* the central hypercube with the same number of points.
Moreover, if we doubled the dimension again (to *p = 20*) we would need over
52 million observations to have (just!) 50 in the central hypercube!
Note that now we doubled the
dimension again but need 1024 times more data! The number of
observations needed to maintain a fixed number of observations
in a region of the space grows exponentially with the
dimension of the space.
Another way to think about this problem is to
ask: "given a sample size of *n = 1000*, say, how wide / large
should the central hypercube be to expect
about *50* observations in it?". The answer is
easily found to be *1 / (2 (n/50)^(1/p))*, which for
*n = 1000* and *p = 5* equals 0.27, with
*p = 10* is 0.37 and with *p = 20* is
0.43, almost the full unit hypercube!
In this sense it is fair to say that in moderate to high dimensions
*local neighbourhoods* are either empty or not really *local*.
## Regression trees as constrained non-parametric regression
Regression trees provide an alternative non-regression
estimator that works well, even with many available features.
As discussed in class, the basic idea is to approximate the
regression function by a linear combination of "simple"
functions (i.e. functions $h(x) = I( x \in A )$ which equal
1 if the argument *x* belongs to the set *A*, and 0 otherwise.
Each function has its own support set *A*. Furthermore,
this linear combination is not estimated at once, but
iteratively, and only considering a specific class of
sets *A* (which ones?) As a result, the regression tree
is not the *global* optimal approximation by simple functions,
but a good *suboptimal* one, that can be computed very rapidly.
Details were discussed in class, refer to your notes and
the corresponding slides.
There are several packages in `R` implementing trees,
in this course we will use `rpart`. To illustrate their
use we will consider the `Boston` data set, that contains
information on housing in the US city of Boston. The
corresponding help page contains more information.
**Here, to simplify the comparison** of the predictions obtained
by trees and other regression estimators, instead of using
K-fold CV, we start by randomly splitting the
available data into a training and a test set:
```{r tree, fig.width=5, fig.height=5, message=FALSE, warning=FALSE}
library(rpart)
data(Boston, package = "MASS")
# split data into a training and
# a test set
set.seed(123456)
n <- nrow(Boston)
ii <- sample(n, floor(n / 4))
dat.te <- Boston[ii, ]
dat.tr <- Boston[-ii, ]
```
We now build a regression tree using the function `rpart` and leave most of
its arguments to their
default values. We specify the response and explanatory variables using
a `formula`, as usual, and set `method='anova'` to indicate we
want to train a regression tree (as opposed to a classification one, for example).
Finally, we use the corresponding `plot` method to display the tree
structure:
```{r tree3, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
set.seed(123)
bos.t <- rpart(medv ~ ., data = dat.tr, method = "anova")
plot(bos.t, uniform = FALSE, margin = 0.05)
text(bos.t, pretty = TRUE)
```
A few questions for you:
* Why did we set the pseudo-random generation seed (`set.seed(123)`)
before calling `rpart`? Is there anything random about building these trees?
* What does the `uniform` argument for `plot.rpart` do? What does `text` do here?
## Compare predictions
We now compare the predictions we obtain on the test with the
above regression tree, the usual linear model using all
explanatory variables, another one constructed using stepwise
variable selections methods, and the "optimal" LASSO.
First, we estimate the MSPE of the regression tree using the test set:
```{r tree4, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
# predictions on the test set
pr.t <- predict(bos.t, newdata = dat.te, type = "vector")
with(dat.te, mean((medv - pr.t)^2))
```
For a full linear model, the estimated MSPE using the test set is:
```{r tree5, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
# full linear model
bos.lm <- lm(medv ~ ., data = dat.tr)
pr.lm <- predict(bos.lm, newdata = dat.te)
with(dat.te, mean((medv - pr.lm)^2))
```
The estimated MSPE of a linear model constructed via stepwise is:
```{r tree6, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
library(MASS)
null <- lm(medv ~ 1, data = dat.tr)
full <- lm(medv ~ ., data = dat.tr)
bos.aic <- stepAIC(null, scope = list(lower = null, upper = full), trace = FALSE)
pr.aic <- predict(bos.aic, newdata = dat.te)
with(dat.te, mean((medv - pr.aic)^2))
```
Finally, the estimated MSPE of the "optimal" LASSO fit is:
```{r tree7, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
# LASSO?
library(glmnet)
x.tr <- as.matrix(dat.tr[, -14])
y.tr <- as.vector(dat.tr$medv)
set.seed(123)
bos.la <- cv.glmnet(x = x.tr, y = y.tr, alpha = 1)
x.te <- as.matrix(dat.te[, -14])
pr.la <- predict(bos.la, s = "lambda.1se", newx = x.te)
with(dat.te, mean((medv - pr.la)^2))
```
Note that the regression tree appears to have the best MSPE, although
we cannot really assess whether the observed differences are beyond
the uncertainty associated with our MSPE estimators. In other words,
would these differences still be so if we used a different
training / test data split? In fact, a very good exercise for you
would be to repeat the above comparison using **many**
different training/test splits, or even better: using all the data for
training and K-fold CV to estimate the different MSPEs.