-
Notifications
You must be signed in to change notification settings - Fork 2
/
10-test-set-and-cv.qmd
executable file
·220 lines (202 loc) · 7.63 KB
/
10-test-set-and-cv.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# Predictions using a linear model (continued)
```{r setup}
#| include: false
source("_common.R")
```
In these notes we continue looking at the problem of
comparing different models based on their
prediction properties. As in the previous lecture, we consider
a **full** and a **reduced** model, and in all that follows we assume that the
variables included in the **reduced** model were not selected using the training data.
**This seemingly innocent assumption is in fact critical, and we will later come back to it.**
## Estimating the MSPE with a test set
One way to estimate the mean squared prediction error of a
model or predictor is to use it on a test set (where the
responses are known, but that was not used when training the
predcitor or estimating the model), and the comparing the
predictions with the actual responses.
First, we load the training set and fit both models:
```{r read}
x.tr <- read.table("data/pollution-train.dat", header = TRUE, sep = ",")
full <- lm(MORT ~ ., data = x.tr)
reduced <- lm(MORT ~ POOR + HC + NOX + HOUS + NONW, data = x.tr)
```
Although the **full** model fits the data better than the
reduced one (see Lecture 1), its predictions on the test set are better.
First, compute the two vectors of test set predictions:
```{r pred1}
x.te <- read.table("data/pollution-test.dat", header = TRUE, sep = ",")
pr.full <- predict(full, newdata = x.te)
pr.reduced <- predict(reduced, newdata = x.te)
```
And now, use them to estimate the mean squared prediction error of
each model:
```{r pred1.2}
with(x.te, mean((MORT - pr.full)^2))
with(x.te, mean((MORT - pr.reduced)^2))
```
Previously, we also saw that
this is not just an artifact of the specific
training / test split of the data. The **reduced**
model generally produces better predictions,
regardless of the specific training / test
split we use. We can verify this repeating
the procedure many times (100, say) and looking
at the estimated mean squared prediction errors
obtained each time for each model.
```{r testrain, fig.width=5, fig.height=5}
x <- read.csv("data/rutgers-lib-30861_CSV-1.csv")
set.seed(123)
Nsplits <- 100
mspe.full <- mspe.red <- vector("numeric", Nsplits)
for (j in 1:Nsplits) {
g <- sample(rep(1:4, each = 15))
a.tr <- x[g != 2, ]
a.te <- x[g == 2, ]
full <- lm(MORT ~ ., data = a.tr)
reduced <- lm(MORT ~ POOR + HC + NOX + HOUS + NONW, data = a.tr)
a.te$pr.full <- predict(full, newdata = a.te)
a.te$pr.reduced <- predict(reduced, newdata = a.te)
mspe.full[j] <- with(a.te, mean((MORT - pr.full)^2))
mspe.red[j] <- with(a.te, mean((MORT - pr.reduced)^2))
}
boxplot(mspe.full, mspe.red,
names = c("Full", "Reduced"),
col = c("gray80", "tomato3"),
main = "Air Pollution - 100 training/test splits", ylim = c(0, 5000)
)
mtext(expression(hat(MSPE)), side = 2, line = 2.5)
```
## Leave-one-out cross-validation
A different procedure to estimate the prediction power
of a model or method is called **leave-one-out CV**.
One advantage of using this method is that
the model we fit can use a larger training set.
We discussed the procedure in class. Here
we apply it to estimate the mean squared
prediction error of the **full** and **reduced**
models. Again, we assume that the reduced model
was chosen independently from the training set.
```{r loocv}
x <- read.csv("data/rutgers-lib-30861_CSV-1.csv")
n <- nrow(x)
pr.full <- pr.reduced <- rep(0, n)
for (i in 1:n) {
full <- lm(MORT ~ ., data = x[-i, ])
reduced <- lm(MORT ~ POOR + HC + NOX + HOUS + NONW, data = x[-i, ])
pr.full[i] <- predict(full, newdata = x[i, ])
pr.reduced[i] <- predict(reduced, newdata = x[i, ])
}
```
Now we have the leave-one-out predictions for each model
and can compute the corresponding estimated mean squared
prediction errors:
```{r loocv2}
mean((x$MORT - pr.full)^2)
mean((x$MORT - pr.reduced)^2)
```
Note that here again the reduced model seems to yield better
prediction errors.
## K-fold cross-validation
Leave-one-out cross-validation can be computationally
very demanding (or even unfeasible) when the sample size
is large and training the predictor is relatively costly.
One solution is called **K-fold CV**. We split the data
into **K** folds, train the predictor on the data without
a fold, and use it to predict the responses
in the removed fold. We cycle through the folds, and
use the average of the squared prediction errors as
an estimate of the mean squared prediction error.
The following script does **5-fold CV** for the
`full` and `reduced` linear models on the
pollution dataset, once again assuming that the reduced model
was originally chosen without using the data.
```{r kfold, fig.width=5, fig.height=5, echo=TRUE}
n <- nrow(x)
k <- 5
pr.full <- pr.reduced <- rep(0, n)
# Create labels for the "folds"
inds <- (1:n) %% k + 1
# shuffle the rows of x, this is bad coding!
set.seed(123)
xs <- x[sample(n, replace = FALSE), ]
# loop through the folds
for (j in 1:k) {
x.tr <- xs[inds != j, ]
x.te <- xs[inds == j, ]
full <- lm(MORT ~ ., data = x.tr)
reduced <- lm(MORT ~ POOR + HC + NOX + HOUS + NONW, data = x.tr)
pr.full[inds == j] <- predict(full, newdata = x.te)
pr.reduced[inds == j] <- predict(reduced, newdata = x.te)
}
```
We now compute the estimated mean squared prediction
error of each model:
```{r kfold2}
mean((xs$MORT - pr.full)^2)
mean((xs$MORT - pr.reduced)^2)
```
This method is clearly faster than leave-one-out CV, but
the results may depend on the specific fold partition,
and on the number **K** of folds used.
* One way to obtain more stable mean squared prediction errors
using K-fold CV is to repeat the above procedure
many times, and compare the distribution of the
mean squared prediction errors for each estimator.
First, fit the **full** and **reduced** models using the
whole data set as training:
```{r fits, fig.width=5, fig.height=5, echo=TRUE}
m.f <- lm(MORT ~ ., data = x)
m.r <- lm(MORT ~ POOR + HC + NOX + HOUS + NONW, data = x)
```
We will use 50 runs of 5-fold CV comparing the **full** and **reduced** models.
Again, here we assume that the reduced model was not obtained
using the training data.
```{r cv10runs, fig.width=5, fig.height=5, echo=TRUE}
N <- 50
mspe1 <- mspe2 <- vector("double", N)
ii <- (1:(n <- nrow(x))) %% 5 + 1
set.seed(327)
for (i in 1:N) {
ii <- sample(ii)
pr.f <- pr.r <- vector("double", n)
for (j in 1:5) {
pr.f[ii == j] <- predict(
update(m.f, data = x[ii != j, ]),
newdata = x[ii == j, ]
)
pr.r[ii == j] <- predict(
update(m.r, data = x[ii != j, ]),
newdata = x[ii == j, ]
)
}
mspe1[i] <- with(x, mean((MORT - pr.f)^2))
mspe2[i] <- with(x, mean((MORT - pr.r)^2))
}
boxplot(mspe1, mspe2,
names = c("Full", "Reduced"),
col = c("gray80", "tomato3"),
main = "Air Pollution - 50 runs 5-fold CV", ylim = c(0, 5000)
)
mtext(expression(hat(MSPE)), side = 2, line = 2.5)
```
Note that the estimated mean squared prediction
error of the **reduced** model has a smaller mean / median
than that of the **full** one. This tells us that
the conclusion we reached favouring the reduced model
(in terms of its prediction mean squared error) does
not depend on a particular choice of folds. In other
words, this provides more evidence to conclude that
the reduced model will produce better predictions
than the full one.
* A computationally simpler (albeit possibly less precise) way
to account for the K-fold variability is to run
K-fold CV once and
use the sample standard error of the
**K** *smaller* mean squared prediction errors to
construct a rough *confidence interval* around
the overall mean squared prediction error estimate (that is
the average of the mean squared prediction errors
over the K folds).
* The dependency of this MSPE on **K** is more involved.
We will discuss it later.