-
Notifications
You must be signed in to change notification settings - Fork 2
/
11-cv-concerns.qmd
executable file
·242 lines (205 loc) · 10.5 KB
/
11-cv-concerns.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
# Cross-validation concerns
```{r setup}
#| include: false
source("_common.R")
```
In this document we study how to perform cross-validation
when the model was selected or determined using the
training data. Consider the following synthetic data
set
```{r load.fallacy}
dat <- read.table("data/fallacy.dat", header = TRUE, sep = ",")
```
This is the same data used in class. In this example
we know what the "true" model is, and thus we also know
what the "optimal" predictor is.
However, let us ignore this knowledge, and build a
linear model instead.
Given how many variables are available, we use
forward stepwise (AIC-based) to select a good subset of
them to include in our linear model:
```{r fallacy}
library(MASS)
p <- ncol(dat)
null <- lm(Y ~ 1, data = dat)
full <- lm(Y ~ ., data = dat) # needed for stepwise
step.lm <- stepAIC(null, scope = list(lower = null, upper = full), trace = FALSE)
```
Without thinking too much, we use 50 runs of 5-fold CV (ten runs)
to compare the MSPE of the
**null** model (which we know is "true") and the
one we obtained using forward stepwise:
```{r wrong}
n <- nrow(dat)
ii <- (1:n) %% 5 + 1
set.seed(17)
N <- 50
mspe.n <- mspe.st <- rep(0, N)
for (i in 1:N) {
ii <- sample(ii)
pr.n <- pr.st <- rep(0, n)
for (j in 1:5) {
tmp.st <- update(step.lm, data = dat[ii != j, ])
pr.st[ii == j] <- predict(tmp.st, newdata = dat[ii == j, ])
pr.n[ii == j] <- with(dat[ii != j, ], mean(Y))
}
mspe.st[i] <- with(dat, mean((Y - pr.st)^2))
mspe.n[i] <- with(dat, mean((Y - pr.n)^2))
}
boxplot(mspe.st, mspe.n, names = c("Stepwise", "NULL"), col = c("gray60", "hotpink"), main = "Wrong")
summary(mspe.st)
summary(mspe.n)
```
* **Something is wrong!** What? Why?
* What would you change above to obtain reliable estimates for the MSPE of the
model selected with the stepwise approach?
<!-- ## Correlated covariates -->
<!-- Technological advances in recent decades have resulted in data -->
<!-- being collected in a fundamentally different way from the way -->
<!-- it was when "classical" statistical methods were proposed. -->
<!-- Specifically, it is not at all uncommon to have data sets with -->
<!-- an abundance of potentially useful explanatory variables. -->
<!-- Sometimes the investigators are not sure which of them can be -->
<!-- expected to be useful or meaningful. In many applications one -->
<!-- finds data with many more variables than cases. -->
<!-- A consequence of this "wide net" data collection strategy is -->
<!-- that many of the explanatory variables may be correlated with -->
<!-- each other. In what follows we will illustrate some of the -->
<!-- problems that this can cause both when training and interpreting -->
<!-- models, and also with the resulting predictions. -->
<!-- ### Significant variables "dissappear" -->
<!-- Consider the air pollution data set, and the fit to the -->
<!-- **reduced** linear regression model used previously in class: -->
<!-- ```{r signif} -->
<!-- # Correlated covariates -->
<!-- x <- read.table('../Lecture1/rutgers-lib-30861_CSV-1.csv', header=TRUE, sep=',') -->
<!-- reduced <- lm(MORT ~ POOR + HC + NOX + HOUS + NONW, data=x) -->
<!-- round( summary(reduced)$coef, 3) -->
<!-- ``` -->
<!-- Note that all coefficients seem to be significant based on -->
<!-- the individual tests of hypothesis (with `POOR` and -->
<!-- `HOUS` maybe only marginally so). In this sense all 5 -->
<!-- explanatory varibles in this model appear to be relevant. -->
<!-- Now, we fit the **full** model, that is, we include -->
<!-- all available explanatory variables in the data set: -->
<!-- ```{r signif2} -->
<!-- full <- lm(MORT ~ ., data=x) -->
<!-- round( summary(full)$coef, 3) -->
<!-- ``` -->
<!-- Now we have many more parameters to estimate, and while two of -->
<!-- them appear to be significantly different from zero (`NONW` -->
<!-- and `PREC`), all the others seem to be redundant. -->
<!-- In particular, note that the p-values for the individual -->
<!-- test of hypotheses for 4 out of the 5 -->
<!-- regression coefficients for the variables of the **reduced** -->
<!-- model have now become not significant. -->
<!-- ```{r signif3} -->
<!-- round( summary(full)$coef[ names(coef(reduced)), ], 3) -->
<!-- ``` -->
<!-- ### Why does this happen? -->
<!-- Recall that the covariance matrix of the least squares estimator involves the -->
<!-- inverse of (X'X), where X' denotes the transpose of the n x p matrix X (that -->
<!-- contains each vector of explanatory variables as a row). It is easy to see -->
<!-- that if two columns of X are linearly dependent, then X'X will be rank deficient. -->
<!-- When two columns of X are "close" to being linearly dependent (e.g. their -->
<!-- linear corrleation is high), then the matrix X'X will be ill-conditioned, and -->
<!-- its inverse will have very large entries. This means that the estimated -->
<!-- standard errors of the least squares estimator will be unduly large, resulting -->
<!-- in non-significant test of hypotheses for each parameter separately, even if -->
<!-- the global test for all of them simultaneously is highly significant. -->
<!-- ### Why is this a problem if we are interested in prediction? -->
<!-- Although in many applications one is interested in interpreting the parameters -->
<!-- of the model, even if one is only trying to fit / train a model to do -->
<!-- predictions, highly variable parameter estimators will typically result in -->
<!-- a noticeable loss of prediction accuracy. This can be easily seen from the -->
<!-- bias / variance factorization of the mean squared prediction error (MSPE) -->
<!-- mentioned in class. Hence, better predictions can be obtained if one -->
<!-- uses less-variable parameter estimators. -->
<!-- ### What can we do? -->
<!-- A commonly used strategy is to remove some explanatory variables from the -->
<!-- model, leaving only non-redundant covariates. However, this is easier said than -->
<!-- done. You have seen some strategies in other courses (stepwise variable selection, etc.) -->
<!-- In coming weeks we will investigate other methods to deal with this problem. -->
## Estimating MSPE with CV when the model was built using the data
<!--Last week we learned that one needs to be careful when using cross-validation (in any of its flavours--leave one out, K-fold, etc.) -->
Misuse of cross-validation is, unfortunately,
not unusual. For [one example](https://doi.org/10.1073/pnas.102102699) see [@Ambroise6562].
In particular, for every fold one needs to repeat **everything** that was done with the training set (selecting variables, looking at pairwise correlations, AIC values, etc.)
## Correlated covariates
Technological advances in recent decades have resulted in data
being collected in a fundamentally different manner from the way
it was done when most "classical" statistical methods were developed
(early to mid 1900's).
Specifically, it is now not at all uncommon to have data sets with
an abundance of potentially useful explanatory variables
(for example with more variables than observations).
Sometimes the investigators are not sure which of the collected variables
can be
expected to be useful or meaningful.
A consequence of this "wide net" data collection strategy is
that many of the explanatory variables may be correlated with
each other. In what follows we will illustrate some of the
problems that this can cause both when training and interpreting
models, and also with the resulting predictions.
### Variables that were important may suddenly "dissappear"
Consider the air pollution data set we used
earlier, and the
**reduced** linear regression model discussed in class:
```{r signif}
x <- read.table("data/rutgers-lib-30861_CSV-1.csv", header = TRUE, sep = ",")
reduced <- lm(MORT ~ POOR + HC + NOX + HOUS + NONW, data = x)
round(summary(reduced)$coef, 3)
```
Note that all coefficients seem to be significant based on
the individual tests of hypothesis (with `POOR` and
`HOUS` maybe only marginally so). In this sense all 5
explanatory varibles in this model appear to be relevant.
Now, we fit the **full** model, that is, we include
all available explanatory variables in the data set:
```{r signif2}
full <- lm(MORT ~ ., data = x)
round(summary(full)$coef, 3)
```
In the **full** model there
are many more parameters that need to be estimated, and while two of
them appear to be significantly different from zero (`NONW`
and `PREC`), all the others appear to be redundant.
In particular, note that the p-values for the individual
test of hypotheses for 4 out of the 5
regression coefficients for the variables of the **reduced**
model have now become not significant.
```{r signif3}
round(summary(full)$coef[names(coef(reduced)), ], 3)
```
In other words, the coeffficients of
explanatory variables that appeared to
be relevant in one model may turn
to be "not significant" when other variables
are included. This could pose some challenges
for interpreting the estimated parameters of the
models.
### Why does this happen?
Recall that the covariance matrix of the least squares estimator involves the
inverse of (X'X), where X' denotes the transpose of the n x p matrix X (that
contains each vector of explanatory variables as a row). It is easy to see
that if two columns of X are linearly dependent, then X'X will be rank deficient.
When two columns of X are "close" to being linearly dependent (e.g. their
linear corrleation is high), then the matrix X'X will be ill-conditioned, and
its inverse will have very large entries. This means that the estimated
standard errors of the least squares estimator will be unduly large, resulting
in non-significant test of hypotheses for each parameter separately, even if
the global test for all of them simultaneously is highly significant.
### Why is this a problem if we are interested in prediction?
Although in many applications one is interested in interpreting the parameters
of the model, even if one is only trying to fit / train a model to do
predictions, highly variable parameter estimators will typically result in
a noticeable loss of prediction accuracy. This can be easily seen from the
bias / variance factorization of the mean squared prediction error (MSPE)
mentioned in class. Hence, better predictions can be obtained if one
uses less-variable parameter (or regression function) estimators.
### What can we do?
A commonly used strategy is to remove some explanatory variables from the
model, leaving only non-redundant covariates. However, this is easier said than
done. You will have seen some strategies in previous Statistics
courses (e.g. stepwise variable selection).
In coming weeks we will investigate other methods to deal with this problem.