forked from UBC-DSCI/introduction-to-datascience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-regression2.Rmd
477 lines (397 loc) · 23.1 KB
/
08-regression2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
# Regression II: linear regression {#regression2}
## Overview
This chapter provides an introduction to linear regression models
in a predictive context, focusing primarily on the case where
there is a single predictor and single response variable of interest,
as well as comparison to K-nearest neighbours methods. The
chapter concludes with a discussion of linear regression with
multiple predictors.
## Chapter learning objectives
By the end of the chapter, students will be able to:
* Perform linear regression in R using `tidymodels` and evaluate it on a test dataset.
* Compare and contrast predictions obtained from K-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.
* In R, overlay regression lines from `geom_smooth` on a single plot.
## Simple linear regression
K-NN is not the only type of regression; another
quite useful, and arguably the most common, type of regression is
called simple linear regression. Simple
linear regression is similar to K-NN regression in that the target/response
variable is quantitative. However, one way it varies quite
differently is how the training data is used to predict a value for a new
observation. Instead of looking at the $K$-nearest neighbours and averaging
over their values for a prediction, in simple linear regression all the
training data points are used to create a straight line of best fit, and then
the line is used to "look up" the predicted value.
> Note: for simple linear
regression there is only one response variable and only one predictor. Later in
this chapter we introduce the more general linear regression case where more
than one predictor can be used.
For example, let's revisit the smaller version of the Sacramento housing data
set. Recall that we have come across a new 2,000-square foot house we are interested
in purchasing with an advertised list price of
\$350,000. Should we offer the list price, or is that over/undervalued?
To answer this question using simple linear regression, we use the data we have
to draw the straight line of best fit through our existing data points:
```{r 08-lin-reg1, message = FALSE, warning = FALSE, echo = FALSE, fig.height = 4, fig.width = 5}
library(tidyverse)
library(gridExtra)
library(caret)
data(Sacramento)
set.seed(1234)
small_sacramento <- sample_n(Sacramento, size = 30)
small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
geom_point() +
xlab("House size (square footage)") +
ylab("Price (USD)") +
scale_y_continuous(labels=dollar_format()) +
geom_smooth(method = "lm", se = FALSE)
small_plot
```
The equation for the straight line is:
$$\text{house price} = \beta_0 + \beta_1 \cdot (\text{house size}),$$
where
- $\beta_0$ is the vertical intercept of the line (the value where the line cuts the vertical axis)
- $\beta_1$ is the slope of the line
Therefore using the data to find the line of best fit is equivalent to finding coefficients
$\beta_0$ and $\beta_1$ that *parametrize* (correspond to) the line of best fit.
Once we have the coefficients, we can use the equation above to evaluate the predicted price given the value we
have for the predictor/explanatory variable—here 2,000 square feet.
```{r 08-lin-reg2, message = FALSE, warning = FALSE, echo = FALSE, fig.height = 4, fig.width = 5}
small_model <- lm(price ~ sqft, data = small_sacramento)
prediction <- predict(small_model, data.frame(sqft = 2000))
small_plot +
geom_vline(xintercept = 2000, linetype = "dotted") +
geom_point(aes(x = 2000, y = prediction[[1]], color = "red", size = 2.5)) +
theme(legend.position="none")
print(prediction[[1]])
```
By using simple linear regression on this small data set to predict the sale price
for a 2,000 square foot house, we get a predicted value of
\$`r format(round(prediction[[1]]), scientific = FALSE)`. But wait a minute...how
exactly does simple linear regression choose the line of best fit? Many
different lines could be drawn through the data points. We show some examples
below:
```{r 08-several-lines, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5}
small_plot +
geom_abline(intercept = -64542.23, slope = 190, color = "green") +
geom_abline(intercept = -6900, slope = 175, color = "purple") +
geom_abline(intercept = -64542.23, slope = 160, color = "red")
```
Simple linear regression chooses the straight line of best fit by choosing
the line that minimzes the **average** vertical distance between itself and
each of the observed data points. From the lines shown above, that is the blue
line. What exactly do we mean by the vertical distance between the predicted
values (which fall along the line of best fit) and the observed data points?
We illustrate these distances in the plot below with a red line:
```{r 08-verticalDistToMin, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5}
small_sacramento <- small_sacramento %>%
mutate(predicted = predict(small_model))
small_plot +
geom_segment(data = small_sacramento, aes(xend = sqft, yend = predicted), colour = "red")
```
To assess the predictive accuracy of a simple linear regression model,
we use RMSPE—the same measure of predictive performance we used with K-NN regression.
## Linear regression in R
We can perform simple linear regression in R using `tidymodels` in a
very similar manner to how we performed K-NN regression.
To do this, instead of creating a `nearest_neighbor` model specification with
the `kknn` engine, we use a `linear_reg` model specification
with the `lm` engine. Another difference is that we do not need to choose $K$ in the
context of linear regression, and so we do not need to perform cross validation.
Below we illustrate how we can use the usual `tidymodels` workflow to predict house sale
price given house size using a simple linear regression approach using the full
Sacramento real estate data set.
> An additional difference that you will notice below is that we do not standardize
> (i.e., scale and center) our predictors. In K-nearest neighbours models, recall that
> the model fit changes depending on whether we standardize first or not. In linear regression,
> standardization does not affect the fit (it *does* affect the coefficients in the equation, though!).
> So you can standardize if you want—it won't hurt anything—but if you leave the
> predictors in their original form, the best fit coefficients are usually easier to interpret afterward.
As usual, we start by putting some test data away in a lock box that we
can come back to after we choose our final model. Let's take care of that now.
```{r 08-test-train-split}
set.seed(1234)
sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price)
sacramento_train <- training(sacramento_split)
sacramento_test <- testing(sacramento_split)
```
Now that we have our training data, we will create the model specification
and recipe, and fit our simple linear regression model:
```{r 08-fitLM, fig.height = 4, fig.width = 5}
lm_spec <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
lm_recipe <- recipe(price ~ sqft, data = sacramento_train)
lm_fit <- workflow() %>%
add_recipe(lm_recipe) %>%
add_model(lm_spec) %>%
fit(data = sacramento_train)
lm_fit
```
Our coefficients are
(intercept) $\beta_0=$ `r format(round(pull(tidy(pull_workflow_fit(lm_fit)), estimate)[1]), scientific=FALSE)`
and (slope) $\beta_1=$ `r format(round(pull(tidy(pull_workflow_fit(lm_fit)), estimate)[2]), scientific=FALSE)`.
This means that the equation of the line of best fit is
$$\text{house price} = `r format(round(pull(tidy(pull_workflow_fit(lm_fit)), estimate)[1]), scientific=FALSE)` + `r format(round(pull(tidy(pull_workflow_fit(lm_fit)), estimate)[2]), scientific=FALSE)`\cdot (\text{house size}),$$
and that the model predicts that houses
start at \$`r format(round(pull(tidy(pull_workflow_fit(lm_fit)), estimate)[1]), scientific=FALSE)` for 0 square feet, and that
every extra square foot increases the cost of the house by \$`r format(round(pull(tidy(pull_workflow_fit(lm_fit)), estimate)[2]), scientific=FALSE)`. Finally, we predict on the test data set to assess how well our model does:
```{r 08-assessFinal}
lm_test_results <- lm_fit %>%
predict(sacramento_test) %>%
bind_cols(sacramento_test) %>%
metrics(truth = price, estimate = .pred)
lm_test_results
```
Our final model's test error as assessed by RMSPE
is `r format(round(lm_test_results %>% filter(.metric == 'rmse') %>% pull(.estimate)), scientific=FALSE)`.
Remember that this is in units of the target/response variable, and here that
is US Dollars (USD). Does this mean our model is "good" at predicting house
sale price based off of the predictor of home size? Again answering this is
tricky to answer and requires to use domain knowledge and think about the
application you are using the prediction for.
To visualize the simple linear regression model, we can plot the predicted house
price across all possible house sizes we might encounter superimposed on a scatter
plot of the original housing price data. There is a plotting function in
the `tidyverse`, `geom_smooth`, that
allows us to do this easily by adding a layer on our plot with the simple
linear regression predicted line of best fit. The default for this adds a
plausible range to this line that we are not interested in at this point, so to
avoid plotting it, we provide the argument `se = FALSE` in our call to
`geom_smooth`.
```{r 08-lm-predict-all, fig.height = 4, fig.width = 5, warning = FALSE, message = FALSE}
lm_plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
xlab("House size (square footage)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
geom_smooth(method = "lm", se = FALSE)
lm_plot_final
```
We can extract the coefficients from our model by accessing the
fit object that is output by the `fit` function; we first have to extract
it from the workflow using the `pull_workflow_fit` function, and then apply
the `tidy` function to convert the result into a data frame:
```{r 08-lm-get-coeffs}
coeffs <- tidy(pull_workflow_fit(lm_fit))
coeffs
```
## Comparing simple linear and K-NN regression
Now that we have a general understanding of both simple linear and K-NN
regression, we can start to compare and contrast these methods as well as the
predictions made by them. To start, let's look at the visualization of the
simple linear regression model predictions for the Sacramento real estate data
(predicting price from house size) and the "best" K-NN regression model
obtained from the same problem:
```{r 08-compareRegression, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4, fig.width = 10}
set.seed(1234)
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 30) %>%
set_engine("kknn") %>%
set_mode("regression")
sacr_wkflw <- workflow() %>%
add_recipe(sacr_recipe) %>%
add_model(sacr_spec)
sacr_fit <- sacr_wkflw %>%
fit(data = sacramento_train)
sacr_preds <- sacr_fit %>%
predict(sacramento_train) %>%
bind_cols(sacramento_train)
sacr_rmse <- sacr_preds %>%
metrics(truth = price, estimate = .pred) %>%
filter(.metric == 'rmse') %>%
pull(.estimate) %>%
round(2)
sacr_rmspe <- sacr_fit %>%
predict(sacramento_test) %>%
bind_cols(sacramento_test) %>%
metrics(truth = price, estimate = .pred) %>%
filter(.metric == 'rmse') %>%
pull(.estimate) %>% round()
knn_plot_final <- ggplot(sacr_preds, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
xlab("House size (square footage)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") +
ggtitle("K-NN regression") +
annotate("text", x = 3500, y = 100000, label = paste("RMSPE =", sacr_rmspe))
lm_rmspe <- lm_test_results %>%
filter(.metric == 'rmse') %>%
pull(.estimate) %>%
round()
lm_plot_final <- lm_plot_final +
annotate("text", x = 3500, y = 100000, label = paste("RMSPE =", lm_rmspe)) +
ggtitle("linear regression")
grid.arrange(lm_plot_final, knn_plot_final, ncol = 2)
```
What differences do we observe from the visualization above? One obvious
difference is the shape of the blue lines. In simple linear regression we are
restricted to a straight line, whereas in K-NN regression our line is much more
flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the
model to a straight line. A
straight line can be defined by two numbers, the
vertical intercept and the slope. The intercept tells us what the prediction is when
all of the predictors are equal to 0; and the slope tells us what unit increase in the target/response
variable we predict given a unit increase in the predictor/explanatory
variable. K-NN regression, as simple as it is to implement and understand, has no such
interpretability from its wiggly line.
There can however also be a disadvantage to using a simple linear regression
model in some cases, particularly when the relationship between the target and
the predictor is not linear, but instead some other shape (e.g. curved or oscillating). In
these cases the prediction model from a simple linear regression
will underfit (have high bias), meaning that model/predicted values does not
match the actual observed values very well. Such a model would probably have a
quite high RMSE when assessing model goodness of fit on the training data and
a quite high RMPSE when assessing model prediction quality on a test data
set. On such a data set, K-NN regression may fare better. Additionally, there
are other types of regression you can learn about in future courses that may do
even better at predicting with such data.
How do these two models compare on the Sacramento house prices data set? On
the visualizations above we also printed the RMPSE as calculated from
predicting on the test data set that was not used to train/fit the models. The RMPSE for the simple linear
regression model is slightly lower than the RMPSE for the K-NN regression model.
Considering that the simple linear regression model is also more interpretable,
if we were comparing these in practice we would likely choose to use the simple
linear regression model.
Finally, note that the K-NN regression model becomes "flat"
at the left and right boundaries of the data, while the linear model
predicts a constant slope. Predicting outside the range of the observed
data is known as *extrapolation*; K-NN and linear models behave quite differently
when extrapolating. Depending on the application, the flat
or constant slope trend may make more sense. For example, if our housing
data were slightly different, the linear model may have actually predicted
a *negative* price for a small houses (if the intercept $\beta_0$ was negative),
which obviously does not match reality. On the other hand, the trend of increasing
house size corresponding to increasing house price probably continues for large houses,
so the "flat" extrapolation of K-NN likely does not match reality.
## Multivariate linear regression
As in K-NN classification and K-NN regression, we can move beyond the simple
case of one response variable and only one predictor and perform multivariate
linear regression where we can have multiple predictors. In this case we fit a
plane to the data, as opposed to a straight line.
To do this, we follow a very similar approach to what we did for
K-NN regression; but recall that we do not need to use cross-validation to choose any parameters,
nor do we need to standardize (i.e., center and scale) the data for linear regression.
We demonstrate how to do this below using the Sacramento real estate data with both house size
(measured in square feet) as well as number of bedrooms as our predictors, and
continue to use house sale price as our outcome/target variable that we are
trying to predict. We will start by changing the formula in the recipe to
include both the `sqft` and `beds` variables as predictors:
```{r 08-lm-mult-test-train-split}
lm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train)
```
Now we can build our workflow and fit the model:
```{r 08-fitlm}
lm_fit <- workflow() %>%
add_recipe(lm_recipe) %>%
add_model(lm_spec) %>%
fit(data = sacramento_train)
lm_fit
```
And finally, we predict on the test data set to assess how well our model does:
```{r 08-assessFinal-multi}
lm_mult_test_results <- lm_fit %>%
predict(sacramento_test) %>%
bind_cols(sacramento_test) %>%
metrics(truth = price, estimate = .pred)
lm_mult_test_results
```
In the case of two predictors, our linear regression creates a *plane* of best fit, shown below:
```{r 08-3DlinReg, echo = FALSE, message = FALSE, warning = FALSE}
library(plotly)
xvals <- seq(from = min(sacramento_train$sqft), to = max(sacramento_train$sqft), length = 50)
yvals <- seq(from = min(sacramento_train$beds), to = max(sacramento_train$beds), length = 50)
zvals <- lm_fit %>%
predict(crossing(xvals, yvals) %>% mutate(sqft = xvals, beds = yvals)) %>%
pull(.pred)
zvalsm <- matrix(zvals, nrow=length(xvals))
plot_ly() %>%
add_markers(data = sacramento_train,
x = ~sqft,
y = ~beds,
z = ~price,
marker = list(size = 5, opacity = 0.4, color = "red")) %>%
layout(scene = list(xaxis = list(title = 'House size (square feet)'),
zaxis = list(title = 'Price (USD)'),
yaxis = list(title = 'Number of bedrooms'))) %>%
add_surface(x = ~xvals,
y = ~yvals,
z = ~zvalsm,
colorbar=list(title='Price (USD)'))
```
We see that the predictions from linear regression with two predictors form a
flat plane. This is the hallmark of linear regression, and differs from the
wiggly, flexible surface we get from other methods such as K-NN regression.
As discussed this can be advantageous in one aspect, which is that for each
predictor, we can get slopes/intercept from linear regression, and thus describe the
plane mathematically. We can extract those slope values from our model object
as shown below:
```{r 08-lm-multi-get-coeffs}
coeffs <- tidy(pull_workflow_fit(lm_fit))
coeffs
```
And then use those slopes to write a mathematical equation to describe the prediction plane:
$$\text{house price} = \beta_0 + \beta_1\cdot(\text{house size}) + \beta_2\cdot(\text{number of bedrooms}),$$
where:
- $\beta_0$ is the vertical intercept of the hyperplane (the value where it cuts the vertical axis)
- $\beta_1$ is the slope for the first predictor (house size)
- $\beta_2$ is the slope for the second predictor (number of bedrooms)
Finally, we can fill in the values for $\beta_0$, $\beta_1$ and $\beta_2$ from the model output above
to create the equation of the plane of best fit to the data:
```{r 08-lm-multi-get-coeffs-hidden, echo = FALSE}
icept <- format(round(coeffs %>% filter(term == '(Intercept)') %>% pull(estimate)), scientific = FALSE)
sqftc <- format(round(coeffs %>% filter(term == 'sqft') %>% pull(estimate)), scientific = FALSE)
bedsc <- format(round(coeffs %>% filter(term == 'beds') %>% pull(estimate)), scientific = FALSE)
```
$$\text{house price} = `r icept` + `r sqftc`\cdot (\text{house size}) `r bedsc` \cdot (\text{number of bedrooms})$$
This model is more interpretable than the multivariate K-NN
regression model; we can write a mathematical equation that explains how
each predictor is affecting the predictions. But as always, we should look at
the test error and ask whether linear regression is doing a better job of
predicting compared to K-NN regression in this multivariate regression case. To
do that we can use this linear regression model to predict on the test data to
get our test error.
```{r 08-get-RMSPE}
lm_mult_test_results
```
We get that the RMSPE for the multivariate linear regression model
of `r format(lm_mult_test_results %>% filter(.metric == 'rmse') %>% pull(.estimate), scientific = FALSE)`. This prediction error
is less than the prediction error for the multivariate K-NN regression model,
indicating that we should likely choose linear regression for predictions of
house price on this data set. But we should also ask if this more complex
model is doing a better job of predicting compared to our simple linear
regression model with only a single predictor (house size). Revisiting last
section, we see that our RMSPE for our simple linear regression model with
only a single predictor was
`r format(lm_test_results %>% filter(.metric == 'rmse') %>% pull(.estimate), scientific = FALSE)`,
which is slightly more than that of our more complex model. Our model with two predictors
provided a slightly better fit on test data than our model with just one.
But should we always end up choosing a model with more predictors than fewer?
The answer is no; you never know what model will be the best until you go through the
process of comparing their performance on held-out test data. Exploratory
data analysis can give you some hints, but until you look
at the prediction errors to compare the models you don't really know.
Additionally, here we compare test errors purely for the purposes of teaching.
In practice, when you want to compare several regression models with
differing numbers of predictor variables, you should use
cross-validation on the training set only; in this case choosing the model is part
of tuning, so you cannot use the test data. There are several well known and more advanced
methods to do this that are beyond the scope of this course, and they include
backward or forward selection, and L1 or L2 regularization (also known as Lasso
and ridge regression, respectively).
## The other side of regression
So far in this textbook we have used regression only in the context of
prediction. However, regression is also a powerful method to understand and/or
describe the relationship between a quantitative response variable and
one or more explanatory variables. Extending the case we have been working with
in this chapter (where we are interested in house price as the outcome/response
variable), we might also be interested in describing the
individual effects of house size and the number of bedrooms on house price,
quantifying how big each of these effects are, and assessing how accurately we
can estimate each of these effects. This side of regression is the topic of
many follow-on statistics courses and beyond the scope of this course.
## Additional readings/resources
- Pages 59-71 of [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- Pages 104 - 109 of [An Introduction to Statistical Learning with Applications in R](https://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- [The `caret` Package](https://topepo.github.io/caret/index.html)
- Chapters 6 - 11 of [Modern Dive](https://moderndive.com/) Statistical Inference via Data Science by Chester Ismay and Albert Y. Kim