forked from UBC-DSCI/introduction-to-datascience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-regression1.Rmd
572 lines (480 loc) · 26.6 KB
/
07-regression1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
# Regression I: K-nearest neighbours {#regression1}
## Overview
This chapter will provide an introduction to regression through K-nearest
neighbours (K-NN) in a predictive context, focusing primarily on the case where
there is a single predictor and single response variable of interest. The
chapter concludes with an example of K-nearest neighbours regression with
multiple predictors.
## Chapter learning objectives
By the end of the chapter, students will be able to:
* Recognize situations where a simple regression analysis would be appropriate for making predictions.
* Explain the K-nearest neighbour (K-NN) regression algorithm and describe how it differs from K-NN classification.
* Interpret the output of a K-NN regression.
* In a dataset with two or more variables, perform K-nearest neighbour regression in R using a `tidymodels` workflow
* Execute cross-validation in R to choose the number of neighbours.
* Evaluate K-NN regression prediction accuracy in R using a test data set and an appropriate metric (*e.g.*, root means square prediction error).
* In the context of K-NN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
* Describe advantages and disadvantages of the K-nearest neighbour regression approach.
## Regression
Regression, like classification, is a predictive problem setting where we want
to use past information to predict future observations. But in the case of
regression, the goal is to predict numerical values instead of class labels.
For example, we could try to use the number of hours a person spends on
exercise each week to predict whether they would qualify for the annual Boston
marathon (*classification*) or to predict their race time itself
(*regression*). As another example, we could try to use the size of a house to
predict whether it sold for more than \$500,000 (*classification*) or to
predict its sale price itself (*regression*). We will use K-nearest neighbours
to explore this question in the rest of this chapter, using a real estate data
set from Sacremento, California.
## Sacremento real estate example
Let's start by loading the libraries we need and doing some preliminary
exploratory analysis. The Sacramento real estate data set we will study in this chapter
was [originally reported in the Sacramento Bee](https://support.spatialkey.com/spatialkey-sample-csv-data/),
but we have provided it with this repository as a stable source for the data.
```{r 07-load, message = FALSE}
library(tidyverse)
library(tidymodels)
library(gridExtra)
sacramento <- read_csv('data/sacramento.csv')
head(sacramento)
```
The purpose of this exercise is to understand whether we can we use house size
to predict house sale price in the Sacramento, CA area. The columns in this
data that we are interested in are `sqft` (house size, in livable square feet)
and `price` (house price, in US dollars (USD). The first step is to visualize
the data as a scatter plot where we place the predictor/explanatory variable
(house size) on the x-axis, and we place the target/response variable that we
want to predict (price) on the y-axis:
```{r 07-edaRegr, message = FALSE, fig.height = 4, fig.width = 5}
eda <- ggplot(sacramento, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
xlab("House size (square footage)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format())
eda
```
Based on the visualization above, we can see that in Sacramento, CA, as the
size of a house increases, so does its sale price. Thus, we can reason that we
may be able to use the size of a not-yet-sold house (for which we don't know
the sale price) to predict its final sale price.
## K-nearest neighbours regression
Much like in the case of classification, we can use a K-nearest
neighbours-based approach in regression to make predictions. Let's take a small
sample of the data above and walk through how K-nearest neighbours (knn) works
in a regression context before we dive in to creating our model and assessing
how well it predicts house price. This subsample is taken to allow us to
illustrate the mechanics of K-NN regression with a few data points; later in
this chapter we will use all the data.
To take a small random sample of size 30, we'll use the function `sample_n`.
This function takes two arguments:
1. `tbl` (a data frame-like object to sample from)
2. `size` (the number of observations/rows to be randomly selected/sampled)
```{r 07-sacramento}
set.seed(1234)
small_sacramento <- sample_n(sacramento, size = 30)
```
Next let's say we come across a 2,000 square-foot house in Sacramento we are
interested in purchasing, with an advertised list price of \$350,000. Should we
offer to pay the asking price for this house, or is it overpriced and we should
offer less? Absent any other information, we can get a sense for a good answer
to this question by using the data we have to predict the sale price given the
sale prices we have already observed. But in the plot below, we have no
observations of a house of size *exactly* 2000 square feet. How can we predict
the price?
```{r 07-small-eda-regr, fig.height = 4, fig.width = 5}
small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
geom_point() +
xlab("House size (square footage)") +
ylab("Price (USD)") +
scale_y_continuous(labels=dollar_format()) +
geom_vline(xintercept = 2000, linetype = "dotted")
small_plot
```
We will employ the same intuition from the classification chapter, and use the
neighbouring points to the new point of interest to suggest/predict what its
price should be. For the example above, we find and label the 5 nearest
neighbours to our observation of a house that is 2000 square feet:
```{r 07-find-k3}
nearest_neighbours <- small_sacramento %>%
mutate(diff = abs(2000 - sqft)) %>%
arrange(diff) %>%
head(5)
nearest_neighbours
```
```{r 07-knn3-example, echo = FALSE, fig.height = 4, fig.width = 5}
nearest_neighbours <- mutate(nearest_neighbours, twothou = rep(2000, 5))
nn_plot <- small_plot +
geom_segment(data = nearest_neighbours,
aes(x = twothou, xend = sqft, y = price, yend = price), color = "orange")
nn_plot
```
Now that we have the 5 nearest neighbours (in terms of house size) to our new
2,000 square-foot house of interest, we can use their values to predict a
selling price for the new home. Specifically, we can take the mean (or
average) of these 5 values as our predicted value.
```{r 07-predicted-knn}
prediction <- nearest_neighbours %>%
summarise(predicted = mean(price))
prediction
```
```{r 07-predictedViz-knn, echo = FALSE, fig.height = 4, fig.width = 5}
nn_plot +
geom_point(aes(x = 2000, y = prediction[[1]]), color = "red", size = 2.5)
```
Our predicted price is \$`r format(round(prediction[[1]]), scientific = FALSE)`
(shown as a red point above), which is much less than \$350,000; perhaps we
might want to offer less than the list price at which the house is advertised.
But this is only the very beginning of the story. We still have all the same
unanswered questions here with K-NN regression that we had with K-NN
classification: which $K$ do we choose, and is our model any good at making
predictions? In the next few sections, we will address these questions in the
context of K-NN regression.
## Training, evaluating, and tuning the model
As usual, we must start by putting some test data away in a lock box that we
will come back to only after we choose our final model. Let's take care of that
now. Note that for the remainder of the chapter we'll be working with the
entire Sacramento data set, as opposed to the smaller sample of 30 points above.
```{r 07-test-train-split}
set.seed(1234)
sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price)
sacramento_train <- training(sacramento_split)
sacramento_test <- testing(sacramento_split)
```
Next, we'll use cross-validation to choose $K$. In K-NN classification, we used
accuracy to see how well our predictions matched the true labels. Here in the
context of K-NN regression we will use root mean square prediction error
(RMSPE) instead. The mathematical formula for calculating RMSPE is:
$$\text{RMSPE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
Where:
- $n$ is the number of observations
- $y_i$ is the observed value for the $i^\text{th}$ observation
- $\hat{y}_i$ is the forcasted/predicted value for the $i^\text{th}$ observation
A key feature of the formula for RMPSE is the squared difference between the observed
target/response variable value, $y$, and the prediction target/response
variable value, $\hat{y}_i$, for each observation (from 1 to $i$).
If the predictions are very close to the true values, then
RMSPE will be small. If, on the other-hand, the predictions are very
different to the true values, then RMSPE will be quite large. When we
use cross validation, we will choose the $K$ that gives
us the smallest RMSPE.
> **RMSPE versus RMSE**
When using many code packages (`tidymodels` included), the evaluation output
we will get to assess the prediction quality of
our K-NN regression models is labelled "RMSE", or "root mean squared
error". Why is this so, and why not just RMSPE?
In statistics, we try to be very precise with our
language to indicate whether we are calculating the prediction error on the
training data (*in-sample* prediction) versus on the testing data
(*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we
say RMSE. By contrast, when predicting and evaluating prediction quality
on the testing or validation data, we say RMSPE.
The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the $y$s are
training or testing data. But many people just use RMSE for both,
and rely on context to denote which data the root mean squared error is being calculated on.
Now that we know how we can assess how well our model predicts a numerical
value, let's use R to perform cross-validation and to choose the optimal $K$.
First, we will create a model specification for K-nearest neighbours regression,
as well as a recipe for preprocessing our data. Note that we use `set_mode("regression")`
now in the model specification to denote a regression problem, as opposed to the classification
problems from the previous chapters. Note also that we include standardization
in our preprocessing to build good habits, but since we only have one
predictor it is technically not necessary; there is no risk of comparing two predictors
of different scales.
```{r 07-choose-k-knn}
sacr_recipe <- recipe(price ~ sqft, data = sacramento_train) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
sacr_vfold <- vfold_cv(sacramento_train, v = 5, strata = price)
sacr_wkflw <- workflow() %>%
add_recipe(sacr_recipe) %>%
add_model(sacr_spec)
sacr_wkflw
```
The major difference you can see in the above workflow compared to previous
chapters is that we are running regression rather than classification. The fact
that we use `set_mode("regression")` essentially
tells `tidymodels` that we need to use different metrics (RMSPE, not accuracy)
for tuning and evaluation. You can see this in the following code, which tunes
the model and returns the RMSPE for each number of neighbours.
```{r 07-choose-k-knn-results}
gridvals <- tibble(neighbors = seq(1,200))
sacr_results <- sacr_wkflw %>%
tune_grid(resamples = sacr_vfold, grid = gridvals) %>%
collect_metrics()
# show all the results
sacr_results
```
We take the *minimum* RMSPE to find the best setting for the number of neighbours:
```{r 07-choose-k-knn-results-2}
# show only the row of minimum RMSPE
sacr_min <- sacr_results %>%
filter(.metric == 'rmse') %>%
filter(mean == min(mean))
sacr_min
```
Here we can see that the smallest RMSPE occurs when $K =$ `r sacr_min %>% pull(neighbors)`.
## Underfitting and overfitting
Similar to the setting of classification, by setting the number of neighbours
to be too small or too large, we cause the RMSPE to increase:
```{r 07-choose-k-knn-plot, echo = FALSE, fig.height = 4, fig.width = 5}
sacr_results <- sacr_results %>%
filter(.metric == 'rmse')
sacr_tunek_plot <- ggplot(sacr_results, aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
labs(x = 'Neighbours', y = 'RMSPE')
sacr_tunek_plot
```
What is happening here? To visualize the effect of different settings of $K$ on the
regression model, we will plot the predicted values for house price from
our K-NN regression models for 6 different values for $K$. For each model, we
predict a price for every possible home size across the range of home sizes we
observed in the data set (here 500 to 4250 square feet) and we plot the
predicted prices as a blue line:
```{r 07-howK, echo = FALSE, fig.height = 12, fig.width = 10}
kmin <- sacr_min %>% pull(neighbors)
gridvals <- c(1, 3, kmin, 41, 250, 450)
plots <- list()
for (i in 1:6) {
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = gridvals[[i]]) %>%
set_engine("kknn") %>%
set_mode("regression")
sacr_wkflw <- workflow() %>%
add_recipe(sacr_recipe) %>%
add_model(sacr_spec)
sacr_preds <- sacr_wkflw %>%
fit(data = sacramento_train) %>%
predict(sacramento_train) %>%
bind_cols(sacramento_train)
plots[[i]] <- ggplot(sacr_preds, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
xlab("House size (square footage)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") +
ggtitle(paste0("K = ", gridvals[[i]]))
}
grid.arrange(grobs = plots, ncol=2)
```
Based on the plots above, we see that when $K$ = 1, the blue line runs perfectly
through almost all of our training observations. This happens because our
predicted values for a given region depend on just a single observation. A
model like this has high variance and low bias (intuitively, it provides unreliable
predictions). It has high variance because
the flexible blue line follows the training observations very closely, and if
we were to change any one of the training observation data points we would
change the flexible blue line quite a lot. This means that the blue line
matches the data we happen to have in this training data set, however, if we
were to collect another training data set from the Sacramento real estate
market it likely wouldn't match those observations as well. Another term
that we use to collectively describe this phenomenon is *overfitting*.
What about the plot where $K$ is quite large, say $K$ = 450, for example? When
$K$ = 450 for this data set, the blue line is extremely smooth, and almost
flat. This happens because our predicted values for a given x value (here home
size), depend on many many (450) neighbouring observations. A model
like this has low variance and high bias (intuitively, it provides very reliable,
but generally very inaccurate predictions). It has low variance because the
smooth, inflexible blue line does not follow the training observations very
closely, and if we were to change any one of the training observation data
points it really wouldn't affect the shape of the smooth blue line at all. This
means that although the blue line matches does not match the data we happen to
have in this particular training data set perfectly, if we were to collect
another training data set from the Sacramento real estate market it likely
would match those observations equally as well as it matches those in this
training data set. Another term that
we use to collectively describe this kind of model is *underfitting*.
Ideally, what we want is neither of the two situations discussed above. Instead,
we would like a model with low variance (so that it will transfer/generalize
well to other data sets, and isn't too dependent on the
observations that happen to be in the training set) **and** low bias
(where the model does not completely ignore our training data). If we explore
the other values for $K$, in particular $K$ = `r sacr_min %>% pull(neighbors)`
(as suggested by cross-validation),
we can see it has a lower bias than our model with a very high $K$ (e.g., 450),
and thus the model/predicted values better match the actual observed values
than the high $K$ model. Additionally, it has lower variance than our model
with a very low $K$ (e.g., 1) and thus it should better transer/generalize to
other data sets compared to the low $K$ model. All of this is similar to how
the choice of $K$ affects K-NN classification (discussed in the previous
chapter).
## Evaluating on the test set
To assess how well our model might do at predicting on unseen data, we will
assess its RMSPE on the test data. To do this, we will first
re-train our K-NN regression model on the entire training data set,
using $K =$ `r sacr_min %>% pull(neighbors)` neighbours. Then we will
use `predict` to make predictions on the test data, and use the `metrics`
function again to compute the summary of regression quality. Because
we specify that we are performing regression in `set_mode`, the `metrics`
function knows to output a quality summary related to regression, and not, say, classification.
```{r 07-predict}
set.seed(1234)
kmin <- sacr_min %>% pull(neighbors)
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) %>%
set_engine("kknn") %>%
set_mode("regression")
sacr_fit <- workflow() %>%
add_recipe(sacr_recipe) %>%
add_model(sacr_spec) %>%
fit(data = sacramento_train)
sacr_summary <- sacr_fit %>%
predict(sacramento_test) %>%
bind_cols(sacramento_test) %>%
metrics(truth = price, estimate = .pred)
sacr_summary
```
Our final model's test error as assessed by RMSPE
is `r format(round(sacr_summary %>% filter(.metric == 'rmse') %>% pull(.estimate)), scientific=FALSE)`.
But what does this RMSPE score mean? When we calculated test set prediction accuracy in K-NN
classification, the highest possible value was 1 and the lowest possible value was 0.
If we got a value close to 1, our model was "good;" and otherwise,
the model was "not good." What about RMSPE? Unfortunately there is no default scale
for RMSPE. Instead, it is measured in
the units of the target/response variable, and so it is a bit hard to
interpret. For now, let's consider this approach to thinking about RMSPE from
our testing data set: as long as its not significantly worse than the cross-validation
RMSPE of our best model, then we can say that we're not doing too much worse on
the test data than we did on the training data. So the model appears to be
generalizing well to a new data set it has never seen before. In future courses
on statistical/machine learning, you will learn more about how to interpret RMSPE
from testing data and other ways to assess models.
Finally, what does our model look like when we predict across all possible
house sizes we might encounter in the Sacramento area? We plotted it above
where we explored how $k$ affects K-NN regression, but we show it again now,
along with the code that generated it:
```{r 07-predict-all, fig.height = 4, fig.width = 5}
set.seed(1234)
sacr_preds <- sacr_fit %>%
predict(sacramento_train) %>%
bind_cols(sacramento_train)
plot_final <- ggplot(sacr_preds, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
xlab("House size (square footage)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") +
ggtitle(paste0("K = ", kmin))
plot_final
```
## Strengths and limitations of K-NN regression
As with K-NN classification (or any prediction algorithm for that manner), K-NN regression has both strengths and weaknesses. Some are listed here:
**Strengths of K-NN regression**
1. Simple and easy to understand
2. No assumptions about what the data must look like
3. Works well with non-linear relationships (i.e., if the relationship is not a straight line)
**Limitations of K-NN regression**
1. As data gets bigger and bigger, K-NN gets slower and slower, quite quickly 2. Does not perform well with a large number of predictors unless the size of the training set is exponentially larger
3. Does not predict well beyond the range of values input in your training data
## Multivariate K-NN regression
As in K-NN classification, in K-NN regression we can have multiple predictors.
When we have multiple predictors in K-NN regression, we have the same concern
regarding the scale of the predictors. This is because once again,
predictions are made by identifying the $K$
observations that are nearest to the new point we want to predict, and any
variables that are on a large scale will have a much larger effect than
variables on a small scale. Since the `recipe` we built above scales and centers
all predictor variables, this is handled for us.
We will now demonstrate a multivariate K-NN regression analysis of the
Sacramento real estate data using `tidymodels`. This time we will use
house size (measured in square feet) as well as number of bedrooms as our
predictors, and continue to use house sale price as our outcome/target variable
that we are trying to predict.
It is always a good practice to do exploratory data analysis, such as
visualizing the data, before we start modeling the data. Thus the first thing
we will do is use ggpairs (from the `GGally` package) to plot all the variables
we are interested in using in our analyses:
```{r 07-ggpairs, fig.height = 5, fig.width = 6}
library(GGally)
plot_pairs <- sacramento %>%
select(price, sqft, beds) %>%
ggpairs()
plot_pairs
```
From this we can see that generally, as both house size and number of bedrooms increase, so does price. Does adding the number of bedrooms to our model improve our ability to predict house price? To answer that question, we will have to come up with the test error for a K-NN regression model using house size and number of bedrooms, and then we can compare it to the test error for the model we previously came up with that only used house size to see if it is smaller (decreased test error indicates increased prediction quality). Let's do that now!
First we'll build a new model specification and recipe for the analysis. Note that
we use the formula `price ~ sqft + beds` to denote that we have two predictors,
and set `neighbors = tune()` to tell `tidymodels` to tune the number of neighbours for us.
```{r 07-mult-setup}
sacr_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
```
Next, we'll use 5-fold cross-validation to choose the number of neighbours via the minimum RMSPE:
```{r 07-mult-cv}
gridvals <- tibble(neighbors = seq(1,200))
sacr_k <- workflow() %>%
add_recipe(sacr_recipe) %>%
add_model(sacr_spec) %>%
tune_grid(sacr_vfold, grid = gridvals) %>%
collect_metrics() %>%
filter(.metric == 'rmse') %>%
filter(mean == min(mean)) %>%
pull(neighbors)
sacr_k
```
Here we see that the smallest RMSPE occurs when $K =$ `r sacr_k`.
Now that we have chosen $K$, we need to re-train the model on the entire
training data set, and after that we can use that model to predict
on the test data to get our test error.
```{r 07-re-train}
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = sacr_k) %>%
set_engine("kknn") %>%
set_mode("regression")
knn_mult_fit <- workflow() %>%
add_recipe(sacr_recipe) %>%
add_model(sacr_spec) %>%
fit(data = sacramento_train)
knn_mult_preds <- knn_mult_fit %>%
predict(sacramento_test) %>%
bind_cols(sacramento_test)
knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred)
knn_mult_mets
```
This time when we performed K-NN regression on the same data set, but also
included number of bedrooms as a predictor we obtained a RMSPE test error
of `r format(round(knn_mult_mets %>% filter(.metric == 'rmse') %>% pull(.estimate)), scientific=FALSE)`.
This compares to a RMSPE test error
of `r format(round(sacr_summary %>% filter(.metric == 'rmse') %>% pull(.estimate)), scientific=FALSE)`
when we used only house size as the
single predictor. Thus in this case, we did not improve the model
by a large amount by adding this additional predictor.
We can also visualize the model's predictions overlaid on top of the data. This time the predictions
will be a surface in 3-D space, instead of a line in 2-D space, as we have 2
predictors instead of 1.
```{r 07-knn-mult-viz, echo = FALSE, message = FALSE, warning = FALSE}
library(plotly)
xvals <- seq(from = min(sacramento_train$sqft), to = max(sacramento_train$sqft), length = 50)
yvals <- seq(from = min(sacramento_train$beds), to = max(sacramento_train$beds), length = 50)
zvals <- knn_mult_fit %>%
predict(crossing(xvals, yvals) %>% mutate(sqft = xvals, beds = yvals)) %>%
pull(.pred)
zvalsm <- matrix(zvals, nrow=length(xvals))
plot_ly() %>%
add_markers(data = sacramento_train,
x = ~sqft,
y = ~beds,
z = ~price,
marker = list(size = 5, opacity = 0.4, color = "red")) %>%
layout(scene = list(xaxis = list(title = 'House size (square feet)'),
zaxis = list(title = 'Price (USD)'),
yaxis = list(title = 'Number of bedrooms'))) %>%
add_surface(x = ~xvals,
y = ~yvals,
z = ~zvalsm,
colorbar=list(title='Price (USD)'))
```
We can see that the predictions in this case, where we have 2 predictors, form
a surface instead of a line. Because the newly added predictor, number of
bedrooms, is correlated with price (USD) (meaning as price changes, so does
number of bedrooms) and not totally determined by house size (our other predictor),
we get additional and useful information for making our
predictions. For example, in this model we would predict that the cost of a
house with a size of 2,500 square feet generally increases slightly as the number
of bedrooms increases. Without having the additional predictor of number of
bedrooms, we would predict the same price for these two houses.