forked from mayer79/statistical_computing_material
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path5_Trees.Rmd
1384 lines (1131 loc) · 52.1 KB
/
5_Trees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "5. Trees"
author: "Michael Mayer"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
toc_float: yes
number_sections: yes
df_print: paged
theme: paper
code_folding: show
math_method: katex
subtitle: "Statistical Computing"
bibliography: biblio.bib
link-citations: yes
editor_options:
chunk_output_type: console
markdown:
wrap: 72
knit: (function(input, ...) {rmarkdown::render(input, output_dir = "docs")})
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
eval = TRUE
)
```
# Introduction
A decision tree is a simple, easy-to-interpret modeling technique for
both regression and classification problems. Compared to other methods,
decision trees usually do not perform very well. Their relevance lies in
the fact that they are the building blocks of two of the most successful
ML algorithms: random forests and gradient boosted trees. In this
chapter, we will introduce these tree-based methods. For more
background, the reader is referred to the main references
@hastie01statisticallearning and @james2021.
# Decision Trees
## How they work
On our journey to estimate the model $f$ by $\hat f$, we have so far
mainly considered linear functions $f$. We now turn to another class of
functions: decision trees. They were introduced in @breiman1984 and are
sometimes called "Classification and Regression Trees" (CART).
(Binary) decision trees are computed recursively by partitioning the
data into two parts. Partitions are chosen to optimize total loss by
asking the best "yes/no" question about the covariates, e.g.: Is the age
of the driver $\le 22$?
![A regression tree for claims frequency](figs/a_tree.png){width="50%"}
For regression problems, the most frequently used loss function $L$ is
the squared error. For classification, it is "information" (=
cross-entropy = log loss = half the unit logistic deviance) or the very
similar Gini impurity.
Predictions are computed by sending an observation down the tree,
starting with the question at the "trunk" and ending in a leaf, say in
leaf $j$. The prediction is the value $\gamma_j$ associated with that
leaf. In regression situations, $\gamma_j$ usually corresponds to the
average response of all training observations that fall into that leaf.
In classification situations, it may be the most frequent class in the
leaf or the vector of class probabilities.
More formally, let $R_1, \dots, R_J$ denote the terminal regions
corresponding to the $J$ leaves in the sense that $\boldsymbol x$ falls
in leaf $j \Leftrightarrow \boldsymbol x \in R_j$. Then, the prediction
function can be expressed as $$
\hat f(\boldsymbol x) = \sum_{j = 1}^{J} \gamma_j \boldsymbol 1\{\boldsymbol x \in R_j\}.
$$
The concept of a decision tree is best understood with an example.
## Example
We will use the `dataCar` data set to predict the claim probability with
a decision tree. As features, we use `veh_value`, `veh_body`, `veh_age`,
`gender`, `area` and `agecat`.
```{r}
library(rpart)
library(rpart.plot)
library(insuranceData)
data(dataCar)
fit <- rpart(
clm ~ veh_value + veh_body + veh_age + gender + area + agecat,
data = dataCar,
method = "class",
parms = list(split = "information"),
xval = 0,
cp = -1,
maxdepth = 3
)
prp(fit, type = 2, extra = 7, shadow.col = "gray",
faclen = 0, box.palette = "auto", branch.type = 4,
varlen = 0, cex = 0.9, digits = 3, split.cex = 0.8)
dataCar[1, c("agecat", "veh_value", "veh_body")]
predict(fit, dataCar[1, ])
```
**Comments**
- The first observation belongs to a person in age category 2 and has
a \$10'600 hatchback: the first question sends us to the right, the
second to the left and the third to the right. This gives us a claim
probability of 6.7%.
- For example, how was the first question (`agecat >= 5`) selected?
The algorithm searches all covariates for all possible split
positions and selects the one with the best total loss improvement.
In this case, the split on the covariate `agecat` at the value 5
reduced the total loss the most. As loss function, we used
"information", which is equivalent to (half) the unit logistic
deviance.
## Properties
Here, we list a couple of properties of decision trees.
- **Outliers:** In contrast to linear models, outliers in covariates
are not an issue because the algorithm only takes into account the
sort order of the values. Similarly, taking logarithms or other
monotone transformations of covariates has no effect. Both
statements do not hold for the response variable.
- **Missing values:** Some implementations can deal with missing
values in the input. Alternatively, missing values are often
replaced by a typical value or a value smaller/larger than the
smallest non-missing value (such as -1 for a positive variable).
- **Categorical covariates:** Unordered categorical covariates are
difficult to split, since $\ell$ levels theoretically give rise to
$2^\ell$ possible partitions. Try grouping small categories
together, or consider representing the levels by ordered categories
(even if it does not make too much sense). One-hot-encoding is also
an option. Some algorithms provide ways to handle unordered
categories internally.
- **Greedy:** Partitions are made in a greedy way to optimize the
objective in *one split*. Looking ahead more than one split would
lead to better models but is not feasible from a computational point
of view.
- **Interactions:** By their flexible structure, a decision tree can
automatically capture interaction effects of any order (and other
non-linear effects), at least if the data set is large and the tree
is deep enough. This is a clear advantage of tree-based methods
compared to linear models, where these elements must be carefully
and manually selected. In the next chapter, we will meet another
model class with this advantage: neural nets.
- **Extrapolation:** By construction, a decision tree cannot
extrapolate. Thus, even a diamond weighing 10 carats cannot get a
higher price prediction than the most expensive diamond in the
training data. In fact, it will be even lower (why?). TODO answer
why
- **Instability:** A small change in the data can result in a
completely different tree. The reason for this is that a change in
an early split also affects later splits. Averaging many trees
reduces this variance, which leads us to random forests.
These properties (except the last one) translate to combinations of
trees like random forests or boosted trees.
Before we delve into these tree ensemble techniques, here is a [short
video](https://www.youtube.com/watch?v=8hupHmBVvb0) of Jerome Friedman
explaining how he tried to efficiently approximate nearest neighbor and
came up with a new method: decision trees.
# Random Forests
In 2001, Leo Breiman introduced a powerful tree-based algorithm called
*random forest* [@breiman2001]. A random forest consists of many
decision trees. The trees differ for two reasons. One of them is an
*ensembling* technique called "bagging". Before studying random forests,
we first have a look at bagging.
## Bagging
Combining multiple models (*base learners*) into a single one is a
common technique to slightly improve the overall performance of a model.
This is referred to as *model ensembling*. For example, one could create
an ensemble model by running multiple $k$-nearest neighbor regression
models, each with a different value for $k$ (e.g., 5, 6, 7), and then
averaging the resulting predictions. The combined model predictions tend
to have lower variance and thus often a better out-of-sample
performance, just like a diversified stock portfolio.
Another way of creating an ensemble model is based on the Bootstrap:
*Bagging* (from "**b**ootstrap **agg**regat**ing**) draws $B$ bootstrap
samples and fits a model on each of those samples. The resulting $B$
models are then combined to a single one.
**Algorithm: bagging** (for regression)
<<<<<<< HEAD
1. Select $B$ bootstrapped training data sets from the original
training data.
2. Fit a model $\hat f^{*j}(\boldsymbol x)$ on each of them.
3. Return the bagged model $$
\hat f(\boldsymbol x) = \frac{1}{B} \sum_{j = 1}^B f^{*j}(\boldsymbol x)
$$
=======
1. Select $B$ bootstrapped training data sets from the original training data.
2. Fit models $\hat f^{*j}(\boldsymbol x)$ on them.
3. Return the bagged model
$$
\hat f(\boldsymbol x) = \frac{1}{B} \sum_{j = 1}^B \hat f^{*j}(\boldsymbol x)
$$
>>>>>>> 789afa2dccefcb4720bc42454ae9b2fbe4147667
For classification, the last step is replaced by taking averages for
each predicted probability or doing majority voting.
### Example
Let's try out the idea of bagging using diamonds data. We first fit a
relatively deep decision tree on the training data without bagging. This
is our reference model. Then, we fit a tree on each of 20 bootstrapped
versions of the training data. In order to evaluate the results, we
store the validation predictions of each of the 20 bootstrapped models.
We evaluate them by taking their successive means (averaging the
predictions of the first $b$ models for $b = 1, \dots, 20$).
```{r}
library(rpart)
library(tidyverse)
library(withr)
RMSE <- function(y, pred) {
sqrt(mean((y - pred)^2))
}
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
# Split diamonds into 80% for "training" and 20% for validation
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
train <- diamonds[ix, ]
valid <- diamonds[-ix, ]
# Tree without bagging for comparison
form <- log_price ~ log_carat + clarity + color + cut
fit_unbagged <- rpart(form, data = train, xval = 0, cp = -1, maxdepth = 8)
rmse_unbagged <- RMSE(valid$log_price, predict(fit_unbagged, valid))
# Fit B models on B bootstrap samples and store their predictions on the validation data
# (one column per bootstrap sample)
B <- 20
valid_pred <- matrix(nrow = nrow(valid), ncol = B)
for (j in 1:B) {
data_boot <- train[sample(nrow(train), replace = TRUE), ]
fit_boot <- rpart(form, data = data_boot, xval = 0, cp = -1, maxdepth = 8)
valid_pred[, j] <- predict(fit_boot, valid)
}
# Very high correlations across models (subset of three models shown)
round(cor(valid_pred[, 1:3]), 3)
# Rowwise successive mean predictions
pred_bagged <- t(apply(valid_pred, 1, function(z) cumsum(z) / 1:length(z)))
# Validation RMSE per column
rmse_bagged <- apply(pred_bagged, 2, function(z) RMSE(valid$log_price, z))
# Visualization
rmse_bagged %>%
as_tibble_col("Validation RMSE") %>%
mutate(`Bagging round` = row_number()) %>%
ggplot(aes(`Bagging round`, `Validation RMSE`)) +
geom_point(color = "chartreuse4") +
geom_hline(yintercept = rmse_unbagged)
```
**Comments**
- The more bagging rounds, the better the validation performance.
- Already after two bagging rounds, performance is better compared
with the unbagged model (straight black line).
- The effect tails off.
- The predictions of the 20 models are extremely strongly correlated.
Still, they are not perfectly correlated, so there *is* some
diversification going on.
**Remarks on bagging**
- *Base learners*: Bagging works best with unstable models like deep
decision trees as base learners. In principle, it can also be
applied to any other modeling technique, e.g, linear regression.
- *OOB validation*: In each tree, about 1/3 of the observations are
not included in the bootstrap sample. Put differently: each
observation is used in about 2/3 of all models. If its prediction is
calculated from the other 1/3 of the models, we get an
"out-of-sample" prediction, also called "out-of-bag" (OOB)
prediction. If rows are independent, the performance measure derived
from these OOB predictions is usually a good estimate of true
performance, replacing (cross-)validation. Do not trust OOB results
when rows are dependent, such as with grouped rows.
- *Parallel computing:* Since the models can be fitted fully
independently, bagging is ideal for parallel computing.
- *Performance versus complexity*: A bagged model consisting of $B$
base models takes $B$ times as long to train, is $B$ times as large,
and is usually more difficult to interpret than a single model.
Still, the performance gain is sometimes worth the effort. The same
tradeoff applies to ensembling techniques in general.
### Bias-Variance Trade-Off
We have mentioned that ensemble models try to reduce variance to improve their test performance. But what does *variance* mean in this context? And why does a smaller variance leads to better test performance? For the MSE as performance measure, we have the following decomposition of the generalization error (@james2021):
$$
\underbrace{\mathbb E(y_o - \hat f(x_o))^2}_{\text{Expected test MSE of } x_o} = \text{Var}(\hat f(x_o)) + \left[\text{Bias}(\hat f(x_o))\right]^2 + \underbrace{\text{Var}(\varepsilon)}_{\text{Irreducible error}},
$$
where
- $(y_o, x_o)$ is a specific observation,
- expectations and variances are calculated over a large number of training sets, and
- "Bias" is the error introduced by approximating true model by $f$.
Models with low bias (e.g. deep decision trees or $k$-nearest-neighbor with small $k$) usually have high variance. On the other hand, models with low variance (like linear regression) tend to have high bias. This is the famous *Bias-Variance Trade-Off*. A bagged decision tree tries to have low bias (deep trees) and low variance (from bagging).
## From bagging to random forests
To turn a bagged decision tree into a random forest, a second source of
randomness is injected: *Each split* scans only a random selection
"mtry" of the $p$ covariates to find the best split, usually $\sqrt{p}$
or $p/3$. "mtry" is the main tuning parameter of a random forest. This
extra randomness further decorrelates the bagged trees, which improves
the diversification effect. Properties of bagging also apply to random
forests.
**Algorithm: random forest** (for regression)
<<<<<<< HEAD
1. Select $B$ bootstrapped training data sets from the original
training data.
2. Fit a (usually deep) decision tree $\hat f^{*j}(\boldsymbol x)$ on
each of them. During each split, consider only a random subset of
features.
3. Return the model $$
\hat f(\boldsymbol x) = \frac{1}{B} \sum_{j = 1}^B f^{*j}(\boldsymbol x).
$$
=======
1. Select $B$ bootstrapped training data sets from the original training data.
2. Fit (usually deep) decision trees $\hat f^{*j}(\boldsymbol x)$ on these. During each split, consider only a random subset of features.
3. Return the model
$$
\hat f(\boldsymbol x) = \frac{1}{B} \sum_{j = 1}^B \hat f^{*j}(\boldsymbol x).
$$
>>>>>>> 789afa2dccefcb4720bc42454ae9b2fbe4147667
For classification, the last step is again replaced by taking averages
for each predicted probability or doing majority voting.
**Comments about random forests**
<<<<<<< HEAD
- **Number of trees:** Usually, $100-1000$ trees are being grown. The
more, the better. More trees also mean longer training time and
larger models.
- **Deep trees and diversification:** Single trees in a random forest
are usually deep, thus they overfit on the training data. It is the
diversity across trees that produces a good and stable model, just
like a well-diversified stock portfolio.
- **Never trust performance on the training set for random forests.**
The reason is the overfitting caused by the deep decision trees.
Besides using the standard approaches to assess true performance
(simple or cross-validation), we can study OOB results.
- **Parameter tuning:** Random forests offer many tuning parameters.
Since the results typically do not depend too much on their choice,
untuned random forests are often great benchmark models, see also
this [short video](https://www.youtube.com/watch?v=t8ooi_tJHSE) by
Adele Cutler on working with Leo Breiman on random forests.
=======
- **Number of trees:** Usually, $100-1000$ trees are being grown. The more, the better. More trees also mean longer training time and larger models.
- **Deep trees and diversification:** Single trees in a random forest are usually deep, thus they overfit on the training data. It is the diversity across trees that produces a good and stable model, just like a well-diversified stock portfolio.
- **Never trust performance on the training set for random forests.** The reason is the overfitting caused by the deep decision trees. Besides using the standard approaches to assess true performance (simple or cross-validation), we can study OOB results.
- **Parameter tuning:** Random forests offer many tuning parameters. Since the results typically do not depend too much on their choice, untuned random forests are often great benchmark models, see also this [short video](https://www.youtube.com/watch?v=t8ooi_tJHSE) by Adele Cutler on working with Leo Breiman on random forests. (Remember: the main tuning parameter is "mtry").
>>>>>>> 789afa2dccefcb4720bc42454ae9b2fbe4147667
### Example
Let us now fit a random forest for (logarithmic) diamond prices with
default parameters and 500 trees. 80% of the data is used for training,
the other 20% we use for evaluating the performance. (Throughout the
rest of the lecture, we will ignore the problematic aspect of having
repeated rows for some diamonds.)
```{r}
library(tidyverse)
library(withr)
library(ranger)
library(MetricsWeighted)
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
y <- "log_price"
x <- c("log_carat", "color", "cut", "clarity")
# Train/test split
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
fit <- ranger(
reformulate(x, y), # reformulate creates a formula from a character vector.
num.trees = 500,
data = diamonds[ix, ],
importance = "impurity",
seed = 83
)
fit
# Performance on test data
pred <- predict(fit, diamonds[-ix, ])$predictions
y_test <- diamonds[-ix, y]
rmse(y_test, pred)
r_squared(y_test, pred)
# Size of the fitted model
format(object.size(fit), units = "auto")
```
**Comments**
- Performance is much better than with bagging alone. Besides the
random feature selection, the main reason for this is that very deep
decision trees are grown.
- The OOB estimate of performance is extremely close to the test set
performance.
- How can we interpret a complex model like this?
- The model fills about 130 MB of RAM. This is a lot, given the fact
that the diamonds data is not too large.
## Interpreting a black box
In contrast to a single decision tree or a linear model, a combination
of many trees is not easy to interpret. It is good practice for any ML
model to not only study
1. performance, but also\
2. variable importance, and
3. feature effects.
A pure prediction machine is hardly of any interest and might even
contain mistakes like using covariates derived from the response. Model
interpretation helps to fight such problems and thus also to increase
trust in a model and its creator. The field of ML/statistics caring
about interpreting complex models is called XAI (from e*X*plainable
*A*rtificial *I*ntelligence). We present some of the methods here, and
some other later in the chapter on neural networks.
<<<<<<< HEAD
> However, as you've pointed out, pure prediction machines can have some
> drawbacks. One potential issue is that they may use covariates that
> are derived from the response variable itself, which can introduce
> biases and inaccuracies into the model. For example, imagine that
> we're building a machine learning model to predict whether a customer
> will buy a particular product. If we include the customer's past
> purchase history as a covariate, we might inadvertently create a
> self-fulfilling prophecy - customers who have bought the product
> before may be more likely to buy it again, regardless of any other
> factors that the model is considering. (thx, chatgpt :) )
Most of the text of this section is adapted from our [responsible ML
lecture](https://github.com/lorentzenchr/responsible_ml_material).
=======
Most of the text of this section is adapted from the [responsible ML lecture]( https://github.com/lorentzenchr/responsible_ml_material).
>>>>>>> 789afa2dccefcb4720bc42454ae9b2fbe4147667
### Variable importance
There are different approaches to measure the importance of a covariate.
Since no general mathematical definition of "importance" exists, the
results of different approaches might be inconsistent across each other.
For tree-based methods, a usual approach is to measure how many times a
covariate $X^{(j)}$ was used in a split or how much total loss
improvement came from splitting on $X^{(j)}$. For a GLM, we could study
partial Chi-squared tests (model with feature $X^{(j)}$ versus model
without).
Approaches that work for *any* supervised model include *permutation
importance* (see later) and *SHAP importance* (not covered).
#### Example: Random forest (continued)
Let us study relative split gain importance regarding MSE improvement
for the features in our random forest.
```{r}
imp <- sort(importance(fit))
imp <- imp / sum(imp)
barplot(imp, horiz = TRUE, col = "chartreuse4")
```
**Comment**: As expected, logarithmic carat is by far the most important
feature.
### Effects
One of the advantages of an *additive* linear regression over a complex
ML model is its simple structure. The coefficient(s) of a feature
$X^{(j)}$ immediately tells us how predictions will react on changes in
$X^{(j)}$, keeping all other covariates fixed ("Ceteris Paribus"). Due
to additivity, the effect of $X^{(j)}$ is the same for all observations.
But what if the linear regression contains complex interaction effects?
Or if the model is an ML model such as a random forest or $k$-nearest
neighbor? Options include studying individual conditional expectations
and/or partial dependence plots.
#### Individual conditional expectations
The *individual conditional expectation (ICE)* function [@goldstein2015]
for covariate $X^{(j)}$ of model $f$ and an observation with covariate
vector $\boldsymbol x \in {\mathbb R}^p$ is given by $$
\text{ICE}_j: v \in {\mathbb R} \mapsto f(v, \boldsymbol x_{\setminus j}),
$$ where $\boldsymbol x_{\setminus j}$ denotes all but the $j$-th
component of $\boldsymbol x$, which is being replaced by the value $v$.
The corresponding ICE *curve* represents the graph
$(v, \text{ICE}_j(v))$ for a grid of values $v \in \mathbb R$. An ICE
*plot* finally shows the ICE curves of many observations.
**Remarks**
- The more different the curves in an ICE plot for covariate $X^{(j)}$
(ignoring vertical shifts), the stronger the interaction effects
with other covariates.
- If there are no interaction effects between $X^{(j)}$ and the other
covariates (additive feature), the curves are parallel. If such
covariate is represented by a linear function with slope $\beta$,
the ICE curves will have slope $\beta$ as well.
- ICE curves can be vertically centered to focus on interaction
strength. These curves are called "c-ICE" curves. For a model
additive in $X^{(j)}$, the c-ICE plot of $X^{(j)}$ shows a single
curve.
- If the model $f$ involves a link function, ICE curves are often
plotted on the link scale (why?).
**Pros and Cons**
- Pro: Very simple to compute.
- Pro: Easy to interpret thanks to Ceteris Paribus logic.
- Pro: Gives an impression about associated interaction strength.
- Con: Suboptimal when Ceteris Paribus clause is unnatural.
- Con: Model is applied to rare/impossible feature value combinations.
- Con: One does not see which variable is interacting.
##### Example: Random forest (continued)
How do ICE and c-ICE plots for two selected features look?
```{r}
library(flashlight)
# Explainer object
fl <- flashlight(
model = fit,
data = diamonds[ix, ],
label = "Random forest",
predict_function = function(m, X) predict(m, X)$predictions
)
# ICE and c-ICE plots of log_carat and clarity
v <- "log_carat"
light_ice(fl, v = v, n_max = 200) %>%
plot(alpha = 0.2, color = "chartreuse4") +
labs(title = paste("ICE plot for", v), y = "Prediction")
light_ice(fl, v = v, n_max = 200, center = "middle") %>%
plot(alpha = 0.1, color = "chartreuse4") +
labs(title = paste("c-ICE plot for", v), y = "Centered prediction")
v <- "clarity"
light_ice(fl, v = v, n_max = 200) %>%
plot(alpha = 0.2, color = "chartreuse4") +
labs(title = paste("ICE plot for", v), y = "Prediction")
light_ice(fl, v = v, n_max = 200, center = "middle") %>%
plot(alpha = 0.1, color = "chartreuse4") +
labs(title = paste("c-ICE plot for", v), y = "Centered prediction")
```
**Comments**
- ICE plot for `log_carat`: The curves are increasing, which makes
sense: everything else fixed, the expected (log) price increases
with (log) carat. For very large values of log carat, predictions
tail off. This is not realistic but a consequence of the fact that
tree-based models cannot extrapolate.
- c-ICE plot for `log_carat`: Vertical scatter is not strong,
indicating weak interaction effects. If our model would use `price`
as response, the interaction effects would be much stronger (why?).
- `clarity`: Better clarity is associated with higher predictions.
This trend is easier to see in the centered ICE plot. There, we can
also see strong interaction effects (some curves are much steeper
than others, etc.).
#### Partial dependence
The pointwise average of many ICE curves is called a *partial dependence
plot* (PDP), originally introduced by @friedman2001 in his seminal work
on gradient boosting. It describes the effect of the variable $X^{(j)}$,
averaged over all interaction effects and holding all other variables
fixed. The empirical partial dependence function for model $\hat f$ and
feature $X^{(j)}$ is given by $$
\text{PD}_j(v) = \frac{1}{n} \sum_{i = 1}^n \hat f(v, \boldsymbol x_{i,\setminus j}),
$$ where $\boldsymbol x_{i,\setminus j}$ denotes the feature vector
without $j$-th component of observation $i$ in some reference dataset,
often the training or the test data.
The corresponding PDP represents the graph $(v, \text{PD}_j(v))$ for a
grid of values $v \in \mathbb R$.
**Pros and Cons**
- Pro: Simple to compute.
- Pro: Easy to interpret thanks to Ceteris Paribus logic.
- Con: Suboptimal when Ceteris Paribus clause is unnatural.
- Con: Model is applied to rare/impossible feature value combinations.
- Con: No information about interactions.
An alternative to the partial dependence plot, the *accumulated local
effects* (ALE) plot [@apley2016], attempts to overcome the first two
(related) downsides. However, ALE plots are more difficult to calculate
and to interpret. We will not consider them here.
##### Example: Random forest (continued)
Let us now study the partial dependence plots of all features.
```{r}
for (v in x) {
p <- light_profile(fl, v = v, n_bins = 40) %>%
plot(color = "chartreuse4") +
labs(title = paste("PDP for", v), y = "Prediction")
print(p)
}
```
**Comment:** All effects as assessed by partial dependence plots make
sense.
# Gradient Boosted Trees
Another way of combining decision trees to an ensemble model are
gradient boosted trees. They belong to the most successful ML models. We
will first explain what boosting is. Then, we introduce the gradient
boosted trees algorithm.
## Boosting
The idea of *boosting* was investigated e.g. in @schapire1990 and could
be described as follows:
1. Fit simple model $\hat f$ to data.
2. For $k = 1, \dots, K$ do:
a. Find simple model $\hat f^k$ that corrects the mistakes of
$\hat f$.
b. Update: $\hat f \leftarrow \hat f + \hat f^k.$
As base learners (or weak learners) $f^k$, often short decision trees
are used. But how to find them specifically?
- **AdaBoost:** In 1995, Freund and Schapire proposed the famous
AdaBoost algorithm for binary classification. It combines decision
trees with a particular reweighting scheme.
- **Gradient boosting:** While AdaBoost is still in use, the most
common way to find the updates $\hat f^k$ goes back to Jerome
Friedman and his article on Gradient Boosting Machines
[@friedman2001]. The key idea of gradient boosting is to choose
$\hat f^k$ to make the total loss
$Q(\hat f + \hat f^k) = \sum_{i = 1}^n L(y_i, \hat f(\boldsymbol x_i) + \hat f^k (\boldsymbol x_i))$
smaller.
## Gradient boosting
Gradient boosting is gradient descent in function space. Gradient
descent is a numerical method to find the (local) minimum of a smooth
function $h: \mathbb R^n \to \mathbb R$. It works as follows:
<<<<<<< HEAD
1. Start at some value $\hat x$.
2. Repeat: $\hat x \leftarrow \hat x - \lambda g$.
$\lambda$ is the step size or learning rate, and $$
=======
1. Start at some value $\hat x \in \mathbb R^n$.
2. Repeat: $\hat x \leftarrow \hat x - \lambda g$.
$\lambda > 0$ is the step size or learning rate, and
$$
>>>>>>> 789afa2dccefcb4720bc42454ae9b2fbe4147667
g = \left[\frac{\partial h(x)}{\partial x}\right]_{x = \hat x}
$$ is the gradient of $h$ at $\hat x$.
It points in the direction of the steepest ascent, so the update step
goes in the opposite direction to make $h$ smaller. Imagine you are
standing on a hill in foggy weather: which direction should you go to
get down? You look for the direction in which the hill is steepest, and
then take a step in the opposite direction, and so on.
The following picture on
[Wikipedia](https://en.wikipedia.org/wiki/Gradient_descent) illustrates
the situation:
![<https://en.wikipedia.org/wiki/Gradient_descent>](figs/gradient_descent.png){width="50%"}
Let's apply gradient descent to the total loss $$
Q(f) = \sum_{i = 1}^n L(y_i, f(\boldsymbol x_i))
$$ viewed as a function of the $n$ parameters $$
f = \left(f(\boldsymbol x_1), \dots, f(\boldsymbol x_n)\right) \in \mathbb R^n,
$$ i.e. of the $n$ predictions on the training data.
We proceed as in standard gradient descent:
1. Start at some value $\hat f$.
2. Repeat: $\hat f \leftarrow \hat f - \lambda g$ with $$
g = \left[\frac{\partial Q(f)}{\partial f}\right]_{f = \hat f}
$$ having components $$
g_i = \left[\frac{\partial L(y_i, f(\boldsymbol x_i))}{\partial f(\boldsymbol x_i)}\right]_{f(\boldsymbol x_i) = \hat f(\boldsymbol x_i)}.
$$
For the squared error loss $L(y, z) = (y-z)^2/2$, these components equal
$$
g_i = -(\underbrace{y_i - \hat f(\boldsymbol x_i)}_{\text{Residual } r_i}),
$$ i.e., the gradient descent updates equal $$
\hat f(\boldsymbol x_i) \leftarrow \hat f(\boldsymbol x_i) + \underbrace{\lambda r_i}_{\hat f^k \text{?}}.
$$ Does this mean that the update steps in gradient boosting are
$\hat f^k = -\lambda g_i$? This would not make sense for two major
reasons:
1. $y_i$ and thus $g_i$ are unknown at prediction stage.
2. The model should be applicable not only to
$\boldsymbol x \in \{\boldsymbol x_1, \dots, \boldsymbol x_n\}$.
Rather we would like to apply it to any $\boldsymbol x$.
<<<<<<< HEAD
The solution is to replace $-g_i$ by predictions of a decision tree (or
of another base learner) with covariates $\boldsymbol X$. This leads us
to the following algorithm:
=======
The solution is to replace $-g_i$ by predictions of a decision tree (or of another base learner) with covariates $\boldsymbol X$. This leads us to the following algorithms (slightly modified from @hastie01statisticallearning):
>>>>>>> 789afa2dccefcb4720bc42454ae9b2fbe4147667
**Algorithm: Gradient boosted trees for the squared error loss**
1. Initialize $\hat f(\boldsymbol x) = \frac{1}{n}\sum_{i = 1}^{n} y_i$
2. For $k = 1, \dots, K$ do:
a. For $i = 1, \dots, n$, calculate residuals
$r_i = y_i - \hat f(\boldsymbol x_i)$
b. Model the $r_i$ as a function of the $\boldsymbol x_i$ by
fitting a regression tree $\hat f^k$ with squared error loss
c. Update:
$\hat f(\boldsymbol x) \leftarrow \hat f(\boldsymbol x) + \lambda \hat f^k(\boldsymbol x)$
3. Output $\hat f(\boldsymbol x)$
For general loss functions, the algorithm is a bit more complicated:
- The residuals are replaced by negative gradients of the loss
function $L$.
- Leaf values might be suboptimal $\rightarrow$ replace by optimal
values.
**Algorithm: Gradient boosted trees for general loss functions**
1. Initialize
$\hat f(\boldsymbol x) = \text{argmin}_\gamma \sum_{i = 1}^{n} L(y_i, \gamma)$
2. For $k = 1, \dots, K$ do:
a. For $i = 1, \dots, n$, calculate negative gradients
(pseudo-residuals) $$
r_i = -\left[\frac{\partial L(y_i, f(\boldsymbol x_i))}{\partial f(\boldsymbol x_i)}\right]_{f(\boldsymbol x_i) = \hat f(\boldsymbol x_i)}
$$
b. Model $r_i$ as function of $\boldsymbol x_i$ by regression tree
$\hat f^k$ with terminal regions $R_1, \dots, R_J$, using
squared error loss
c. For each $j = 1, \dots, J$, use line-search to find the optimal
leaf value $$
\gamma_j = \text{argmin}_{\gamma} \sum_{\boldsymbol x_i \in R_j} L(y_i, \hat f(\boldsymbol x_i) + \gamma)
$$
d. Update:
$\hat f(\boldsymbol x) \leftarrow \hat f(\boldsymbol x) + \lambda \underbrace{\sum_{j = 1}^{J} \gamma_j \boldsymbol 1\{\boldsymbol x \in R_j\}}_{\text{modified tree}}$
3. Output $\hat f(\boldsymbol x)$
**Remarks**
- Predictions are made by the final model $\hat f(\boldsymbol x)$,
which equals a sum of decision trees whose leaf values have been
replaced.
- While a random forest equals the average of many deep trees, a
boosted trees model equals the sum of many shallow trees.
- We will see later how to select the learning rate $\lambda$, the
number of trees $K$, and further parameters.
- For classification models with more than two categories, the
algorithm needs to be slightly extended, but the idea is the same.
- AdaBoost (Freund and Schapire, 1995) is gradient tree boosting with
a special (exponential) loss.
## Modern implementations
Modern implementations of gradient boosted trees include
[XGBoost](https://xgboost.readthedocs.io/en/latest/),
[LightGBM](https://lightgbm.readthedocs.io/en/latest/), and
[CatBoost](https://catboost.ai/). These are the predominant algorithms
in ML competitions on tabular data, check [this
comparison](https://github.com/mayer79/gradient_boosting_comparison) for
differences with a screenshot as per October 10, 2022:
![](figs/comparison_boosting.PNG).
They differ from Friedman's original approach in multiple ways, e.g.:
- Second order gradients: XGBoost introduced the use of second order
gradients, mainly removing the line-search step for all loss
functions, see @chen2016 or [this slide
deck](https://web.njit.edu/~usman/courses/cs675_spring20/BoostedTree.pdf)
from Tianqi Chen.
- Histogram binning: LightGBM [@ke2017], a very fast competitor of
XGBoost, introduced histogram binning: Continuous covariates are
binned into a histogram (e.g. in $B = 255$ bins) before training,
and their values are replaced by short unsigned integers
representing the corresponding bin number. During tree growth, to
determine the best split position of a given covariate, at most $B$
split positions must be evaluated instead of evaluating the loss
gain for all unique feature values. This significantly speeds up
tree construction. To activate histogram binning also for XGBoost,
set `tree_method = "hist"`
- Possibility to penalize the objective function.
### Example: XGBoost
As an initial example on gradient boosting and XGBoost, we fit a model
for diamond prices with squared error as loss function and RMSE as
corresponding evaluation metric. The number of rounds/trees is chosen by
early stopping on the validation data, i.e., until validation RMSE stops
improving for a couple or rounds. The learning rate (weight of each
tree) is chosen by trial and error in order to end up with a reasonable
number of trees, see the next section for more details.
```{r}
library(ggplot2)
library(withr)
library(xgboost)
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
y <- "log_price"
x <- c("log_carat", "color", "cut", "clarity")
# Split into train and valid
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
y_train <- diamonds[ix, y]
X_train <- diamonds[ix, x]
y_valid <- diamonds[-ix, y]
X_valid <- diamonds[-ix, x]
# XGBoost data interface
# attention for character values! make sure that we use the same integer encoding
# e.g. female -> 1, male -> 2 in training time
# at prediction time: male -> 1 (make sure that we use the same factor levels)
# "stable pipelines"
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train)
dvalid <- xgb.DMatrix(data.matrix(X_valid), label = y_valid)
# Minimal set of parameters
params <- list(
learning_rate = 0.1, # usually a good start
objective = "reg:squarederror", # = loss function
eval_metric = "rmse" # optional
)
# Fit model until validation RMSE stops improving over the last 20 rounds
fit <- xgb.train(
params = params,
data = dtrain,
watchlist = list(train = dtrain, valid = dvalid), # monitor validation performance
print_every_n = 50, # every 50 rounds, give me the current performance
nrounds = 5000, # huge!
early_stopping_rounds = 20 # if algorithm does not improve for 20 rounds -> stop
) # much better than -> do 100 rounds and stop
```
**Comments:**
- More trees would mean better training performance but poorer
validation performance.
- Compared to simple validation, cross-validation would provide an
even better way to find the optimal number of boosting rounds/trees.
around 9% -\> better than random forest with 10%.
if we continue -\> train-rmse decreases but valid-rmse increases
## Parameter tuning
Gradient boosted trees offer many parameters to select from. Unlike with
random forests, they need to be tuned to achieve good results. It would
be naive to use an algorithm like XGBoost without parameter tuning. Here
is a selection:
- **Number of boosting rounds** $K$: In contrast to random forests,
more trees/rounds is not always beneficial because the model begins
to overfit after some time. The optimal number of rounds is usually
found by early stopping, i.e., one lets the algorithm stop as soon
as the (cross-)validation performance stops improving, see the
example above.
- **Learning rate** $\lambda$: The learning rate determines the
training speed and the impact of each tree to the final model.
Typical values are between 0.001 and 1. In practical applications,
it is set to a value that leads to a reasonable amount of trees
($100 - 1000$). Usually, halving the learning rate means twice as
much boosting rounds for comparable performance.
- **Regularization parameters:** Depending on the implementation,
additional parameters are
- the tree depth (often $3 - 7$) or the number of leaves (often
$7 - 63$),
- the strength of the L1 and/or L2 penalties added to the
objective function (often between 0 and 5),
- the per-tree row subsampling rate (often between 0.8 and 1),
- the per-tree or per-split column subsampling rate (often between
0.6 and 1),
- ...
Reasonable regularization parameters are chosen by trial and error
or systematically by randomized/grid search CV. Usually, it takes a
couple of manual iterations until the range of the parameter values
have been set appropriately.
Overall, the **modeling strategy** could be summarized as follows:
1. Using default regularization parameters, set the learning rate to
get a reasonable number of trees with CV-based early stopping.
2. Use randomized search CV with early stopping to tune regularization
parameters such as tree depth. Iterate if needed.
3. Use the best parameter combination (incl. number of trees) to fit
the model on the training data. "Best" typically means in terms of
CV performance. As mentioned in the last chapter and depending on
the situation, it can also mean "good CV performance and not too
heavy overfit compared to insample performance" or some other
reasonable criterion.
Remark: Since the learning rate, the number of boosting rounds, and
other regularization parameters are highly interdependent, a large
randomized grid search to select the learning rate, boosting rounds, and
regularization parameters is often not ideal. Above suggestion (fix
learning rate, select number of rounds by early stopping and do grid
search only on regularization parameters) is more focused, see also the
example below.
### Example: XGBoost
We will use XGBoost to fit diamond prices using the squared error as
loss function and RMSE as corresponding evaluation metric, now using a
sophisticated tuning strategy.
```{r}
# This code is pretty long. It may serve as a general template to fit a
# high-performance XGBoost model
library(tidyverse)
library(withr)
library(xgboost)
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
y <- "log_price"
x <- c("log_carat", "color", "cut", "clarity")
# Split into train and test
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
y_train <- diamonds[ix, y]
X_train <- diamonds[ix, x]
y_test <- diamonds[-ix, y]
X_test <- diamonds[-ix, x]
# XGBoost data interface