forked from mayer79/statistical_computing_material
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path6_Neural_Nets.Rmd
856 lines (633 loc) · 36 KB
/
6_Neural_Nets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
---
title: "6. Neural Nets"
author: "Michael Mayer"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
toc_float: yes
number_sections: yes
df_print: paged
theme: paper
code_folding: show
math_method: katex
subtitle: "Statistical Computing"
bibliography: biblio.bib
link-citations: yes
editor_options:
chunk_output_type: console
knit: (function(input, ...) {rmarkdown::render(input, output_dir = "docs")})
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
eval = TRUE
)
# Path to conda env with TensorFlow
# keras::use_condaenv(path_to_conda_env)
```
# Introduction
In this chapter, we dive into artificial neural networks, one of the main drivers of artificial intelligence.
Neural networks are around since many decades. (Maybe) the first such model was built by Marvin Minsky in 1951. He called his algorithm SNARC ("stochastic neural-analog reinforcement calculator"). Since then, neural networks have gone through several stages of development. One of the milestones was the idea in @werbos1974 to efficiently calculate gradients in the optimization algorithm by an approach called "backpropagation". Another milestone was the use of GPUs (graphics processing units) to greatly reduce calculation time.
Artificial neural nets are extremely versatile and powerful. They can be used to
1. fit simple models like GLMs,
2. learn interactions and non-linear effects in an automatic way (like tree-based methods),
3. optimize general loss functions,
4. fit data much larger than RAM (e.g. images),
5. learn "online" (update the model with additional data),
6. fit multiple response variables at the same time,
7. model input of dimension higher than two (e.g. images, videos),
8. model input of *different* input dimensions (e.g. text *and* images),
9. fit data with sequential structure in both in- and output (e.g. a text translator),
10. model data with spatial structure (images),
11. fit models with many millions of parameters,
12. do non-linear dimension reduction,
13. and many more.
In this chapter, we will mainly deal with the first three aspects. Since a lot of new terms are being used, a small glossary can be found in Section "Neural Network Slang".
# Understanding Neural Nets
To learn how and why neural networks work, we will go through three steps - each illustrated on the diamonds data:
- Step 1: Linear regression as neural net
- Step 2: Hidden layers
- Step 3: Activation functions
After this, we will be ready to build more complex models.
## Step 1: Linear regression as neural net
Let us revisit the simple linear regression
$$
\mathbb E(\text{price} \mid \text{carat}) = \alpha + \beta \cdot \text{carat}
$$
calculated on the full diamonds data. In Chapter 4 we have found the solution $\hat\alpha = -2256.36$ and $\hat \beta = 7756.43$ by linear least-squares.
Above situation can be viewed as a neural network with
- an input layer with two nodes (`carat` and the intercept called "bias unit" with value 1),
- a "fully connected" (= "dense") output layer with one node (`price`). Fully connected means that each node of a layer is a linear function of all node values of the previous layer. Each linear function has parameters or *weights* to be estimated, in our simple case just $\alpha$ and $\beta$.
Visualized as a graph, the situation looks as follows:
![](figs/nn_simple_linear.PNG)
*Part of the figures were done with this cool [webtool](http://alexlenail.me/NN-SVG/index.html).*
To gain confidence in neural networks, we first show that the parameters estimated by a neural network are quite similar to those learned by the linear least squares method. We will use Google's [TensorFlow](https://www.tensorflow.org/) with its flexible functional [Keras](https://keras.io/) interface. A great book on Keras (and neural networks in general) is @chollet2018.
### Example: Simple linear regression
```{r}
library(tidyverse)
library(keras)
# Input layer: we have 1 covariate
input <- layer_input(shape = 1)
# Output layer connected to the input layer
output <- input %>%
layer_dense(units = 1) # 1 response valiable
# there is another API (non-functional), but this is less powerful
# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
# summary(nn)
nn %>% compile(
optimizer = optimizer_adam(learning_rate = 1),
loss = "mse", # mean squared error
metrics = metric_root_mean_squared_error() # monitor performance during fitting
)
# Fit model - naive without validation
history <- nn %>% fit(
x = diamonds$carat,
y = diamonds$price,
epochs = 30, # 30 rounds
batch_size = 100
)
# if you see progress bars -> good
plot(history, metrics = "root_mean_squared_error")
unlist(get_weights(nn)) # fitted parameters, closed to LLS
# Plot effect of carat on average price
data.frame(carat = seq(0.3, 3, by = 0.1)) %>%
mutate(price = predict(nn, carat, verbose = 0)) %>%
ggplot(aes(x = carat, y = price)) +
geom_line(color = "chartreuse4") +
geom_point(color = "chartreuse4")
```
**Comment:** The solution of the simple neural network is indeed quite similar to the OLS solution.
### The optimization algorithm
Neural nets are typically fitted by *mini-batch gradient descent*, using *backpropagation* to efficiently calculate gradients. It works as follows:
Let $f_\beta$ denote a neural net with parameters $\beta$, and
$$
Q(f_\beta, D) = \sum_{(y_i, \boldsymbol x_i) \in D} L(y_i, f_\beta(\boldsymbol x_i))
$$
its total loss on a data set $D$ with respect to the loss function $L$.
1. Initialize the parameter vector $\beta$ with random values $\hat \beta$.
2. Forward step: Calculate $Q(f_{\hat\beta}, D_\text{batch})$ on a *batch* $D_\text{batch}$ of observations. This is a small subset of the training data.
3. Backpropagation step: Modify $\hat \beta$ to improve $Q(f_{\hat\beta}, D_\text{batch})$ by gradient descent: Calculate the vector of partial derivatives
$$
\nabla \hat \beta = \frac{\partial Q(f_\beta, D_\text{batch})}{\partial \beta}\large\mid_{\beta = \hat \beta}
$$
at the current estimates $\hat \beta$. Use it to update
$$
\hat \beta \leftarrow \hat \beta - \lambda \nabla \hat \beta,
$$
where $\lambda > 0$ is a sufficiently small learning rate. In neural nets, parameters are organized in multiple layers, which makes it difficult to calculate $\nabla \hat\beta$. This is where backpropagation enters the game: It calculates the partial derivatives layer per layer using the chain rule, starting from the output layer.
4. Repeat Steps 2 and 3 until each observation appeared in a batch. This is called an *epoch*.
5. Repeat Step 4 for multiple epochs until some stopping criterion triggers.
Gradient descent on batches of size 1 is called "stochastic gradient descent" (SGD). Further note that there is no guarantee that the algorithm reaches a global minimum.
## Step 2: Hidden layers
Our first neural network above consisted of only an input layer and an output layer. By adding one or more *hidden* layers between in- and output, the network gains additional parameters, and thus more flexibility. The nodes of a hidden layer can be viewed as latent variables, representing the original covariates. The nodes of a hidden layer are sometimes called *encoding*. The closer a layer is to the output, the better its nodes are suitable to predict the response variable. In this way, a neural network finds the right transformations and interactions of its covariates in an automatic way. The only ingredients are a large data set and a flexible enough network "architecture" (number of layers, nodes per layer).
Neural nets with more than one hidden layer are called "deep neural nets".
We will now add a hidden layer with five nodes $v_1, \dots, v_5$ to our simple linear regression network. The architecture looks as follows:
![](figs/nn_1_hidden.PNG)
This network has 16 parameters. How much better than our simple network with just two parameters will it be?
### Example
The following code is identical to the last one up to one extra line of code specifying the hidden layer.
```{r}
library(tidyverse)
library(keras)
# Input layer: we have 1 covariate
input <- layer_input(shape = 1)
# One hidden layer
output <- input %>%
layer_dense(units = 5) %>% # the only new line of code! (squeeze another dense layer)
layer_dense(units = 1)
# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
# summary(nn)
nn %>% compile(
optimizer = optimizer_adam(learning_rate = 1),
loss = "mse",
metrics = metric_root_mean_squared_error()
)
# Fit model - naive without validation
nn %>% fit(
x = diamonds$carat,
y = diamonds$price,
epochs = 30,
batch_size = 100
)
# Plot effect of carat on average price
data.frame(carat = seq(0.3, 3, by = 0.1)) %>%
mutate(price = predict(nn, carat, verbose = 0)) %>%
ggplot(aes(x = carat, y = price)) +
geom_line(color = "chartreuse4") +
geom_point(color = "chartreuse4")
```
**Comment:** Oops, it seems as if the extra hidden layer had no effect. The reason is that a linear function of a linear function is still a linear function. Adding the hidden layer did not really change the capabilities of the model. It just added a lot of unnecessary parameters.
Why? Linear function of linear function is again a linear function -> we have more params but all are redundant
We need to change something: Non-linear transformations of node values
## Step 3: Activation functions
The missing magic component is the so called [*activation* function](https://en.wikipedia.org/wiki/Activation_function) $\sigma$ after each layer, which transforms the values of the nodes. So far, we have implicitly used "linear activations", which - in neural network slang - is just the identity function.
Applying *non-linear* activation functions after hidden layers have the purpose to introduce non-linear and interaction effects. Typical such functions are
- the hyperbolic tangent $\sigma(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ ("S"-shaped function that maps real values to $[-1, 1]$),
- the standard logistic function ("sigmoid") $\sigma(x) = 1 / (1 + e^{-x})$ ("S"-shaped function that maps real values to $[0, 1]$, shifted and scaled hyperbolic tangent),
- the **re**ctangular **l**inear **u**nit "ReLU" $\sigma(x) = \text{max}(0, x)$ that sets negative values to 0.
Activation functions applied to the *output* layer have a different purpose, namely the same as the inverse of the link function of a corresponding GLM. It maps predictions to the scale of the response:
- identity/"linear" activation $\rightarrow$ usual regression
- logistic activation $\rightarrow$ binary logistic regression (one probability)
- softmax activation $\rightarrow$ multinomial logistic regression (one probability per class)
- exponential activation $\rightarrow$ log-linear regression as with Poisson or Gamma regression
Let us add a hyperbolic tangent activation function ($\sigma$) after the hidden layer of our simple example.
![](figs/nn_activation.PNG)
### Example
Again, the code is very similar to the last one, with the exception of using a hyperbolic tangent activation after the hidden layer (and different learning rate and number of epochs).
```{r}
library(tidyverse)
library(keras)
# Input layer: we have 1 covariate
input <- layer_input(shape = 1)
# One hidden layer
output <- input %>%
layer_dense(units = 5, activation = "tanh") %>%
layer_dense(units = 1, activation = "linear") # linear means identity
# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
nn %>% compile(
optimizer = optimizer_adam(learning_rate = 0.2),
loss = "mse",
metrics = metric_root_mean_squared_error()
)
# Fit model - naive without validation
nn %>% fit(
x = diamonds$carat,
y = diamonds$price,
epochs = 50,
batch_size = 100
)
# Plot effect of carat on average price
data.frame(carat = seq(0.3, 3, by = 0.1)) %>%
mutate(price = predict(nn, carat, verbose = 0)) %>%
ggplot(aes(x = carat, y = price)) +
geom_line(color = "chartreuse4") +
geom_point(color = "chartreuse4")
```
**Comment:** Adding the non-linear activation after the hidden layer has changed the model. The effect of carat is now representing the association between carat and price by a non-linear function.
activation functions after hidden layer are essential.
same problem as in trees: no extrapolation -> not so much obs. then it does anything
the more flexibility = the less extrapolation
# Practical Considerations
## Validation and tuning of main parameters
So far, we have naively fitted the neural networks without splitting the data for test and validation. Don't do this! Usually, one sets a small test dataset (e.g. 10% of rows) aside to assess the final model performance and use simple (or cross-)validation for model tuning.
In order to choose the main tuning parameters, namely
- network architecture,
- activation functions,
- learning rate,
- batch size, and
- number of epochs,
one often uses simple validation because cross-validation takes too much time.
## Missing values
A neural net does not accept missing values in the input. They need to be filled, e.g., by a typical value or a value below the minimum.
## Input standardization
Gradient descent starts by random initialization of parameters. This step is optimized for standardized input. Standardization has to be done manually by either
- min/max scale the values of each input to the range $[-1, 1]$,
- standard scale the values of each input to mean 0 and standard deviation 1, or
- use relative ranks.
Note that the scaling transformation is calculated on the training data and then applied to the validation and test data. This usually requires a couple of lines of code.
## Categorical input
There are three common ways to represent categorical input variables in a neural network.
1. Binary and ordinal categoricals are best represented by integers and then treated as numeric.
2. Unordered categoricals are either one-hot-encoded (i.e., each category is represented by a binary variable) or
3. they are represented by a (categorical) embedding. To do so, the $K$ categories are integer encoded and then condensed by a special *embedding layer* to $m << K$ dense features. This requires a more complex network architecture but saves memory and preprocessing. This approach is heavily used when the input consists of words (which is a categorical variable with thousands of levels - one level per word). The embedding is represented by a fully parametrized $(K \times m)$ matrix $\beta$ estimated together with the other network parameters. Instead of representing $X$ by $K$ dummy variables $\tilde X$ and then computing $\tilde X \beta$, the matrix multiplication is done implicitly by index slicing. For instance, if $X_1 = j$, then the first row of $X \beta$ equals the $j$-th row of $\beta$.
For Option 2, input standardization is not required, for Option 3 it *must* not be applied as the embedding layer expects integers.
## Callbacks
Sometimes, we want to take actions during training, such as
- stop training when validation performance starts worsening,
- reduce the learning rate when the optimization is stuck in a "plateau", or
- save the network weights between epochs.
Such monitoring tasks are called *callbacks*. We will see them in the example below.
## Types of layers
So far, we have encountered only dense (= fully connected) layers and activation layers. Here some further types:
- Embedding layers to represent integer encoded categoricals.
- Dropout layers to add regularization.
- Convolutional and pooling layers for image data.
- Recurrent layers (long-short-term memory LSTM, gated recurrent unit GRU) for sequence data.
- Concatenation layers to combine different branches of the network (like in a directed graph).
- Flatten layers to bring higher dimensional layers to dimension 1 (relevant, e.g., for embeddings, image and text data).
## Optimizer
Pure gradient descent is rarely applied without tweaks because it tends to be stuck in local minima, especially for complex networks with many layers. Modern variants are "adam", "nadam" and "RMSProp". These optimizers work usually out-of-the-box, except for the learning rate, which has to be chosen manually.
## Custom losses and evaluation metrics
Frameworks like Keras/TensorFlow offer many predefined loss functions and evaluation metrics. Choosing them is a crucial step, just as with tree boosting.
Using TensorFlow's backend functions, one can define own metrics and loss functions (see exercises).
## Overfitting and regularization
As with linear models, a model with too many parameters will overfit in an undesirable way. With about 50 to 100 observations per parameter, overfitting is usually not problematic. (Different rules apply to image and text data). Besides using fewer parameters, the main ways to reduce overfitting are as follows:
- Pull parameters of a layer slightly toward zero by applying L1 and/or L2 penalties to the objective function.
- Adding dropout layers. A dropout layer randomly sets some node values of the previous layer to 0, turning them off. Dropout is only applied during training.
## Choosing the architecture
How many layers and how many nodes per layer should be selected? For tabular data, up to three hidden layers are usually sufficient. If we start with $p$ input variables, the number of nodes in the first hidden layer is usually higher than $p$ and reduces for later layers. There should not be a "representational bottleneck", i.e., an early hidden layer with too few parameters.
The number of parameters should not be too high compared to the number of rows, see "Overfitting and regularization" above.
## Interpretation
Variable importance of covariates in neural networks can be assessed by permutation importance (see below) or SHAP importance (not covered). Covariate effects can be investigated, e.g., by partial dependence plots or SHAP dependence plots.
# Example: Diamonds
We will now fit a neural net with two hidden layers (30 and 15 nodes) and a total of 631 parameters to model diamond prices. Learning rate, activation functions, and batch size were manually chosen by simple validation. The number of epochs is automatically being chosen by an early stopping callback.
![](figs/nn_2_hidden.PNG)
```{r}
library(tidyverse)
library(withr)
library(keras)
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
y <- "log_price"
x <- c("log_carat", "color", "cut", "clarity")
# Split into train and test
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
train <- diamonds[ix, ]
test <- diamonds[-ix, ]
X_train <- train[, x]
X_test <- test[, x]
y_train <- train[[y]]
y_test <- test[[y]]
# Standardization information using X_train
temp <- scale(data.matrix(X_train))
sc <- list(
center = attr(temp, "scaled:center"),
scale = attr(temp, "scaled:scale")
)
# Function that maps data to scaled network input
# Usually, we have to do this by hand
prep_nn <- function(X, sel = x, scaling = sc) {
X <- data.matrix(X[, sel, drop = FALSE])
scale(X, center = scaling$center, scale = scaling$scale)
}
# Trying to make things reproducible...
k_clear_session()
tensorflow::set_random_seed(499)
# Input layer: we have 4 covariates
input <- layer_input(shape = 4)
# Two hidden layers with contracting number of nodes
output <- input %>%
layer_dense(units = 30, activation = "relu") %>%
layer_dense(units = 15, activation = "relu") %>%
layer_dense(units = 1, activation = "linear")
# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
summary(nn)
nn %>% compile(
optimizer = optimizer_adam(learning_rate = 0.05), # we need to find this param: training too slow -> increase learning rate, and vice versa. Validation curve should look natural
loss = "mse",
metrics = metric_root_mean_squared_error()
)
# Callbacks
cb <- list(
callback_early_stopping(patience = 20), # early stopping
callback_reduce_lr_on_plateau(patience = 5) # reduce learning rate if there is no improvement (trick to get out of local minimum)
)
# Fit model
history <- nn %>% fit(
x = prep_nn(X_train),
y = y_train,
epochs = 100,
batch_size = 400,
validation_split = 0.2,
callbacks = cb
)
plot(history, metrics = "root_mean_squared_error", smooth = FALSE) +
coord_cartesian(xlim = c(2, length(history$metrics$loss)), ylim = c(0.1, 0.2))
# Interpret
library(flashlight)
library(MetricsWeighted)
fl <- flashlight(
model = nn,
y = y,
data = test,
label = "Neural net",
metrics = list(rmse = rmse, `R squared` = r_squared),
predict_function = function(m, X)
predict(m, prep_nn(X), batch_size = 1000, verbose = 0)
)
# Performance on validation data
perf <- light_performance(fl)
perf$data
# Variable importance?
# PDPs
for (v in x) {
p <- light_profile(fl, v = v, n_bins = 40) %>%
plot(color = "chartreuse4") +
ggtitle(paste("PDP for", v))
print(p)
}
```
**Comment:** Performance is a bit lower than of the tree-based models. This might partly be a consequence of effects being smoother, but also because the model has not been refitted on the full training data for simplicity (20% of the training rows are used for validation). Compared to our tree-based models, one important aspect in model interpretation is missing: variable importance.
# Excursion: Permutation Importance
So far, we have measured variable importance by model-specific approaches such as split-gain importance for trees. A *model-agnostic* technique is called *permutation importance* introduced in @breiman2001 for random forests and later generalized by @fisher2018 to other models. It measures by how much a relevant performance measure $S$ (e.g., the average deviance or the RMSE) of a fitted model $\hat f$, evaluated on a dataset $D$, worsens after randomly shuffling the values of the $j$-th feature. The larger the performance change, the more important the feature.
Formally, permutation importance $\text{PVI}(j, D)$ of the $j$-th feature $X^{(j)}$ and data $D$ can be defined as follows:
$$
\text{PVI}(j, D) = S(\hat f, D^{(j)}) - S(\hat f, D),
$$
where $D^{(j)}$ is a version of $D$ with randomly permuted values in the column representing the $j$-th feature $X^{(j)}$. For simplicity, we assume that the performance measure $S$ is defined such that smaller values mean better performance.
If there are $p$ features and $n$ is the sample size of $D$, then $n(p+1)$ predictions have to be calculated. Compared to other methods from XAI, this is a relatively cheap operation. Note that the process can be repeated multiple times to increase (and assess) robustness of the results.
## Remarks
- During the process of calculating permutation importance, the model is never refitted.
- Generally, variable importance measures struggle with strongly collinear (or even strongly causally dependent) features. The problem can usually be reduced by applying clever data transformations during feature construction, i.e., before modeling. Example: the age $v$ of the driver and the "age" $t$ of the driving license can often be decorrelated by considering $v$ and $v-t$ instead of the original features $v$ and $t$. The use of such transformations contrasts with the bad habit of blindly throwing all available columns into the model.
- Calculating permutation importance makes use of the response variable. In order to not influence the results by overfitting, it usually makes sense to evaluate it on a hold-out data set. Can you think of a situation where one would rather consider the in-sample version?
- Different versions of permutation importance exist. For example, instead of studying absolute drops in performance, one could also use relative drops.
Most of the text of this section is taken from our [responsible ML lecture]( https://github.com/lorentzenchr/responsible_ml_material).
## Example: Diamonds (continued)
```{r}
light_importance(fl, v = x) %>%
plot(fill = "chartreuse4")
```
**Comment:** As in all our models on diamond prices, the dominating feature is the diamond size.
Note: All algorithms will produce other variable importance result (no gold standard for this)
# Example: Taxi
The last example on neural networks aims to show
- how strong such models perform on large data,
- how fast neural nets are,
- how flexible they are, and
- how embedding layers are used to represent unordered categorical input.
To do so, we create a model for (log) taxi trip durations, using the same features as before. We use the same train/test split as with the corresponding boosted trees model. Slight tuning (mainly the learning rate and the architecture) is done manually. The number of epochs is chosen by an early stopping callback. Pick-up location is modeled by a ten-dimensional embedding layer.
```{r}
library(arrow)
library(data.table)
library(withr)
library(tidyverse)
library(keras)
df <- read_parquet("taxi/yellow_tripdata_2018-01.parquet")
setDT(df)
head(df)
#=======================================================================
# Prepare data for modeling
#=======================================================================
df[, duration := as.numeric(
difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins")
)]
df = df[between(trip_distance, 0.2, 100) & between(duration, 1, 120)]
df[, `:=`(
pu_hour = factor(data.table::hour(tpep_pickup_datetime)),
weekday = factor(data.table::wday(tpep_pickup_datetime)), # 1 = Sunday
pu_loc = forcats::fct_lump_min(factor(PULocationID), 1e5),
log_duration = log(duration),
log_distance = log(trip_distance)
)]
y <- "log_duration"
x_dense <- c("log_distance", "weekday", "pu_hour")
x <- c(x_dense, "pu_loc")
# Random split
with_seed(1,
ix <- sample(nrow(df), 0.8 * nrow(df))
)
train <- as.data.frame(df[ix, c(y, x), with = FALSE])
test <- as.data.frame(df[-ix, c(y, x), with = FALSE])
y_train <- train[[y]]
y_test <- test[[y]]
X_train <- train[, x]
X_test <- test[, x]
# Standardization info using X_train
temp <- scale(data.matrix(X_train[, x_dense]))
sc <- list(
center = attr(temp, "scaled:center"),
scale = attr(temp, "scaled:scale")
)
# Function that maps data.frame to network input
# (a list with a scaled matrix and an integer vector representing pu_loc)
prep_nn <- function(X, dense = x_dense, scaling = sc) {
X_dense <- data.matrix(X[, dense, drop = FALSE])
X_dense <- scale(X_dense, center = scaling$center, scale = scaling$scale)
list(dense1 = X_dense, pu_loc = as.integer(X$pu_loc) - 1)
} # integer encoded version
# Does prep_nn() really work?
# how does it look like?
prep_nn(head(test, 1))
# dense part at first, we now have a list of tables as input!
# Trying to make things reproducible...
k_clear_session()
tensorflow::set_random_seed(469)
# Inputs
# dense in the sense: input are doubles
input_dense <- layer_input(shape = length(x_dense), name = "dense1")
# integer encoded pickup-locations
input_pu_loc <- layer_input(shape = 1, name = "pu_loc")
# Embedding of pu_loc
emb_pu_loc <- input_pu_loc %>%
# bug in R implementation (+1 always required)
# aggressive -> output_dim (represented by just a single double)
layer_embedding(input_dim = nlevels(train$pu_loc) + 1, output_dim = 10) %>%
layer_flatten() # this layer reduces the dimension (makes the output tabular)
# Combine dense input and embedding, and add dense layers
# input starts with list!
outputs <- list(input_dense, emb_pu_loc) %>%
layer_concatenate() %>% # connect with other inputs
layer_dense(100, activation = "relu") %>% # parameter explosion
layer_dense(10, activation = "relu") %>% # pull it together
layer_dense(1, activation = "linear")
# Input
inputs <- list(dense1 = input_dense, pu_loc = input_pu_loc)
# Create and compile model
nn <- keras_model(inputs = inputs, outputs = outputs)
summary(nn)
nn %>% compile(
optimizer = optimizer_adam(learning_rate = 0.005), # learning rate: there are tools to help us there
loss = "mse"
)
# Callbacks
cb <- list(
callback_early_stopping(patience = 10),
callback_reduce_lr_on_plateau(patience = 5)
)
# Fit model
if (FALSE) { # Set to true to refit
history <- nn %>% fit(
x = prep_nn(X_train),
y = y_train,
epochs = 100,
batch_size = 50000,
validation_split = 0.2,
callbacks = cb
)
plot(history, metrics = c("loss", "val_loss"), smooth = FALSE) +
coord_cartesian(ylim = c(0, 0.15))
save_model_weights_hdf5(nn, "taxi/taxi_nn.h5")
} else {
load_model_weights_hdf5(nn, "taxi/taxi_nn.h5")
}
#=======================================================================
# Inspect
#=======================================================================
library(flashlight)
library(MetricsWeighted)
fl <- flashlight(
model = nn,
y = y,
data = test,
label = "Neural net",
metrics = list(RMSE = rmse, `R squared` = r_squared),
predict_function = function(m, X)
predict(m, prep_nn(X), batch_size = 50000, verbose = 0)
# we need the prep function, and batch_size increased (just for better results)
)
# Performance on test data
perf <- light_performance(fl)
perf$data
plot(perf, fill = "chartreuse4") +
labs(x = element_blank(), y = element_blank())
# thats quite good, without tuning (just learning rate and activation function)
# Permutation importance
light_importance(fl, v = x) %>%
plot(fill = "chartreuse4") # IMPORTANT FOR EXAM
# we only need a stable predict function (should also work if we restructure the data)
# Partial dependence
for (v in x) {
p <- light_profile(fl, v = v, n_bins = 40) %>%
plot(color = "chartreuse4", rotate_x = (v == "pu_loc")) +
labs(title = paste("PDP for", v), y = "Prediction")
print(p)
}
# PDP for tree-based models: not smooth
# PDP for NN models: smooth .. (why?)
# we see that pickup location is highly unordered! (one special location)
# performance similar to lightgbm
```
**Comment:** The resulting model performs quite similarly to the boosted trees model. As in the diamonds example above, we used an early-stopping callback on 20% of the training rows to automatically find a good number of epochs. And again, for simplicity, we did not refit the model on the full training data. Thus, the final model was trained on less rows than the LightGBM model.
Pickup location not so important -> do we need 10 dimensions in embedding layer?
# Neural Network Slang
Here, we summarize some of the neural network slang.
- Activation function: The transformation applied to the node values.
- Architecture: The layout of layers and nodes.
- Backpropagation: An efficient way to calculate gradients.
- Batch: A couple of data rows used for one mini-batch gradient descent step.
- Callback: An action during training (save weights, reduce learning rate, stop training, ...).
- Epoch: The process of updating the network weights by gradient descent until each observation in the training set was used once.
- Embedding: A numeric representation of categorical input as learned by the neural net.
- Encoding: The values of latent variables of a hidden layer, usually the last.
- Gradient descent: The basic optimization algorithm of neural networks.
- Keras: User-friendly wrapper of TensorFlow.
- Layer: Main organizational unit of a neural network.
- Learning rate: Controls the step size of gradient descent, i.e., how aggressive the network learns.
- Node: Nodes on the input layer are the covariates, nodes on the output layer the response(s) and nodes on a hidden layer are latent variables representing the covariates for the task to predict the response.
- Optimizer: The specific variant of gradient descent.
- PyTorch: An important implementation of neural networks.
- Stochastic gradient descent (SGD): Mini-batch gradient descent with batches of size 1.
- TensorFlow: An important implementation of neural networks.
- Weights: The parameters of a neural net.
# Excursion: Analysis Scheme X
Let $T(Y)$ be a quantity of interest, e.g., an average diamond price, claims frequency, or the success probability of a medical intervention. Many interesting insights can be found by the following analysis schema:
1. Calculate $T(Y)$ on the full data.
2. Calculate $T(Y)$ stratified by each covariate $X^{(j)}$. This will describe the bivariate associations between $Y$ and the $X^{(j)}$. Continuous covariates need to be binned.
3. Build an ML model $T(Y \mid \boldsymbol x) \approx f(\boldsymbol x)$, using features $X^{(1)}, \dots, X^{(p)}$, and a clean validation strategy. It describes the multivariate association between $Y$ and the covariates.
a. Study model performance.
b. Study variable importance and use them to sort the results of Step 2. Which of the associations are very strong/weak?
c. For each feature, plot its main effect (e.g. partial or SHAP dependence plot). Use it to complement the marginal effects from Step 2: Are they quite similar or not? Ideally, search for strong interaction effects as well.
## Example: Diamonds
Let's apply above scheme for diamond prices.
We use trees (faster)
```{r}
library(tidyverse)
library(ranger)
library(flashlight)
library(withr)
# STEP 1
y <- "price"
mean(diamonds[[y]])
# interested in average diamond prices
# STEP 2
x <- c("carat", "color", "cut", "clarity")
fl <- flashlight(model = NULL, y = y, data = diamonds, label = "any")
# average price grouped by different features
# carat is the most difficult one (we need binning)
for (v in x) {
light_profile(fl, v = v, type = "response") %>%
plot(color = "chartreuse4") %>%
print()
}
# STEP 3
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
# train/test split
fit <- ranger(
reformulate(x, y),
num.trees = 500,
data = diamonds[ix, ],
seed = 83
)
fit
fl <- flashlight(
fl,
model = fit, predict_function = function(m, x) predict(m, x)$predictions,
data = diamonds[-ix, ]
)
# Performance -> RMSE
light_performance(fl)$data$value
# Importance
imp <- light_importance(fl, v = x)
plot(imp, fill = "chartreuse4")
# "everything comes from carat"
# PDP/ response for carat is similar > really important
# clarity: strange
# Add PDP to descriptive effect
for (v in most_important(imp)) {
p <- light_effects(fl, v = v) %>%
plot(use = c("response", "pd")) +
ylab("Price")
print(p)
}
```
**Selected insights**
- Step 1: The average price is 3932.8 USD.
- Step 2: The average price heavily depends on carat and the other factors. The associations from Step 2 are not intuitive, except for carat.
- Step 3: The most important feature is carat, followed by clarity, color and cut. Conditional effects complement the marginal descriptive effects. They also explain some of the unintuitive descriptive effects. (Larger diamonds tend to be of reduced quality, introducing confounding.)
# Exercises
1. Explain why a neural network is conceptually closer to a GLM than to a random forest. Why are activation functions important?
2. Fit a neural network to (non-logarithmic) diamond prices by minimizing Gamma deviance with log-link (-> exponential output activation). Use the custom loss function defined below. Tune the model by simple validation and evaluate it on a test dataset. Study test performance, permutation importance, and partial dependence plots. Hints: Play with activation functions. Furthermore, the response needs to be transformed from int to float for certain TensorFlow versions.
```{r, eval=FALSE}
loss_gamma <- function(y_true, y_pred) {
-k_log(y_true / y_pred) + y_true / y_pred
}
```
3. (Optional) Craft a neural net for the "dataCar" dataset. The binary response is "clm", and the features are "veh_value", "veh_age", "gender", "area", "agecat", and "veh_body". Represent the latter with a one-dimensional embedding layer. You can either write the code from scratch or modify above taxi trip code. Work with a clean train/test split. Study test performance, permutation importance, and partial dependence plots.
4. Choose a data set of your interest and apply "Analysis Scheme X" to it.
# Summary
In this chapter, we have glimpsed into the world of neural networks and deep learning. Step by step we have learned how a neural network works. We have used Keras and TensorFlow to build models brick by brick. Furthermore, we met a simple analysis scheme involving ML that can be applied to many analysis situations.
Here a short [comparison](https://github.com/mayer79/ML_Algorithm_Comparison) of some properties of the ML algorithms learned in this lecture (screenshot as per Sept. 7, 2020):
![](figs/comparison_ML.PNG)
# References