-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06.Rmd
1765 lines (1369 loc) · 81.5 KB
/
06.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Ch. 6 Overfitting, Regularization, and Information Criteria"
author: "A Solomon Kurz"
date: "`r format(Sys.Date())`"
output:
github_document
---
```{r, echo = FALSE, cache = FALSE}
knitr::opts_chunk$set(fig.retina = 2.5)
options(width = 100)
```
# Overfitting, Regularization, and Information Criteria
In this chapter we contend with two contrasting kinds of statistical error:
* overfitting, "which leads to poor prediction by learning too *much* from the data"
* underfitting, "which leads to poor prediction by learning too *little* from the data" (p. 166, *emphasis* added)
Toward that end, we will employ
> two common families of approaches. The first approach is to use a **regularizing prior** to tell the model not to get too excited by the data. This is the same device that non-Bayesian methods refer to as "penalized likelihood." The second approach is to use some scoring device, like **information criteria**, to model the prediction risk and estimate predictive accuracy for some purpose. Both families of approaches are routinely used in the natural and social sciences Furthermore, they can be--maybe should be--used in combination. (p. 167, **emphasis** in the original)
#### Rethinking: Stargazing.
> The most common form of model selection among practicing scientists is a search for a model in which every coefficient is statistically significant. Statisticians sometimes call this **stargazing**, as it is embodied by scanning for asterisks (**) trailing after estimates...
>
> ...Whatever you thing about null hypothesis significance testing in general, using it to select among structurally different models is a mistake--$p$-values are not designed to help you navigate between undercutting and overfitting. (p. 167, **emphasis** in the original)
McElreath spent little time discussing $p$-values and null hypothesis testing in the text. If you'd like to learn more from a Bayesian perspective, you might check out the first several chapters (particularly 10--13) in [Kruschke's (2015) text](https://sites.google.com/site/doingbayesiandataanalysis/) and my [project translating it to brms and the tidyverse](https://bookdown.org/content/3686/). The great [Frank Harrell](https://twitter.com/f2harrell) has complied [*A Litany of Problems With p-values*](https://www.fharrell.com/post/pval-litany/), which you might also find of use.
## The problem with parameters
The $R^2$ is a popular way to measure how well you can retrodict the data. It traditionally follows the form
$$R^2 = \frac{\text{var(outcome)} - \text{var(residuals)}}{\text{var(outcome)}} = 1 - \frac{\text{var(residuals)}}{\text{var(outcome)}}.$$
By $\text{var()}$, of course, we meant variance (i.e., the `var()` function in R).
McElreath is not a fan of the $R^2$. But it's important in my field, so instead of a summary at the [end of the chapter][~~Summary~~ Bonus: $R^2$ talk], we will cover the Bayesian version of $R^2$ and how to use it in brms.
### More parameters always improve fit.
> **Overfitting** occurs when a model learns too much from the sample. What this means is that there are both *regular* and *irregular* features in every sample. The regular features are the targets of our learning, because they generalize well or answer a question of interest. Regular features are useful, given an objective of our choice. The irregular features are instead aspects of the data that do not generalize and so may mislead us. (p. 169, **emphasis** in the original)
There's a lot to like about this section, but my experience as a clinical psychologist inclines me to approach this topic differently. In practice, therapy is often a one-on-one process where the outcome is for a specific person in a specific time and set of circumstances in their life. Therapy research is often the aggregate of many such individual cases. As such, both the regular and irregular features of every therapy trial are of great interest. So yes, I agree with McElreath that overfitting of the kind we're about to entertain is silly and to be avoided. And yet the goal of my line of research is to develop treatments (and to examine those treatments with statistical models) that are robust enough to handle irregularities as well. It's my job to deal with the outliers. For more on what you might think of as robust models (i.e., models that can handle unusual observations), check out my [blog on Student-$t$ regression](https://solomonkurz.netlify.com/post/robust-linear-regression-with-the-robust-student-s-t-distribution/). Though not covered in this edition of the text, McElreath [plans on covering Student-$t$ regression](https://twitter.com/SolomonKurz/status/1207328933164830721) in his upcoming second edition.
In other words, the topics McElreath grappled with in this chapter are difficult and ones I hope you return to again and again during your data analysis career. Back to the text!
We'll start off by making the data with brain size and body size for seven `species` (see [McHenry & Coffing, 2000](https://www.researchgate.net/profile/Henry_Mchenry/publication/228368549_Australopithecus_To_Homo_Transformations_in_Body_and_Mind/links/572a6e1f08ae2efbfdbc20b6/Australopithecus-To-Homo-Transformations-in-Body-and-Mind.pdf)).
```{r, warning = F, message = F}
library(tidyverse)
(
d <-
tibble(species = c("afarensis", "africanus", "habilis", "boisei", "rudolfensis", "ergaster", "sapiens"),
brain = c(438, 452, 612, 521, 752, 871, 1350),
mass = c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5))
)
```
Let's get ready for Figure 6.2. The plots in this chapter will be characterized by `theme_classic() + theme(text = element_text(family = "Courier"))`. Our color palette will come from the [rcartocolor package](https://cran.r-project.org/package=rcartocolor), which provides color schemes [designed by 'CARTO'](https://carto.com/carto-colors/).
```{r, warning = F, message = F}
# install.packages("rcartocolor", dependencies = T)
library(rcartocolor)
```
The specific palette we'll be using is "BurgYl." In addition to palettes, the rcartocolor package offers a few convenience functions which make it easier to use their palettes. The `carto_pal()` function will return the HEX numbers associated with a given palette's colors and the `display_carto_pal()` function will display the actual colors.
```{r, fig.width = 6, fig.height = 2.25}
carto_pal(7, "BurgYl")
display_carto_pal(7, "BurgYl")
```
We'll be using a diluted version of the third color for the panel background (i.e., `theme(panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))`) and the darker purples for other plot elements. Here's the plot.
```{r, fig.width = 3.5, fig.height = 3, message = F, warning = F}
library(ggrepel)
d %>%
ggplot(aes(x = mass, y = brain, label = species)) +
geom_point(color = carto_pal(7, "BurgYl")[5]) +
geom_text_repel(size = 3, color = carto_pal(7, "BurgYl")[7], family = "Courier", seed = 438) +
labs(subtitle = "Average brain volume by body\nmass for six hominin species",
x = "body mass (kg)",
y = "brain volume (cc)") +
coord_cartesian(xlim = 30:65) +
theme_classic() +
theme(text = element_text(family = "Courier"),
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))
```
The first six models, which we'll be fitting using OLS regression via the `lm()` function, range in complexity from the simple univariable model
\begin{align*}
\text{brain}_i & \sim \operatorname{Normal} (\mu_i, \sigma) \\
\mu_i & = \beta_0 + \beta_1 \text{mass}_i,
\end{align*}
to the dizzying sixth-degree polynomial model
\begin{align*}
\text{brain}_i & \sim \operatorname{Normal} (\mu_i, \sigma) \\
\mu_i & = \beta_0 + \beta_1 \text{mass}_i + \beta_2 \text{mass}_i^2 + \beta_3 \text{mass}_i^3 + \beta_4 \text{mass}_i^4 + \beta_5 \text{mass}_i^5 + \beta_6 \text{mass}_i^6.
\end{align*}
Let's fit them in bulk. First we'll make a custom function, `fit_lm()`, into which we'll feed the desired names and formulas of our models. We'll make a tibble initially composed of those names (i.e., `model`) and formulas (i.e., `formula`). Via `purrr::map2()` within `mutate()`, we'll then fit the models and save the model objects within the tibble. The [broom package](https://cran.r-project.org/package=broom) provides an array of convenience functions to convert statistical analysis summaries into tidy data objects. We'll employ `broom::tidy()` and `broom::glance()` to extract information from the model fits.
```{r, message = F, warning = F}
library(broom)
fit_lm <- function(model, formula) {
model <- lm(data = d, formula = formula)
}
fits <-
tibble(model = str_c("b6.", 1:6),
formula = c("brain ~ mass",
"brain ~ mass + I(mass^2)",
"brain ~ mass + I(mass^2) + I(mass^3)",
"brain ~ mass + I(mass^2) + I(mass^3) + I(mass^4)",
"brain ~ mass + I(mass^2) + I(mass^3) + I(mass^4) + I(mass^5)",
"brain ~ mass + I(mass^2) + I(mass^3) + I(mass^4) + I(mass^5) + I(mass^6)")) %>%
mutate(fit = map2(model, formula, fit_lm)) %>%
mutate(tidy = map(fit, tidy),
glance = map(fit, glance))
# what did we just do?
print(fits)
```
Our `fits` object is a [nested tibble](https://tidyr.tidyverse.org/reference/nest.html). To learn more about this bulk approach to fitting models, check out Hadley Wickham's talk, [Managing many models with R](https://www.youtube.com/watch?v=rz3_FDVt9eg&t=2339s&frags=pl%2Cwn). As you might learn in the talk, we can extract the $R^2$ from each model with `map_dbl("r.squared")`, which we'll then display in a plot.
```{r, fig.width = 8, fig.height = 1.5}
fits <-
fits %>%
mutate(r2 = glance %>% map_dbl("r.squared")) %>%
mutate(r2_text = round(r2, digits = 2) %>% as.character() %>% str_replace(., "0.", "."))
fits %>%
ggplot(aes(x = r2, y = formula, label = r2_text)) +
geom_text(color = carto_pal(7, "BurgYl")[7], size = 3.5) +
scale_x_continuous(expression(italic(R)^2), limits = 0:1, breaks = 0:1) +
ylab(NULL) +
theme_classic() +
theme(text = element_text(family = "Courier"),
axis.text.y = element_text(hjust = 0),
axis.ticks.y = element_blank(),
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))
```
If we wanted to look at the model coefficients, we could `unnest(tidy)` and wrangle a bit.
```{r}
fits %>%
unnest(tidy) %>%
select(model, term:estimate) %>%
mutate_if(is.double, round, digits = 1) %>%
complete(model, term) %>%
spread(key = term, value = estimate) %>%
select(model, `(Intercept)`, mass, everything()) %>%
knitr::kable()
```
For Figure 6.3, we'll make each plot individually and them glue them together with the patchwork package. Since they all share a common stucture, we'll start by specifying a base plot which we'll save as `p`.
```{r}
p <-
d %>%
ggplot(aes(x = mass, y = brain)) +
geom_point(color = carto_pal(7, "BurgYl")[7]) +
scale_x_continuous("body mass (kg)", limits = c(33, 62), expand = c(0, 0)) +
ylab("brain volume (cc)") +
coord_cartesian(ylim = c(300, 1500)) +
theme_classic() +
theme(text = element_text(family = "Courier"),
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))
```
Now for each subplot, we'll tack the subplot-specific components onto `p`. The main action is in `stat_smooth()`. For each subplot, the first three lines in `stat_smooth()` are identical, with only the bottom `formula` line differing. Like McElreath did in the text, we also adjust the y-axis range for the last two plots.
```{r, warning = F, message = F}
# linear
p1 <-
p +
stat_smooth(method = "lm", fullrange = TRUE, level = .89, # note our rare use of 89% intervals
color = carto_pal(7, "BurgYl")[6], fill = carto_pal(7, "BurgYl")[6],
size = 1/2, alpha = 1/3,
formula = y ~ x) +
ggtitle(NULL, subtitle = expression(italic(R)^2==".49"))
# quadratic
p2 <-
p +
stat_smooth(method = "lm", fullrange = TRUE, level = .89,
color = carto_pal(7, "BurgYl")[6], fill = carto_pal(7, "BurgYl")[6],
size = 1/2, alpha = 1/3,
formula = y ~ poly(x, 2)) +
ggtitle(NULL, subtitle = expression(italic(R)^2==".54"))
# cubic
p3 <-
p +
stat_smooth(method = "lm", fullrange = TRUE, level = .89,
color = carto_pal(7, "BurgYl")[6], fill = carto_pal(7, "BurgYl")[6],
size = 1/2, alpha = 1/3,
formula = y ~ poly(x, 3)) +
ggtitle(NULL, subtitle = expression(italic(R)^2==".68"))
# fourth-order polynomial
p4 <-
p +
stat_smooth(method = "lm", fullrange = TRUE, level = .89,
color = carto_pal(7, "BurgYl")[6], fill = carto_pal(7, "BurgYl")[6],
size = 1/2, alpha = 1/3,
formula = y ~ poly(x, 4)) +
ggtitle(NULL, subtitle = expression(italic(R)^2==".81"))
# fifth-order polynomial
p5 <-
p +
stat_smooth(method = "lm", fullrange = TRUE, level = .89,
color = carto_pal(7, "BurgYl")[6], fill = carto_pal(7, "BurgYl")[6],
size = 1/2, alpha = 1/3,
formula = y ~ poly(x, 5)) +
# we're adjusting the y-axis range for this plot (and the next)
coord_cartesian(ylim = c(150, 1900)) +
ggtitle(NULL, subtitle = expression(italic(R)^2==".99"))
# sixth-order polynomial
p6 <-
p +
# mark off 0 on the y-axis
geom_hline(yintercept = 0, color = carto_pal(7, "BurgYl")[2], linetype = 2) +
stat_smooth(method = "lm", fullrange = TRUE, level = .89,
color = carto_pal(7, "BurgYl")[6], fill = carto_pal(7, "BurgYl")[6],
size = 1/2, alpha = 1/3,
formula = y ~ poly(x, 6)) +
coord_cartesian(ylim = c(-300, 1500)) +
ggtitle(NULL, subtitle = expression(italic(R)^2==1))
```
Okay, now we're ready to combine the six subplots and produce our version of Figure 6.3.
```{r, fig.width = 6, fig.height = 8, warning = F, message = F}
library(patchwork)
(p1 + p2) / (p3 + p4) / (p5 + p6)
```
"If you adopt a model family with enough parameters, you can fit the data exactly. But such a model will make rather absurd predictions for yet-to-be observed cases" (p. 172)
### Too few parameters hurts, too.
> The overfit polynomial models manage to fit the data extremely well, bu they suffer for this within-sample accuracy by making nonsensical out-of-sample predictions. In contract, **underfitting** produces models that are inaccurate both within and out of sample. They have learned too little, failing to recover regular features of the sample. (p. 172, **emphasis** in the original)
Fit the underfit intercept-only model, `b6.7`.
```{r}
b6.7 <- lm(data = d,
brain ~ 1)
summary(b6.7)
```
With the intercept-only model, we didn't even get an $R^2$ value in the summary. The `broom::glance()` function offers a quick way to get one.
```{r}
glance(b6.7)
```
Zero. Our intercept-only `b6.7` explained exactly zero variance in `brain`. All it did was tell us what the unconditional mean and variance (i.e., 'Residual standard error') were. I hope that makes sense. They were the only things in the model: $\text{brain}_i \sim \text{Normal}(\mu = \alpha, \sigma)$. To get the intercept-only model for Figure 6.4, we plug `formula = y ~ 1` into the `stat_smooth()` function.
```{r, fig.width = 3, fig.height = 2.67}
p +
stat_smooth(method = "lm", fullrange = TRUE, level = .89,
color = carto_pal(7, "BurgYl")[6], fill = carto_pal(7, "BurgYl")[6],
size = 1/2, alpha = 1/3,
formula = y ~ 1) +
ggtitle(NULL, subtitle = expression(italic(R)^2==0))
```
Our underfit model `b6.7` didn't learn anything about the relation between `mass` and `brain`. "Such a model not only fails to describe the sample. It would also do a poor job for new data" (p. 173).
#### Overthinking: Dropping rows.
You can `filter()` by `row_number()` to drop rows in a [tidyverse kind of way](https://dplyr.tidyverse.org/reference/slice.html). For example, we can drop the second row of `d` like this.
```{r}
d %>%
filter(row_number() != 2)
```
We can then extend that logic into a custom function, `make_lines()`, that will drop a row from `d`, fit the simple model `brain ~ mass`, and then use base R `predict()` to return the model-implied trajectory over new data values.
```{r}
# because these lines are straight, we only need new data over two points of `mass`
nd <- tibble(mass = c(30, 70))
make_lines <- function(row) {
# fit the model
my_fit <-
d %>%
filter(row_number() != row) %>%
lm(formula = brain ~ mass)
# compute fitted lines
predict(my_fit, nd) %>%
as_tibble() %>%
rename(brain = value) %>%
bind_cols(nd)
}
```
Here we'll make a tibble, `lines`, which will specify rows 1 through 7 in the `row` column. We'll then feed those `row` numbers into our custom `make_lines()` function, which will return the predicted values and their corresponding `mass` values, per model.
```{r, warning = F, message = F}
(
lines <-
tibble(row = 1:7) %>%
mutate(p = map(row, make_lines)) %>%
unnest(p)
)
```
Now we're ready to plot the left panel of Figure 6.5.
```{r, fig.width = 3.5, fig.height = 2.75, warning = F, message = F}
p +
scale_x_continuous(expand = c(0, 0)) +
geom_line(data = lines,
aes(x = mass, y = brain, group = row),
color = carto_pal(7, "BurgYl")[6], alpha = 1/2, size = 1/2)
```
To make the right panel for Figure 6.5, we'll need to increase the number of `mass` points in our `nd` data and redefine the `make_lines()` function to fit the sixth-order-polynomial model.
```{r, fig.width = 3.5, fig.height = 2.75, message = F, warning = F}
# because these lines will be very curvy, we'll need new data over many points of `mass`
nd <- tibble(mass = seq(from = 30, to = 65, length.out = 200))
# redifine the function
make_lines <- function(row) {
my_fit <-
d %>%
filter(row_number() != row) %>%
lm(formula = brain ~ mass + I(mass^2) + I(mass^3) + I(mass^4) + I(mass^5) + I(mass^6))
predict(my_fit, nd) %>%
as_tibble() %>%
rename(brain = value) %>%
bind_cols(nd)
}
# make our new tibble
lines <-
tibble(row = 1:7) %>%
mutate(p = map(row, make_lines)) %>%
unnest(p)
# plot!
p +
geom_line(data = lines,
aes(group = row),
color = carto_pal(7, "BurgYl")[6], alpha = 1/2, size = 1/2) +
coord_cartesian(ylim = -300:2000)
```
## Information theory and model performance
> Whether you end up using regularization or information criteria or both, the first thing you must do is pick a criterion of model performance. What do you want the model to do well at? We'll call this criterion the *target*, and in this section you'll see how information theory provides a common and useful target, the out-of-sample *deviance*. (p. 174, *emphasis* in the original)
### Firing the weatherperson.
If you let rain = 1 and sun = 0, here's a way to make a plot of the first table of page 175, the weatherperson's predictions.
```{r, fig.width = 6, fig.height = 1.5}
weatherperson <-
tibble(day = 1:10,
prediction = rep(c(1, 0.6), times = c(3, 7)),
observed = rep(c(1, 0), times = c(3, 7)))
weatherperson %>%
gather(key, value, -day) %>%
ggplot(aes(x = day, y = key, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = value, color = value == 0)) +
scale_fill_viridis_c(direction = -1) +
scale_color_manual(values = c("white", "black")) +
scale_x_continuous(breaks = 1:10, expand = c(0, 0)) +
scale_y_discrete(NULL, expand = c(0, 0)) +
theme(text = element_text(family = "Courier"),
axis.ticks.y = element_blank(),
legend.position = "none")
```
Here's how the newcomer fared.
```{r, fig.width = 6, fig.height = 1.5}
newcomer <-
tibble(day = 1:10,
prediction = 0,
observed = rep(c(1, 0), times = c(3, 7)))
newcomer %>%
gather(key, value, -day) %>%
ggplot(aes(x = day, y = key, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = value, color = value == 0)) +
scale_fill_viridis_c(direction = -1) +
scale_color_manual(values = c("white", "black")) +
scale_x_continuous(breaks = 1:10, expand = c(0, 0)) +
scale_y_discrete(NULL, expand = c(0, 0)) +
theme(text = element_text(family = "Courier"),
axis.ticks.y = element_blank(),
legend.position = "none")
```
If we do the math entailed in the tibbles, we'll see why the newcomer could boast "I'm the best person for the job" (p. 175).
```{r}
weatherperson %>%
bind_rows(newcomer) %>%
mutate(person = rep(c("weatherperson", "newcomer"), each = n() / 2),
hit = ifelse(prediction == observed, 1, 1 - prediction - observed)) %>%
group_by(person) %>%
summarise(n_days = n(),
n_hits = sum(hit),
hit_rate = mean(hit))
```
#### Costs and benefits.
Our new `points` variable doesn't fit into the nice color-based `geom_tile()` plots from above. But we can still do the math.
```{r}
weatherperson %>%
bind_rows(newcomer) %>%
mutate(person = rep(c("weatherperson", "newcomer"), each = n()/2),
points = ifelse(observed == 1 & prediction != 1, -5,
ifelse(observed == 1 & prediction == 1, -1,
-1 * prediction))) %>%
group_by(person) %>%
summarise(happiness = sum(points))
```
#### Measuring accuracy.
> Consider for example computing the probability of predicting the exact sequence of days. This means computing the probability of a correct prediction for each day. Then multiply all of these probabilities together to get the joint probability of correctly predicting the observed sequence. This is the same thing as the joint likelihood, which you've been using up to this point to fit models with Bayes' theorem.
>
> In this light, the newcomer looks even worse. (p. 176)
```{r}
weatherperson %>%
bind_rows(newcomer) %>%
mutate(person = rep(c("weatherperson", "newcomer"), each = n() / 2),
hit = ifelse(prediction == observed, 1, 1 - prediction - observed)) %>%
group_by(person, hit) %>%
count() %>%
ungroup() %>%
mutate(power_character = str_c(hit, "^", n), # this line is just for pedagogy
power = hit ^ n,
term = rep(letters[1:2], times = 2)) %>%
select(person, term, power) %>%
spread(key = term, value = power) %>%
mutate(probability_correct_sequence = a * b)
```
### Information and uncertainty.
Within the context of information theory, *information* is "the reduction of uncertainty derived from learning an outcome" (p. 177). *Information entropy* is a way of measuring that uncertainty in a way that is (a) continuous, (b) increases as the number of possible events increases, and (c) is additive. The formula for information entropy is:
$$H(p) = - \text E \log (p_i) = - \sum_{i = 1}^n p_i \log (p_i).$$
McElreath put it in words as "the uncertainty contained in a probability distribution is the average log-probability of the event" (p. 178). We'll compute the information entropy for weather at the first unnamed location, which we'll call `McElreath's house`, and `Abu Dhabi` at once.
```{r}
tibble(place = c("McElreath's house", "Abu Dhabi"),
p_rain = c(.3, .01)) %>%
mutate(p_shine = 1 - p_rain) %>%
group_by(place) %>%
mutate(H_p = (p_rain * log(p_rain) + p_shine * log(p_shine)) %>% mean() * -1)
```
Did you catch how we used the equation $H(p) = - \sum_{i = 1}^n p_i \log (p_i)$ in our `mutate()` code, there? Our computation indicated the uncertainty is less in Abu Dhabi because it rarely rains, there. If you have sun, rain and snow, the entropy for weather is:
```{r}
p <- c(.7, .15, .15)
-sum(p * log(p))
```
"These entropy values by themselves don’t mean much to us, though. Instead we can use them to build a measure of accuracy. That comes next" (p. 178).
### From entropy to accuracy.
> How can we use information entropy to say how far a model is from the target? The key lies in **divergence**:
>
>> **Divergence**: The additional uncertainty induced by using probabilities from one distribution to describe another distribution.
>
> This is often known as *Kullback-Leibler divergence* or simply K-L divergence. (p. 179, **emphasis** in the original; see [Kullback & Leibler, 1951](https://projecteuclid.org/download/pdf_1/euclid.aoms/1177729694))
The formula for K-L divergence is
$$D_\text{KL} (p, q) = \sum_i p_i \big ( \log (p_i) - \log (q_i) \big ) = \sum_i p_i \log \bigg ( \frac{p_i}{q_i} \bigg ),$$
which, in plainer language, is what McElreath described as "the average difference in log probability between the target ($p$) and model ($q$)" (p. 179).
In McElreath's example
* $p_1 = .3$,
* $p_2 = .7$,
* $q_1 = .25$, and
* $q_2 = .75$.
With those values, we can compute $D_\text{KL} (p, q)$ within a tibble like so.
```{r}
tibble(p_1 = .3,
p_2 = .7,
q_1 = .25,
q_2 = .75) %>%
mutate(d_kl = (p_1 * log(p_1 / q_1)) + (p_2 * log(p_2 / q_2)))
```
Did you notice how we used the formula $D_\text{KL} (p, q) = \sum_i p_i \log \Big ( \frac{p_i}{q_i} \Big )$ within `mutate()`?
Our systems in this section are binary (e.g., $q = \lbrace q_i, q_2 \rbrace$). Thus if you know $q_1 = .3$ you know of a necessity $q_2 = 1 - q_1$. Therefore we can code the tibble for the next example of when $p = q$ like this.
```{r}
tibble(p_1 = .3) %>%
mutate(p_2 = 1 - p_1,
q_1 = p_1) %>%
mutate(q_2 = 1 - q_1) %>%
mutate(d_kl = (p_1 * log(p_1 / q_1)) + (p_2 * log(p_2 / q_2)))
```
Building off of that, you can make the data required for Figure 6.6 like this.
```{r}
t <-
tibble(p_1 = .3,
p_2 = .7,
q_1 = seq(from = .01, to = .99, by = .01)) %>%
mutate(q_2 = 1 - q_1) %>%
mutate(d_kl = (p_1 * log(p_1 / q_1)) + (p_2 * log(p_2 / q_2)))
head(t)
```
Now we have the data, plotting Figure 6.6 is little more than `geom_line()` with stylistic flourishes.
```{r, fig.width = 3, fig.height = 2.75}
t %>%
ggplot(aes(x = q_1, y = d_kl)) +
geom_vline(xintercept = .3, color = carto_pal(7, "BurgYl")[5], linetype = 2) +
geom_line(color = carto_pal(7, "BurgYl")[7], size = 1.5) +
annotate(geom = "text", x = .4, y = 1.5, label = "q = p",
color = carto_pal(7, "BurgYl")[5], family = "Courier", size = 3.5) +
labs(x = expression(italic(q)[1]),
y = "Divergence of q from p") +
theme_classic() +
theme(text = element_text(family = "Courier"),
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))
```
> What divergence can do for us now is help us contrast different approximations to $p$. As an approximating function $q$ becomes more accurate, $D_\text{KL} (p, q)$ will shrink. So if we have a pair of candidate distributions, then the candidate that minimizes the divergence will be closest to the target. Since predictive models specify probabilities of events (observations), we can use divergence to compare the accuracy of models. (p. 180)
#### Rethinking: Divergence depends upon direction.
Here we see $H(p, q) \neq H(q, p)$. That is, direction matters.
```{r}
tibble(direction = c("Earth to Mars", "Mars to Earth"),
p_1 = c(.01, .7),
q_1 = c(.7, .01)) %>%
mutate(p_2 = 1 - p_1,
q_2 = 1 - q_1) %>%
mutate(d_kl = (p_1 * log(p_1 / q_1)) + (p_2 * log(p_2 / q_2)))
```
The $D_\text{KL}$ was double when applying Martian estimates to Terran estimates.
### From divergence to deviance.
> The point of all the preceding material about information theory and divergence is to establish both:
>
> 1. How to measure the distance of a model from our target. Information theory gives us the distance measure we need, the K-L divergence.
>
> 2. How to estimate the divergence. Having identified the right measure of distance, we now need a way to estimate it in real statistical modeling tasks. (p. 181)
Now we'll start working on item #2.
Within the context of science, say we've labeled the true model for our topic of interest as $p$. We don't actually know what $p$ is--we wouldn't need the scientific method if we did. But say what we do have are two candidate models $q$ and $r$. We would at least like to know which is closer to $p$. It turns out we don't even need to know the absolute value of $p$ to achieve this. Just the relative values of $q$ and $r$ will suffice. We express model $q$'s average log-probability as $\text E \log (q_i)$. Extrapolating, the difference $\text E \log (q_i) - \text E \log (r_i)$ gives us a sense about the divergence of both $q$ and $r$ from the target $p$. That is,
> we can compare the average log-probability from each model to get an estimate of the relative distance of each model from the target..., [which] delivers us to a very common measure of *relative* model fit, one that also turns out to be an approximation of K-L divergence. To approximate the relative value of we can use a model's **deviance**, which is defined as:
>
> $$D(q) = -2 \sum_i \log (q_i)$$
>
>where $i$ indexes each observation (case), and each $q_i$ is just the likelihood of case $i$. (p. 182, **emphasis** in the original)
Here's the deviance from model `b6.1`.
```{r}
lm(data = d,
brain ~ mass) %>%
logLik() * -2
```
Next we learn how do this by hand.
#### Overthinking: Computing deviance.
To follow along with the text, we'll need to standardize `mass` before we fit our model.
```{r}
d <-
d %>%
mutate(mass_s = (mass - mean(mass)) / sd(mass))
```
Open brms.
```{r, message = F, warning = F}
library(brms)
```
Now we'll specify the initial values and fit the model.
```{r b6.8}
# define the starting values
inits <- list(Intercept = mean(d$brain),
mass_s = 0,
sigma = sd(d$brain))
inits_list <- list(inits, inits, inits, inits)
# The model
b6.8 <-
brm(data = d, family = gaussian,
brain ~ 1 + mass_s,
prior = c(prior(normal(0, 1000), class = Intercept),
prior(normal(0, 1000), class = b),
prior(cauchy(0, 10), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
inits = inits_list, # here we insert our start values
seed = 6,
file = "fits/b06.08")
```
```{r}
print(b6.8)
```
**Details about `inits`**: You don't have to specify your `inits` lists outside of the `brm()` function the way we did, here. This is just how I currently prefer. When you specify start values for the parameters in your Stan models, you need to do so with a list of lists. You need as many lists as HMC chains--four in this example. And then you put your--in this case--four lists inside a list. You need lists within lists. Also, we were lazy and specified the same start values across all our chains. You can mix them up across chains if you want.
Anyway, the brms `log_lik()` function returns a matrix. Each occasion gets a column and each HMC chain iteration gets a row. To make it easier to understand the output, we'll name the columns by `species` using the [`.name_repair` argument](https://tibble.tidyverse.org/reference/name-repair.html) within the `as_tibble()` function.
```{r}
ll <-
b6.8 %>%
log_lik() %>%
as_tibble(.name_repair = ~ d$species)
ll %>%
glimpse()
```
Deviance, recall, is the sum of the occasion-level LLs multiplied by -2: $D(q) = -2 \sum_i \log (q_i)$. Why by -2? "The -2 in front doesn't do anything important. It's there for historical reasons" (p. 182). If you follow footnote 93 at the end of that sentence in the text, you'll learn "under somewhat general conditions, for many common model types, a difference between two deviances has a chi-squared distribution. The factor of 2 is there to scale it that way" (p. 451).
```{r}
ll <-
ll %>%
mutate(sums = rowSums(.),
deviance = -2 * sums)
glimpse(ll)
```
Because we used HMC, deviance is a distribution rather than a single number. That is, we have a deviance $D(q)$ value for each row, for each HMC iteration.
```{r, fig.width = 3.75, fig.height = 2.5, warning = F, message = F}
library(tidybayes)
ll %>%
ggplot(aes(x = deviance, y = 0)) +
geom_halfeyeh(point_interval = median_qi, .width = .95,
fill = carto_pal(7, "BurgYl")[5], color = carto_pal(7, "BurgYl")[7]) +
scale_x_continuous(breaks = quantile(ll$deviance, c(.025, .5, .975)),
labels = quantile(ll$deviance, c(.025, .5, .975)) %>% round(1)) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle("The deviance distribution") +
theme_classic() +
theme(text = element_text(family = "Courier"),
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))
```
But notice our deviance distribution was centered right around the sole value McElreath reported in the text.
### From deviance to out-of-sample.
> Deviance is a principled way to measure distance from the target. But deviance as computed in the previous section has the same flaw as $R^2$: It always improves as the model gets more complex, at least for the types of models we have considered so far. Just like $R^2$, deviance in-sample is a measure of retrodictive accuracy, not predictive accuracy.
In the next subsection, we'll see this in a simulation which will produce the data necessary to make Figure 6.7.
#### Overthinking: Simulated training and testing.
I find the `rethinking::sim.train.test()` function opaque. If you're curious, you can find McElreath's code [here](https://github.com/rmcelreath/rethinking/blob/a309712d904d1db7af1e08a76c521ab994006fd5/R/sim_train_test.R). Let's simulate and see what happens.
```{r, eval = F}
library(rethinking)
n <- 20
kseq <- 1:5
# I've reduced this number by one order of magnitude to reduce computation time
n_sim <- 1e3
n_cores <- 4
# here's our dev object based on `N <- 20`
dev_20 <-
sapply(kseq, function(k) {
print(k);
r <- mcreplicate(n_sim, sim.train.test(N = n, k = k),
mc.cores = n_cores);
c(mean(r[1, ]), mean(r[2, ]), sd(r[1, ]), sd(r[2, ]))
}
)
# here's our dev object based on N <- 100
n <- 100
dev_100 <-
sapply(kseq, function(k) {
print(k);
r <- mcreplicate(n_sim, sim.train.test(N = n, k = k),
mc.cores = n_cores);
c(mean(r[1, ]), mean(r[2, ]), sd(r[1, ]), sd(r[2, ]))
}
)
```
```{r, echo = F}
# Even after reducing `n_sim` by an order of magnitude, that simulation takes 4.528313 mins to
# complete. So instead of completing it on the fly for each new version of this project, I've
# saved the results in an external file.
# save(list = c("dev_100", "dev_20", "kseq", "n_sim", "n_cores"), file = "sims/06.sim_1.rda")
load("sims/06.sim_1.rda")
```
If you didn't quite catch it, the simulation yields `dev_20` and `dev_100`. We'll want to convert them to tibbles, bind them together, and wrangle extensively before we're ready to plot.
```{r, message = F, warning = F}
dev_tibble <-
dev_20 %>%
as_tibble() %>%
bind_rows(
dev_100 %>%
as_tibble()
) %>%
mutate(n = rep(c("n = 20", "n = 100"), each = 4),
statistic = rep(c("mean", "sd"), each = 2) %>% rep(., times = 2),
sample = rep(c("in", "out"), times = 2) %>% rep(., times = 2)) %>%
gather(n_par, value, -n, -statistic, -sample) %>%
spread(key = statistic, value = value) %>%
mutate(n = factor(n, levels = c("n = 20", "n = 100")),
n_par = str_remove(n_par, "V") %>% as.double()) %>%
mutate(n_par = ifelse(sample == "in", n_par - .075, n_par + .075))
head(dev_tibble)
```
Now we're ready to make Figure 6.7.
```{r, fig.width = 6, fig.height = 3}
# this intermediary tibble will make `geom_text()` easier
dev_text <-
dev_tibble %>%
filter(n_par > 1.5,
n_par < 2.5) %>%
mutate(n_par = ifelse(sample == "in", n_par - .2, n_par + .28))
# the plot
dev_tibble %>%
ggplot(aes(x = n_par, y = mean,
ymin = mean - sd, ymax = mean + sd,
group = sample, color = sample, fill = sample)) +
geom_pointrange(shape = 21) +
geom_text(data = dev_text,
aes(label = sample)) +
scale_color_manual(values = c(carto_pal(7, "BurgYl")[7], carto_pal(7, "BurgYl")[5])) +
scale_fill_manual(values = c(carto_pal(7, "BurgYl")[5], carto_pal(7, "BurgYl")[7])) +
labs(x = "number of parameters",
y = "deviance") +
theme_classic() +
theme(text = element_text(family = "Courier"),
legend.position = "none",
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)),
strip.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[1], 1/4), color = "white")) +
facet_wrap(~n, scale = "free_y")
```
Even with a substantially smaller $N$, our simulation results matched up well with those in the text.
> Deviance is an assessment of predictive accuracy, not of truth. The true model, in terms of which predictors are included, is not guaranteed to produce the best predictions. Likewise a false model, in terms of which predictors are included, is not guaranteed to produce poor predictions.
>
> The point of this thought experiment is to demonstrate how deviance behaves, in theory. While deviance on training data always improves with additional predictor variables, deviance on future data may or may not, depending upon both the true data-generating processes and how much data is available to precisely estimate the parameters. These facts form the basis for understanding both regularizing priors and information criteria. (p. 185)
## Regularization
> The root of overfitting is a model's tendency to get overexcited by the training sample... One way to prevent a model from getting too excited by the training sample is to give it a skeptical prior. By "skeptical," I mean a prior that slows the rate of learning from the sample. (p. 186)
In case you were curious, here's how you might do Figure 6.8 with ggplot2. All the action is in the `geom_ribbon()` portions.
```{r, fig.width = 3, fig.height = 3}
tibble(x = seq(from = - 3.5,
to = 3.5,
by = .01)) %>%
ggplot(aes(x = x, ymin = 0)) +
geom_ribbon(aes(ymax = dnorm(x, mean = 0, sd = 0.2)),
fill = carto_pal(7, "BurgYl")[7], alpha = 1/2) +
geom_ribbon(aes(ymax = dnorm(x, mean = 0, sd = 0.5)),
fill = carto_pal(7, "BurgYl")[6], alpha = 1/2) +
geom_ribbon(aes(ymax = dnorm(x, mean = 0, sd = 1)),
fill = carto_pal(7, "BurgYl")[5], alpha = 1/2) +
scale_y_continuous(NULL, breaks = NULL) +
xlab("parameter value") +
coord_cartesian(xlim = c(-3, 3)) +
theme_classic() +
theme(text = element_text(family = "Courier"),
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)))
```
In our version of the plot, darker purple = more regularizing.
But to prepare for Figure 6.9, let's simulate. This time we'll wrap the basic simulation code we used before into a function we'll call `make_sim()`. Our `make_sim()` function has two parameters, `N` and `b_sigma`, both of which come from McElreath's simulation code. So you'll note that instead of hard coding the values for `N` and `b_sigma` within the simulation, we're leaving them adjustable (i.e., `sim.train.test(N = n, k = k, b_sigma = b_sigma)`). Also notice that instead of saving the simulation results as objects, like before, we're just converting them to tibbles with the `as_tibble()` function at the bottom. Our goal is to use `make_sim()` within a `purrr::map2()` statement. The result will be a nested tibble into which we've saved the results of 6 simulations based off of two sample sizes (i.e., `n = c(20, 100)`) and three values of $\sigma$ for our Gaussian $\beta$ prior (i.e., `b_sigma = c(1, .5, .2)`).
```{r, eval = F}
library(rethinking)
# I've reduced this number by one order of magnitude to reduce computation time
n_sim <- 1e3
make_sim <- function(n, b_sigma) {
sapply(kseq, function(k) {
print(k);
# this is an augmented line of code
r <- mcreplicate(n_sim, sim.train.test(N = n, k = k, b_sigma = b_sigma),
mc.cores = n_cores);
c(mean(r[1, ]), mean(r[2, ]), sd(r[1, ]), sd(r[2, ])) }) %>%
# this is a new line of code
as_tibble()
}
s <-
tibble(n = rep(c(20, 100), each = 3),
b_sigma = rep(c(1, .5, .2), times = 2)) %>%
mutate(sim = map2(n, b_sigma, make_sim)) %>%
unnest()
```
```{r, echo = F}
# Even after reducing `n_sim` by an order of magnitude, that simulation takes 11.77091 mins to
# complete. So instead of completing it on the fly for each new version of this project, I’ve
# saved the results in an external file.
# save("s", file = "sims/06.sim_2.rda")
load("sims/06.sim_2.rda")
```
We'll follow the same principles for wrangling these data as we did those from the previous simulation, `dev_tibble`. And after wrangling, we'll feed the data directly into the code for our version of Figure 6.9.
```{r, fig.width = 6, fig.height = 3}
# wrangle the simulation data
s %>%
mutate(statistic = rep(c("mean", "sd"), each = 2) %>% rep(., times = 3 * 2),
sample = rep(c("in", "out"), times = 2) %>% rep(., times = 3 * 2)) %>%
gather(n_par, value, -n, -b_sigma, -statistic, -sample) %>%
spread(key = statistic, value = value) %>%
mutate(n = str_c("n = ", n) %>% factor(., levels = c("n = 20", "n = 100")),
n_par = str_remove(n_par, "V") %>% as.double()) %>%
# now plot
ggplot(aes(x = n_par, y = mean,
group = interaction(sample, b_sigma))) +
geom_line(aes(color = sample, size = b_sigma %>% as.character())) +
# this function contains the data from the previous simulation
geom_point(data = dev_tibble,
aes(group = sample, fill = sample),
color = "black", shape = 21, size = 2.5, stroke = .1) +
scale_fill_manual(values = c(carto_pal(7, "BurgYl")[7], carto_pal(7, "BurgYl")[5])) +
scale_color_manual(values = c(carto_pal(7, "BurgYl")[7], carto_pal(7, "BurgYl")[5])) +
scale_size_manual(values = c(1, .5, .2)) +
labs(x = "number of parameters",
y = "deviance") +
theme_classic() +
theme(text = element_text(family = "Courier"),
legend.position = "none",
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)),
strip.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[1], 1/4), color = "white")) +
facet_wrap(~n, scale = "free_y")
```
Our results don't perfectly align with those in the text. I suspect his is because we used `1e3` iterations, rather than the `1e4` of the text. If you'd like to wait all night long for the simulation to yield more stable results, be my guest.
> Regularizing priors are great, because they reduce overfitting. But if they are too skeptical, they prevent the model from learning from the data. So to use them effectively, you need some way to tune them. Tuning them isn't always easy. (p. 187)
For more on this how to choose your priors, consider Gelman, Simpson, and Betancourt's [*The prior can generally only be understood in the context of the likelihood*](https://arxiv.org/abs/1708.07487), a paper that will probably make more sense after [Chapter 9][Big Entropy and the Generalized Linear Model]. And if you're feeling feisty, also check out Simpson's related blog post [*(It’s never a) Total Eclipse of the Prior*](https://statmodeling.stat.columbia.edu/2017/09/05/never-total-eclipse-prior/).
#### Rethinking: Multilevel models as adaptive regularization.
> When you encounter multilevel models in [Chapter 12][Multilevel Models], you'll see that their central device is to learn the strength of the prior form the data itself. So you can think of multilevel models as adaptive regularization, where the model itself tries to learn how skeptical it should be. (p. 188)
I found this connection difficult to grasp for a long time. Practice now and hopefully it'll sink in for you faster than it did me.
#### Rethinking: Ridge regression.
Within the brms framework, you can do something like this with the horseshoe prior via the `horseshoe()` function. You can learn all about it from the `horseshoe` section of the [brms reference manual (version 2.12.0)](https://cran.r-project.org/package=brms/brms.pdf). Here's an extract from the section:
> The horseshoe prior is a special shrinkage prior initially proposed by [Carvalho et al. (2009)](https://pdfs.semanticscholar.org/edf6/117a38afa73b211259c88e4e5539891f6c2c.pdf?_ga=2.142875701.1091907784.1577308805-576400738.1563831905). It is symmetric around zero with fat tails and an infinitely large spike at zero. This makes it ideal for sparse models that have many regression coefficients, although only a minority of them is non-zero. The horseshoe prior can be applied on all population-level effects at once (excluding the intercept) by using `set_prior("horseshoe(1)")`. (p. 93)
And to dive even deeper into the horseshoe prior, check out Michael Betancourt's tutorial, [*Bayes Sparse Regression*](https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html#35_the_horseshoe).
## Information criteria
The data from our initial simulation isn't formatted well to plot Figure 6.10. We'll have to wrangle a little.
```{r}
(
dev_tibble <-
dev_tibble %>%
select(-sd) %>%
mutate(n_par = ifelse(sample == "in", n_par + .075, n_par - .075)) %>%
spread(key = sample, value = mean) %>%
mutate(height = (out - `in`) %>% round(digits = 1) %>% as.character(),
dash = `in` + 2 * n_par)
)
```
Now we're ready to plot.
```{r, fig.width = 6, fig.height = 3}
dev_tibble %>%
ggplot(aes(x = n_par)) +
geom_line(aes(y = dash),
linetype = 2, color = carto_pal(7, "BurgYl")[5]) +
geom_point(aes(y = `in`),
color = carto_pal(7, "BurgYl")[7], size = 2) +
geom_point(aes(y = out),
color = carto_pal(7, "BurgYl")[5], size = 2) +
geom_errorbar(aes(x = n_par + .15,
ymin = `in`, ymax = out),
width = .1, color = carto_pal(7, "BurgYl")[6]) +
geom_text(aes(x = n_par + .4,
y = (out + `in`) / 2,
label = height),
family = "Courier", size = 3, color = carto_pal(7, "BurgYl")[6]) +
labs(x = "number of parameters",
y = "deviance") +
theme_classic() +
theme(text = element_text(family = "Courier"),
panel.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[3], 1/4)),
strip.background = element_rect(fill = alpha(carto_pal(7, "BurgYl")[1], 1/4), color = "white")) +
facet_wrap(~n, scale = "free_y")
```
Again, our numbers aren't the exact same as McElreath's because a) this is a simulation and b) our number of simulations was an order of magnitude smaller than his. But the overall pattern is the same. More to the point, the distances between the in- and out-of-sample points
> are nearly the same, for each model, at both $N = 20$ (left) and $N = 100$ (right). Each distance is nearly twice the number of parameters, as labeled on the horizontal axis. The dashed lines show exactly the [dark purple] points plus twice the number of parameters, tracing closely along the average out-of-sample deviance for each model.
>
> This is the phenomenon behind information criteria. (p. 189)
*Information criteria estimate out-of-sample deviance*. The frequentist AIC is the oldest and most restrictive. In the text, McElreach focused on the DIC and WAIC. As you'll see, the LOO has increased in popularity since he published the text. Going forward, we'll juggle the WAIC and the LOO in this project. But we will respect the text and work in a little DIC talk.
### DIC.
The DIC has been widely used for some time, now. For a great talk on the DIC, check out the authoritative David Spiegelhalter's [*Retrospective read paper: Bayesian measure of model complexity and fit*](https://www.youtube.com/watch?v=H-59eqmHuuQ&frags=pl%2Cwn). If we define $D$ as the deviance's posterior distribution, $\bar D$ as its mean and $\hat D$ as the deviance when computed at the posterior mean, then we define the DIC as
$$\text{DIC} = \bar D + (\bar D - \hat D) = \bar D + p_D$$
And $p_D$ is the number of effective parameters in the model, which is also sometimes referred to as the penalty term. As you'll see, you can get the $p_D$ for `brms::brm()` models. However, I'm not aware of a way to that brms or the loo package--to be introduced shortly--offer convenience functions that yield the DIC.
I'm also not aware of an easy to compute the DIC by hand the way we've been doing, so far. I'll walk it out with our model `b6.8`. Remember how we had an entire posterior distribution for the deviance?
```{r}
ll %>%
select(deviance) %>%
glimpse()