forked from tidymodels/cloudstart
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_build_a_model.Rmd
189 lines (138 loc) · 4.5 KB
/
01_build_a_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Build a model"
output:
html_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
options(tibble.print_min = 5)
```
Get started with building a model in this R Markdown document that accompanies [Build a model](https://www.tidymodels.org/start/models/) tidymodels start article.
If you ever get lost, you can visit the links provided next to section headers to see the accompanying section in the online article.
Take advantage of the RStudio IDE and use "Run All Chunks Above" or "Run Current Chunk" buttons to easily execute code chunks.
## [Introduction](https://www.tidymodels.org/start/models/#intro)
Load necessary packages:
```{r}
library(tidymodels) # for the parsnip package, along with the rest of tidymodels
# Helper packages
library(readr) # for importing data
library(broom.mixed) # for converting bayesian models to tidy tibbles
```
## [The Sea Urchins Data](https://www.tidymodels.org/start/models/#data)
```{r}
urchins <-
# Data were assembled for a tutorial
# at https://www.flutterbys.com.au/stats/tut/tut7.5a.html
read_csv("https://tidymodels.org/start/models/urchins.csv") %>%
# Change the names to be a little more verbose
setNames(c("food_regime", "initial_volume", "width")) %>%
# Factors are very helpful for modeling, so we convert one column
mutate(food_regime = factor(food_regime, levels = c("Initial", "Low", "High")))
```
Look at the data:
```{r}
urchins
```
Plot the data:
```{r}
ggplot(urchins,
aes(x = initial_volume,
y = width,
group = food_regime,
col = food_regime)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
scale_color_viridis_d(option = "plasma", end = .7)
```
## [Build and fit a model](https://www.tidymodels.org/start/models/#build-model)
```{r}
linear_reg() %>%
set_engine("lm")
```
Try typing `?linear_reg()` in the console to see all available engines and other details about this model type.
Create model specification:
```{r}
lm_mod <-
linear_reg() %>%
set_engine("lm")
```
Fit model:
```{r}
lm_fit <-
lm_mod %>%
fit(width ~ initial_volume * food_regime, data = urchins)
lm_fit
```
Present model results in a tidyverse friendly way with `tidy()` from `broom` package.
```{r}
tidy(lm_fit)
```
## [Use a model to predict](https://www.tidymodels.org/start/models/#predict-model)
New example data to predict:
```{r}
new_points <- expand.grid(initial_volume = 20,
food_regime = c("Initial", "Low", "High"))
new_points
```
Generate the mean body width values:
```{r}
mean_pred <- predict(lm_fit, new_data = new_points)
mean_pred
```
Get confidence intervals and plot:
```{r}
conf_int_pred <- predict(lm_fit,
new_data = new_points,
type = "conf_int")
conf_int_pred
# Now combine:
plot_data <-
new_points %>%
bind_cols(mean_pred) %>%
bind_cols(conf_int_pred)
# and plot:
ggplot(plot_data, aes(x = food_regime)) +
geom_point(aes(y = .pred)) +
geom_errorbar(aes(ymin = .pred_lower,
ymax = .pred_upper),
width = .2) +
labs(y = "urchin size")
```
## [Model with a different engine](https://www.tidymodels.org/start/models/#new-engine)
Switch to Bayesian approach by simply changing your engine to **stan**:
```{r}
# set the prior distribution
prior_dist <- rstanarm::student_t(df = 1)
set.seed(123)
# make the parsnip model
bayes_mod <-
linear_reg() %>%
set_engine("stan",
prior_intercept = prior_dist,
prior = prior_dist)
# train the model
bayes_fit <-
bayes_mod %>%
fit(width ~ initial_volume * food_regime, data = urchins)
print(bayes_fit, digits = 5)
```
To update the parameter table, the `tidy()` method is once again used:
```{r}
tidy(bayes_fit, intervals = TRUE)
```
Get your predictions without changing the syntax you used earlier:
```{r}
bayes_plot_data <-
new_points %>%
bind_cols(predict(bayes_fit, new_data = new_points)) %>%
bind_cols(predict(bayes_fit, new_data = new_points, type = "conf_int"))
ggplot(bayes_plot_data, aes(x = food_regime)) +
geom_point(aes(y = .pred)) +
geom_errorbar(aes(ymin = .pred_lower, ymax = .pred_upper), width = .2) +
labs(y = "urchin size") +
ggtitle("Bayesian model with t(1) prior distribution")
```
Think about how we are using the pipe (`%>%`):
+ Use the pipe to pass around the _data_ in the **tidyverse**
+ Use the pipe to pass around the _model object_ with **tidymodels**