forked from tidymodels/cloudstart
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_preprocess_with_recipes.Rmd
288 lines (205 loc) · 8.05 KB
/
02_preprocess_with_recipes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
---
title: "Preprocess your data with recipes"
output:
html_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
options(tibble.print_min = 5)
```
Get started with building a model in this R Markdown document that accompanies [Preprocess your data with recipes](https://www.tidymodels.org/start/recipes) tidymodels start article.
If you ever get lost, you can visit the links provided next to section headers to see the accompanying section in the online article.
Take advantage of the RStudio IDE and use "Run All Chunks Above" or "Run Current Chunk" buttons to easily execute code chunks.
## [Introduction](https://www.tidymodels.org/start/recipes/#intro)
Load necessary packages:
```{r}
library(tidymodels) # for the recipes package, along with the rest of tidymodels
# Helper packages
library(nycflights13) # for flight data
library(skimr) # for variable summaries
```
Load and wrangle data:
```{r}
flight_data <-
flights %>%
mutate(
# Convert the arrival delay to a factor
arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
arr_delay = factor(arr_delay),
# We will use the date (not date-time) in the recipe below
date = as.Date(time_hour)
) %>%
# Include the weather data
inner_join(weather, by = c("origin", "time_hour")) %>%
# Only retain the specific columns we will use
select(dep_time, flight, origin, dest, air_time, distance,
carrier, date, arr_delay, time_hour) %>%
# Exclude missing data
na.omit() %>%
# For creating models, it is better to have qualitative columns
# encoded as factors (instead of character strings)
mutate_if(is.character, as.factor)
```
Before moving forward, let's reduce the size of our data so we can run these analyses with the default computational resources on RStudio Cloud. By doing so we will avoid aborting our session.
Let's sample 20% of the rows and assign it as our data:
```{r}
# Fix the random numbers by setting the seed
# This enables the analysis to be reproducible when random numbers are used
set.seed(3)
flight_data <- flight_data %>%
slice_sample(prop = 0.2)
nrow(flight_data)
```
Note that since we are using a subset of the original data set, the results you generate here will be slightly different than the *Preprocess your data with recipes* article.
Check the number of delayed flights:
```{r}
flight_data %>%
count(arr_delay) %>%
mutate(prop = n/sum(n))
```
For example, the number of `late` and `on_time` flights you get here are less than the number of flights you see in the article. The proportions are very close, though, suggesting that our random sampling was indeed random and did not over- or under-sample one category vs. the other.
Take a look at data types and data points:
```{r}
glimpse(flight_data)
```
Summarise the dataset:
```{r}
flight_data %>%
skimr::skim(dest, carrier)
```
## [Data splitting](https://www.tidymodels.org/start/recipes/#data-split)
Create training and test sets:
```{r}
# Put 3/4 of the data into the training set
data_split <- initial_split(flight_data, prop = 3/4)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
```
Try typing `?initial_split` in the console to get more details about the splitting function from `rsample` package.
## [Create recipe and roles](https://www.tidymodels.org/start/recipes/#recipe)
Let's initiate a new recipe:
```{r}
flights_rec <-
recipe(arr_delay ~ ., data = train_data)
```
You can see more details about how to create **recipes** by typing `?recipe` in the console.
Update variable roles of a recipe with `update_role`:
```{r}
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID")
```
You can also read more about adding/updating/removing roles with `?roles`.
To get the current set of variables and roles, use the `summary()` function:
```{r}
summary(flights_rec)
```
## [Create features](https://www.tidymodels.org/start/recipes/#features)
What happens if we transform `date` column to `numeric`?
```{r}
flight_data %>%
distinct(date) %>%
mutate(numeric_date = as.numeric(date))
```
From `date` we can derive more meaningful features such as:
* the day of the week,
* the month, and
* whether or not the date corresponds to a holiday.
Add **steps** to your recipe to generate these features:
```{r}
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date)
```
Check out help documents for these step functions with `?step_date`, `?step_holiday`, `?step_rm`.
Create dummy variables using `step_dummy()`:
```{r}
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes())
```
Check if some destinations present in test set are not included in the training set:
```{r}
test_data %>%
distinct(dest) %>%
anti_join(train_data)
```
Remove variables that contain only a single value with `step_zv()`:
```{r}
flights_rec <-
recipe(arr_delay ~ ., data = train_data) %>%
update_role(flight, time_hour, new_role = "ID") %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
```
## [Fit a model with a recipe](https://www.tidymodels.org/start/recipes/#fit-workflow)
Recall the [Build a model](https://www.tidymodels.org/start/models/) article.
This time we build a model specification for logistic regression using the `glm` engine:
```{r}
lr_mod <-
logistic_reg() %>%
set_engine("glm")
```
For more details try typing `?set_engine` and `?glm` in the console.
Bundle the model specification (`lr_mod`) with the recipe (`flights_rec`) to create a *model workflow*:
```{r}
flights_wflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(flights_rec)
flights_wflow
```
Prepare the recipe and train the model:
Be patient; this step will take a little time to compute.
```{r}
flights_fit <-
flights_wflow %>%
fit(data = train_data)
```
Pull the fitted model object then use the `broom::tidy()` function to get a tidy tibble of model coefficients:
```{r}
flights_fit %>%
pull_workflow_fit() %>%
tidy()
```
## [Use a trained workflow to predict](https://www.tidymodels.org/start/recipes/#predict-workflow)
Simply apply fitted model to `test_data` and predict outcomes.
```{r}
predict(flights_fit, test_data)
```
Get predicted class probabilities and bind them with some variables from the test data:
```{r}
flights_pred <-
predict(flights_fit, test_data, type = "prob") %>%
bind_cols(test_data %>% select(arr_delay, time_hour, flight))
# The data look like:
flights_pred
```
Note that the result you get here will be different than the online article since we only fitted the model to the subset of the actual data set.
Let's look at model performance with ROC curve (`roc_curve()`) and plot by piping it to the `autoplot()`.
```{r}
flights_pred %>%
roc_curve(truth = arr_delay, .pred_late) %>%
autoplot()
```
Similarly, `roc_auc()` estimates the area under the curve:
```{r}
flights_pred %>%
roc_auc(truth = arr_delay, .pred_late)
```
Good job!
Now it's your turn to test out this workflow [*without*](https://tidymodels.github.io/workflows/reference/add_formula.html) this recipe!
In the [Build a model](https://www.tidymodels.org/start/models/) article, we did not use a recipe but used a **formula** instead.
You can use `workflows::add_formula(arr_delay ~ .)` instead of `add_recipe()` (remember to remove the identification variables first!), and see whether our recipe improved our model's ability to predict late arrivals.