Skip to content

Commit

Permalink
Merge branch 'main' into dark-mode
Browse files Browse the repository at this point in the history
  • Loading branch information
‘topepo’ committed Jun 29, 2024
2 parents c520018 + 7b3731c commit 361cb00
Show file tree
Hide file tree
Showing 11 changed files with 297 additions and 53 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Quarto
Package: aml4td
Title: Applied Machine Learning for Tabular Data
Version: 24.01.02
Version: 24.06.17
Authors@R: c(
person("Max", "Kuhn", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-2402-136X")),
Expand Down
Binary file modified RData/deliveries_lm.RData
Binary file not shown.
2 changes: 1 addition & 1 deletion _freeze/chapters/introduction/execute-results/html.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/index/execute-results/html.json

Large diffs are not rendered by default.

2 changes: 0 additions & 2 deletions chapters/categorical-predictors.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -381,8 +381,6 @@ hash_256 <-
```{r}
#| label: tbl-feature-hash
remake_name <- function(x) {
x <- gsub("_", " ", x)
stringr::str_to_title(x)
Expand Down
226 changes: 226 additions & 0 deletions chapters/interactions-nonlinear.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1315,6 +1315,232 @@ The categorical encoding approach shown in @eq-effect-posterior from @sec-effect

As mentioned, we’ll return to this topic several times in upcoming sections.


## Discretization

Discretization^[Also known as binning or dichotomization.] is the process of converting quantitative data into a set of qualitative groups (a.k.a "bins"). The model uses these values instead of the original predictor column (perhaps requiring an additional step to convert them into binary indicator columns). Sadly, @fig-wall-of-pie illustrates numerous analyses that we have witnessed. This example uses the food delivery data and breaks the order hour and distance predictors into six and three groups, respectively. It also converts the delivery time outcome into three groups. This visualization, colloquially known as the "Wall of Pie," tries to explain how the two predictors affect the outcome categories, often with substantial subjectivity.

```{r}
#| label: fig-wall-of-pie
#| echo: false
#| out-width: 60%
#| fig-width: 6
#| fig-height: 3
#| fig-cap: "An unfortunate visualization of two predictors and the outcome in the food delivery data. The colors of the pie chart reflect the binned delivery times at cut points of 20 and 30 minutes with lighter blue indicating earlier times."
dist_pctl <- quantile(delivery_train$distance, probs = seq(0, 1, length.out = 4))
dist_pctl <- round(dist_pctl, 1)
dist_pctl[1] <- 1.7
pie_n <-
delivery_train %>%
mutate(
distance0 = distance,
distance = cut(distance, breaks = dist_pctl, include.lowest = TRUE),
distance = factor(as.character(distance), levels = rev(levels(distance))),
hour = cut(hour, breaks = 2 * (5:11), include.lowest = TRUE),
delivery_time = cut(
time_to_delivery,
breaks = c(10, 20, 30, 100),
include.lowest = TRUE
)
) %>%
count(hour, distance, delivery_time, .drop = FALSE)
pie_by_group <-
pie_n %>%
summarize(n_group = sum(n),
.by = c(hour, distance))
pie_prop <-
full_join(pie_n, pie_by_group, by = c("hour", "distance")) %>%
mutate(rate = n / n_group * 100)
pie_prop %>%
ggplot(aes(x = 1, y = rate, fill = delivery_time)) +
geom_bar(stat = "identity", width = 1) +
scale_x_discrete(drop = FALSE) +
coord_polar(theta = "y") +
facet_grid(distance ~ hour, switch = "both") +
scale_fill_brewer(palette = "Blues") +
theme_minimal() +
theme(
legend.position = "none",
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank()
) +
labs(x = "Distance", y = "Order Hour")
```

Our general advice, described in more detail in @fes, is that the first inclination should never be to engineer continuous predictors via discretization^[In the immortal words^[[`https://livefreeordichotomize.com/`](https://livefreeordichotomize.com/)] of Lucy D’Agostino McGowan and Nick Strayer: "Live free or dichotomize."]. Other tools, such as splines, are both theoretically and practically superior to converting quantitative data to qualitative data. In addition, if a predictor can be split into regions that are predictive of the response, then methods that use recursive partitioning will be less arbitrary and more effective.

The literature supporting this is extensive, such as: @Cohen1983mn, @Altman1991ro, @Maxwell1993ig, @Altman1994oa, @Buettner1997bt, @Altman1998vs, @Taylor2002jj, @MacCallum2002ox, @Irwin2003mp, @Owen2005do, @Altman2006gn, @Royston2006md, @VanWalraven2008ne, @Fedorov2009jy, @Naggara2011xu, @Bennette2012ua, @Kuss2013zi, @Kenny2013cf, @BarnwellMenard2015xa, @Fernandes2019na, as well as the references shown in @harrell2015regression. These articles identify the main problems of discretization as follows:

* Arbitrary (non-methodological) choice of breaks for binning can lead to significant bias.
* The predictor suffers a significant loss of information, making it less effective. Moreover, there is reduced statistical power to detect differences between groups when they exist.
* The number of features are increased, thus exacerbating the challenge of feature selection.
* Correlations between predictors are inflated due to the unrealistic reduction in the variance of predictors.

@pettersson2016quantitative shows differences in analyses with and without discretization. Their Fig. 1 shows a common binning analysis: a continuous outcome and one or more predictors are converted to qualitative formats and a grid of pie charts is created. Inferences are made from this visualization. One main problem is related to uncertainty. The noise in the continuous data is squashed so that any visual signal that is seen appears more factual than it is in reality^[@Kenny2013cf does an excellent job illustrating this issue.]. Also, the pie charts do not show measures of uncertainty; how do we know when two pie charts are "significantly different"?

Alternatively, Figs. 4 and 5 of their paper shows the results of a logistic regression model where all predictors were left as-is and splines were used to model the probability of the outcome. This has a much simpler interpretation and confidence bands give the reader a sense that the differences are real.

While it is not advisable to discretize most predictors, there are some cases when discretization can be helpful.
As a counter-example, one type of measurement that is often appropriate to discretize is date. For example, @fes show a data set where daily ridership data was collected for the Chicago elevated train system. The primary trend in the data was whether or not the day was a weekday. Ridership was significantly higher when people commute to work. A simple indicator for Saturday/Sunday (as well as major holiday indicators) was the driving force behind many regression models on those data. In this case, making qualitative versions of the date was rational, non-arbitrary, and driven by data analysis.

Note that several models, such as classification/regression trees and multivariate adaptive regression splines, estimate cut points in the model-building process. The difference between these methodologies and manual binning is that the models use all the predictors to derive bins based on a single objective (such as maximizing accuracy). They evaluate many variables simultaneously and are usually based on statistically sound methodologies.

If it is the last resort, how should one go about discretizing predictors? First, topic specific expertise of the problem can be used to create appropriate categories when categories are _truly_ merited as in the example of creating an indicator for weekend day in the Chicago ridership data. Second, and most important, _any_ methodology should be well validated using data that were not used to build the model (or choose the cut points for binning). To convert data to a qualitative format, there are both supervised and unsupervised methods.

The most reliable unsupervised approach is to choose the number of new features and use an appropriate number of percentiles to bin the data. For example, if four new features are required, the 0, 25%, 50%, 75%, and 100% quantiles would be used. This ensures that each resulting bin contains about the same number of samples from the training set.

If you noticed that this is basically the same approach suggested for choosing spline knots in the discussion earlier in this chapter, you are correct. This process is very similar to using a zero-order polynomial spline, the minor difference being the placement of the knots. A zero-order model is a simple constant value, usually estimated by the mean. This is theoretically interesting but also enables users to contrast discretization _directly_ with traditional spline basis expansions. For example, if a B-spline was used, the modeler could tune over the number of model terms (i.e., the number of knots) and the spline polynomial degree. If binning is the superior approach, the tuning process would select that approach as optimal. In other words, we can let the data decide if discretization is a good idea^[But is most likely **not** a good idea.].

A supervised approach would, given a specific number of new features to create, determine the breakpoints by optimizing a performance measure (e.g., RMSE, classification accuracy, etc.). A good example is a tree-based model (very similar to the process shown in @fig-collapse). After fitting a single tree or, better yet, an ensemble of trees, the split values in the trees can be used as the breakpoints.

```{r}
#| label: fossil-bins
#| cache: true
data(fossil)
fossil_df <- tibble(age = fossil$age, ratio = fossil$strontium.ratio)
set.seed(82)
fossil_rs <- vfold_cv(fossil_df, repeats = 5)
rmse_only <- metric_set(rmse)
ctrl <- control_grid(save_workflow = TRUE)
rec <- recipe(ratio ~ age, data = fossil_df)
rec_unsup <-
rec %>%
step_discretize(age, num_breaks = tune(), min_unique = 5)
rec_tree_splits<-
rec %>%
step_discretize_cart(
age,
outcome = "ratio",
cost_complexity = tune(),
min_n = 5,
tree_depth = 25
)
unsuper_wflow <- rec_unsup %>% workflow(linear_reg())
unsuper_res <-
unsuper_wflow %>%
tune_grid(
resamples = fossil_rs,
grid = tibble(num_breaks = 2:10),
metrics = rmse_only,
control = ctrl
)
super_wflow <- rec_tree_splits %>% workflow(linear_reg())
set.seed(724)
super_res <-
super_wflow %>%
tune_grid(
resamples = fossil_rs,
grid = tibble(cost_complexity = 10^seq(-3, -1, length.out = 20)),
metrics = rmse_only,
control = ctrl
)
bin_lvls <- c("unsupervised", "supervised")
bin_cols <- c("#E31A1C", "#6A3D9A")
names(bin_cols) <- bin_lvls
bin_resampling <-
unsuper_res %>%
collect_metrics() %>%
mutate(method = "unsupervised") %>%
bind_rows(
super_res %>%
collect_metrics() %>%
mutate(method = "supervised")
) %>%
mutate(
lower = mean - qnorm(.95) * std_err,
upper = mean + qnorm(.95) * std_err,
method = factor(method, levels = bin_lvls)
)
super_breaks <-
fit_best(super_res) %>%
extract_recipe() %>%
tidy(number = 1) %>%
mutate(
method = "supervised",
method = factor(method, levels = bin_lvls)
)
num_super_breaks <- length(unique(super_breaks$value))
super_pred <-
fit_best(super_res) %>%
augment(fossil_df) %>%
mutate(
method = "supervised",
method = factor(method, levels = bin_lvls)
)
unsuper_breaks <-
fit_best(unsuper_res) %>%
extract_recipe() %>%
tidy(number = 1) %>%
mutate(
method = "unsupervised",
method = factor(method, levels = bin_lvls)
) %>%
rename(value = value)%>%
filter(!is.infinite(value))
num_unsuper_breaks <- length(unique(unsuper_breaks$value))
unsuper_pred <-
fit_best(unsuper_res) %>%
augment(fossil_df) %>%
mutate(
method = "unsupervised",
method = factor(method, levels = bin_lvls)
)
both_pred <- bind_rows(unsuper_pred, super_pred)
best_unsuper <- show_best(unsuper_res, metric = "rmse")
best_super <- show_best(super_res, metric = "rmse")
```

Let's again use the fossil data from @bralower1997mid illustrated in previous section on basis functions. We can fit a linear regression with qualitative terms for the age derived using:

- An unsupervised approach using percentiles at cut-points.
- A supervised approach where a regression tree model is used to set the breakpoints for the bins.

In each case, the number of new features requires tuning. Using the basic grid search tools described in @sec-grid, the number of required terms was set for each method (ranging from `r min(collect_metrics(unsuper_res)$num_breaks)` to `r max(collect_metrics(unsuper_res)$num_breaks)` terms) by minimizing the RMSE from a simple linear regression model. The results are that both approaches required the same number of new features and produced about the same level of performance; the unsupervised approach required `r num_unsuper_breaks` breaks to achieve an RMSE of `r format(best_unsuper$mean[1], digits = 3, scientific = FALSE)` and the supervised model has an RMSE of `r format(best_super$mean[1], digits = 3, scientific = FALSE)` with `r num_super_breaks` cut points. @fig-fossil-bins shows the fitted model using the unsupervised terms.

```{r}
#| label: fig-fossil-bins
#| out-width: 80%
#| fig-width: 8
#| fig-height: 4
#| fig-cap: The estimated relationship between fossil age and the isotope ratio previously shown in @fig-piecewise-polynomials, now using discretization methods.
fossil_df %>%
ggplot() +
geom_point(aes(age, ratio), alpha = 3 / 4, pch = 1, cex = 3) +
geom_step(data = both_pred,
aes(age, .pred, col = method),
lwd = 1,
alpha = 1 / 2) +
labs(x = "Age", y = "Isotope Ratio") +
scale_color_manual(values = bin_cols)
```

The results are remarkably similar to one another. The blocky nature of the fitted trend reflects that, within each bin, a simple mean is used to estimate the sale price.

Again, we want to emphasize that arbitrary or subjective discretization is almost always suboptimal.

## Chapter References {.unnumbered}

```{r}
Expand Down
4 changes: 4 additions & 0 deletions chapters/news.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ No reported edits (yet).

## Changelog {.unlisted .unnumbered}

### 2024-06-17 {.unlisted .unnumbered}

Several new chapters on embeddings, splines/interactions/discretization, and overfitting. Also, shinylive is used for interactive visualizations of concepts.

### 2024-01-02 {.unlisted .unnumbered}

Fixed various typos.
Expand Down
15 changes: 13 additions & 2 deletions chapters/whole-game.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -574,6 +574,14 @@ final_hidden_units <-
best_nnet <- select_best(nnet_res, metric = "mae")
nnet_mae_vals <-
nnet_res %>%
collect_metrics() %>%
filter(.metric == "mae") %>%
arrange(desc(mean)) %>%
slice(-1) %>%
arrange(hidden_units)
nnet_mae <-
nnet_res %>%
collect_metrics() %>%
Expand Down Expand Up @@ -623,7 +631,11 @@ Unfortunately, no equation can directly compute an estimate the number of units
#| out-width: "60%"
#| fig-cap: "The relationship between the number of hidden units and the mean absolute deviation from the validation set data."
autoplot(nnet_res, metric = "mae") + ylab("MAE")
nnet_mae_vals %>%
ggplot(aes(hidden_units, mean)) +
geom_point() +
geom_line() +
labs(x = hidden_units()$label, y = "MAE")
```
A small number of hidden units performs poorly due to underfitting (i.e., insufficient complexity). Adding more improves the situation and, while the numerically best result corresponds to a value of `r best_nnet$hidden_units`, there is appreciable noise in the MAE values^[The noisy profile suggests that our validation set might not be large enough to differentiate between subtle differences within or between models. If the problem required more precision in the performance statistics, a more thorough resampling method would be appropriate, such as cross-validation. These are described in Chapter @sec-resampling.] . Eventually, adding too much complexity will either result in the MAE values becoming larger due to overfitting or will drastically increase the time to fit each model. The MAE pattern in @fig-delivery-nnet-tune shows a plateau of values with a similar MAE once enought hidden units are added. Given the noise in the system, we'll select a value of `r final_hidden_units` hidden units, by determining the least complex model that is within 1% of the numerically best MAE. The MAE associated with the selected candidate value^[To some degree, this is an artificially optimistic estimate since we are using the same data to pick the best result and estimate the model's overall performance. This is called optimization bias and it is discussed more in @sec-nested-resampling. For the characteristics of these data, the optimism is likely to be small, especially compared to the noise in the MAE results.] was `r dec_to_time(nnet_mae)` (but is likely to be closer to `r dec_to_time(nnet_mae_median)`, given the noise in @fig-delivery-nnet-tune).
Expand Down Expand Up @@ -676,7 +688,6 @@ For our test set predictions, we'll remove this trend and, hopefully, this will
lm_cal <- cal_estimate_linear(lin_reg_res)
```
## Test Set Results {#sec-test-results-whole-game}
```{r}
Expand Down
4 changes: 2 additions & 2 deletions chapters/whole-game_files/execute-results/html.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ cite <- glue::glue("
}
", .open = "[", .close = "]")
cite <- paste("```{bib}", cite, "```", sep = "\n")
cite <- paste("```", cite, "```", sep = "\n")
cat(cite)
```

Expand Down
Loading

0 comments on commit 361cb00

Please sign in to comment.