03-tuning-svm.Rmd

---
title: "03: Tuning & SVMs"
date: "`r Sys.time()`"
output:
  html_notebook:
    toc: yes
    theme: flatly
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{r setup}
library(mlr3verse) # All the mlr3 things
library(ggplot2)   # For plotting

# Spam Task setup
spam_task <- tsk("spam")
set.seed(26)

# train/test split
spam_split <- partition(spam_task, ratio = 2/3)
```

Goals of this part:

1.  Introduce hyperparameter tuning
2.  Experiment with tuning different learners
3.  Introduce SVMs
4.  Tune an SVM with a more complex setup

# Hyperparameter Tuning

So far we've seen four learners:

1.  kNN via `{kknn}`
2.  Decision Trees via `{rpart}`
3.  Random Forest via `{ranger}`
4.  Gradient Boosting via `{xgboost}`

We've gotten to know the first two a little, and now we'll also take a closer look at the second two.

First we'll start doing some tuning with `{mlr3}` based on the **kNN** learner because it's nice and simple.
We saw that `k` is an important parameter, and it's an integer greater than 1 at least. To tune it, we also have to make a few other decisions, also using what we learned about (nested) resampling.

-   What's our resampling strategy?
-   What measure to we tune on?
-   What does the parameter search space look like?
-   How long to we tune? What's our *budget*?
-   What's our tuning strategy?

We'll use `{mlr3}`'s [`auto_tuner`](https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html) for this because it's just so convenient:

```{r knn-tuning-setup}
# Defining a search space: k is an integer, we look in range 3 to 51
search_space_knn = ps(
  k = p_int(lower = 3, upper = 51)
)

tuned_knn <- auto_tuner(
  # The base learner we want to tune, optionally setting other parameters
  learner = lrn("classif.kknn", predict_type = "prob"),
  # Resampling strategy for the tuning (inner resampling)
  resampling = rsmp("cv", folds = 3),
  # Tuning measure: Maximize the classification accuracy
  measure = msr("classif.bbrier"),
  # Setting the search space we defined above
  search_space = search_space_knn,
  # Budget: Try n_evals different values
  terminator = trm("evals", n_evals = 20),
  # Strategy: Randomly try parameter values in the space
  # (not ideal in this case but in general a good place to start)
  tuner = tnr("random_search")
)

# Take a look at the new tuning learner
tuned_knn
```

Now `tuned_knn` behaves the same way as any other Learner that has not been trained on any data yet --- first we have to train (and tune!) it on our spam training data. The result will be the best hyperparameter configuration of those we tried:

```{r tune-knn}
# If you're on your own machine with multiple cores available, you can parallelize:
future::plan("multisession")

# Setting a seed so we get the same results -- train on train set only!
set.seed(2398)
tuned_knn$train(spam_task, row_ids = spam_split$train)
```

In this case, we get a `k` of 9, with an accuracy of about 90%.

(Side task: Try the same but with the AUC measure --- do you get the same `k`?)

We can visualize the performance across all values of `k` we tried by accessing the tuning instance now included in the `tuned_knn` learner object:

```{r plot-knn-tuning-result}
autoplot(tuned_knn$tuning_instance)

tuned_knn$tuning_instance
```

(See also docs at `?mlr3viz:::autoplot.TuningInstanceSingleCrit`)

And we can get the hyperparameter results that worked best in the end:

```{r knn-tuning-param-results}
tuned_knn$tuning_result
```

Now that we've tuned on the training set, it's time to evaluate on the test set:

```{r knn-eval}
tuned_knn_pred <- tuned_knn$predict(spam_task, row_ids = spam_split$test)

# Accuracy and AUC
tuned_knn_pred$score(msrs(c("classif.acc", "classif.auc", "classif.bbrier")))
```

That was basically the manual way of doing nested resampling with an inner resampling strategy (CV) and an outer resampling strategy of holdout (the `spam_train` and `spam_test` sets).

To be more robust we can use CV in the outer resampling as well:

```{r knn-nested-resampling}
# Reset the learner so we start fresh
tuned_knn$reset()

# Set up resampling with the ready-to-be-tuned learner and outer resampling: CV
knn_nested_tune <- resample(
  task = spam_task, 
  learner = tuned_knn, 
  resampling = rsmp("cv", folds = 5), # this is effectively the outer resampling
  store_models = TRUE
)

# Extract inner tuning results, since we now tuned in multiple CV folds
# Folds might conclude different optimal k
# AUC is averaged over inner folds
extract_inner_tuning_results(knn_nested_tune)

# Above combined individual results also accessible via e.g.
knn_nested_tune$learners[[2]]$tuning_result

# Plot of inner folds
autoplot(knn_nested_tune$learners[[1]]$tuning_instance)
autoplot(knn_nested_tune$learners[[2]]$tuning_instance)
autoplot(knn_nested_tune$learners[[3]]$tuning_instance)

# Get AUC and acc for outer folds
knn_nested_tune$score(msrs(c("classif.acc", "classif.auc")))[, .(iteration, classif.auc, classif.acc)]

# Average over outer folds - our "final result" performance for kNN
# AUC is averaged over outer folds
knn_nested_tune$aggregate(msrs(c("classif.acc", "classif.auc")))
```

Seems like a decent result?
Let's try to beat it with some other learner!

## Your Turn!

Above you have a boilerplate to tune your own learner.
Start with either of the other three learners we've seen, pick one ore two hyperparameters to tune with a reasonable budget (note we have limited time and resources), tune on the training set and evaluate per AUC on the test set.

Some pointers:

-   Consult the Learner docs to see tuning-worthy parameters:

    -   `lrn("classif.xgboost")$help()` links to the `xgboost` help
    -   `lrn("classif.rpart")$help()` analogously for the decision tree
    -   You can also see the documentation online, e.g. <https://mlr3learners.mlr-org.com/reference/mlr_learners_classif.xgboost.html>

-   Parameter search spaces in `ps()` have different types, see the help at `?paradox::Domain`
    -   Use `p_int()` for integers, `p_dbl()` for real-valued params etc.

-   If you don't know which parameter to tune, try the following:
    -   `classif.xgboost`:
        -   Important: `nrounds` (integer) (>= 1)
        -   Important: `eta` (double) (0 < eta < 1)
        -   Maybe: `max_depth` (integer) (< 10)

    -   `classif.rpart`:
        -   `cp` (double)
        -   Maybe: `maxdepth` (integer) (< 30)

    -   `classif.ranger`:
        -   `num.trees` (integer) (default is 500)
        -   `max.depth` (integer) (ranger has no limit here aside from the data itself :)

Note: Instead of randomly picking parameters from the design space, we can also generate a grid of parameters and try those.
We'll not try that here for now, but you can read up on how to do that here: `?mlr_tuners_grid_search`.

-> `generate_design_grid(search_space, resolution = 5)`

Also note that the cool thing about the `auto_tuner()` is that it behaves just like any other `mlr3` learner, but it automatically tunes itself. You can plug it into `resample()` or `benchmark_grid()` just fine!

# Support Vector Machines

Let's circle back to new learners and explore SVMs a little by trying out different kernels at the example of our penguin dataset we used in the beginning:

```{r penguins}
penguins <- na.omit(palmerpenguins::penguins)

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Palmer Penguins",
    x = "Flipper Length [mm]", y = "Body Mass [g]",
    color = "Species"
  ) +
  theme_minimal()
```

Since we don't care about prediction accuracy for now, we'll use the whole dataset for training and prediction. Please only do this with toy data 🙃.

For the SVM algorithm itself, we use the `svm` learner from the `e1071` package (great name, I know) but once again use `{mlr3}`'s interface

According to the docs (`?e1071::svm`) we have the choice of the following kernels:

-   `"linear"`: $u'v$
-   `"polynomial"`: $(\mathtt{gamma} \cdot u' \cdot v + \mathtt{coef0})^\mathtt{degree}$
-   `"radial"`: $\exp(-\mathtt{gamma} \cdot |u-v|^2)$
-   `"sigmoid"`: $\tanh(\mathtt{gamma} \cdot u'v + \mathtt{coef0})$

Where `gamma`, `degree`, and `coef0` are further hyperparameters.

```{r svm-learner-default}
svm_learner <- lrn("classif.svm")

# What parameters do we have?
svm_learner$param_set$ids()

# Default kernel
svm_learner$param_set$default$kernel
```

## Your Turn!

Below you have a boilerplate for

a)  Creating an SVM learner and train it on the penguin dataset with 2 predictors
b)  Plotting decision boundaries with it (using the `{mlr3}` helper function)

Run the code below once to see what linear decision boundaries look like, then pick different kernels from the list above and run it again.

-   What kernel would you pick just by the looks of the boundaries?
-   How do the boundaries change if you also adjust the other hyperparameters?
-   Try picking any other two variables as features (`penguin_task$col_info`)

```{r penguin-task-2-predictors}
penguin_task <- TaskClassif$new(
  id = "penguins", 
  backend = na.omit(palmerpenguins::penguins), 
  target = "species"
)

penguin_task$col_roles$feature <- c("body_mass_g", "flipper_length_mm")
```

```{r svm-decision-boundaries}
# Create the learner, picking a kernel and/or other hyperparams
svm_learner <- lrn("classif.svm", predict_type = "prob", kernel = "polynomial", degree = 3)
svm_learner <- lrn("classif.svm", predict_type = "prob", kernel = "linear")

# Train the learner
svm_learner$train(penguin_task)

# Plot decision boundaries
plot_learner_prediction(
  learner = svm_learner, 
  task = penguin_task
)
```

## SVM-Tuning

Let's try a more complex tuning experiment, based on the spam task from before.

We'll create a new SVM learner object and this time explicitly tell it which classification to do --- that's the default value anyway, but `{mlr3}` wants us to be explicit here for tuning:

```{r svm-learner}
svm_learner <- lrn("classif.svm", predict_type = "prob", type = "C-classification")
```

First up we'll define our search space, meaning the range of parameters we want to test out. Since `kernel` is a categorical parameter (i.e. no numbers, just names of kernels), we'll define the search space for that parameter by just passing the names of the kernels to the `p_fct()` helper function that defines `factor`-parameters in `{mlr3}`.

The interesting thing here is that some parameters are only relevant for some kernels, which we can declare via a `depends` argument:

```{r svm-search-space-short}
search_space_svm = ps(
  kernel = p_fct(c("linear", "polynomial", "radial", "sigmoid")),
  # Degree is onoly relevant if "kernel" is "polynomial"
  degree = p_int(lower = 1, upper = 7, depends = kernel == "polynomial")
)
```


We can create an example design grid to inspect our setup and see that `degree` is `NA` for cases where `kernel` is not `"polynomial"`, just as we expected

```{r svm-search-space-inspect}
generate_design_grid(search_space_svm, resolution = 3)
```

## Your Turn!

The above should get you started to...

1.  Create a `search_space_svm` like above, tuning...

-   `cost` from 0.1 to 1 (hint: `trafo = function(x) 10^x`)
-   `kernel`, (like above example)
-   `degree`, as above, **only if** `kernel == "polynomial"`
-   `gamma`, from e.g. 0.01 to 0.2, **only if** `kernel` is polynomial, radial, sigmoid (hint: you can't use `kernel != "linear"` unfortunately, but `kernel %in% c(...)`) works

2.  Use the `auto_tuner` function as previously seen with

-   `svm_learner` (see above)
-   A resampling strategy (use `"holdout"` if runtime is an issue)
-   A measure (e.g. `classif.acc` or `classif.auc`)
-   The search space you created in 1.
-   A termination criterion (e.g. 40 evaluations)
-   Random search as your tuning strategy

3.  Train the auto-tuned learner and evaluate on the test set

What parameter settings worked best?