Skip to content

Commit

Permalink
Merge pull request #852 from mrapp-ke/testbed-documentation
Browse files Browse the repository at this point in the history
Add section "Prediction Characteristics" to the documentation
  • Loading branch information
michael-rapp authored Apr 24, 2024
2 parents 1b14d5f + c99922a commit ac1b2cd
Show file tree
Hide file tree
Showing 7 changed files with 66 additions and 29 deletions.
6 changes: 4 additions & 2 deletions doc/developer_guide/coding_standards.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,9 +153,11 @@ To enable releasing new major, feature, or bugfix releases at any time, we maint
- `feature` comes with the changes that will be part of an upcoming feature release (including changes on the bugfix branch).
- `bugfix` is restricted to minor changes that will be published as a bugfix release.

We do not allow directly pushing to the above branches. Instead, all changes must be submitted via pull requests and require certain checks to pass. Once modifications to one of the branches have been merged, {ref}`ci` jobs are used to automatically update downstream branches via pull requests. If all checks run for such pull requests are successful, they are merged automatically. If there are any merge conflicts, they must be resolved manually. Following this procedure, changes to the feature brach are merged into the main branch (see `merge_feature.yml`). Changes to the bugfix branch are first merged into the feature branch and then into the main branch (see `merge_bugfix.yml`).
We do not allow directly pushing to the above branches. Instead, all changes must be submitted via pull requests and require certain checks to pass.

Whenever a new release has been published, the release branch is merged into the upstream branches (see `release.yml`), i.e., major releases result in the feature and bugfix branches being updated, whereas minor releases result in the bugfix branch to be updated. The version of the release branch and the affected branches are updated accordingly. The version of a branch is specified in the file `VERSION` in the project's root directory. Similarly, the file `VERSION.dev` is used to keep track of the version number used for development releases (see `release_development.yml`).
Once modifications to one of the branches have been merged, {ref}`ci` jobs are used to automatically update downstream branches via pull requests. If all checks for such pull requests are successful, they are merged automatically. If there are any merge conflicts, they must be resolved manually. Following this procedure, changes to the feature brach are merged into the main branch (see `merge_feature.yml`), whereas changes to the bugfix branch are first merged into the feature branch and then into the main branch (see `merge_bugfix.yml`).

Whenever a new release has been published, the release branch is merged into the upstream branches (see `release.yml`), i.e., major releases result in the feature and bugfix branches being updated, whereas minor releases result in the bugfix branch being updated. The version of the release branch and the affected branches are updated accordingly. The version of a branch is specified in the file `VERSION` in the project's root directory. Similarly, the file `VERSION.dev` is used to keep track of the version number used for development releases (see `release_development.yml`).

(dependencies)=

Expand Down
4 changes: 2 additions & 2 deletions doc/quickstart/testbed.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,13 @@ In addition to the mandatory arguments that must be provided to the command line

````{tab} BOOMER
```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --output-dir /path/to/output/
boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/output/
```
````

````{tab} SeCo
```text
seco --data-dir /path/to/datsets/ --dataset dataset-name --output-dir /path/to/output/
seco --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/output/
```
````

Expand Down
24 changes: 12 additions & 12 deletions doc/user_guide/testbed/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ Several strategies for splitting the available data into distinct training and t
The simplest and computationally least demanding strategy for obtaining training and tests is to randomly split the available data into two, mutually exclusive, parts. This strategy, which is used by default, if not specified otherwise, can be used by providing the argument `--data-split train-test` to the command line API:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split train-test
boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split train-test
```

Following the argument `--dataset`, the program will load the training data from a file named `dataset-name_training.arff`. Similarly, it will expect the test data to be stored in a file named `dataset-name_test.arff`. If these files are not available, the program will look for a file with the name `dataset-name.arff` and split it into training and test data automatically.

When it is the responsibility of the command line API to split a given dataset into training and test tests, 66% of the data will be included in the training set, whereas the remaining 33% will be part of the test set. Although this ratio is frequently used in machine learning, you can easily adjust it by providing the option `test_size`:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split 'train-test{test_size=0.25}'
boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'train-test{test_size=0.25}'
```

This command will tell the command line API to include 75% of the available data in the training set and use the remaining 25% for the test set.
Expand All @@ -35,19 +35,19 @@ This command will tell the command line API to include 75% of the available data
A more elaborate strategy for splitting data into training and test sets, which results in more realistic performance estimates, but also entails greater computational costs, is referred to as [cross validation](<https://en.wikipedia.org/wiki/Cross-validation_(statistics)>) (CV). The basic idea is to split the available data into several, equally-sized, parts. Afterwards, several machine learning models are trained and evaluated on different portions of the data using the same learning method. Each of these parts will be used for testing exactly once, whereas the remaining ones make up the training set. The performance estimates that are obtained for each of these subsequent runs, referred to as *folds*, are finally averaged to obtain a single score and corresponding [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation). The command line API can be instructed to perform a cross validation using the argument `--data-split cv`:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split cv
boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv
```

By default, a 10-fold cross validation, where ten models are trained and evaluated, will be performed. The number of folds can easily be adjusted via the option `num_folds`. For example, the following command results in a 5-fold CV being used:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split 'cv{num_folds=5}'
boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5}'
```

```{tip}
When providing the option `current_fold`, only a single fold, instead of the entire procedure, will be performed. This is particularly useful, if one intends to train and evaluate the models for each individual fold in parallel on different machines. For example, the following command does only execute the second fold of a 5-fold CV:
boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split 'cv{num_folds=5,current_fold=2}'
boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5,current_fold=2}'
```

(evaluating-training-data)=
Expand All @@ -61,13 +61,13 @@ The configuraton described in this section should only be used for testing purpo
Sometimes, evaluating the performance of a model on the data it has been trained on can be helpful for analyzing the behavior of a machine learning algorithm, e.g., if one needs to check if the approach is able to fit the data accurately. For this purpose, the command line API allows to use the argument `--data-split none`, which will not result in the given data to be split at all. Instead, the learning algorithm will be applied to the entire dataset and predictions will be obtained from the resulting model for the exact same data points. The argument can be specified as follows:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split none
boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split none
```

```{tip}
If you are interested in obtaining evaluation results for the training data in addition to the test data when using a train-test-split or a cross validation, as discussed above, the argument `--evaluate-training-data true` may be used:
boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split cv --evaluate-training-data true
boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv --evaluate-training-data true
```

(prediction-types)=
Expand All @@ -83,7 +83,7 @@ The metrics for evaluating the quality of predictions that have been obtained fo
We refer to real-valued predictions, which may be positive or negative, as *regression scores*. In the context of multi-label classification, positive scores indicate a preference towards predicting a label as relevant, whereas negative scores are predicted for labels that are more likely to be irrelevant. The absolute size of the scores corresponds to the confidence of the predictions, i.e., if a large value is predicted for a label, the model is more certain about the correctness of the predicted outcome. Unlike {ref}`probability-estimates`, regression scores are not bound to a certain interval and can be arbitrary positive or negative values. The BOOMER algorithm uses regression scores as a basis for predicting probabilities or binary labels. If you want to evaluate the quality of the regression scores directly, instead of transforming them into probabilities or binary predictions, the argument `--prediction-type scores` may be passed to the command line API:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --prediction-type scores
boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type scores
```

For evaluating the quality of regression scores, [multi-label ranking measures](https://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-ranking-metrics) provided by the [scikit-learn](https://scikit-learn.org) framework are used.
Expand All @@ -95,7 +95,7 @@ For evaluating the quality of regression scores, [multi-label ranking measures](
Probability estimates are given as real values between zero and one. In the context of multi-label classification, they express the probability of a label being relevant. If the predicted probability is close to zero, the corresponding label is more likely to be irrelevant, whereas a probability close to one is predicted for labels that are likely to be relevant. If you intend to evaluate the quality of probabilistic predictions, the argument `--prediction-type probabilities` should be used:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --prediction-type probabilities
boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type probabilities
```

Similar to {ref}`regression-scores`, the command line API relies on [multi-label ranking measures](https://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-ranking-metrics), as implemented by the [scikit-learn](https://scikit-learn.org) framework, for evaluating probability estimates.
Expand All @@ -105,7 +105,7 @@ Similar to {ref}`regression-scores`, the command line API relies on [multi-label
The most common type of prediction used for multi-label classification are binary predictions that directly indicate whether a label is considered as irrelevant or relevant. Irrelevant labels are represented by the value `0`, whereas the value `1` is predicted for relevant labels. By default, the command line API instructs the learning method to provide binary predictions. If you want to explicitly instruct it to use this particular type of predictions, you can use the argument `--prediction-type binary`:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --prediction-type binary
boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type binary
```

In a multi-label setting, the quality of binary predictions is assessed in terms of commonly used [multi-label classification metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) implemented by the [scikit-learn](https://scikit-learn.org) framework. If a dataset contains only a single label, the evaluation will be restricted to classification metrics that are suited for single-label classification problems.
Expand All @@ -115,7 +115,7 @@ In a multi-label setting, the quality of binary predictions is assessed in terms
When evaluating the predictive performance of an [ensemble method](https://en.wikipedia.org/wiki/Ensemble_learning), i.e., models that consist of several weak predictors, also referred to as *ensemble members*, the command line API supports to evaluate these models incrementally. In particular, rule-based machine learning algorithms like the ones implemented by this project are often considered as ensemble methods, where each rule in a model can be viewed as a weak predictor. Adding more rules to a model typically results in better predictive performance. However, adding too many rules may result in overfitting the training data and therefore achieving subpar performance on the test data. For analyzing such behavior, the arugment `--incremental-evaluation true` may be passed to the command line API:

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --incremental-evaluation true
boomer --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation true
```

When using the above command, the rule-based model that is learned by the BOOMER algorithm will be evaluated repeatedly as more rules are added to it. Evaluation results will be obtained for a model consisting of a single rule, two rules, three rules, and so on. Of course, because the evaluation is performed multiple times, this evaluation strategy comes with a large computational overhead. Therefore, depending on the size of the final model, it might be necessary to limit the number of evaluations via the following options:
Expand All @@ -127,5 +127,5 @@ When using the above command, the rule-based model that is learned by the BOOMER
For example, the following command may be used for the incremental evaluation of a BOOMER model that consists of up to 1000 rules. The model will be evaluated for the first time after 200 rules have been added. Subsequent evaluations will be perfomed when the model comprises 400, 600, 800, and 1000 rules.

```text
boomer --data-dir /path/to/datsets/ --dataset dataset-name --incremental-evaluation 'true{min_size=200,max_size=1000,step_size=200}'
boomer --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation 'true{min_size=200,max_size=1000,step_size=200}'
```
Loading

0 comments on commit ac1b2cd

Please sign in to comment.