From 0938a306160efb5b90106d862b00516e97c7bfbe Mon Sep 17 00:00:00 2001 From: Michael Rapp Date: Wed, 24 Apr 2024 23:13:08 +0200 Subject: [PATCH 1/3] Revise section "Release Process" of the documentation. --- doc/developer_guide/coding_standards.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/developer_guide/coding_standards.md b/doc/developer_guide/coding_standards.md index 452893ac03..33421c28c4 100644 --- a/doc/developer_guide/coding_standards.md +++ b/doc/developer_guide/coding_standards.md @@ -153,9 +153,11 @@ To enable releasing new major, feature, or bugfix releases at any time, we maint - `feature` comes with the changes that will be part of an upcoming feature release (including changes on the bugfix branch). - `bugfix` is restricted to minor changes that will be published as a bugfix release. -We do not allow directly pushing to the above branches. Instead, all changes must be submitted via pull requests and require certain checks to pass. Once modifications to one of the branches have been merged, {ref}`ci` jobs are used to automatically update downstream branches via pull requests. If all checks run for such pull requests are successful, they are merged automatically. If there are any merge conflicts, they must be resolved manually. Following this procedure, changes to the feature brach are merged into the main branch (see `merge_feature.yml`). Changes to the bugfix branch are first merged into the feature branch and then into the main branch (see `merge_bugfix.yml`). +We do not allow directly pushing to the above branches. Instead, all changes must be submitted via pull requests and require certain checks to pass. -Whenever a new release has been published, the release branch is merged into the upstream branches (see `release.yml`), i.e., major releases result in the feature and bugfix branches being updated, whereas minor releases result in the bugfix branch to be updated. The version of the release branch and the affected branches are updated accordingly. The version of a branch is specified in the file `VERSION` in the project's root directory. Similarly, the file `VERSION.dev` is used to keep track of the version number used for development releases (see `release_development.yml`). +Once modifications to one of the branches have been merged, {ref}`ci` jobs are used to automatically update downstream branches via pull requests. If all checks for such pull requests are successful, they are merged automatically. If there are any merge conflicts, they must be resolved manually. Following this procedure, changes to the feature brach are merged into the main branch (see `merge_feature.yml`), whereas changes to the bugfix branch are first merged into the feature branch and then into the main branch (see `merge_bugfix.yml`). + +Whenever a new release has been published, the release branch is merged into the upstream branches (see `release.yml`), i.e., major releases result in the feature and bugfix branches being updated, whereas minor releases result in the bugfix branch being updated. The version of the release branch and the affected branches are updated accordingly. The version of a branch is specified in the file `VERSION` in the project's root directory. Similarly, the file `VERSION.dev` is used to keep track of the version number used for development releases (see `release_development.yml`). (dependencies)= From efecb5ac5366db9404a16252c987e2fd934f7836 Mon Sep 17 00:00:00 2001 From: Michael Rapp Date: Wed, 24 Apr 2024 23:33:32 +0200 Subject: [PATCH 2/3] Fix typos in code examples. --- doc/quickstart/testbed.md | 4 ++-- doc/user_guide/testbed/evaluation.md | 24 +++++++++---------- .../testbed/experimental_results.md | 16 ++++++------- doc/user_guide/testbed/model_persistence.md | 2 +- .../testbed/parameter_persistence.md | 4 ++-- doc/user_guide/testbed/pre_processing.md | 2 +- 6 files changed, 26 insertions(+), 26 deletions(-) diff --git a/doc/quickstart/testbed.md b/doc/quickstart/testbed.md index 2561a03535..078fec11dc 100644 --- a/doc/quickstart/testbed.md +++ b/doc/quickstart/testbed.md @@ -44,13 +44,13 @@ In addition to the mandatory arguments that must be provided to the command line ````{tab} BOOMER ```text - boomer --data-dir /path/to/datsets/ --dataset dataset-name --output-dir /path/to/output/ + boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/output/ ``` ```` ````{tab} SeCo ```text - seco --data-dir /path/to/datsets/ --dataset dataset-name --output-dir /path/to/output/ + seco --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/output/ ``` ```` diff --git a/doc/user_guide/testbed/evaluation.md b/doc/user_guide/testbed/evaluation.md index b7bcf77a95..7dd35f8f9f 100644 --- a/doc/user_guide/testbed/evaluation.md +++ b/doc/user_guide/testbed/evaluation.md @@ -15,7 +15,7 @@ Several strategies for splitting the available data into distinct training and t The simplest and computationally least demanding strategy for obtaining training and tests is to randomly split the available data into two, mutually exclusive, parts. This strategy, which is used by default, if not specified otherwise, can be used by providing the argument `--data-split train-test` to the command line API: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split train-test +boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split train-test ``` Following the argument `--dataset`, the program will load the training data from a file named `dataset-name_training.arff`. Similarly, it will expect the test data to be stored in a file named `dataset-name_test.arff`. If these files are not available, the program will look for a file with the name `dataset-name.arff` and split it into training and test data automatically. @@ -23,7 +23,7 @@ Following the argument `--dataset`, the program will load the training data from When it is the responsibility of the command line API to split a given dataset into training and test tests, 66% of the data will be included in the training set, whereas the remaining 33% will be part of the test set. Although this ratio is frequently used in machine learning, you can easily adjust it by providing the option `test_size`: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split 'train-test{test_size=0.25}' +boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'train-test{test_size=0.25}' ``` This command will tell the command line API to include 75% of the available data in the training set and use the remaining 25% for the test set. @@ -35,19 +35,19 @@ This command will tell the command line API to include 75% of the available data A more elaborate strategy for splitting data into training and test sets, which results in more realistic performance estimates, but also entails greater computational costs, is referred to as [cross validation]() (CV). The basic idea is to split the available data into several, equally-sized, parts. Afterwards, several machine learning models are trained and evaluated on different portions of the data using the same learning method. Each of these parts will be used for testing exactly once, whereas the remaining ones make up the training set. The performance estimates that are obtained for each of these subsequent runs, referred to as *folds*, are finally averaged to obtain a single score and corresponding [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation). The command line API can be instructed to perform a cross validation using the argument `--data-split cv`: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split cv +boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv ``` By default, a 10-fold cross validation, where ten models are trained and evaluated, will be performed. The number of folds can easily be adjusted via the option `num_folds`. For example, the following command results in a 5-fold CV being used: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split 'cv{num_folds=5}' +boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5}' ``` ```{tip} When providing the option `current_fold`, only a single fold, instead of the entire procedure, will be performed. This is particularly useful, if one intends to train and evaluate the models for each individual fold in parallel on different machines. For example, the following command does only execute the second fold of a 5-fold CV: - boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split 'cv{num_folds=5,current_fold=2}' + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5,current_fold=2}' ``` (evaluating-training-data)= @@ -61,13 +61,13 @@ The configuraton described in this section should only be used for testing purpo Sometimes, evaluating the performance of a model on the data it has been trained on can be helpful for analyzing the behavior of a machine learning algorithm, e.g., if one needs to check if the approach is able to fit the data accurately. For this purpose, the command line API allows to use the argument `--data-split none`, which will not result in the given data to be split at all. Instead, the learning algorithm will be applied to the entire dataset and predictions will be obtained from the resulting model for the exact same data points. The argument can be specified as follows: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split none +boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split none ``` ```{tip} If you are interested in obtaining evaluation results for the training data in addition to the test data when using a train-test-split or a cross validation, as discussed above, the argument `--evaluate-training-data true` may be used: - boomer --data-dir /path/to/datsets/ --dataset dataset-name --data-split cv --evaluate-training-data true + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv --evaluate-training-data true ``` (prediction-types)= @@ -83,7 +83,7 @@ The metrics for evaluating the quality of predictions that have been obtained fo We refer to real-valued predictions, which may be positive or negative, as *regression scores*. In the context of multi-label classification, positive scores indicate a preference towards predicting a label as relevant, whereas negative scores are predicted for labels that are more likely to be irrelevant. The absolute size of the scores corresponds to the confidence of the predictions, i.e., if a large value is predicted for a label, the model is more certain about the correctness of the predicted outcome. Unlike {ref}`probability-estimates`, regression scores are not bound to a certain interval and can be arbitrary positive or negative values. The BOOMER algorithm uses regression scores as a basis for predicting probabilities or binary labels. If you want to evaluate the quality of the regression scores directly, instead of transforming them into probabilities or binary predictions, the argument `--prediction-type scores` may be passed to the command line API: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --prediction-type scores +boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type scores ``` For evaluating the quality of regression scores, [multi-label ranking measures](https://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-ranking-metrics) provided by the [scikit-learn](https://scikit-learn.org) framework are used. @@ -95,7 +95,7 @@ For evaluating the quality of regression scores, [multi-label ranking measures]( Probability estimates are given as real values between zero and one. In the context of multi-label classification, they express the probability of a label being relevant. If the predicted probability is close to zero, the corresponding label is more likely to be irrelevant, whereas a probability close to one is predicted for labels that are likely to be relevant. If you intend to evaluate the quality of probabilistic predictions, the argument `--prediction-type probabilities` should be used: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --prediction-type probabilities +boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type probabilities ``` Similar to {ref}`regression-scores`, the command line API relies on [multi-label ranking measures](https://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-ranking-metrics), as implemented by the [scikit-learn](https://scikit-learn.org) framework, for evaluating probability estimates. @@ -105,7 +105,7 @@ Similar to {ref}`regression-scores`, the command line API relies on [multi-label The most common type of prediction used for multi-label classification are binary predictions that directly indicate whether a label is considered as irrelevant or relevant. Irrelevant labels are represented by the value `0`, whereas the value `1` is predicted for relevant labels. By default, the command line API instructs the learning method to provide binary predictions. If you want to explicitly instruct it to use this particular type of predictions, you can use the argument `--prediction-type binary`: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --prediction-type binary +boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type binary ``` In a multi-label setting, the quality of binary predictions is assessed in terms of commonly used [multi-label classification metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) implemented by the [scikit-learn](https://scikit-learn.org) framework. If a dataset contains only a single label, the evaluation will be restricted to classification metrics that are suited for single-label classification problems. @@ -115,7 +115,7 @@ In a multi-label setting, the quality of binary predictions is assessed in terms When evaluating the predictive performance of an [ensemble method](https://en.wikipedia.org/wiki/Ensemble_learning), i.e., models that consist of several weak predictors, also referred to as *ensemble members*, the command line API supports to evaluate these models incrementally. In particular, rule-based machine learning algorithms like the ones implemented by this project are often considered as ensemble methods, where each rule in a model can be viewed as a weak predictor. Adding more rules to a model typically results in better predictive performance. However, adding too many rules may result in overfitting the training data and therefore achieving subpar performance on the test data. For analyzing such behavior, the arugment `--incremental-evaluation true` may be passed to the command line API: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --incremental-evaluation true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation true ``` When using the above command, the rule-based model that is learned by the BOOMER algorithm will be evaluated repeatedly as more rules are added to it. Evaluation results will be obtained for a model consisting of a single rule, two rules, three rules, and so on. Of course, because the evaluation is performed multiple times, this evaluation strategy comes with a large computational overhead. Therefore, depending on the size of the final model, it might be necessary to limit the number of evaluations via the following options: @@ -127,5 +127,5 @@ When using the above command, the rule-based model that is learned by the BOOMER For example, the following command may be used for the incremental evaluation of a BOOMER model that consists of up to 1000 rules. The model will be evaluated for the first time after 200 rules have been added. Subsequent evaluations will be perfomed when the model comprises 400, 600, 800, and 1000 rules. ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --incremental-evaluation 'true{min_size=200,max_size=1000,step_size=200}' +boomer --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation 'true{min_size=200,max_size=1000,step_size=200}' ``` diff --git a/doc/user_guide/testbed/experimental_results.md b/doc/user_guide/testbed/experimental_results.md index 3a8b0ffa41..f3aeab3f70 100644 --- a/doc/user_guide/testbed/experimental_results.md +++ b/doc/user_guide/testbed/experimental_results.md @@ -23,13 +23,13 @@ TODO In cases where the performance metrics obtained via the arguments ``--print-evaluation`` or ``--store-evaluation`` are not sufficient for a detailed analysis, it may be desired to directly inspect the predictions provided by the evaluated models. They can be printed on the console, together with the ground truth labels, by proving the argument ``--print-predictions``: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --print-predictions true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-predictions true ``` Alternatively, the argument ``--store-predictions`` can be used to save the predictions, as well as the ground truth labels, to [.arff](http://weka.wikispaces.com/ARFF) files: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --output-dir /path/to/results/ --store-predictions true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-predictions true ``` ```{tip} @@ -67,13 +67,13 @@ TODO To obtain insightful statistics regarding the characteristics of a data set, the command line argument ``--print-data-characteristics`` may be helpful: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --print-data-characteristics true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-data-characteristics true ``` If you prefer to write the statistics into a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file, the argument ``--store-data-characteristics`` can be used: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --output-dir /path/to/results/ --store-data-characteristics true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-data-characteristics true ``` ```{tip} @@ -118,13 +118,13 @@ TODO To obtain a quick overview of some statistics that characterize a rule-based model learned by one of the algorithms provided by this project, the command line argument ``--print-model-characteristics`` can be useful: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --print-model-characteristics true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-model-characteristics true ``` The above command results in a tabular representation of the characteristics being printed on the console. If one intends to write them into a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file instead, the argument ``--store-model-characteristics`` may be used: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --output-dir /path/to/results/ --store-model-characteristics true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-model-characteristics true ``` Model characteristics are obtained for each model training during an experiment. This means that a single output file will be created when using on {ref}`train-test-split`: @@ -152,13 +152,13 @@ The statistics captured by the previous commands include the following: It is considered one of the advantages of rule-based machine learning models that they capture patterns found in the training data in a human-comprehensible form. This enables to manually inspect the models and reason about their predictive behavior. To help with this task, the command line API allows to output the rules in a model using a textual representation. If the text should be printed on the console, the following command specifying the argument ``--print-rules`` can be used: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --print-rules true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-rules true ``` Alternatively, by using the argument ``--store-rules``, a textual representation of models can be written into a text file in the specifed output directory: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --output-dir /path/to/results/ --store-rules true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-rules true ``` ```{tip} diff --git a/doc/user_guide/testbed/model_persistence.md b/doc/user_guide/testbed/model_persistence.md index b538996add..d23f92d704 100644 --- a/doc/user_guide/testbed/model_persistence.md +++ b/doc/user_guide/testbed/model_persistence.md @@ -5,7 +5,7 @@ Because the training of machine learning models can be time-consuming, they are usually trained once and then reused later for making predictions. For this purpose, the command line API provides means to store models on disk and load them from the created files later on. This requires to specify the path of a directory, where models should be saved, via the command line argument `--model-dir`: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --model-dir /path/to/models +boomer --data-dir /path/to/datasets/ --dataset dataset-name --model-dir /path/to/models ``` ```{note} diff --git a/doc/user_guide/testbed/parameter_persistence.md b/doc/user_guide/testbed/parameter_persistence.md index abd3217ec0..fb5eb685dd 100644 --- a/doc/user_guide/testbed/parameter_persistence.md +++ b/doc/user_guide/testbed/parameter_persistence.md @@ -7,7 +7,7 @@ To remember the parameters that have been used for training a model, it might be On the one hand, this requires to specify a directory where parameter settings should be saved via the command line argument `--parameter-dir`. On the other hand, the argument `--store-parameters true` instructs the program to save custom parameters that are set via command line argments (see {ref}`setting-algorithmic-parameters`). For example, the following command sets a custom value for the parameter `shrinkage`, which will be stored in an output file: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --parameter-dir /path/to/parameters --store-parameters true --shrinkage 0.5 +boomer --data-dir /path/to/datasets/ --dataset dataset-name --parameter-dir /path/to/parameters --store-parameters true --shrinkage 0.5 ``` ```{note} @@ -35,5 +35,5 @@ When executing the previously mentioned command again, the program will restore If you want to print all custom parameters that are used by a learning algorithm on the console, you can specify the argument `--print-parameters true`: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --print-parameters true --shrinkage 0.5 +boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-parameters true --shrinkage 0.5 ``` diff --git a/doc/user_guide/testbed/pre_processing.md b/doc/user_guide/testbed/pre_processing.md index f0538e3251..9d6fa48510 100644 --- a/doc/user_guide/testbed/pre_processing.md +++ b/doc/user_guide/testbed/pre_processing.md @@ -15,7 +15,7 @@ Not all machine learning methods can deal with nominal or binary features out-of Even though nominal and binary features are natively supported in an efficient way by all algorithms provided by this project, it might still be useful to use one-hot-encoding if one seeks for a fair comparison with machine learning approaches that cannot deal with such features. In such cases, you can provide the argument `--one-hot-encoding true` to the command line API: ```text -boomer --data-dir /path/to/datsets/ --dataset dataset-name --one-hot-encoding true +boomer --data-dir /path/to/datasets/ --dataset dataset-name --one-hot-encoding true ``` Under the hood, the program will make use of scikit-learn's [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for pre-processing the data. From c99922acd07c5734993231b79ec3e7e8b1da0431 Mon Sep 17 00:00:00 2001 From: Michael Rapp Date: Thu, 25 Apr 2024 00:07:50 +0200 Subject: [PATCH 3/3] Add section "Prediction Characteristics" to the documentation of the command line API. --- .../testbed/experimental_results.md | 37 ++++++++++++++++++- 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/doc/user_guide/testbed/experimental_results.md b/doc/user_guide/testbed/experimental_results.md index f3aeab3f70..8eec30cd49 100644 --- a/doc/user_guide/testbed/experimental_results.md +++ b/doc/user_guide/testbed/experimental_results.md @@ -58,7 +58,42 @@ When using a {ref}`cross-validation` for performance evaluation, a model is trai ## Prediction Characteristics -TODO +By using the command line argument ``--print-prediction-characteristics``, characteristics regarding a model's predictions can be printed: + +```text +boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-prediction-characteristics true +``` + +Alternatively, they statistics can be written into a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file by using the argument ``--store-prediction-characteristics``: + +```text +boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-prediction-characteristics true +``` + +```{tip} +The output produced by the arguments ``--print-data-characteristics`` and ``--store-data-characteristics`` can be customized via several options described {ref}`here`. It is possible to exclude certain statistics from the output, to specify whether they should be given as percentages, and how many decimal places should be used. +``` + +The statistics obtained via the arguments given above correspond to the test data for which predictions are obtained from the model. Consequently, they depend on the strategy used for splitting a dataset into training and test sets. When using {ref}`train-test-split`, predictions for a single test set are obtained and their characteristics are written into a file. In addition, statistics for the training data are written into an additional output file when using an {ref}`evaluating-training-data`: + +- `prediction_characteristics_train_overall.arff` +- `prediction_characteristics_test_overall.arff` + +When using a {ref}`cross-validation`, the data is split into several parts of which each one is used once for prediction. Multiple output files are needed to save the statistics for different cross validation folds. For example, a 5-fold cross validation results in the following files: + +- `prediction_characteristics_fold-1.csv` +- `prediction_characteristics_fold-2.csv` +- `prediction_characteristics_fold-3.csv` +- `prediction_characteristics_fold-4.csv` +- `prediction_characteristics_fold-5.csv` + +The statistics obtained via the previous commands include the following: + +- The number of labels for which predictions have been obtained. +- The percentage of labels predicted as irrelevant for all examples, indicating the sparsity of the prediction matrix. +- The average label cardinality, i.e., the average number of labels predicted as relevant for each example. +- The number of distinct label vectors, i.e., the number of unique label combinations, predicted for different examples. +- The *label imbalance ratio* [^charte2013] that measures the imbalance between labels predicted as relevant and irrelevant, respectively. (output-data-characteristics)=