From f12a418702a67b8a7f90676f95c14819182352f4 Mon Sep 17 00:00:00 2001 From: Michael Rapp Date: Sun, 5 May 2024 00:13:38 +0200 Subject: [PATCH 1/3] Add section "Evaluation Results" to the documentation of the command line API. --- .../testbed/experimental_results.md | 36 +++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/doc/user_guide/testbed/experimental_results.md b/doc/user_guide/testbed/experimental_results.md index 33ddfdf048..1be6a18933 100644 --- a/doc/user_guide/testbed/experimental_results.md +++ b/doc/user_guide/testbed/experimental_results.md @@ -14,13 +14,45 @@ The path of the directory, where experimental results should be saved, can be ei ## Evaluation Results -TODO +By default, the predictive performance of all models trained during an experiment is evaluated in terms of commonly used evaluation metrics and the evaluation results are printed to the console. In addition, if the argument `--output-dir` is given, the evaluation results are also written into output files. The command line argument `--print-evaluation` can be used to explicitly enable or disable printing the evaluation results: + +```text +boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-evaluation true +``` + +Accordingly, the argument `--store-evaluation` allows to enable or disable saving the evaluation results to [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) files: + +``` +boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-evaluation true +``` + +```{tip} +The command line arguments ``--print-evaluation`` and ``--store-evaluation`` come with several options for customization described {ref}`here`. It is possible to specify the performance metrics that should be used for evaluation by providing a black- or whitelist. Moreover, one can specify whether performance scores should be given as percentages and the number of decimals used for these scores can be chosen freely. +``` + +The number of models evaluated during an experiment varies depending on the strategy used for splitting the available data into training and test sets. When using {ref}`train-test-split`, only a single model is evaluated. The performance scores according to different metrics that assess the quality of the model's predictions are saved to a single output file. In addition, if an {ref}`evaluating-training-data` is used, the performance scores for the model's predictions on the training set are also evaluated and written to a file. As shown below, the names of the output files specify whether predictions for the training or test set have been evaluated: + +- `evaluation_train_overall.csv` +- `evaluation_test_overall.csv` + +When using a {ref}`cross-validation`, a model is trained and evaluated for each fold. Again, the names of the output files specify whether predictions for the training or test data have been evaluated: + +- `evaluation_train_fold-1.csv` +- `evaluation_test_fold-1.csv` +- `evaluation_train_fold-2.csv` +- `evaluation_test_fold-2.csv` +- `evaluation_train_fold-3.csv` +- `evaluation_test_fold-3.csv` +- `evaluation_train_fold-4.csv` +- `evaluation_test_fold-4.csv` +- `evaluation_train_fold-5.csv` +- `evaluation_test_fold-5.csv` (output-predictions)= ## Predictions -In cases where the performance metrics obtained via the arguments ``--print-evaluation`` or ``--store-evaluation`` are not sufficient for a detailed analysis, it may be desired to directly inspect the predictions provided by the evaluated models. They can be printed on the console, together with the ground truth labels, by proving the argument ``--print-predictions``: +In cases where the {ref}`output-evaluation-results` obtained via the arguments ``--print-evaluation`` or ``--store-evaluation`` are not sufficient for a detailed analysis, it may be desired to directly inspect the predictions provided by the evaluated models. They can be printed on the console, together with the ground truth labels, by proving the argument ``--print-predictions``: ```text boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-predictions true From 5af763b0e20e720c817a4de6253b0121961f54e4 Mon Sep 17 00:00:00 2001 From: Michael Rapp Date: Sun, 5 May 2024 00:30:54 +0200 Subject: [PATCH 2/3] Add code examples for SeCo algorithm to documentation. --- doc/user_guide/testbed/arguments.md | 28 ++- doc/user_guide/testbed/evaluation.md | 172 +++++++++++--- .../testbed/experimental_results.md | 224 ++++++++++++++---- doc/user_guide/testbed/model_persistence.md | 14 +- .../testbed/parameter_persistence.md | 30 ++- doc/user_guide/testbed/pre_processing.md | 14 +- 6 files changed, 379 insertions(+), 103 deletions(-) diff --git a/doc/user_guide/testbed/arguments.md b/doc/user_guide/testbed/arguments.md index 93b6d4d1af..2e5379f3fb 100644 --- a/doc/user_guide/testbed/arguments.md +++ b/doc/user_guide/testbed/arguments.md @@ -425,12 +425,28 @@ In accordance with the syntax that is typically used by command line programs, t For example, the value of the parameter `feature_binning` may be set as follows: -```text -boomer --data-dir /path/to/datasets/ --dataset name --feature-binning equal-width -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset name --feature-binning equal-width + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset name --feature-binning equal-width + ``` +```` Some algorithmic parameters, including the parameter `feature_binning`, allow to specify additional options as key-value pairs by using a {ref}`bracket-notation`. This is also supported by the command line API, where the options may not contain any spaces and special characters like `{` or `}` must be escaped by using single-quotes (`'`): -```text -boomer --data-dir /path/to/datasets/ --dataset name --feature-binning equal-width'{bin_ratio=0.33,min_bins=2,max_bins=64}' -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset name --feature-binning equal-width'{bin_ratio=0.33,min_bins=2,max_bins=64}' + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset name --feature-binning equal-width'{bin_ratio=0.33,min_bins=2,max_bins=64}'seco --data-dir /path/to/datasets/ --dataset name --feature-binning equal-width + ``` +```` diff --git a/doc/user_guide/testbed/evaluation.md b/doc/user_guide/testbed/evaluation.md index 83e3c14e56..61fedea2b7 100644 --- a/doc/user_guide/testbed/evaluation.md +++ b/doc/user_guide/testbed/evaluation.md @@ -14,17 +14,33 @@ Several strategies for splitting the available data into distinct training and t The simplest and computationally least demanding strategy for obtaining training and tests is to randomly split the available data into two, mutually exclusive, parts. This strategy, which is used by default, if not specified otherwise, can be used by providing the argument `--data-split train-test` to the command line API: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split train-test -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split train-test + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --data-split train-test + ``` +```` Following the argument `--dataset`, the program loads the training data from a file named `dataset-name_training.arff`. Similarly, it expects the test data to be stored in a file named `dataset-name_test.arff`. If these files are not available, the program searches for a file with the name `dataset-name.arff` and splits it into training and test data automatically. When it is the responsibility of the command line API to split a given dataset into training and test tests, 66% of the data are included in the training set, whereas the remaining 33% are part of the test set. Although this ratio is frequently used in machine learning, you can easily adjust it by providing the option `test_size`: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'train-test{test_size=0.25}' -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'train-test{test_size=0.25}' + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'train-test{test_size=0.25}' + ``` +```` This command instructs the command line API to include 75% of the available data in the training set and use the remaining 25% for the test set. @@ -34,21 +50,47 @@ This command instructs the command line API to include 75% of the available data A more elaborate strategy for splitting data into training and test sets, which results in more realistic performance estimates, but also entails greater computational costs, is referred to as [cross validation]() (CV). The basic idea is to split the available data into several, equally-sized, parts. Afterwards, several machine learning models are trained and evaluated on different portions of the data using the same learning method. Each of these parts are used for testing exactly once, whereas the remaining ones make up the training set. The performance estimates that are obtained for each of these subsequent runs, referred to as *folds*, are finally averaged to obtain a single score and corresponding [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation). The command line API can be instructed to perform a cross validation using the argument `--data-split cv`: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv + ``` +```` By default, a 10-fold cross validation, where ten models are trained and evaluated, is performed. The number of folds can easily be adjusted via the option `num_folds`. For example, the following command results in a 5-fold CV being used: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5}' -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5}' + ``` +```` -```{tip} +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5}' + ``` +```` + +`````{tip} When providing the option `current_fold`, only a single fold, instead of the entire procedure, is performed. This is particularly useful, if one intends to train and evaluate the models for each individual fold in parallel on different machines. For example, the following command does only execute the second fold of a 5-fold CV: - boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5,current_fold=2}' -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5,current_fold=2}' + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --data-split 'cv{num_folds=5,current_fold=2}' + ``` +```` +````` (evaluating-training-data)= @@ -60,15 +102,33 @@ The configuraton described in this section should only be used for testing purpo Sometimes, evaluating the performance of a model on the data it has been trained on can be helpful for analyzing the behavior of a machine learning algorithm, e.g., if one needs to check if the approach is able to fit the data accurately. For this purpose, the command line API allows to use the argument `--data-split none`, which results in the given data not being split at all. Instead, the learning algorithm is applied to the entire dataset and predictions are be obtained from the resulting model for the exact same data points. The argument can be specified as follows: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split none -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split none + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --data-split none + ``` +```` -```{tip} +`````{tip} If you are interested in obtaining evaluation results for the training data in addition to the test data when using a train-test-split or a cross validation, as discussed above, the argument `--evaluate-training-data true` may be used: - boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv --evaluate-training-data true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv --evaluate-training-data true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --data-split cv --evaluate-training-data true + ``` +```` +````` (prediction-types)= @@ -82,9 +142,17 @@ The metrics for evaluating the quality of predictions that have been obtained fo We refer to real-valued predictions, which may be positive or negative, as *regression scores*. In the context of multi-label classification, positive scores indicate a preference towards predicting a label as relevant, whereas negative scores are predicted for labels that are more likely to be irrelevant. The absolute size of the scores corresponds to the confidence of the predictions, i.e., if a large value is predicted for a label, the model is more certain about the correctness of the predicted outcome. Unlike {ref}`probability-estimates`, regression scores are not bound to a certain interval and can be arbitrary positive or negative values. The BOOMER algorithm uses regression scores as a basis for predicting probabilities or binary labels. If you want to evaluate the quality of the regression scores directly, instead of transforming them into probabilities or binary predictions, the argument `--prediction-type scores` may be passed to the command line API: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type scores -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type scores + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type scores + ``` +```` For evaluating the quality of regression scores, [multi-label ranking measures](https://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-ranking-metrics) provided by the [scikit-learn](https://scikit-learn.org) framework are used. @@ -94,9 +162,17 @@ For evaluating the quality of regression scores, [multi-label ranking measures]( Probability estimates are given as real values between zero and one. In the context of multi-label classification, they express the probability of a label being relevant. If the predicted probability is close to zero, the corresponding label is more likely to be irrelevant, whereas a probability close to one is predicted for labels that are likely to be relevant. If you intend to evaluate the quality of probabilistic predictions, the argument `--prediction-type probabilities` should be used: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type probabilities -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type probabilities + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type probabilities + ``` +```` Similar to {ref}`regression-scores`, the command line API relies on [multi-label ranking measures](https://scikit-learn.org/stable/modules/model_evaluation.html#multilabel-ranking-metrics), as implemented by the [scikit-learn](https://scikit-learn.org) framework, for evaluating probability estimates. @@ -104,9 +180,17 @@ Similar to {ref}`regression-scores`, the command line API relies on [multi-label The most common type of prediction used for multi-label classification are binary predictions that directly indicate whether a label is considered as irrelevant or relevant. Irrelevant labels are represented by the value `0`, whereas the value `1` is predicted for relevant labels. By default, the command line API instructs the learning method to provide binary predictions. If you want to explicitly instruct it to use this particular type of predictions, you can use the argument `--prediction-type binary`: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type binary -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type binary + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --prediction-type binary + ``` +```` In a multi-label setting, the quality of binary predictions is assessed in terms of commonly used [multi-label classification metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) implemented by the [scikit-learn](https://scikit-learn.org) framework. If a dataset contains only a single label, the evaluation is restricted to classification metrics that are suited for single-label classification problems. @@ -114,9 +198,17 @@ In a multi-label setting, the quality of binary predictions is assessed in terms When evaluating the predictive performance of an [ensemble method](https://en.wikipedia.org/wiki/Ensemble_learning), i.e., models that consist of several weak predictors, also referred to as *ensemble members*, the command line API supports to evaluate these models incrementally. In particular, rule-based machine learning algorithms like the ones implemented by this project are often considered as ensemble methods, where each rule in a model can be viewed as a weak predictor. Adding more rules to a model typically results in better predictive performance. However, adding too many rules may result in overfitting the training data and therefore achieving subpar performance on the test data. For analyzing such behavior, the arugment `--incremental-evaluation true` may be passed to the command line API: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation true + ``` +```` When using the above command, the rule-based model that is learned by the BOOMER algorithm is evaluated repeatedly as more rules are added to it. Evaluation results are obtained for a model consisting of a single rule, two rules, three rules, and so on. Of course, because the evaluation is performed multiple times, this evaluation strategy comes with a large computational overhead. Therefore, depending on the size of the final model, it might be necessary to limit the number of evaluations via the following options: @@ -126,6 +218,14 @@ When using the above command, the rule-based model that is learned by the BOOMER For example, the following command may be used for the incremental evaluation of a BOOMER model that consists of up to 1000 rules. The model is evaluated for the first time after 200 rules have been added. Subsequent evaluations are perfomed when the model comprises 400, 600, 800, and 1000 rules. -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation 'true{min_size=200,max_size=1000,step_size=200}' -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation 'true{min_size=200,max_size=1000,step_size=200}' + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --incremental-evaluation 'true{min_size=200,max_size=1000,step_size=200}' + ``` +```` diff --git a/doc/user_guide/testbed/experimental_results.md b/doc/user_guide/testbed/experimental_results.md index 1be6a18933..c28d53230a 100644 --- a/doc/user_guide/testbed/experimental_results.md +++ b/doc/user_guide/testbed/experimental_results.md @@ -16,15 +16,31 @@ The path of the directory, where experimental results should be saved, can be ei By default, the predictive performance of all models trained during an experiment is evaluated in terms of commonly used evaluation metrics and the evaluation results are printed to the console. In addition, if the argument `--output-dir` is given, the evaluation results are also written into output files. The command line argument `--print-evaluation` can be used to explicitly enable or disable printing the evaluation results: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-evaluation true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-evaluation true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-evaluation true + ``` +```` Accordingly, the argument `--store-evaluation` allows to enable or disable saving the evaluation results to [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) files: -``` -boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-evaluation true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-evaluation true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-evaluation true + ``` +```` ```{tip} The command line arguments ``--print-evaluation`` and ``--store-evaluation`` come with several options for customization described {ref}`here`. It is possible to specify the performance metrics that should be used for evaluation by providing a black- or whitelist. Moreover, one can specify whether performance scores should be given as percentages and the number of decimals used for these scores can be chosen freely. @@ -54,15 +70,31 @@ When using a {ref}`cross-validation`, a model is trained and evaluated for each In cases where the {ref}`output-evaluation-results` obtained via the arguments ``--print-evaluation`` or ``--store-evaluation`` are not sufficient for a detailed analysis, it may be desired to directly inspect the predictions provided by the evaluated models. They can be printed on the console, together with the ground truth labels, by proving the argument ``--print-predictions``: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-predictions true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-predictions true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-predictions true + ``` +```` Alternatively, the argument ``--store-predictions`` can be used to save the predictions, as well as the ground truth labels, to [.arff](http://weka.wikispaces.com/ARFF) files: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-predictions true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-predictions true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-prediction-characteristics true + ``` +```` ```{tip} Depending on the {ref}`prediction-types`, the machine learning models used in an experiment are supposed to provide, the predictions stored in the resulting output files are either binary values (if binary predictions are provided), or real values (if regression scores or proability estimates are provided). When working with real-valued predictions, the option ``decimals`` may be supplied to the arguments ``--print-predictions`` and ``--store-predictions`` to specify the number of decimals that should be included in the output (see {ref}`here` for more information). @@ -92,15 +124,31 @@ When using a {ref}`cross-validation` for performance evaluation, a model is trai By using the command line argument ``--print-prediction-characteristics``, characteristics regarding a model's predictions can be printed: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-prediction-characteristics true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-prediction-characteristics true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-predictions true + ``` +```` Alternatively, they statistics can be written into a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file by using the argument ``--store-prediction-characteristics``: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-prediction-characteristics true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-prediction-characteristics true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-prediction-characteristics true + ``` +```` ```{tip} The output produced by the arguments ``--print-data-characteristics`` and ``--store-data-characteristics`` can be customized via several options described {ref}`here`. It is possible to exclude certain statistics from the output, to specify whether they should be given as percentages, and how many decimal places should be used. @@ -133,15 +181,31 @@ The statistics obtained via the previous commands include the following: To obtain insightful statistics regarding the characteristics of a data set, the command line argument ``--print-data-characteristics`` may be helpful: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-data-characteristics true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-data-characteristics true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-data-characteristics true + ``` +```` If you prefer to write the statistics into a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file, the argument ``--store-data-characteristics`` can be used: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-data-characteristics true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-data-characteristics true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-data-characteristics true + ``` +```` ```{tip} As shown {ref}`here`, the arguments ``--print-data-characteristics`` and ``--store-data-characteristics`` come with several options that allow to exclude specific statistics from the respective output. It is also possible to specify whether percentages should be prefered for presenting the statistics. Additionally, the number of decimals to be included in the output can be limited. @@ -178,15 +242,31 @@ In addition, the following statistics regarding the labels in a dataset are prov We refer to the unique labels combinations present for different examples in a dataset as label vectors. They can be printed by using the command line argument ``--print-label-vectors``: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-label-vectors true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-label-vectors true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-label-vectors true + ``` +```` If you prefer writing the label vectors into an output file, the argument ``--store-label-vectors`` can be used: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --store-label-vectors true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --store-label-vectors true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --store-label-vectors true + ``` +```` When using {ref}`train-test-split` for splitting the available data into distinct training and test sets, a single output file is created. It stores the label vectors present in the training data: @@ -218,15 +298,31 @@ By setting the option ``sparse`` to the value ``true``, an alternative represent To obtain a quick overview of some statistics that characterize a rule-based model learned by one of the algorithms provided by this project, the command line argument ``--print-model-characteristics`` can be useful: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-model-characteristics true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-model-characteristics true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-model-characteristics true + ``` +```` The above command results in a tabular representation of the characteristics being printed on the console. If one intends to write them into a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file instead, the argument ``--store-model-characteristics`` may be used: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-model-characteristics true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-model-characteristics true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-model-characteristics true + ``` +```` Model characteristics are obtained for each model training during an experiment. This means that a single output file is created when using on {ref}`train-test-split`: @@ -252,15 +348,31 @@ The statistics captured by the previous commands include the following: It is considered one of the advantages of rule-based machine learning models that they capture patterns found in the training data in a human-comprehensible form. This enables to manually inspect the models and reason about their predictive behavior. To help with this task, the command line API allows to output the rules in a model using a textual representation. If the text should be printed on the console, the following command specifying the argument ``--print-rules`` can be used: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-rules true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-rules true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-rules true + ``` +```` Alternatively, by using the argument ``--store-rules``, a textual representation of models can be written into a text file in the specifed output directory: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-rules true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-rules true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --output-dir /path/to/results/ --store-rules true + ``` +```` ```{tip} Both, the ``--print-rules`` and ``--store-rules`` arguments, come with several options that allow to customize the textual representation of models. An overview of these options is provided {ref}`here`. @@ -302,15 +414,31 @@ Examples that satisfy all conditions in a rule's body are said to be "covered" b Some machine learning algorithms provided by this project allow to obtain probabilistic predictions. These predictions can optionally be fine-tuned via calibration models to improve the reliability of the probability estimates. We support two types of calibration models for tuning marginal and joint probabilities, respectively. If one needs to inspect these calibration models, the command line arguments ``--print-marginal-probability-calibration-model`` and ``--print-joint-probability-calibration-model`` may be helpful: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-marginal-probability-calibration-model true --print-joint-probabiliy-calibration-model true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-marginal-probability-calibration-model true --print-joint-probabiliy-calibration-model true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-marginal-probability-calibration-model true --print-joint-probabiliy-calibration-model true + ``` +```` Alternatively, a representations of the calibration models can be written into [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) files by using the arguments ``--store-marginal-probability-calibration-model`` and ``--store-joint-probability-calibration-model`` -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --store-marginal-probability-calibration-model true --store-joint-probabiliy-calibration-model true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --store-marginal-probability-calibration-model true --store-joint-probabiliy-calibration-model true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --store-marginal-probability-calibration-model true --store-joint-probabiliy-calibration-model true + ``` +```` ```{tip} All of the above commands come with options for customizing the textual representation of models. A more detailed description of these options is available {ref}`here`. diff --git a/doc/user_guide/testbed/model_persistence.md b/doc/user_guide/testbed/model_persistence.md index 9ebe99f66f..b70ceee754 100644 --- a/doc/user_guide/testbed/model_persistence.md +++ b/doc/user_guide/testbed/model_persistence.md @@ -4,9 +4,17 @@ Because the training of machine learning models can be time-consuming, they are usually trained once and then reused later for making predictions. For this purpose, the command line API provides means to store models on disk and load them from the created files later on. This requires to specify the path of a directory, where models should be saved, via the command line argument `--model-dir`: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --model-dir /path/to/models -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --model-dir /path/to/models + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --model-dir /path/to/models + ``` +```` ```{note} The path of the directory, where models should be saved, can be either absolute or relative to the working directory. diff --git a/doc/user_guide/testbed/parameter_persistence.md b/doc/user_guide/testbed/parameter_persistence.md index def523bff5..b817465da5 100644 --- a/doc/user_guide/testbed/parameter_persistence.md +++ b/doc/user_guide/testbed/parameter_persistence.md @@ -4,11 +4,19 @@ To remember the parameters that have been used for training a model, it might be useful to save them to disk. Similar to {ref}`model-persistence`, keeping the resulting files allows to load a previously used configuration and reuse it at a later point in time. -On the one hand, this requires to specify a directory where parameter settings should be saved via the command line argument `--parameter-dir`. On the other hand, the argument `--store-parameters true` instructs the program to save custom parameters that are set via command line argments (see {ref}`setting-algorithmic-parameters`). For example, the following command sets a custom value for the parameter `shrinkage`, which is stored in an output file: +On the one hand, this requires to specify a directory where parameter settings should be saved via the command line argument `--parameter-dir`. On the other hand, the argument `--store-parameters true` instructs the program to save custom parameters that are set via command line argments (see {ref}`setting-algorithmic-parameters`). For example, the following command sets a custom value for a parameter, which is stored in an output file: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --parameter-dir /path/to/parameters --store-parameters true --shrinkage 0.5 -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --parameter-dir /path/to/parameters --store-parameters true --shrinkage 0.5 + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --parameter-dir /path/to/parameters --store-parameters true --heuristic precision + ``` +```` ```{note} The path of the directory, where parameter settings should be saved, can be either absolute or relative to the working directory. @@ -34,6 +42,14 @@ When executing the previously mentioned command again, the program restores the If you want to print all custom parameters that are used by a learning algorithm on the console, you can specify the argument `--print-parameters true`: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-parameters true --shrinkage 0.5 -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --print-parameters true --shrinkage 0.5 + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --print-parameters true --heuristic precision + ``` +```` diff --git a/doc/user_guide/testbed/pre_processing.md b/doc/user_guide/testbed/pre_processing.md index 345ef03941..6b822ce4dd 100644 --- a/doc/user_guide/testbed/pre_processing.md +++ b/doc/user_guide/testbed/pre_processing.md @@ -14,8 +14,16 @@ Not all machine learning methods can deal with nominal or binary features out-of Even though nominal and binary features are natively supported in an efficient way by all algorithms provided by this project, it might still be useful to use one-hot-encoding if one seeks for a fair comparison with machine learning approaches that cannot deal with such features. In such cases, you can provide the argument `--one-hot-encoding true` to the command line API: -```text -boomer --data-dir /path/to/datasets/ --dataset dataset-name --one-hot-encoding true -``` +````{tab} BOOMER + ```text + boomer --data-dir /path/to/datasets/ --dataset dataset-name --one-hot-encoding true + ``` +```` + +````{tab} SeCo + ```text + seco --data-dir /path/to/datasets/ --dataset dataset-name --one-hot-encoding true + ``` +```` Under the hood, the program makes use of scikit-learn's [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for pre-processing the data. From dc153c255c0f84f4a1833709221ad93efc2ddb60 Mon Sep 17 00:00:00 2001 From: michael-rapp <6638695+michael-rapp@users.noreply.github.com> Date: Sat, 4 May 2024 22:40:58 +0000 Subject: [PATCH 3/3] [Bot] Merge bugfix into feature branch. --- VERSION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/VERSION b/VERSION index f514a2f0bd..2774f8587f 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.9.1 \ No newline at end of file +0.10.0 \ No newline at end of file