Skip to content

Version 0.9.0

Compare
Choose a tag to compare
@michael-rapp michael-rapp released this 02 Jul 16:39

A major update to the BOOMER algorithm that introduces the following changes.

This release comes with several API changes. For an updated overview of the available parameters and command line arguments, please refer to the documentation.

Algorithmic Enhancements

  • Sparse matrices can now be used to store gradients and Hessians if supported by the loss function. The desired behavior can be specified via a new parameter --statistic-format.
  • Rules with partial heads can now be learned by setting the parameter --head-type to the value partial-fixed, if the number of predicted labels should be predefined, or partial-dynamic, if the subset of predicted labels should be determined dynamically.
  • A beam search can now be used for the induction of individual rules by setting the parameter --rule-induction to the value top-down-beam-search.
  • Variants of the squared error loss and squared hinge loss, which take all labels of an example into account at the same time, can now be used by setting the parameter --loss to the value squared-error-example-wise or squared-hinge-example-wise.
  • Probability estimates can be obtained for each label independently or via marginalization over the label vectors encountered in the training data by setting the new parameter --probability-predictor to the value label-wise or marginalized.
  • Predictions that maximize the example-wise F1-measure can now be obtained by setting the parameter --classification-predictor to the value gfm.
  • Binary predictions can now be derived from probability estimates by specifying the new option based_on_probabilities.
  • Isotonic regression models can now be used to calibrate marginal and joint probabilities predicted by a model via the new parameters --marginal-probability-calibration and --joint-probability-calibration.
  • The rules in a previously learned model can now be post-optimized by reconstructing each one of them in the context of the other rules via the new parameter --sequential-post-optimization.
  • Early stopping or post-pruning can now be used by setting the new parameter --global-pruning to the value pre-pruning or post-pruning.
  • Single labels can now be sampled in a round-robin fashion by setting the parameter --feature-sampling to the new value round-robin.
  • A fixed number of trailing features can now be retained when the parameter --feature-sampling is set to the value without-replacement by specifying the option num_retained.

Additions to the Command Line API

  • Data sets in the MEKA format are now supported.
  • Certain characteristics of binary predictions can be printed or written to output files via the new arguments --print-prediction-characteristics and --store-prediction-characteristics.
  • Unique label vectors contained in the training data can be printed or written to output files via the new arguments --print-label-vectors and --store-label-vectors.
  • Models for the calibration of marginal or joint probabilities can be printed or written to output files via the new arguments --print-marginal-probability-calibration-model, --store-marginal-probability-calibration-model, --print-joint-probability-calibration-model and --store-joint-probability-calibration-model.
  • Models can now be evaluated repeatedly, using a subset of their rules with increasing size, by specifying the argument --incremental-prediction.
  • More control of how data is split into training and test sets is now provided by the argument --data-split that replaces the arguments --folds and --current-fold.
  • Binary labels, regression scores, or probabilities can now be predicted, depending on the value of the new argument --prediction-type, which can be set to the values binary, scores, or probabilities.
  • Individual evaluation measures can now be enabled or disabled via additional options that have been added to the arguments --print-evaluation and --store-evaluation.
  • The presentation of values printed on the console has vastly been improved. In addition, options for controlling the presentation of values to be printed or written to output files have been added to various command line arguments.

Bugfixes

  • The behavior of the parameter --label-format has been fixed when set to the value auto.
  • The behavior of the parameters --holdout and --instance-sampling has been fixed when set to the value stratified-label-wise.
  • The behavior of the parameter --binary-predictor has been fixed when set to the value example-wise and using a model that has been loaded from disk.
  • Rules are now guaranteed to not cover more examples than specified via the option min_coverage. The option is now also taken into account when using feature binning. Alternatively, the minimum coverage of rules can now also be specified as a fraction via the option min_support.

API Changes

  • The parameter --early-stopping has been replaced with a new parameter --global-pruning.
  • The parameter --pruning has been renamed to --rule-pruning.
  • The parameter --classification-predictor has been renamed to --binary-predictor.
  • The command line argument --predict-probabilities has been replaced with a new argument --prediction-type.
  • The command line argument --predicted-label-format has been renamed to --prediction-format.

Quality-of-Life Improvements

  • Continuous integration is now used to test the most common functionalites of the BOOMER algorithm and the corresponding command line API.
  • Successful generation of the documentation is now tested via continuous integration.
  • Style definitions for Python and C++ code are now enforced by applying the tools clang-format, yapf, and isort via continuous integration.