diff --git a/CHANGELOG.md b/CHANGELOG.md index 0c9eea4944..33cc4e0c86 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -92,7 +92,7 @@ This release comes with several API changes. For an updated overview of the avai - Development versions of wheel packages are now regularly built via continuous integration, uploaded as artifacts, and published on [Test-PyPI](https://test.pypi.org/). - Continuous integration is now used to maintain separate branches for major, feature, and bugfix releases and keep them up-to-date. - The runtime of continuous integration jobs has been optimized by running individual steps only if necessary, caching files across subsequent runs, and making use of parallelization. -- When tests are run via continuous integration, a summary of the test results is now added to merge requests and Github workflows. +- When tests are run via continuous integration, a summary of the test results is now added to merge requests and GitHub workflows. - Markdown files are now used for writing the documentation. - A consistent style is now enforced for Markdown files by applying the tool `mdformat` via continuous integration. - C++ 17 or newer is now required for compiling the project. @@ -149,7 +149,7 @@ This release comes with several API changes. For an updated overview of the avai ### Quality-of-Life Improvements -- Continuous integration is now used to test the most common functionalites of the BOOMER algorithm and the corresponding command line API. +- Continuous integration is now used to test the most common functionalities of the BOOMER algorithm and the corresponding command line API. - Successful generation of the documentation is now tested via continuous integration. - Style definitions for Python and C++ code are now enforced by applying the tools `clang-format`, `yapf`, and `isort` via continuous integration. @@ -185,8 +185,8 @@ This release comes with changes to the command line API. For an updated overview A bugfix release that solves the following issues: - Fixes an issue preventing the use of dense representations of ground truth label matrices that was introduced in version 0.7.0. -- Pre-built packages for MacOS systems are now available at [PyPI](https://pypi.org/project/mlrl-boomer/). -- Linux and MacOS packages for Python 3.10 are now provided. +- Pre-built packages for macOS systems are now available at [PyPI](https://pypi.org/project/mlrl-boomer/). +- Linux and macOS packages for Python 3.10 are now provided. ## Version 0.7.0 (Dec. 5, 2021) @@ -266,7 +266,7 @@ A major update to the BOOMER algorithm that features the following changes: - Includes many refactorings and quality of live improvements. Code that is not directly related with the algorithm, such as the implementation of baselines, has been removed. - The algorithm is now able to natively handle nominal features without the need for pre-processing techniques such as one-hot encoding. - Sparse feature matrices can now be used for training and prediction, which reduces the memory footprint and results in a significant speed-up of training times on some data sets. -- Additional hyper-parameters (`min_coverage`, `max_conditions` and `max_head_refinements`) that provide fine-grained control over the specificity/generality of rules have been added. +- Additional hyperparameters (`min_coverage`, `max_conditions` and `max_head_refinements`) that provide fine-grained control over the specificity/generality of rules have been added. ## Version 0.1.0 (Jun. 22, 2020) diff --git a/README.md b/README.md index 1eeaba9e9c..56d1b023b1 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ The algorithm that is provided by this project currently supports the following - **Rules can be constructed via a greedy search or a beam search.** The latter may help to improve the quality of individual rules. - **Sampling techniques and stratification methods** can be used for learning new rules on a subset of the available training examples, features, or output variables. - **Shrinkage (a.k.a. the learning rate) can be adjusted** for controlling the impact of individual rules on the overall ensemble. -- **Fine-grained control over the specificity/generality of rules** is provided via hyper-parameters. +- **Fine-grained control over the specificity/generality of rules** is provided via hyperparameters. - **Incremental reduced error pruning** can be used for removing overly specific conditions from rules and preventing overfitting. - **Post- and pre-pruning (a.k.a. early stopping)** allows to determine the optimal number of rules to be included in an ensemble. - **Sequential post-optimization** may help improving the predictive performance of a model by reconstructing each rule in the context of the other rules. diff --git a/doc/developer_guide/coding_standards.md b/doc/developer_guide/coding_standards.md index 350d8edc1b..db1e4c4f88 100644 --- a/doc/developer_guide/coding_standards.md +++ b/doc/developer_guide/coding_standards.md @@ -8,23 +8,23 @@ As it is common for Open Source projects, where everyone is invited to contribut ## Continuous Integration -We make use of [Github Actions](https://docs.github.com/en/actions) as a [Continuous Integration](https://en.wikipedia.org/wiki/Continuous_integration) (CI) server for running predefined jobs, such as automated tests, in a controlled environment. Whenever certain parts of the project's repository have changed, relevant jobs are automatically executed. +We make use of [GitHub Actions](https://docs.github.com/en/actions) as a [Continuous Integration](https://en.wikipedia.org/wiki/Continuous_integration) (CI) server for running predefined jobs, such as automated tests, in a controlled environment. Whenever certain parts of the project's repository have changed, relevant jobs are automatically executed. ```{tip} -A track record of past runs can be found on Github in the [Actions](https://github.com/mrapp-ke/MLRL-Boomer/actions) tab. +A track record of past runs can be found on GitHub in the [Actions](https://github.com/mrapp-ke/MLRL-Boomer/actions) tab. ``` The workflow definitions of individual CI jobs can be found in the directory [.github/workflows/](https://github.com/mrapp-ke/MLRL-Boomer/tree/8ed4f36af5e449c5960a4676bc0a6a22de195979/.github/workflows). Currently, the following jobs are used in the project: -- `publish.yml` is used for publishing pre-built packages on [PyPI](https://pypi.org/) (see {ref}`installation`). For this purpose, the project is built from source for each of the target platforms and architectures, using virtualization in some cases. The job is run automatically when a new release was published on [Github](https://github.com/mrapp-ke/MLRL-Boomer/releases). It does also increment the project's major version number and merge the release branch into its upstream branches (see {ref}`release-process`). +- `publish.yml` is used for publishing pre-built packages on [PyPI](https://pypi.org/) (see {ref}`installation`). For this purpose, the project is built from source for each of the target platforms and architectures, using virtualization in some cases. The job is run automatically when a new release was published on [GitHub](https://github.com/mrapp-ke/MLRL-Boomer/releases). It does also increment the project's major version number and merge the release branch into its upstream branches (see {ref}`release-process`). - `publish_development.yml` publishes development versions of packages on [Test-PyPI](https://test.pypi.org/) whenever changes to the project's source code have been pushed to the main branch. The packages built by each of these runs are also saved as [artifacts](https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts) and can be downloaded as zip archives. - `test_publish.yml` ensures that the packages to be released for different architectures and Python versions can be built. The job is run for pull requests that modify relevant parts of the source code. -- `test_build.yml` builds the project for each of the supported target platforms, i.e., Linux, Windows, and MacOS (see {ref}`compilation`). In the Linux environment, this job does also execute all available unit and integration tests (see {ref}`testing`). It is run for pull requests whenever relevant parts of the project's source code have been modified. +- `test_build.yml` builds the project for each of the supported target platforms, i.e., Linux, Windows, and macOS (see {ref}`compilation`). In the Linux environment, this job does also execute all available unit and integration tests (see {ref}`testing`). It is run for pull requests whenever relevant parts of the project's source code have been modified. - `test_doc.yml` generates the latest documentation (see {ref}`documentation`) whenever relevant parts of the source code are affected by a pull request. - `test_format.yml` ensures that all source files in the project adhere to our coding style guidelines (see {ref}`code-style`). This job is run automatically for pull requests whenever they include any changes affecting the relevant source files. - `test_changelog.yml` ensures that all changelog files in the project adhere to the structure that is necessary to be processed automatically when publishing a new release. This job is run for pull requests if they modify one of the changelog files. - `merge_feature.yml` and `merge_bugfix.yml` are used to merge changes that have been pushed to the feature or bugfix branch into downstream branches via pull requests (see {ref}`release-process`). -- `merge_release.yml` is used to merge all changes included in a new release published on [Github](https://github.com/mrapp-ke/MLRL-Boomer/releases) into upstream branches and update the version numbers of these branches. +- `merge_release.yml` is used to merge all changes included in a new release published on [GitHub](https://github.com/mrapp-ke/MLRL-Boomer/releases) into upstream branches and update the version numbers of these branches. (testing)= @@ -38,7 +38,7 @@ To be able to detect problems with the project's source code early during develo ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build tests ``` @@ -58,7 +58,7 @@ This will result in all tests being run and their results being reported. If the ``` ```` -````{tab} MacOS +````{tab} macOS ```text SKIP_EARLY=true ./build tests ``` @@ -100,7 +100,7 @@ If you have modified the project's source code, you can check whether it adheres ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build test_format ``` @@ -124,7 +124,7 @@ In order to automatically format the project's source files according to our sty ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build format ``` @@ -178,7 +178,7 @@ To enable releasing new major, feature, or bugfix releases at any time, we maint We do not allow directly pushing to the above branches. Instead, all changes must be submitted via pull requests and require certain checks to pass. -Once modifications to one of the branches have been merged, {ref}`Continuous Integration ` jobs are used to automatically update downstream branches via pull requests. If all checks for such pull requests are successful, they are merged automatically. If there are any merge conflicts, they must be resolved manually. Following this procedure, changes to the feature brach are merged into the main branch (see `merge_feature.yml`), whereas changes to the bugfix branch are first merged into the feature branch and then into the main branch (see `merge_bugfix.yml`). +Once modifications to one of the branches have been merged, {ref}`Continuous Integration ` jobs are used to automatically update downstream branches via pull requests. If all checks for such pull requests are successful, they are merged automatically. If there are any merge conflicts, they must be resolved manually. Following this procedure, changes to the feature branch are merged into the main branch (see `merge_feature.yml`), whereas changes to the bugfix branch are first merged into the feature branch and then into the main branch (see `merge_bugfix.yml`). Whenever a new release has been published, the release branch is merged into the upstream branches (see `merge_release.yml`), i.e., major releases result in the feature and bugfix branches being updated, whereas minor releases result in the bugfix branch being updated. The version of the release branch and the affected branches are updated accordingly. The version of a branch is specified in the file `.version` in the project's root directory. Similarly, the file `.version-dev` is used to keep track of the version number used for development releases (see `publish_development.yml`). @@ -200,7 +200,7 @@ To ease the life of developers, the following command provided by the project's ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build check_dependencies ``` diff --git a/doc/developer_guide/compilation.md b/doc/developer_guide/compilation.md index 07bb8e422f..3c17af0042 100644 --- a/doc/developer_guide/compilation.md +++ b/doc/developer_guide/compilation.md @@ -8,7 +8,7 @@ Unlike pure Python programs, the C++ and Cython source files must be compiled fo ## Prerequisites -As a prerequisite, a supported version of Python, a suitable C++ compiler, as well as optional libraries for multi-threading and GPU support, must be available on the host system. The installation of these software components depends on the operation system at hand. In the following, we provide installation instructions for the supported platforms. +As a prerequisite, a supported version of Python, a suitable C++ compiler, as well as optional libraries for multi-threading and GPU support, must be available on the host system. The installation of these software components depends on the operating system at hand. In the following, we provide installation instructions for the supported platforms. ```{tip} This project uses [Meson](https://mesonbuild.com/) as a build system for compiling C++ code. If available on the system, Meson automatically utilizes [Ccache](https://ccache.dev/) for caching previous compilations and detecting when the same compilation is being done again. Compared to the runtime without Ccache, where changes are only detected at the level of entire files, usage of this compiler cache can significantly speed up recompilation and therefore is strongly adviced. @@ -29,12 +29,12 @@ This project uses [Meson](https://mesonbuild.com/) as a build system for compili ``` ```` -````{tab} MacOS +````{tab} macOS ```{list-table} * - **Python** - - Recent versions of MacOS do not include Python by default. A suitable Python version can manually be downloaded from the [project's website](https://www.python.org/downloads/macos/). Alternatively, the package manager [Homebrew]() can be used for installation via the command `brew install python`. + - Recent versions of macOS do not include Python by default. A suitable Python version can manually be downloaded from the [project's website](https://www.python.org/downloads/macos/). Alternatively, the package manager [Homebrew]() can be used for installation via the command `brew install python`. * - **C++ compiler** - - MacOS relies on the [Clang](https://en.wikipedia.org/wiki/Clang) compiler for building C++ code. It is part of the [Xcode](https://developer.apple.com/support/xcode/) developer toolset. + - macOS relies on the [Clang](https://en.wikipedia.org/wiki/Clang) compiler for building C++ code. It is part of the [Xcode](https://developer.apple.com/support/xcode/) developer toolset. * - **GoogleTest** - The [GoogleTest](https://github.com/google/googletest) framework must optionally be installed in order to compile the project with {ref}`testing support ` enabled. It can easily be installed via [Homebrew]() by runnig the command `brew install googletest`. * - **OpenMP** @@ -70,7 +70,7 @@ Instead of following the instructions below step by step, the following command, ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build ``` @@ -94,7 +94,7 @@ As shown in the section {ref}`project-structure`, this project is organized in t ``` ```` -````{tab} MacOS +````{tab} macOS ```text SUBPROJECTS=common,boosting ./build ``` @@ -110,7 +110,7 @@ As shown in the section {ref}`project-structure`, this project is organized in t ## Creating a Virtual Environment -The build process is based on an virtual Python environment that allows to install build- and run-time dependencies in an isolated manner and independently from the host system. Once the build process was completed, the resulting Python packages are installed into the virtual environment. To create new virtual environment and install all necessarily run-time dependencies, the following command must be executed: +The build process is based on a virtual Python environment that allows to install build- and run-time dependencies in an isolated manner and independently of the host system. Once the build process was completed, the resulting Python packages are installed into the virtual environment. To create new virtual environment and install all necessarily run-time dependencies, the following command must be executed: ````{tab} Linux ```text @@ -118,7 +118,7 @@ The build process is based on an virtual Python environment that allows to insta ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build venv ``` @@ -142,7 +142,7 @@ Once a new virtual environment has successfully been created, the compilation of ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build compile_cpp ``` @@ -166,7 +166,7 @@ Once the compilation of the C++ code has completed, the Cython code, which allow ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build compile_cython ``` @@ -186,7 +186,7 @@ Instead of performing the previous steps one after the other, the build target ` ## Installing Shared Libraries -The shared libraries that have been created in the previous steps from the C++ source files must afterwards be copied into the Python source tree. This can be achieved by executing the following command: +The shared libraries that have been created in the previous steps from the C++ source files must afterward be copied into the Python source tree. This can be achieved by executing the following command: ````{tab} Linux ```text @@ -194,7 +194,7 @@ The shared libraries that have been created in the previous steps from the C++ s ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build install_cpp ``` @@ -222,7 +222,7 @@ Similar to the previous step, the Python extension modules that have been built ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build install_cython ``` @@ -250,7 +250,7 @@ Once the compilation files have been copied into the Python source tree, wheel p ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build build_wheels ``` @@ -274,7 +274,7 @@ The wheel packages that have previously been created can finally be installed in ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build install_wheels ``` @@ -298,7 +298,7 @@ It is possible to delete the compilation files that result from an individual st ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build --clean compile_cpp ``` @@ -318,7 +318,7 @@ If you want to delete all compilation files that have previously been created, i ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build --clean ``` diff --git a/doc/developer_guide/documentation.md b/doc/developer_guide/documentation.md index c81a84fccf..59fa832283 100644 --- a/doc/developer_guide/documentation.md +++ b/doc/developer_guide/documentation.md @@ -21,7 +21,7 @@ It is not necessary to execute the steps below one after the other. Instead, run ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build doc ``` @@ -46,7 +46,7 @@ By running the following command, the C++ API documentation is generated via Dox ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build apidoc_cpp ``` @@ -70,7 +70,7 @@ Similarly, the following command generates an API documentation from the project ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build apidoc_python ``` @@ -96,7 +96,7 @@ To generate the final documentation's HTML files via [sphinx](https://www.sphinx ``` ```` -````{tab} MacOS +````{tab} macOS ```text ./build doc ``` @@ -108,7 +108,7 @@ To generate the final documentation's HTML files via [sphinx](https://www.sphinx ``` ```` -Afterwards, the generated files can be found in the directory `doc/_build/html/`. +Afterward, the generated files can be found in the directory `doc/_build/html/`. ## Cleaning up Build Files diff --git a/doc/index.md b/doc/index.md index fead866c92..82793cdb23 100644 --- a/doc/index.md +++ b/doc/index.md @@ -14,14 +14,14 @@ class: only-dark --- ``` -BOOMER is an algorithm for learning ensembles of gradient boosted multi-output rules that integrates with the popular [scikit-learn](https://scikit-learn.org) machine learning framework. It allows to train a machine learning model on labeled training data, which can afterwards be used to make predictions for unseen data. In contrast to prominent boosting algorithms like [XGBoost](https://xgboost.readthedocs.io/en/latest/) or [LightGBM](https://lightgbm.readthedocs.io/en/latest/), the algorithm is aimed at multi-output problems. On the one hand, this includes [multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification) problems, where individual data examples do not only correspond to a single class, but may be associated with several labels at the same time. Real-world applications of this problem domain include the assignment of keywords to text documents, the annotation of multimedia data, such as images, videos or audio recordings, as well as applications in the field of biology, chemistry and more. On the other hand, multi-output [regression](https://en.wikipedia.org/wiki/Regression_analysis) problems require to predict for more than a single numerical output variable. +BOOMER is an algorithm for learning ensembles of gradient boosted multi-output rules that integrates with the popular [scikit-learn](https://scikit-learn.org) machine learning framework. It allows to train a machine learning model on labeled training data, which can afterward be used to make predictions for unseen data. In contrast to prominent boosting algorithms like [XGBoost](https://xgboost.readthedocs.io/en/latest/) or [LightGBM](https://lightgbm.readthedocs.io/en/latest/), the algorithm is aimed at multi-output problems. On the one hand, this includes [multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification) problems, where individual data examples do not only correspond to a single class, but may be associated with several labels at the same time. Real-world applications of this problem domain include the assignment of keywords to text documents, the annotation of multimedia data, such as images, videos or audio recordings, as well as applications in the field of biology, chemistry and more. On the other hand, multi-output [regression](https://en.wikipedia.org/wiki/Regression_analysis) problems require to predict for more than a single numerical output variable. To provide a versatile tool for different use cases, great emphasis is put on the *efficiency* of the implementation. Moreover, to ensure its *flexibility*, it is designed in a modular fashion and can therefore easily be adjusted to different requirements. This modular approach enables implementing different kind of rule learning algorithms. For example, this project does also provide a Separate-and-Conquer (SeCo) algorithm based on traditional rule learning techniques that are particularly well-suited for learning interpretable models. This document is intended for end users of our algorithms and developers who are interested in their implementation. In addition, the following links might be of interest: - For a detailed description of the methodology used by the algorithms, please refer to the {ref}`list of publications `. -- The source code maintained by this project can be found in the [Github repository](https://github.com/mrapp-ke/MLRL-Boomer). +- The source code maintained by this project can be found in the [GitHub repository](https://github.com/mrapp-ke/MLRL-Boomer). - Issues with the software, feature requests, or questions to the developers should be posted via the project's [issue tracker](https://github.com/mrapp-ke/MLRL-Boomer/issues). ```{toctree} diff --git a/doc/misc/references.md b/doc/misc/references.md index 96fe68a978..403309ff47 100644 --- a/doc/misc/references.md +++ b/doc/misc/references.md @@ -115,7 +115,7 @@ The BOOMER algorithm was used as a baseline in the experimental study that is in ### Correlation-based Discovery of Disease Patterns for Syndromic Surveillance -In the following [paper](https://www.frontiersin.org/article/10.3389/fdata.2021.784159), a novel rule learning approach for discovering syndrome definitions for the early detection of infectious diseases is presented. The implementation of the proposed method, which is available at [Github](https://github.com/mrapp-ke/SyndromeLearner), is based on this project's source code. A preprint of the paper is available at [arxiv.org](https://arxiv.org/pdf/2110.09208.pdf). +In the following [paper](https://www.frontiersin.org/article/10.3389/fdata.2021.784159), a novel rule learning approach for discovering syndrome definitions for the early detection of infectious diseases is presented. The implementation of the proposed method, which is available at [GitHub](https://github.com/mrapp-ke/SyndromeLearner), is based on this project's source code. A preprint of the paper is available at [arxiv.org](https://arxiv.org/pdf/2110.09208.pdf). *Michael Rapp, Moritz Kulessa, Eneldo Loza Mencía and Johannes Fürnkranz. Correlation-based Discovery of Disease Patterns for Syndromic Surveillance. In: Frontiers in Big Data (4), 2021, Frontiers Media SA.* diff --git a/doc/quickstart/installation.md b/doc/quickstart/installation.md index bca875ed52..a39cefc6f2 100644 --- a/doc/quickstart/installation.md +++ b/doc/quickstart/installation.md @@ -5,7 +5,7 @@ All algorithms provided by this project are published on [PyPi](https://pypi.org/). As shown below, they can easily be installed via the Python package manager [pip](). Unless you intend to modify the algorithms' source code, in which case you should have a look at the section {ref}`compilation`, this is the recommended way for installing the software. ```{note} -Currently, the packages mentioned below are available for Linux (x86_64 and aarch64), MacOS (arm64) and Windows (AMD64). +Currently, the packages mentioned below are available for Linux (x86_64 and aarch64), macOS (arm64) and Windows (AMD64). ``` Examples of how to use the algorithms in your own Python programs can be found in the section {ref}`usage`. @@ -28,7 +28,7 @@ In addition to the BOOMER algorithm, this project does also provide a Separate-a pip install mlrl-seco ``` -In {ref}`this` section, we elaborate on the techiques utilized by the SeCo algorithm and discuss its parameters. +In {ref}`this` section, we elaborate on the techniques utilized by the SeCo algorithm and discuss its parameters. ## Installing the Command Line API diff --git a/doc/quickstart/testbed.md b/doc/quickstart/testbed.md index a5261a08b7..55537765f9 100644 --- a/doc/quickstart/testbed.md +++ b/doc/quickstart/testbed.md @@ -129,7 +129,7 @@ Some algorithmic parameters, including the parameter `feature_binning`, come wit ## Bracket Notation -Each algorithmic parameter is identified by an unique name. Depending on the type of a parameter, it either accepts numbers as possible values or allows to specify a string that corresponds to a predefined set of possible values (boolean values are also represented as strings). +Each algorithmic parameter is identified by a unique name. Depending on the type of parameter, it either accepts numbers as possible values or allows to specify a string that corresponds to a predefined set of possible values (boolean values are also represented as strings). In addition to the specified value, some parameters allow to provide additional options as key-value pairs. These options must be provided by using the following bracket notation: diff --git a/doc/quickstart/usage.md b/doc/quickstart/usage.md index b9ce73ef86..63478327a6 100644 --- a/doc/quickstart/usage.md +++ b/doc/quickstart/usage.md @@ -41,7 +41,7 @@ An illustration of how the classification algorithms can be fit to exemplary tra The `fit` method accepts two inputs, `x` and `y`: - A two-dimensional feature matrix `x`, where each row corresponds to a training example and each column corresponds to a particular feature. -- An one- or two-dimensional binary feature matrix `y`, where each row corresponds to a training example and each column corresponds to a label. If an element in the matrix is unlike zero, it indicates that the respective label is relevant to an example. Elements that are equal to zero denote irrelevant labels. In multi-label classification, where each example may be associated with several labels, the label matrix is two-dimensional. However, the algorithms are also capable of dealing with traditional binary classification problems, where an one-dimensional vector of ground truth labels is provided to the learning algorithm. +- A one- or two-dimensional binary feature matrix `y`, where each row corresponds to a training example and each column corresponds to a label. If an element in the matrix is unlike zero, it indicates that the respective label is relevant to an example. Elements that are equal to zero denote irrelevant labels. In multi-label classification, where each example may be associated with several labels, the label matrix is two-dimensional. However, the algorithms are also capable of dealing with traditional binary classification problems, where a one-dimensional vector of ground truth labels is provided to the learning algorithm. Both, `x` and `y`, are expected to be [numpy arrays](https://numpy.org/doc/stable/reference/generated/numpy.array.html) or equivalent [array-like](https://scikit-learn.org/stable/glossary.html#term-array-like) data types. @@ -64,7 +64,7 @@ The arguments that must be passed to the `fit` method are similar to the ones us ### Using Sparse Matrices -In addition to dense matrices like [numpy arrays](https://numpy.org/doc/stable/reference/generated/numpy.array.html), the algorithms also support to use [scipy sparse matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html). If certain cases, where the feature matrices consists mostly of zeros (or any other value), this can require significantly less amounts of memory and may speed up training. Sparse matrices can be provided to the `fit` method via the arguments `x` and `y` just as before. Optionally, the value that should be used for sparse elements in the feature matrix `x` can be specified via the keyword argument `sparse_feature_value`: +In addition to dense matrices like [numpy arrays](https://numpy.org/doc/stable/reference/generated/numpy.array.html), the algorithms also support to use [scipy sparse matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html). If certain cases, where the feature matrices consists mostly of zeros (or any other value), this can require significantly fewer amounts of memory and may speed up training. Sparse matrices can be provided to the `fit` method via the arguments `x` and `y` just as before. Optionally, the value that should be used for sparse elements in the feature matrix `x` can be specified via the keyword argument `sparse_feature_value`: ```python clf.fit(x, y, sparse_feature_value = 0.0) diff --git a/doc/user_guide/boosting/parameters.md b/doc/user_guide/boosting/parameters.md index 1ff33ba229..cddac6cbec 100644 --- a/doc/user_guide/boosting/parameters.md +++ b/doc/user_guide/boosting/parameters.md @@ -239,7 +239,7 @@ The duration in seconds after which the induction of rules should be canceled. T ## Pruning and Post-Optimization -The following parameters provide fine-grain control over the techniques that should be used for pruning rules or optimizing them after they have been learned. These techniques can help to prevent overfitting and may be helpful if one strives for simple models without any superflous rules. +The following parameters provide fine-grain control over the techniques that should be used for pruning rules or optimizing them after they have been learned. These techniques can help to prevent overfitting and may be helpful if one strives for simple models without any superfluous rules. ### `holdout` @@ -383,7 +383,7 @@ The following parameters provide fine-grain control over the techniques that sho ## Sampling Techniques -The following parameters allow to employ various sampling techniques that may help reducing computational costs when dealing with large datasets. Moreover, they may be used to ensure that a diverse set of rules is learned, which mcanay lead to better generalization when dealing with large models. +The following parameters allow to employ various sampling techniques that may help reducing computational costs when dealing with large datasets. Moreover, they may be used to ensure that a diverse set of rules is learned, which may lead to better generalization when dealing with large models. ### `random_state` diff --git a/doc/user_guide/testbed/arguments.md b/doc/user_guide/testbed/arguments.md index d750efe37f..b35082def1 100644 --- a/doc/user_guide/testbed/arguments.md +++ b/doc/user_guide/testbed/arguments.md @@ -38,7 +38,7 @@ The following optional arguments allow additional control over the loading mecha - `-r` or `--runnable` (Default value = `Runnable`) The name of the class extending {py:class}`mlrl.testbed.runnables.Runnable` that resides within the module or source file specified via the argument ``. -The arguments given above can be used to integrate any scikit-learn compatible machine learning algorithm with the comman line API. You can learn about this {ref}`here`. +The arguments given above can be used to integrate any scikit-learn compatible machine learning algorithm with the command line API. You can learn about this {ref}`here`. ### Dataset @@ -64,7 +64,7 @@ The command line API can conduct experiments for classification and regression p > A more detailed description of the following arguments can be found {ref}`here`. -One of the most important capabilities of the command line API is to train machine learning models and obtain an unbiased estimate of their predictive performance. For this purpose, the available data must be split into training and test data. The former is used to train models and the latter is used for evaluation afterwards, whereas the evaluation metrics depend on the type of predictions provided by a model. +One of the most important capabilities of the command line API is to train machine learning models and obtain an unbiased estimate of their predictive performance. For this purpose, the available data must be split into training and test data. The former is used to train models and the latter is used for evaluation afterward, whereas the evaluation metrics depend on the type of predictions provided by a model. ### Strategies for Data Splitting @@ -74,7 +74,7 @@ One of the most important capabilities of the command line API is to train machi - `test_size` (Default value = `0.33`) The fraction of the available data to be included in the test set, if the training and test set are not provided as separate files. Must be in (0, 1). - - `cross-validation` A cross validation is performed. Given that `dataset-name` is provided as the value of the argument `--dataset`, the data for individual folds must be stored in files named `dataset-name_fold-1`, `dataset-name_fold-2`, etc.. If no such files are available, the program searches for a file with the name `dataset-name.arff` and splits it into training and test data for the individual folds automatically. The following options may be specified using the {ref}`bracket notation`: + - `cross-validation` A cross validation is performed. Given that `dataset-name` is provided as the value of the argument `--dataset`, the data for individual folds must be stored in files named `dataset-name_fold-1`, `dataset-name_fold-2`, etc. If no such files are available, the program searches for a file with the name `dataset-name.arff` and splits it into training and test data for the individual folds automatically. The following options may be specified using the {ref}`bracket notation`: - `num_folds` (Default value = `10`) The total number of cross validation folds to be performed. Must be at least 2. - `current_fold` (Default value = `0`) The cross validation fold to be performed. Must be in \[1, `num_folds`\] or 0, if all folds should be performed. @@ -92,7 +92,7 @@ One of the most important capabilities of the command line API is to train machi - `scores` The learner is instructed to predict scores. In this case, ranking measures are used for evaluation. - `probabilities` The learner is instructed to predict probability estimates. In this case, ranking measures are used for evaluation. - - `binary` The learner is instructed to predict binary labels. In this case, bipartition evaluation measures are used for evaluation. + - `binary` The learner is instructed to predict binary labels. In this case, bi-partition evaluation measures are used for evaluation. ### Incremental Evaluation @@ -332,7 +332,7 @@ To provide valuable insights into the models learned by an algorithm, the predic - `percentage` (Default value = `true`) `true`, if the characteristics should be given as a percentage, if possible, `false` otherwise. - `outputs` (Default value = `true`) `true`, if the number of outputs should be stored, `false` otherwise. - `output_density` (Default value = `true`) `true`, if the density of the ground truth matrix should be stored, `false` otherwise. - - `output_sparsity` (Default value = `true`) `true`, if the sparsity of the groun dtruth matrix should be stored, `false` otherwise. + - `output_sparsity` (Default value = `true`) `true`, if the sparsity of the ground truth matrix should be stored, `false` otherwise. - `label_imbalance_ratio` (Default value = `true`, *classification only*) `true`, if the label imbalance ratio should be stored, `false` otherwise. - `label_cardinality` (Default value = `true`, *classification only*) `true`, if the average label cardinality should be stored, `false` otherwise. - `distinct_label_vectors` (Default value = `true`, *classification only*) `true`, if the number of distinct label vectors should be stored, `false` otherwise. @@ -451,7 +451,7 @@ To provide valuable insights into the models learned by an algorithm, the predic ## Setting Algorithmic Parameters -In addition to the command line arguments that are discussed above, it is often desirable to not rely on the default configuration of the BOOMER algorithm in an experiment, but to use a custom configuration. For this purpose, all of the algorithmic parameters that are discussed in the section {ref}`parameters` may be set by providing corresponding arguments to the command line API. +In addition to the command line arguments that are discussed above, it is often desirable to not rely on the default configuration of the BOOMER algorithm in an experiment, but to use a custom configuration. For this purpose, all the algorithmic parameters that are discussed in the section {ref}`parameters` may be set by providing corresponding arguments to the command line API. In accordance with the syntax that is typically used by command line programs, the parameter names must be given according to the following syntax that slightly differs from the names that are used by the programmatic Python API: diff --git a/doc/user_guide/testbed/evaluation.md b/doc/user_guide/testbed/evaluation.md index 9f06c6d807..a2485a6131 100644 --- a/doc/user_guide/testbed/evaluation.md +++ b/doc/user_guide/testbed/evaluation.md @@ -2,7 +2,7 @@ # Performance Evaluation -A major task in machine learning is to assess the predictive performance of different learning approaches, compare them to each other, and decide for the best approach suitable for a particular problem. The command line API provided by this project helps with these tasks by implementing several strategies for splitting available data into training and test sets, which is crucial to obtain unbiased estimates of a method's performance. In accordance with established practices, a machine learning model that is trained on a test set is afterwards applied to the corresponding test set to obtain predictions for data that was not included in the training process. The metrics that are used for evaluating the quality of these predictions are automatically chosen, depending on the type of predictions (binary predictions, probability estimates, etc.) provided by the tested method. +A major task in machine learning is to assess the predictive performance of different learning approaches, compare them to each other, and decide for the best approach suitable for a particular problem. The command line API provided by this project helps with these tasks by implementing several strategies for splitting available data into training and test sets, which is crucial to obtain unbiased estimates of a method's performance. In accordance with established practices, a machine learning model that is trained on a test set is afterward applied to the corresponding test set to obtain predictions for data that was not included in the training process. The metrics that are used for evaluating the quality of these predictions are automatically chosen, depending on the type of predictions (binary predictions, probability estimates, etc.) provided by the tested method. ## Strategies for Data Splitting @@ -12,7 +12,7 @@ Several strategies for splitting the available data into distinct training and t ### Train-Test-Splits -The simplest and computationally least demanding strategy for obtaining training and tests is to randomly split the available data into two, mutually exclusive, parts. This strategy, which is used by default, if not specified otherwise, can be used by providing the argument `--data-split train-test` to the command line API: +The simplest and computationally the least demanding strategy for obtaining training and tests is to randomly split the available data into two, mutually exclusive, parts. This strategy, which is used by default, if not specified otherwise, can be used by providing the argument `--data-split train-test` to the command line API: ````{tab} BOOMER ```text @@ -60,7 +60,7 @@ This command instructs the command line API to include 75% of the available data ### Cross Validation -A more elaborate strategy for splitting data into training and test sets, which results in more realistic performance estimates, but also entails greater computational costs, is referred to as [cross validation]() (CV). The basic idea is to split the available data into several, equally-sized, parts. Afterwards, several machine learning models are trained and evaluated on different portions of the data using the same learning method. Each of these parts are used for testing exactly once, whereas the remaining ones make up the training set. The performance estimates that are obtained for each of these subsequent runs, referred to as *folds*, are finally averaged to obtain a single score and corresponding [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation). The command line API can be instructed to perform a cross validation using the argument `--data-split cv`: +A more elaborate strategy for splitting data into training and test sets, which results in more realistic performance estimates, but also entails greater computational costs, is referred to as [cross validation]() (CV). The basic idea is to split the available data into several, equally-sized, parts. Afterward, several machine learning models are trained and evaluated on different portions of the data using the same learning method. Each of these parts are used for testing exactly once, whereas the remaining ones make up the training set. The performance estimates that are obtained for each of these subsequent runs, referred to as *folds*, are finally averaged to obtain a single score and corresponding [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation). The command line API can be instructed to perform a cross validation using the argument `--data-split cv`: ````{tab} BOOMER ```text @@ -258,7 +258,7 @@ In a multi-label setting, the quality of binary predictions is assessed in terms ## Incremental Evaluation -When evaluating the predictive performance of an [ensemble method](https://en.wikipedia.org/wiki/Ensemble_learning), i.e., models that consist of several weak predictors, also referred to as *ensemble members*, the command line API supports to evaluate these models incrementally. In particular, rule-based machine learning algorithms like the ones implemented by this project are often considered as ensemble methods, where each rule in a model can be viewed as a weak predictor. Adding more rules to a model typically results in better predictive performance. However, adding too many rules may result in overfitting the training data and therefore achieving subpar performance on the test data. For analyzing such behavior, the arugment `--incremental-evaluation true` may be passed to the command line API: +When evaluating the predictive performance of an [ensemble method](https://en.wikipedia.org/wiki/Ensemble_learning), i.e., models that consist of several weak predictors, also referred to as *ensemble members*, the command line API supports to evaluate these models incrementally. In particular, rule-based machine learning algorithms like the ones implemented by this project are often considered as ensemble methods, where each rule in a model can be viewed as a weak predictor. Adding more rules to a model typically results in better predictive performance. However, adding too many rules may result in overfitting the training data and therefore achieving subpar performance on the test data. For analyzing such behavior, the argument `--incremental-evaluation true` may be passed to the command line API: ````{tab} BOOMER ```text @@ -282,9 +282,9 @@ When using the above command, the rule-based model that is learned by the BOOMER - `min_size` specifies the minimum number of ensemble members that must be included in a model for the first evaluation to be performed. - `max_size` specifies the maximum number of ensemble members to be evaluated. -- `step_size` allows to to specify after how many additional ensemble members the evaluation should be repeated. +- `step_size` allows to specify after how many additional ensemble members the evaluation should be repeated. -For example, the following command may be used for the incremental evaluation of a BOOMER model that consists of up to 1000 rules. The model is evaluated for the first time after 200 rules have been added. Subsequent evaluations are perfomed when the model comprises 400, 600, 800, and 1000 rules. +For example, the following command may be used for the incremental evaluation of a BOOMER model that consists of up to 1000 rules. The model is evaluated for the first time after 200 rules have been added. Subsequent evaluations are performed when the model comprises 400, 600, 800, and 1000 rules. ````{tab} BOOMER ```text diff --git a/doc/user_guide/testbed/experimental_results.md b/doc/user_guide/testbed/experimental_results.md index b5b7bb5443..65314c7ae8 100644 --- a/doc/user_guide/testbed/experimental_results.md +++ b/doc/user_guide/testbed/experimental_results.md @@ -352,7 +352,7 @@ When using a {ref}`cross validation`, several models are train - `label_vectors_fold-4.csv` - `label_vectors_fold-5.csv` -The above commands output each label vector present in a dataset, as well as their frequency, i.e., the number of examples they are associated with. Moreover, each label vector is assigned an unique index. By default, feature vectors are given in the following format, where the n-th element indicates whether the n-th label is relevant (1) or not (0): +The above commands output each label vector present in a dataset, as well as their frequency, i.e., the number of examples they are associated with. Moreover, each label vector is assigned a unique index. By default, feature vectors are given in the following format, where the n-th element indicates whether the n-th label is relevant (1) or not (0): ```text [0 0 1 1 1 0] @@ -424,7 +424,7 @@ When using a {ref}`cross validation`, several models are train The statistics captured by the previous commands include the following: -- **Statistics about conditions:** Information about the number of rules in a model, as well as the different types of conditons contained in their bodies. +- **Statistics about conditions:** Information about the number of rules in a model, as well as the different types of conditions contained in their bodies. - **Statistics about predictions:** The distribution of positive and negative predictions provided by the rules in a model. - **Statistics per local rule:** The minimum, average, and maximum number of conditions and predictions the rules in a model entail in their bodies and heads, respectively. @@ -452,7 +452,7 @@ It is considered one of the advantages of rule-based machine learning models tha ``` ```` -Alternatively, by using the argument `--store-rules`, a textual representation of models can be written into a text file in the specifed output directory: +Alternatively, by using the argument `--store-rules`, a textual representation of models can be written into a text file in the specified output directory: ````{tab} BOOMER ```text @@ -490,7 +490,7 @@ A {ref}`cross validation` results in multiple output files, ea - `rules_fold-4.csv` - `rules_fold-5.csv` -Each rule in a model consists of a *body* and a *head* (we use the notation `body => head`). The body specifies to which examples a rule applies. It consist of one or several conditions that compare the feature values of given examples to thresholds derived from the training data. The head of a rule consists of the predictions it provides for individual outputs. The predictions provided by a head may be restricted to a subset of the available output or even a single one. +Each rule in a model consists of a *body* and a *head* (we use the notation `body => head`). The body specifies to which examples a rule applies. It consists of one or several conditions that compare the feature values of given examples to thresholds derived from the training data. The head of a rule consists of the predictions it provides for individual outputs. The predictions provided by a head may be restricted to a subset of the available output or even a single one. If not configured otherwise, the first rule in a model is a *default rule*. Unlike the other rules, it does not contain any conditions in its body and therefore applies to any given example. As shown in the following example, it always provides predictions for all available labels: @@ -500,7 +500,7 @@ If not configured otherwise, the first rule in a model is a *default rule*. Unli In regression models, the predictions of individual rules sum up to the regression scores predicted by the overall model. In classification models, a rule's prediction for a particular label is positive, if most examples it covers are associated with the respective label, otherwise it is negative. The ratio between the number of examples that are associated with a label, and those that are not, affects the absolute size of the default prediction. Large values indicate a stong preference towards predicting a particular label as relevant or irrelevant, depending on the sign. -The remaining rules only apply to examples that satisfy all of the conditions in their bodies. For example, the body of the following rule consists of two conditions: +The remaining rules only apply to examples that satisfy all the conditions in their bodies. For example, the body of the following rule consists of two conditions: ```text {feature1 <= 1.53 & feature2 > 7.935} => (output1 = -0.31) diff --git a/doc/user_guide/testbed/parameter_persistence.md b/doc/user_guide/testbed/parameter_persistence.md index dea8c4499c..d0d8c1a8eb 100644 --- a/doc/user_guide/testbed/parameter_persistence.md +++ b/doc/user_guide/testbed/parameter_persistence.md @@ -4,7 +4,7 @@ To remember the parameters that have been used for training a model, it might be useful to save them to disk. Similar to {ref}`saving models`, keeping the resulting files allows to load a previously used configuration and reuse it at a later point in time. -On the one hand, this requires to specify a directory where parameter settings should be saved via the command line argument `--parameter-dir`. On the other hand, the argument `--store-parameters true` instructs the program to save custom parameters that are set via command line argments (see {ref}`setting-algorithmic-parameters`). For example, the following command sets a custom value for a parameter, which is stored in an output file: +On the one hand, this requires to specify a directory where parameter settings should be saved via the command line argument `--parameter-dir`. On the other hand, the argument `--store-parameters true` instructs the program to save custom parameters that are set via command line arguments (see {ref}`setting-algorithmic-parameters`). For example, the following command sets a custom value for a parameter, which is stored in an output file: ````{tab} BOOMER ```text