diff --git a/DESCRIPTION b/DESCRIPTION index 336c8172..55d30bd0 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -54,30 +54,30 @@ Language: en-US NeedsCompilation: no Roxygen: list(markdown = TRUE) RoxygenNote: 7.2.1 -Collate: - 'assertions.R' - 'AutoFSelector.R' +Collate: 'ArchiveFSelect.R' - 'ObjectiveFSelect.R' - 'helper.R' + 'AutoFSelector.R' + 'FSelectInstanceSingleCrit.R' + 'FSelectInstanceMultiCrit.R' 'mlr_fselectors.R' - 'auto_fselector.R' - 'extract_inner_fselect_archives.R' - 'extract_inner_fselect_results.R' - 'fselect.R' - 'fselect_nested.R' 'FSelector.R' - 'FSelectorFromOptimizer.R' + 'FSelectorDesignPoints.R' 'FSelectorExhaustiveSearch.R' + 'FSelectorFromOptimizer.R' + 'FSelectorGeneticSearch.R' 'FSelectorRFE.R' 'FSelectorRandomSearch.R' 'FSelectorSequential.R' 'FSelectorShadowVariableSearch.R' - 'FSelectorDesignPoints.R' - 'FSelectorGeneticSearch.R' - 'FSelectInstanceMultiCrit.R' - 'FSelectInstanceSingleCrit.R' + 'ObjectiveFSelect.R' + 'assertions.R' + 'auto_fselector.R' + 'bibentries.R' + 'extract_inner_fselect_archives.R' + 'extract_inner_fselect_results.R' + 'fselect.R' + 'fselect_nested.R' + 'helper.R' 'reexports.R' 'sugar.R' - 'bibentries.R' 'zzz.R' diff --git a/NAMESPACE b/NAMESPACE index 0ba34852..7c62f04d 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -26,6 +26,7 @@ export(extract_inner_fselect_results) export(fs) export(fselect) export(fselect_nested) +export(fsi) export(fss) export(mlr_fselectors) export(mlr_terminators) diff --git a/NEWS.md b/NEWS.md index c5a43846..f73a5f00 100644 --- a/NEWS.md +++ b/NEWS.md @@ -4,6 +4,8 @@ * refactor: The `AutoFSelector` stores the instance and benchmark result if `store_models = TRUE`. * refactor: The `AutoFSelector` stores the instance if `store_benchmark_result = TRUE`. * feat: Add missing parameters from `AutoFSelector` to `auto_fselect()`. +* feat: Add `fsi()` function to create a `FSelectInstanceSingleCrit` or `FSelectInstanceMultiCrit`. +* refactor: Remove `unnest` option from `as.data.table.ArchiveFSelect()` function. # mlr3fselect 0.7.2 diff --git a/R/ArchiveFSelect.R b/R/ArchiveFSelect.R index 860a72b9..5157b91c 100644 --- a/R/ArchiveFSelect.R +++ b/R/ArchiveFSelect.R @@ -1,8 +1,16 @@ -#' @title Logging Object for Evaluated Feature Sets +#' @title Class for Logging Evaluated Feature Sets #' #' @description -#' Container around a [data.table::data.table()] which stores all evaluated -#' feature sets and performance scores. +#' The [ArchiveFSelect] stores all evaluated feature sets and performance scores. +#' +#' @details +#' The [ArchiveFSelect] is a container around a [data.table::data.table()]. +#' Each row corresponds to a single evaluation of a feature set. +#' See the section on Data Structure for more information. +#' The archive stores additionally a [mlr3::BenchmarkResult] (`$benchmark_result`) that records the resampling experiments. +#' Each experiment corresponds to to a single evaluation of a feature set. +#' The table (`$data`) and the benchmark result (`$benchmark_result`) are linked by the `uhash` column. +#' If the archive is passed to `as.data.table()`, both are joined automatically. #' #' @section Data structure: #' @@ -11,52 +19,33 @@ #' * One column for each feature of the task (`$search_space`). #' * One column for each performance measure (`$codomain`). #' * `runtime_learners` (`numeric(1)`)\cr -#' Sum of training and predict times logged in learners per -#' [mlr3::ResampleResult] / evaluation. This does not include potential -#' overhead time. +#' Sum of training and predict times logged in learners per [mlr3::ResampleResult] / evaluation. +#' This does not include potential overhead time. #' * `timestamp` (`POSIXct`)\cr #' Time stamp when the evaluation was logged into the archive. #' * `batch_nr` (`integer(1)`)\cr -#' Feature sets are evaluated in batches. Each batch has a unique batch -#' number. +#' Feature sets are evaluated in batches. Each batch has a unique batch number. #' * `uhash` (`character(1)`)\cr -#' Connects each feature set to the resampling experiment -#' stored in the [mlr3::BenchmarkResult]. -#' -#' Each row corresponds to a single evaluation of a feature set. -#' -#' The archive stores additionally a [mlr3::BenchmarkResult] -#' (`$benchmark_result`) that records the resampling experiments. Each -#' experiment corresponds to to a single evaluation of a feature set. The table -#' (`$data`) and the benchmark result (`$benchmark_result`) are linked by the -#' `uhash` column. If the results are viewed with `as.data.table()`, both are -#' joined automatically. +#' Connects each feature set to the resampling experiment stored in the [mlr3::BenchmarkResult]. #' #' @section Analysis: -#' -#' For analyzing the feature selection results, it is recommended to pass the archive to -#' `as.data.table()`. The returned data table is joined with the benchmark result -#' which adds the [mlr3::ResampleResult] for each feature set. +#' For analyzing the feature selection results, it is recommended to pass the archive to `as.data.table()`. +#' The returned data table is joined with the benchmark result which adds the [mlr3::ResampleResult] for each feature set. #' #' The archive provides various getters (e.g. `$learners()`) to ease the access. -#' All getters extract by position (`i`) or unique hash (`uhash`). For a -#' complete list of all getters see the methods section. +#' All getters extract by position (`i`) or unique hash (`uhash`). +#' For a complete list of all getters see the methods section. #' -#' The benchmark result (`$benchmark_result`) allows to score the feature sets -#' again on a different measure. Alternatively, measures can be supplied to -#' `as.data.table()`. +#' The benchmark result (`$benchmark_result`) allows to score the feature sets again on a different measure. +#' Alternatively, measures can be supplied to `as.data.table()`. #' #' @section S3 Methods: -#' * `as.data.table.ArchiveFSelect(x, unnest = NULL, exclude_columns = "uhash", measures = NULL)`\cr +#' * `as.data.table.ArchiveFSelect(x, exclude_columns = "uhash", measures = NULL)`\cr #' Returns a tabular view of all evaluated feature sets.\cr #' [ArchiveFSelect] -> [data.table::data.table()]\cr #' * `x` ([ArchiveFSelect]) -#' * `unnest` (`character()`)\cr -#' Transforms list columns to separate columns. Set to `NULL` if no column -#' should be unnested. #' * `exclude_columns` (`character()`)\cr -#' Exclude columns from table. Set to `NULL` if no column should be -#' excluded. +#' Exclude columns from table. Set to `NULL` if no column should be excluded. #' * `measures` (list of [mlr3::Measure])\cr #' Score feature sets on additional measures. #' @@ -67,14 +56,33 @@ ArchiveFSelect = R6Class("ArchiveFSelect", public = list( #' @field benchmark_result ([mlr3::BenchmarkResult])\cr - #' Stores benchmark result. + #' Benchmark result. benchmark_result = NULL, #' @description - #' Retrieve [mlr3::Learner] of the i-th evaluation, by position - #' or by unique hash `uhash`. `i` and `uhash` are mutually exclusive. - #' Learner does not contain a model. Use `$learners()` to get learners with - #' models. + #' Creates a new instance of this [R6][R6::R6Class] class. + #' + #' @param search_space ([paradox::ParamSet])\cr + #' Search space. + #' Internally created from provided [mlr3::Task] by instance. + #' + #' @param codomain ([bbotk::Codomain])\cr + #' Specifies codomain of objective function i.e. a set of performance measures. + #' Internally created from provided [mlr3::Measure]s by instance. + #' + #' @param check_values (`logical(1)`)\cr + #' If `TRUE` (default), hyperparameter configurations are check for validity. + initialize = function(search_space, codomain, check_values = TRUE) { + super$initialize(search_space, codomain, check_values) + + # initialize empty benchmark result + self$benchmark_result = BenchmarkResult$new() + }, + + #' @description + #' Retrieve [mlr3::Learner] of the i-th evaluation, by position or by unique hash `uhash`. + #' `i` and `uhash` are mutually exclusive. + #' Learner does not contain a model. Use `$learners()` to get learners with models. #' #' @param i (`integer(1)`)\cr #' The iteration value to filter for. @@ -138,20 +146,16 @@ ArchiveFSelect = R6Class("ArchiveFSelect", ) #' @export -as.data.table.ArchiveFSelect = function(x, ..., unnest = NULL, exclude_columns = "uhash", measures = NULL) { +as.data.table.ArchiveFSelect = function(x, ..., exclude_columns = "uhash", measures = NULL) { if (nrow(x$data) == 0) return(data.table()) # always ignore x_domain column exclude_columns = c("x_domain", exclude_columns) # default value for exclude_columns might be not present in archive - if (is.null(x$benchmark_result)) exclude_columns = exclude_columns[exclude_columns %nin% "uhash"] - - assert_subset(unnest, names(x$data)) + if (!x$benchmark_result$n_resample_results) exclude_columns = exclude_columns[exclude_columns %nin% "uhash"] cols_y_extra = NULL + tab = copy(x$data) - # unnest data - tab = unnest(copy(x$data), unnest, prefix = "{col}_") - - if (!is.null(x$benchmark_result)) { + if (x$benchmark_result$n_resample_results) { # add extra measures if (!is.null(measures)) { measures = assert_measures(as_measures(measures), learner = x$learners(1)[[1]], task = x$resample_result(1)$task) diff --git a/R/AutoFSelector.R b/R/AutoFSelector.R index 63187d0e..33393f99 100644 --- a/R/AutoFSelector.R +++ b/R/AutoFSelector.R @@ -1,4 +1,4 @@ -#' @title AutoFSelector +#' @title Class for Automatic Feature Selection #' #' @description #' The [AutoFSelector] wraps a [mlr3::Learner] and augments it with an automatic feature selection. @@ -34,12 +34,14 @@ #' #' @export #' @examples -#' # Automafsic Feafsure Selection +#' # Automatic Feature Selection +#' \donttest{ #' +#' # split to train and external set #' task = tsk("penguins") -#' train_set = sample(task$nrow, 0.8 * task$nrow) -#' test_set = setdiff(seq_len(task$nrow), train_set) +#' split = partition(task, ratio = 0.8) #' +#' # create auto fselector #' afs = auto_fselector( #' method = fs("random_search"), #' learner = lrn("classif.rpart"), @@ -47,13 +49,13 @@ #' measure = msr("classif.ce"), #' term_evals = 4) #' -#' # optimize feafsure subset and fit final model -#' afs$train(task, row_ids = train_set) +#' # optimize feature subset and fit final model +#' afs$train(task, row_ids = split$train) #' #' # predict with final model -#' afs$predict(task, row_ids = test_set) +#' afs$predict(task, row_ids = split$test) #' -#' # show fselect result +#' # show result #' afs$fselect_result #' #' # model slot contains trained learner and fselect instance @@ -84,8 +86,9 @@ #' # performance scores estimated on the outer resampling #' rr$score() #' -#' # unbiased performance of the final model trained on the full dafsa set +#' # unbiased performance of the final model trained on the full data set #' rr$aggregate() +#' } AutoFSelector = R6Class("AutoFSelector", inherit = Learner, public = list( diff --git a/R/FSelectInstanceMultiCrit.R b/R/FSelectInstanceMultiCrit.R index 405968a3..0a115165 100644 --- a/R/FSelectInstanceMultiCrit.R +++ b/R/FSelectInstanceMultiCrit.R @@ -1,23 +1,14 @@ -#' @title Multi Criterion Feature Selection Instance +#' @title Class for Multi Criteria Feature Selection #' -#' @description -#' Specifies a general feature selection scenario, including objective function -#' and archive for feature selection algorithms to act upon. This class stores -#' an [ObjectiveFSelect] object that encodes the black box objective function -#' which an [FSelector] has to optimize. It allows the basic operations of -#' querying the objective at feature subsets (`$eval_batch()`), storing the -#' evaluations in the internal [bbotk::Archive] and accessing the final result -#' (`$result`). +#' @include FSelectInstanceSingleCrit.R ArchiveFSelect.R #' -#' Evaluations of feature subsets are performed in batches by calling -#' [mlr3::benchmark()] internally. Before a batch is evaluated, the -#' [bbotk::Terminator] is queried for the remaining budget. If the available -#' budget is exhausted, an exception is raised, and no further evaluations can -#' be performed from this point on. +#' @description +#' The [FSelectInstanceMultiCrit] specifies a feature selection problem for [FSelectors][FSelector]. +#' The function [fsi()] creates a [FSelectInstanceMultiCrit] and the function [fselect()] creates an instance internally. #' -#' The [FSelector] is also supposed to store its final result, consisting -#' of the selected feature subsets and associated estimated performance values, by -#' calling the method `instance$assign_result()`. +#' @inherit FSelectInstanceSingleCrit details +#' @inheritSection FSelectInstanceSingleCrit Resources +#' @inheritSection ArchiveFSelect Analysis #' #' @template param_task #' @template param_learner @@ -31,63 +22,66 @@ #' #' @export #' @examples -#' library(mlr3) -#' library(data.table) -#' -#' # Objects required to define the performance evaluator -#' task = tsk("iris") -#' measures = msrs(c("classif.ce", "classif.acc")) -#' learner = lrn("classif.rpart") -#' resampling = rsmp("cv") -#' terminator = trm("evals", n_evals = 8) +#' # Feature selection on Palmer Penguins data set +#' task = tsk("penguins") #' -#' inst = FSelectInstanceMultiCrit$new( +#' # Construct feature selection instance +#' instance = fsi( #' task = task, -#' learner = learner, -#' resampling = resampling, -#' measures = measures, -#' terminator = terminator +#' learner = lrn("classif.rpart"), +#' resampling = rsmp("cv", folds = 3), +#' measures = msrs(c("classif.ce", "time_train")), +#' terminator = trm("evals", n_evals = 4) #' ) #' -#' # Try some feature subsets -#' xdt = data.table( -#' Petal.Length = c(TRUE, FALSE), -#' Petal.Width = c(FALSE, TRUE), -#' Sepal.Length = c(TRUE, FALSE), -#' Sepal.Width = c(FALSE, TRUE) -#' ) +#' # Choose optimization algorithm +#' fselector = fs("random_search", batch_size = 2) +#' +#' # Run feature selection +#' fselector$optimize(instance) #' -#' inst$eval_batch(xdt) +#' # Optimal feature sets +#' instance$result_feature_set #' -#' # Get archive data -#' as.data.table(inst$archive) +#' # Inspect all evaluated sets +#' as.data.table(instance$archive) FSelectInstanceMultiCrit = R6Class("FSelectInstanceMultiCrit", inherit = OptimInstanceMultiCrit, public = list( #' @description #' Creates a new instance of this [R6][R6::R6Class] class. - initialize = function(task, learner, resampling, measures, terminator, - store_models = FALSE, check_values = TRUE, store_benchmark_result = TRUE) { - obj = ObjectiveFSelect$new(task = task, learner = learner, - resampling = resampling, measures = measures, + initialize = function(task, learner, resampling, measures, terminator, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE) { + # initialized specialized fselect archive and objective + archive = ArchiveFSelect$new( + search_space = task_to_domain(assert_task(task)), + codomain = measures_to_codomain(assert_measures(measures)), + check_values = check_values) + + objective = ObjectiveFSelect$new( + task = task, + learner = learner, + resampling = resampling, + measures = measures, store_benchmark_result = store_benchmark_result, - store_models = store_models, check_values = check_values) - super$initialize(obj, obj$domain, terminator) + store_models = store_models, + check_values = check_values, + archive = archive) - self$archive = ArchiveFSelect$new(search_space = self$objective$domain, codomain = self$objective$codomain, - check_values = check_values) - self$objective$archive = self$archive + super$initialize(objective, objective$domain, terminator) + + # super class of instance initializes default archive, overwrite with fselect archive + self$archive = archive private$.objective_function = objective_function }, #' @description - #' The [FSelector] object writes the best found feature subsets - #' and estimated performance values here. For internal use. + #' The [FSelector] object writes the best found feature subsets and estimated performance values here. + #' For internal use. #' #' @param ydt (`data.table::data.table()`)\cr - #' Optimal outcomes, e.g. the Pareto front. + #' Optimal outcomes, e.g. the Pareto front. assign_result = function(xdt, ydt) { # Add feature names to result for easy task subsetting features = map(transpose_list(xdt), function(x) { @@ -103,7 +97,7 @@ FSelectInstanceMultiCrit = R6Class("FSelectInstanceMultiCrit", ), active = list( - #' @field result_feature_set (`list()` of `character()`)\cr + #' @field result_feature_set (list of `character()`)\cr #' Feature sets for task subsetting. result_feature_set = function() { map(self$result$features, function(x) { diff --git a/R/FSelectInstanceSingleCrit.R b/R/FSelectInstanceSingleCrit.R index 16ec368d..c1a2060e 100644 --- a/R/FSelectInstanceSingleCrit.R +++ b/R/FSelectInstanceSingleCrit.R @@ -1,23 +1,26 @@ -#' @title Single Criterion Feature Selection Instance +#' @title Class for Single Criterion Feature Selection +#' +#' @include ArchiveFSelect.R +#' +#' @description +#' The [FSelectInstanceSingleCrit] specifies a feature selection problem for [FSelectors][FSelector]. +#' The function [fsi()] creates a [FSelectInstanceSingleCrit] and the function [fselect()] creates an instance internally. #' #' @description -#' Specifies a general feature selection scenario, including objective function -#' and archive for feature selection algorithms to act upon. This class stores -#' an [ObjectiveFSelect] object that encodes the black box objective function -#' which an [FSelector] has to optimize. It allows the basic operations of -#' querying the objective at feature subsets (`$eval_batch()`), storing the -#' evaluations in the internal [bbotk::Archive] and accessing the final result -#' (`$result`). +#' The instance contains an [ObjectiveFSelect] object that encodes the black box objective function a [FSelector] has to optimize. +#' The instance allows the basic operations of querying the objective at design points (`$eval_batch()`). +#' This operation is usually done by the [FSelector]. +#' Evaluations of feature subsets are performed in batches by calling [mlr3::benchmark()] internally. +#' The evaluated feature subsets are stored in the [Archive][ArchiveFSelect] (`$archive`). +#' Before a batch is evaluated, the [bbotk::Terminator] is queried for the remaining budget. +#' If the available budget is exhausted, an exception is raised, and no further evaluations can be performed from this point on. +#' The [FSelector] is also supposed to store its final result, consisting of a selected feature subset and associated estimated performance values, by calling the method `instance$assign_result()`. #' -#' Evaluations of feature subsets are performed in batches by calling -#' [mlr3::benchmark()] internally. Before a batch is evaluated, the -#' [bbotk::Terminator] is queried for the remaining budget. If the available -#' budget is exhausted, an exception is raised, and no further evaluations can -#' be performed from this point on. +#' @inheritSection ArchiveFSelect Analysis #' -#' The [FSelector] is also supposed to store its final result, consisting -#' of a selected feature subset and associated estimated performance values, by -#' calling the method `instance$assign_result()`. +#' @section Resources: +#' * [book chapter](https://mlr3book.mlr-org.com/feature-selection.html#fs-wrapper) on feature selection. +#' * [gallery post](https://mlr-org.com/gallery/2020-09-14-mlr3fselect-basic/) on feature selection on the Titanic data set. #' #' @template param_task #' @template param_learner @@ -31,63 +34,70 @@ #' #' @export #' @examples -#' library(mlr3) -#' library(data.table) -#' -#' # Objects required to define the objective function -#' task = tsk("iris") -#' measure = msr("classif.ce") +#' # Feature selection on Palmer Penguins data set +#' task = tsk("penguins") #' learner = lrn("classif.rpart") -#' resampling = rsmp("cv") #' -#' # Create instance -#' terminator = trm("evals", n_evals = 8) -#' inst = FSelectInstanceSingleCrit$new( +#' # Construct feature selection instance +#' instance = fsi( #' task = task, #' learner = learner, -#' resampling = resampling, -#' measure = measure, -#' terminator = terminator +#' resampling = rsmp("cv", folds = 3), +#' measures = msr("classif.ce"), +#' terminator = trm("evals", n_evals = 4) #' ) #' -#' # Try some feature subsets -#' xdt = data.table( -#' Petal.Length = c(TRUE, FALSE), -#' Petal.Width = c(FALSE, TRUE), -#' Sepal.Length = c(TRUE, FALSE), -#' Sepal.Width = c(FALSE, TRUE) -#' ) +#' # Choose optimization algorithm +#' fselector = fs("random_search", batch_size = 2) +#' +#' # Run feature selection +#' fselector$optimize(instance) #' -#' inst$eval_batch(xdt) +#' # Subset task to optimal feature set +#' task$select(instance$result_feature_set) #' -#' # Get archive data -#' as.data.table(inst$archive) +#' # Train the learner with optimal feature set on the full data set +#' learner$train(task) +#' +#' # Inspect all evaluated sets +#' as.data.table(instance$archive) FSelectInstanceSingleCrit = R6Class("FSelectInstanceSingleCrit", inherit = OptimInstanceSingleCrit, public = list( #' @description #' Creates a new instance of this [R6][R6::R6Class] class. - initialize = function(task, learner, resampling, measure, terminator, - store_models = FALSE, check_values = TRUE, store_benchmark_result = TRUE) { - obj = ObjectiveFSelect$new(task = task, learner = learner, - resampling = resampling, measures = measure, + initialize = function(task, learner, resampling, measure, terminator, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE) { + # initialized specialized fselect archive and objective + archive = ArchiveFSelect$new( + search_space = task_to_domain(assert_task(task)), + codomain = measures_to_codomain(assert_measure(measure)), + check_values = check_values) + + objective = ObjectiveFSelect$new( + task = task, + learner = learner, + resampling = resampling, + measures = measure, store_benchmark_result = store_benchmark_result, - store_models = store_models, check_values = check_values) - super$initialize(obj, obj$domain, terminator) + store_models = store_models, + check_values = check_values, + archive = archive) - self$archive = ArchiveFSelect$new(search_space = self$objective$domain, codomain = self$objective$codomain, - check_values = check_values) - self$objective$archive = self$archive + super$initialize(objective, objective$domain, terminator) + + # super class of instance initializes default archive, overwrite with fselect archive + self$archive = archive private$.objective_function = objective_function }, #' @description - #' The [FSelector] writes the best found feature subset - #' and estimated performance value here. For internal use. + #' The [FSelector] writes the best found feature subset and estimated performance value here. + #' For internal use. + #' #' @param y (`numeric(1)`)\cr - #' Optimal outcome. + #' Optimal outcome. assign_result = function(xdt, y) { # Add feature names to result for easy task subsetting features = list(self$objective$task$feature_names[as.logical(xdt)]) diff --git a/R/FSelector.R b/R/FSelector.R index 5f95ede4..a301dbf5 100644 --- a/R/FSelector.R +++ b/R/FSelector.R @@ -1,56 +1,42 @@ -#' @title FSelector +#' @title Class for Feature Selection Algorithms +#' +#' @include mlr_fselectors.R #' #' @description -#' Abstract `FSelector` class that implements the base functionality each -#' fselector must provide. A `FSelector` object describes the feature selection -#' strategy, i.e. how to optimize the black-box function and its feasible set -#' defined by the [FSelectInstanceSingleCrit] / [FSelectInstanceMultiCrit] object. +#' The [FSelector] implements the optimization algorithm. #' -#' A fselector must write its result into the [FSelectInstanceSingleCrit] / -#' [FSelectInstanceMultiCrit] using the `assign_result` method of the -#' [bbotk::OptimInstance] at the end of its selection in order to store the best -#' selected feature subset and its estimated performance vector. +#' @details +#' [FSelector] is a abstract base class that implements the base functionality each fselector must provide. +#' A subclass is implemented in the following way: +#' * Inherit from FSelector. +#' * Specify the private abstract method `$.optimize()` and use it to call into your optimizer. +#' * You need to call `instance$eval_batch()` to evaluate design points. +#' * The batch evaluation is requested at the [FSelectInstanceSingleCrit]/[FSelectInstanceMultiCrit] object `instance`, so each batch is possibly executed in parallel via [mlr3::benchmark()], and all evaluations are stored inside of `instance$archive`. +#' * Before the batch evaluation, the [bbotk::Terminator] is checked, and if it is positive, an exception of class `"terminated_error"` is generated. +#' In the later case the current batch of evaluations is still stored in `instance`, but the numeric scores are not sent back to the handling optimizer as it has lost execution control. +#' * After such an exception was caught we select the best set from `instance$archive` and return it. +#' * Note that therefore more points than specified by the [bbotk::Terminator] may be evaluated, as the Terminator is only checked before a batch evaluation, and not in-between evaluation in a batch. +#' How many more depends on the setting of the batch size. +#' * Overwrite the private super-method `.assign_result()` if you want to decide yourself how to estimate the final set in the instance and its estimated performance. +#' The default behavior is: We pick the best resample-experiment, regarding the given measure, then assign its set and aggregated performance to the instance. #' #' @section Private Methods: #' * `.optimize(instance)` -> `NULL`\cr -#' Abstract base method. Implement to specify feature selection of your -#' subclass. See technical details sections. +#' Abstract base method. Implement to specify feature selection of your subclass. +#' See technical details sections. #' * `.assign_result(instance)` -> `NULL`\cr -#' Abstract base method. Implement to specify how the final feature subset is -#' selected. See technical details sections. +#' Abstract base method. Implement to specify how the final feature subset is selected. +#' See technical details sections. #' -#' @section Technical Details and Subclasses: -#' A subclass is implemented in the following way: -#' * Inherit from `FSelector`. -#' * Specify the private abstract method `$.optimize()` and use it to call into -#' your optimizer. -#' * You need to call `instance$eval_batch()` to evaluate feature subsets. -#' * The batch evaluation is requested at the [FSelectInstanceSingleCrit] / -#' [FSelectInstanceMultiCrit] object `instance`, so each batch is possibly -#' executed in parallel via [mlr3::benchmark()], and all evaluations are stored -#' inside of `instance$archive`. -#' * Before the batch evaluation, the [bbotk::Terminator] is checked, and if it is -#' positive, an exception of class `"terminated_error"` is generated. In the -#' later case the current batch of evaluations is still stored in `instance`, -#' but the numeric scores are not sent back to the handling optimizer as it has -#' lost execution control. -#' * After such an exception was caught we select the best feature subset from -#' `instance$archive` and return it. -#' * Note that therefore more points than specified by the [bbotk::Terminator] -#' may be evaluated, as the Terminator is only checked before a batch -#' evaluation, and not in-between evaluation in a batch. How many more depends -#' on the setting of the batch size. -#' * Overwrite the private super-method `.assign_result()` if you want to decide -#' yourself how to estimate the final feature subset in the instance and its -#' estimated performance. The default behavior is: We pick the best -#' resample-experiment, regarding the given measure, then assign its -#' feature subset and aggregated performance to the instance. +#' @section Resources: +#' * [book section](https://mlr3book.mlr-org.com/feature-selection.html#the-fselector-class) on feature selection algorithms. #' #' @template param_man #' #' @export FSelector = R6Class("FSelector", public = list( + #' @field id (`character(1)`)\cr #' Identifier of the object. #' Used in tables, plot and text output. @@ -89,6 +75,7 @@ FSelector = R6Class("FSelector", #' @description #' Helper for print outputs. + #' #' @return (`character()`). format = function() { sprintf("<%s>", class(self)[1L]) @@ -96,6 +83,7 @@ FSelector = R6Class("FSelector", #' @description #' Print method. + #' #' @return (`character()`). print = function() { catn(format(self), if (is.na(self$label)) "" else paste0(": ", self$label)) @@ -112,15 +100,13 @@ FSelector = R6Class("FSelector", }, #' @description - #' Performs the feature selection on a [FSelectInstanceSingleCrit] or - #' [FSelectInstanceMultiCrit] until termination. - #' The single evaluations will be written into the [ArchiveFSelect] that resides in the - #' [FSelectInstanceSingleCrit] / [FSelectInstanceMultiCrit]. + #' Performs the feature selection on a [FSelectInstanceSingleCrit] or [FSelectInstanceMultiCrit] until termination. + #' The single evaluations will be written into the [ArchiveFSelect] that resides in the [FSelectInstanceSingleCrit] / [FSelectInstanceMultiCrit]. #' The result will be written into the instance object. #' - #' @param inst ([FSelectInstanceSingleCrit]|[FSelectInstanceMultiCrit]). + #' @param inst ([FSelectInstanceSingleCrit] | [FSelectInstanceMultiCrit]). #' - #' @return [data.table::data.table]. + #' @return [data.table::data.table()]. optimize = function(inst) { assert_multi_class(inst, c("FSelectInstanceSingleCrit", "FSelectInstanceMultiCrit")) optimize_default(inst, self, private) @@ -130,7 +116,7 @@ FSelector = R6Class("FSelector", active = list( #' @field param_set [paradox::ParamSet]\cr - #' Set of control parameters. + #' Set of control parameters. param_set = function(rhs) { if (!missing(rhs) && !identical(rhs, private$.param_set)) { stop("$param_set is read-only.") @@ -139,8 +125,8 @@ FSelector = R6Class("FSelector", }, #' @field properties (`character()`)\cr - #' Set of properties of the fselector. - #' Must be a subset of [`mlr_reflections$fselect_properties`][mlr3::mlr_reflections]. + #' Set of properties of the fselector. + #' Must be a subset of [`mlr_reflections$fselect_properties`][mlr3::mlr_reflections]. properties = function(rhs) { if (!missing(rhs) && !identical(rhs, private$.properties)) { stop("$properties is read-only.") @@ -149,8 +135,8 @@ FSelector = R6Class("FSelector", }, #' @field packages (`character()`)\cr - #' Set of required packages. - #' Note that these packages will be loaded via [requireNamespace()], and are not attached. + #' Set of required packages. + #' Note that these packages will be loaded via [requireNamespace()], and are not attached. packages = function(rhs) { if (!missing(rhs) && !identical(rhs, private$.packages)) { stop("$packages is read-only.") @@ -159,8 +145,8 @@ FSelector = R6Class("FSelector", }, #' @field label (`character(1)`)\cr - #' Label for this object. - #' Can be used in tables, plot and text output instead of the ID. + #' Label for this object. + #' Can be used in tables, plot and text output instead of the ID. label = function(rhs) { if (!missing(rhs) && !identical(rhs, private$.param_set)) { stop("$label is read-only.") @@ -169,8 +155,8 @@ FSelector = R6Class("FSelector", }, #' @field man (`character(1)`)\cr - #' String in the format `[pkg]::[topic]` pointing to a manual page for this object. - #' The referenced help package can be opened via method `$help()`. + #' String in the format `[pkg]::[topic]` pointing to a manual page for this object. + #' The referenced help package can be opened via method `$help()`. man = function(rhs) { if (!missing(rhs) && !identical(rhs, private$.man)) { stop("$man is read-only.") diff --git a/R/FSelectorDesignPoints.R b/R/FSelectorDesignPoints.R index 3427f9cd..70c0efa3 100644 --- a/R/FSelectorDesignPoints.R +++ b/R/FSelectorDesignPoints.R @@ -1,11 +1,14 @@ -#' @title Feature Selection via Design Points +#' @title Feature Selection with Design Points #' +#' @include mlr_fselectors.R #' @name mlr_fselectors_design_points #' #' @description -#' Design points uses feature sets specified by the user. +#' Feature selection using user-defined feature sets. #' +#' @details #' The feature sets are evaluated in order as given. +#' #' The feature selection terminates itself when all feature sets are evaluated. #' It is not necessary to set a termination criterion. #' @@ -14,18 +17,18 @@ #' #' @inheritSection bbotk::OptimizerDesignPoints Parameters #' +#' @family FSelector #' @export #' @examples -#' library(mlr3misc) +#' # Feature Selection +#' \donttest{ #' -#' # retrieve task +#' # retrieve task and load learner #' task = tsk("pima") -#' -#' # load learner #' learner = lrn("classif.rpart") #' #' # create design -#' design = rowwise_table( +#' design = mlr3misc::rowwise_table( #' ~age, ~glucose, ~insulin, ~mass, ~pedigree, ~pregnant, ~pressure, ~triceps, #' TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, #' TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, @@ -33,21 +36,19 @@ #' TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE #' ) #' -#' \donttest{ -#' # feature selection on the pima indians diabetes data set +#' # run feature selection on the Pima Indians diabetes data set #' instance = fselect( -#' method = "design_points", +#' method = fs("design_points", design = design), #' task = task, #' learner = learner, -#' resampling = rsmp("cv", folds = 3), -#' measure = msr("classif.ce"), -#' design = design +#' resampling = rsmp("holdout"), +#' measure = msr("classif.ce") #' ) #' -#' # best performing feature subset +#' # best performing feature set #' instance$result #' -#' # all evaluated feature subsets +#' # all evaluated feature sets #' as.data.table(instance$archive) #' #' # subset the task and fit the final model diff --git a/R/FSelectorExhaustiveSearch.R b/R/FSelectorExhaustiveSearch.R index dc9c452b..40b686e5 100644 --- a/R/FSelectorExhaustiveSearch.R +++ b/R/FSelectorExhaustiveSearch.R @@ -1,22 +1,27 @@ -#' @title Feature Selection via Exhaustive Search +#' @title Feature Selection with Exhaustive Search #' +#' @include mlr_fselectors.R #' @name mlr_fselectors_exhaustive_search #' #' @description -#' Exhaustive search generates all possible feature sets. +#' Feature Selection using the Exhaustive Search Algorithm. +#' Exhaustive Search generates all possible feature sets. #' +#' @details #' The feature selection terminates itself when all feature sets are evaluated. #' It is not necessary to set a termination criterion. #' #' @templateVar id exhaustive_search #' @template section_dictionary_fselectors #' -#' @section Parameters: +#' @section Control Parameters: #' \describe{ #' \item{`max_features`}{`integer(1)`\cr -#' Maximum number of features. By default, number of features in [mlr3::Task].} +#' Maximum number of features. +#' By default, number of features in [mlr3::Task].} #' } #' +#' @family FSelector #' @export #' @template example FSelectorExhaustiveSearch = R6Class("FSelectorExhaustiveSearch", diff --git a/R/FSelectorGeneticSearch.R b/R/FSelectorGeneticSearch.R index 365a1928..e4016313 100644 --- a/R/FSelectorGeneticSearch.R +++ b/R/FSelectorGeneticSearch.R @@ -1,30 +1,21 @@ -#' @title Feature Selection via Genetic Search +#' @title Feature Selection with Genetic Search #' +#' @include mlr_fselectors.R #' @name mlr_fselectors_genetic_search #' #' @description -#' Genetic search imitates the process of natural selection to generate feature sets. -#' -#' Calls [genalg::rbga.bin()] from package \CRANpkg{genalg}. +#' Feature selection using the Genetic Algorithm from the package \CRANpkg{genalg}. #' #' @templateVar id genetic_search #' @template section_dictionary_fselectors #' -#' @section Parameters: -#' \describe{ -#' \item{`suggestions`}{`list()`} -#' \item{`popSize`}{`integer(1)`} -#' \item{`mutationChance`}{`numeric(1)`} -#' \item{`elitism`}{`integer(1)`} -#' \item{`zeroToOneRatio`}{`integer(1)`} -#' \item{`iters`}{`integer(1)`} -#' } -#' +#' @section Control Parameters: #' For the meaning of the control parameters, see [genalg::rbga.bin()]. #' [genalg::rbga.bin()] internally terminates after `iters` iteration. #' We set `ìters = 100000` to allow the termination via our terminators. #' If more iterations are needed, set `ìters` to a higher value in the parameter set. #' +#' @family FSelector #' @export #' @template example FSelectorGeneticSearch = R6Class("FSelectorGeneticSearch", diff --git a/R/FSelectorRFE.R b/R/FSelectorRFE.R index 0caa0e59..5fe594a2 100644 --- a/R/FSelectorRFE.R +++ b/R/FSelectorRFE.R @@ -1,8 +1,10 @@ -#' @title Feature Selection via Recursive Feature Elimination +#' @title Feature Selection with Recursive Feature Elimination #' +#' @include mlr_fselectors.R #' @name mlr_fselectors_rfe #' #' @description +#' Feature selection using the Recursive Feature Elimination Algorithm (RFE). #' Recursive feature elimination iteratively removes features with a low importance score. #' Only works with [Learner]s that can calculate importance scores (see section on optional extractors in [Learner]). #' @@ -18,12 +20,14 @@ #' @templateVar id rfe #' @template section_dictionary_fselectors #' -#' @section Parameters: +#' @section Control Parameters: #' \describe{ #' \item{`n_features`}{`integer(1)`\cr -#' The number of features to select. By default half of the features are selected.} +#' The number of features to select. +#' By default half of the features are selected.} #' \item{`feature_fraction`}{`double(1)`\cr -#' Fraction of features to retain in each iteration, The default 0.5 retrains half of the features.} +#' Fraction of features to retain in each iteration. +#' The default 0.5 retrains half of the features.} #' \item{`feature_number`}{`integer(1)`\cr #' Number of features to remove in each iteration.} #' \item{`subset_sizes`}{`integer()`\cr @@ -35,18 +39,19 @@ #' #' The parameter `feature_fraction`, `feature_number` and `subset_sizes` are mutually exclusive. #' +#' @family FSelector #' @export #' @examples -#' # retrieve task -#' task = tsk("pima") +#' # Feature Selection +#' \donttest{ #' -#' # load learner +#' # retrieve task and load learner +#' task = tsk("penguins") #' learner = lrn("classif.rpart") #' -#' \donttest{ -#' # feature selection on the pima indians diabetes data set +#' # run feature selection on the Palmer Penguins data set #' instance = fselect( -#' method = "rfe", +#' method = fs("rfe"), #' task = task, #' learner = learner, #' resampling = rsmp("holdout"), diff --git a/R/FSelectorRandomSearch.R b/R/FSelectorRandomSearch.R index 0f48f2b2..5f5ccec1 100644 --- a/R/FSelectorRandomSearch.R +++ b/R/FSelectorRandomSearch.R @@ -1,20 +1,24 @@ -#' @title Feature Selection via Random Search +#' @title Feature Selection with Random Search #' +#' @include mlr_fselectors.R #' @name mlr_fselectors_random_search #' #' @description -#' Random search randomly draws feature sets. +#' Feature selection using Random Search Algorithm. #' -#' Feature sets are evaluated in batches of size `batch_size`. +#' @details +#' The feature sets are randomly drawn. +#' The sets are evaluated in batches of size `batch_size`. #' Larger batches mean we can parallelize more, smaller batches imply a more fine-grained checking of termination criteria. #' #' @templateVar id random_search #' @template section_dictionary_fselectors #' -#' @section Parameters: +#' @section Control Parameters: #' \describe{ #' \item{`max_features`}{`integer(1)`\cr -#' Maximum number of features. By default, number of features in [mlr3::Task].} +#' Maximum number of features. +#' By default, number of features in [mlr3::Task].} #' \item{`batch_size`}{`integer(1)`\cr #' Maximum number of feature sets to try in a batch.} #' } @@ -22,23 +26,24 @@ #' @source #' `r format_bib("bergstra_2012")` #' +#' @family FSelector #' @export #' @examples -#' # retrieve task -#' task = tsk("pima") +#' # Feature Selection +#' \donttest{ #' -#' # load learner +#' # retrieve task and load learner +#' task = tsk("penguins") #' learner = lrn("classif.rpart") #' -#' \donttest{ -#' # feature selection on the pima indians diabetes data set +#' # run feature selection on the Palmer Penguins data set #' instance = fselect( -#' method = "random_search", +#' method = fs("random_search"), #' task = task, #' learner = learner, #' resampling = rsmp("holdout"), #' measure = msr("classif.ce"), -#' term_evals = 100 +#' term_evals = 10 #' ) #' #' # best performing feature subset diff --git a/R/FSelectorSequential.R b/R/FSelectorSequential.R index d0dc7e23..e013a356 100644 --- a/R/FSelectorSequential.R +++ b/R/FSelectorSequential.R @@ -1,10 +1,12 @@ -#' @title Feature Selection via Sequential Search +#' @title Feature Selection with Sequential Search #' +#' @include mlr_fselectors.R #' @name mlr_fselectors_sequential #' #' @description -#' Sequential search iteratively adds features to the set. +#' Feature selection using Sequential Search Algorithm. #' +#' @details #' Sequential forward selection (`strategy = fsf`) extends the feature set in each iteration with the feature that increases the models performance the most. #' Sequential backward selection (`strategy = fsb`) follows the same idea but starts with all features and removes features from the set. #' @@ -14,16 +16,17 @@ #' @templateVar id sequential #' @template section_dictionary_fselectors #' -#' @section Parameters: +#' @section Control Parameters: #' \describe{ #' \item{`min_features`}{`integer(1)`\cr -#' Minimum number of features. By default, 1.} +#' Minimum number of features. By default, 1.} #' \item{`max_features`}{`integer(1)`\cr -#' Maximum number of features. By default, number of features in [mlr3::Task].} +#' Maximum number of features. By default, number of features in [mlr3::Task].} #' \item{`strategy`}{`character(1)`\cr -#' Search method `sfs` (forward search) or `sbs` (backward search).} +#' Search method `sfs` (forward search) or `sbs` (backward search).} #' } #' +#' @family FSelector #' @export #' @template example FSelectorSequential = R6Class("FSelectorSequential", diff --git a/R/FSelectorShadowVariableSearch.R b/R/FSelectorShadowVariableSearch.R index fc41421b..6995d39b 100644 --- a/R/FSelectorShadowVariableSearch.R +++ b/R/FSelectorShadowVariableSearch.R @@ -1,10 +1,13 @@ -#' @title Feature Selection via Shadow Variable Search +#' @title Feature Selection with Shadow Variable Search #' +#' @include mlr_fselectors.R #' @name mlr_fselectors_shadow_variable_search #' #' @description +#' Feature selection using the Shadow Variable Search Algorithm. #' Shadow variable search creates for each feature a permutated copy and stops when one of them is selected. #' +#' @details #' The feature selection terminates itself when the first shadow variable is selected. #' It is not necessary to set a termination criterion. #' @@ -14,18 +17,19 @@ #' @source #' `r format_bib("thomas2017", "wu2007")` #' +#' @family FSelector #' @export #' @examples -#' # retrieve task -#' task = tsk("pima") +#' # Feature Selection +#' \donttest{ #' -#' # load learner +#' # retrieve task and load learner +#' task = tsk("penguins") #' learner = lrn("classif.rpart") #' -#' \donttest{ -#' # feature selection on the pima indians diabetes data set +#' # run feature selection on the Palmer Penguins data set #' instance = fselect( -#' method = "shadow_variable_search", +#' method = fs("shadow_variable_search"), #' task = task, #' learner = learner, #' resampling = rsmp("holdout"), diff --git a/R/ObjectiveFSelect.R b/R/ObjectiveFSelect.R index c12acc4b..217835a6 100644 --- a/R/ObjectiveFSelect.R +++ b/R/ObjectiveFSelect.R @@ -1,9 +1,8 @@ -#' @title ObjectiveFSelect +#' @title Class for Feature Selection Objective #' #' @description -#' Stores the objective function that estimates the performance of feature -#' subsets. This class is usually constructed internally by by the -#' [FSelectInstanceSingleCrit] / [FSelectInstanceMultiCrit]. +#' Stores the objective function that estimates the performance of feature subsets. +#' This class is usually constructed internally by by the [FSelectInstanceSingleCrit] / [FSelectInstanceMultiCrit]. #' #' @template param_task #' @template param_learner @@ -16,21 +15,18 @@ #' @export ObjectiveFSelect = R6Class("ObjectiveFSelect", inherit = Objective, - - #' @description - #' Creates a new instance of this [R6][R6::R6Class] class. public = list( - #' @field task ([mlr3::Task]) + #' @field task ([mlr3::Task]). task = NULL, - #' @field learner ([mlr3::Learner]) + #' @field learner ([mlr3::Learner]). learner = NULL, - #' @field resampling ([mlr3::Resampling]) + #' @field resampling ([mlr3::Resampling]). resampling = NULL, - #' @field measures (list of [mlr3::Measure]) + #' @field measures (list of [mlr3::Measure]). measures = NULL, #' @field store_models (`logical(1)`). @@ -44,38 +40,32 @@ ObjectiveFSelect = R6Class("ObjectiveFSelect", #' @description #' Creates a new instance of this [R6][R6::R6Class] class. - initialize = function(task, learner, resampling, measures, - check_values = TRUE, store_benchmark_result = TRUE, - store_models = FALSE) { - + #' + #' @param archive ([ArchiveFSelect])\cr + #' Reference to the archive of [FSelectInstanceSingleCrit] | [FSelectInstanceMultiCrit]. + #' If `NULL` (default), benchmark result and models cannot be stored. + initialize = function(task, learner, resampling, measures, check_values = TRUE, store_benchmark_result = TRUE, store_models = FALSE, archive = NULL) { self$task = assert_task(as_task(task, clone = TRUE)) - self$learner = assert_learner(as_learner(learner, clone = TRUE), - task = self$task) - self$resampling = assert_resampling(as_resampling(resampling, - clone = TRUE)) - self$measures = assert_measures(as_measures(measures, clone = TRUE), - task = self$task, learner = self$learner) - self$store_benchmark_result = assert_logical(store_benchmark_result) - self$store_models = assert_logical(store_models) - if (!resampling$is_instantiated) { - self$resampling$instantiate(self$task) - } - - domain = ParamSet$new(map(self$task$feature_names, - function(s) ParamLgl$new(id = s))) - - codomain = ParamSet$new(map(self$measures, function(s) { - ParamDbl$new(id = s$id, - tags = ifelse(s$minimize, "minimize", "maximize")) - })) - - super$initialize(id = sprintf("%s_on_%s", self$learner$id, self$task$id), - domain = domain, codomain = codomain, check_values = check_values) + self$learner = assert_learner(as_learner(learner, clone = TRUE), task = self$task) + self$resampling = assert_resampling(as_resampling(resampling, clone = TRUE)) + self$measures = assert_measures(as_measures(measures, clone = TRUE), task = self$task, learner = self$learner) + + self$archive = assert_r6(archive, "ArchiveFSelect", null.ok = TRUE) + if (is.null(self$archive)) store_benchmark_result = store_models = FALSE + self$store_models = assert_flag(store_models) + self$store_benchmark_result = assert_flag(store_benchmark_result) || self$store_models + + if (!resampling$is_instantiated) self$resampling$instantiate(self$task) + + super$initialize( + id = sprintf("%s_on_%s", self$learner$id, self$task$id), + domain = task_to_domain(self$task), + codomain = measures_to_codomain(self$measures), + check_values = check_values) } ), private = list( - .eval_many = function(xss) { learners = map(xss, function(x) { state = self$task$feature_names[unlist(x)] @@ -84,28 +74,25 @@ ObjectiveFSelect = R6Class("ObjectiveFSelect", GraphLearner$new(graph) }) + # benchmark feature subsets design = benchmark_grid(self$task, learners, self$resampling) - bmr = benchmark(design, store_models = self$store_models) - aggr = bmr$aggregate(self$measures) - y = map_chr(self$measures, "id") + benchmark_result = benchmark(design, store_models = self$store_models) + + # aggregate performance scores + aggregated_performance = benchmark_result$aggregate(self$measures, conditions = TRUE)[, c(self$codomain$target_ids, "warnings", "errors"), with = FALSE] - # add runtime - time = map_dbl(bmr$resample_results$resample_result, function(rr) { - sum(map_dbl(rr$learners, function(l) sum(l$timings))) + # add runtime to evaluations + time = map_dbl(benchmark_result$resample_results$resample_result, function(rr) { + sum(map_dbl(get_private(rr)$.data$learner_states(get_private(rr)$.view), function(state) state$train_time + state$predict_time)) }) - aggr[, "runtime_learners" := time] + set(aggregated_performance, j = "runtime_learners", value = time) + # store benchmark result in archive if (self$store_benchmark_result) { - self$archive$benchmark_result = - if (is.null(self$archive$benchmark_result)) { - self$archive$benchmark_result = bmr - } else { - self$archive$benchmark_result$combine(bmr) - } - cbind(aggr[, c(y, "runtime_learners"), with = FALSE], uhash = bmr$uhashes) - } else { - aggr[, c(y, "runtime_learners"), with = FALSE] + self$archive$benchmark_result$combine(benchmark_result) + set(aggregated_performance, j = "uhash", value = benchmark_result$uhashes) } + aggregated_performance } ) ) diff --git a/R/auto_fselector.R b/R/auto_fselector.R index 49f01dfe..4549f594 100644 --- a/R/auto_fselector.R +++ b/R/auto_fselector.R @@ -26,15 +26,7 @@ #' @template param_check_values #' #' @export -#' @examples -#' at = auto_fselector( -#' method = "random_search", -#' learner = lrn("classif.rpart"), -#' resampling = rsmp ("holdout"), -#' measure = msr("classif.ce"), -#' term_evals = 4) -#' -#' at$train(tsk("pima")) +#' @inherit AutoFSelector examples auto_fselector = function(method, learner, resampling, measure = NULL, term_evals = NULL, term_time = NULL, terminator = NULL, store_fselect_instance = TRUE, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, ...) { fselector = if (is.character(method)) { assert_choice(method, mlr_fselectors$keys()) diff --git a/R/extract_inner_fselect_archives.R b/R/extract_inner_fselect_archives.R index 296c5602..f5d4397a 100644 --- a/R/extract_inner_fselect_archives.R +++ b/R/extract_inner_fselect_archives.R @@ -1,12 +1,10 @@ #' @title Extract Inner Feature Selection Archives #' #' @description -#' Extract inner feature selection archives of nested resampling. Implemented for -#' [mlr3::ResampleResult] and [mlr3::BenchmarkResult]. The function iterates -#' over the [AutoFSelector] objects and binds the archives to a -#' [data.table::data.table()]. [AutoFSelector] must be initialized with -#' `store_fselect_instance = TRUE` and `resample()` or `benchmark()` must be -#' called with `store_models = TRUE`. +#' Extract inner feature selection archives of nested resampling. +#' Implemented for [mlr3::ResampleResult] and [mlr3::BenchmarkResult]. +#' The function iterates over the [AutoFSelector] objects and binds the archives to a [data.table::data.table()]. +#' [AutoFSelector] must be initialized with `store_fselect_instance = TRUE` and `resample()` or `benchmark()` must be called with `store_models = TRUE`. #' #' @section Data structure: #' @@ -21,7 +19,7 @@ #' * `runtime_learners` (`numeric(1)`)\cr #' Sum of training and predict times logged in learners per #' [mlr3::ResampleResult] / evaluation. This does not include potential -#' overhead time. +#' overhead time. #' * `timestamp` (`POSIXct`)\cr #' Time stamp when the evaluation was logged into the archive. #' * `batch_nr` (`integer(1)`)\cr @@ -34,9 +32,6 @@ #' * `resampling_id` (`character(1)`). #' #' @param x ([mlr3::ResampleResult] | [mlr3::BenchmarkResult]). -#' @param unnest (`character()`)\cr -#' Transforms list columns to separate columns. Set to `NULL` if no column -#' should be unnested. #' @param exclude_columns (`character()`)\cr #' Exclude columns from result table. Set to `NULL` if no column should be #' excluded. @@ -44,29 +39,33 @@ #' #' @export #' @examples +#' # Nested Resampling on Palmer Penguins Data Set +#' +#' # create auto fselector #' at = auto_fselector( -#' method = "random_search", +#' method = fs("random_search"), #' learner = lrn("classif.rpart"), #' resampling = rsmp ("holdout"), #' measure = msr("classif.ce"), #' term_evals = 4) #' #' resampling_outer = rsmp("cv", folds = 2) -#' rr = resample(tsk("iris"), at, resampling_outer, store_models = TRUE) +#' rr = resample(tsk("penguins"), at, resampling_outer, store_models = TRUE) #' +#' # extract inner archives #' extract_inner_fselect_archives(rr) -extract_inner_fselect_archives = function (x, unnest = NULL, exclude_columns = "uhash") { +extract_inner_fselect_archives = function (x, exclude_columns = "uhash") { UseMethod("extract_inner_fselect_archives") } #' @export -extract_inner_fselect_archives.ResampleResult = function(x, unnest = NULL, exclude_columns = "uhash") { +extract_inner_fselect_archives.ResampleResult = function(x, exclude_columns = "uhash") { rr = assert_resample_result(x) if (is.null(rr$learners[[1]]$model$fselect_instance)) { return(data.table()) } tab = imap_dtr(rr$learners, function(learner, i) { - data = as.data.table(learner$archive, unnest, exclude_columns) + data = as.data.table(learner$archive, exclude_columns) set(data, j = "iteration", value = i) }) tab[, "task_id" := rr$task$id] @@ -79,10 +78,10 @@ extract_inner_fselect_archives.ResampleResult = function(x, unnest = NULL, exclu } #' @export -extract_inner_fselect_archives.BenchmarkResult = function(x, unnest = NULL, exclude_columns = "uhash") { +extract_inner_fselect_archives.BenchmarkResult = function(x, exclude_columns = "uhash") { bmr = assert_benchmark_result(x) tab = imap_dtr(bmr$resample_results$resample_result, function(rr, i) { - data = extract_inner_fselect_archives(rr, unnest, exclude_columns) + data = extract_inner_fselect_archives(rr, exclude_columns) if (nrow(data) > 0) set(data, j = "experiment", value = i) }, .fill = TRUE) @@ -93,4 +92,4 @@ extract_inner_fselect_archives.BenchmarkResult = function(x, unnest = NULL, excl setcolorder(tab, c("experiment", "iteration", cols_x, cols_y)) } tab -} \ No newline at end of file +} diff --git a/R/extract_inner_fselect_results.R b/R/extract_inner_fselect_results.R index ca27df8f..ca089a90 100644 --- a/R/extract_inner_fselect_results.R +++ b/R/extract_inner_fselect_results.R @@ -1,12 +1,13 @@ #' @title Extract Inner Feature Selection Results #' #' @description -#' Extract inner feature selection results of nested resampling. Implemented for -#' [mlr3::ResampleResult] and [mlr3::BenchmarkResult]. The function iterates -#' over the [AutoFSelector] objects and binds the feature selection results to a -#' [data.table::data.table()]. [AutoFSelector] must be initialized with -#' `store_fselect_instance = TRUE` and `resample()` or `benchmark()` must be -#' called with `store_models = TRUE`. +#' Extract inner feature selection results of nested resampling. +#' Implemented for [mlr3::ResampleResult] and [mlr3::BenchmarkResult]. +#' +#' @details +#' The function iterates over the [AutoFSelector] objects and binds the feature selection results to a [data.table::data.table()]. +#' [AutoFSelector] must be initialized with `store_fselect_instance = TRUE` and `resample()` or `benchmark()` must be called with `store_models = TRUE`. +#' Optionally, the instance can be added for each iteration. #' #' @section Data structure: #' @@ -25,12 +26,20 @@ #' * `resampling_id` (`character(1)`). #' #' @param x ([mlr3::ResampleResult] | [mlr3::BenchmarkResult]). +#' @param fselect_instance (`logical(1)`)\cr +#' If `TRUE`, instances are added to the table. +#' @param ... (any)\cr +#' Additional arguments. +#' #' @return [data.table::data.table()]. #' #' @export #' @examples +#' # Nested Resampling on Palmer Penguins Data Set +#' +#' # create auto fselector #' at = auto_fselector( -#' method = "random_search", +#' method = fs("random_search"), #' learner = lrn("classif.rpart"), #' resampling = rsmp ("holdout"), #' measure = msr("classif.ce"), @@ -39,13 +48,14 @@ #' resampling_outer = rsmp("cv", folds = 2) #' rr = resample(tsk("iris"), at, resampling_outer, store_models = TRUE) #' +#' # extract inner results #' extract_inner_fselect_results(rr) -extract_inner_fselect_results = function (x) { +extract_inner_fselect_results = function (x, fselect_instance, ...) { UseMethod("extract_inner_fselect_results", x) } #' @export -extract_inner_fselect_results.ResampleResult = function(x) { +extract_inner_fselect_results.ResampleResult = function(x, fselect_instance = FALSE, ...) { rr = assert_resample_result(x) if (is.null(rr$learners[[1]]$model$fselect_instance)) { return(data.table()) @@ -53,6 +63,8 @@ extract_inner_fselect_results.ResampleResult = function(x) { tab = imap_dtr(rr$learners, function(learner, i) { data = setalloccol(learner$fselect_result) set(data, j = "iteration", value = i) + if (fselect_instance) set(data, j = "fselect_instance", value = list(learner$fselect_instance)) + data }) tab[, "task_id" := rr$task$id] tab[, "learner_id" := rr$learner$id] @@ -64,10 +76,10 @@ extract_inner_fselect_results.ResampleResult = function(x) { } #' @export -extract_inner_fselect_results.BenchmarkResult = function(x) { +extract_inner_fselect_results.BenchmarkResult = function(x, fselect_instance = FALSE, ...) { bmr = assert_benchmark_result(x) tab = imap_dtr(bmr$resample_results$resample_result, function(rr, i) { - data = extract_inner_fselect_results(rr) + data = extract_inner_fselect_results(rr, fselect_instance = fselect_instance) if (nrow(data) > 0) set(data, j = "experiment", value = i) }, .fill = TRUE) # reorder dt @@ -77,4 +89,4 @@ extract_inner_fselect_results.BenchmarkResult = function(x) { setcolorder(tab, unique(c("experiment", "iteration", cols_x, cols_y))) } tab -} \ No newline at end of file +} diff --git a/R/fselect.R b/R/fselect.R index 6ee06d03..fa3bbd3b 100644 --- a/R/fselect.R +++ b/R/fselect.R @@ -1,10 +1,29 @@ #' @title Function for Feature Selection #' +#' @include FSelectInstanceSingleCrit.R ArchiveFSelect.R +#' #' @description -#' Function to optimize the feature set of a [mlr3::Learner]. +#' Function to optimize the features of a [mlr3::Learner]. +#' The function internally creates a [FSelectInstanceSingleCrit] or [FSelectInstanceMultiCrit] which describe the feature selection problem. +#' It executes the feature selection with the [FSelector] (`method`) and returns the result with the fselect instance (`$result`). +#' The [ArchiveFSelect] (`$archive`) stores all evaluated hyperparameter configurations and performance scores. +#' +#' @details +#' The [mlr3::Task], [mlr3::Learner], [mlr3::Resampling], [mlr3::Measure] and [Terminator] are used to construct a [FSelectInstanceSingleCrit]. +#' If multiple performance [Measures][Measure] are supplied, a [FSelectInstanceMultiCrit] is created. +#' The parameter `term_evals` and `term_time` are shortcuts to create a [Terminator]. +#' If both parameters are passed, a [TerminatorCombo] is constructed. +#' For other [Terminators][Terminator], pass one with `terminator`. +#' If no termination criterion is needed, set `term_evals`, `term_time` and `terminator` to `NULL`. +#' +#' @inheritSection FSelectInstanceSingleCrit Resources +#' @inheritSection ArchiveFSelect Analysis #' #' @param method (`character(1)` | [FSelector])\cr #' Key to retrieve fselector from [mlr_fselectors] dictionary or [FSelector] object. +#' @param measures ([mlr3::Measure] or list of [mlr3::Measure])\cr +#' A single measure creates a [FSelectInstanceSingleCrit] and multiple measures a [FSelectInstanceMultiCrit]. +#' If `NULL`, default measure is used. #' @param term_evals (`integer(1)`)\cr #' Number of allowed evaluations. #' @param term_time (`integer(1)`)\cr @@ -12,39 +31,58 @@ #' @param ... (named `list()`)\cr #' Named arguments to be set as parameters of the fselector. #' -#' @return `FSelectInstanceSingleCrit` | `FSelectInstanceMultiCrit` +#' @return [FSelectInstanceSingleCrit] | [FSelectInstanceMultiCrit] #' #' @template param_task #' @template param_learner #' @template param_resampling -#' @template param_measures +#' @template param_terminator +#' @template param_store_benchmark_result #' @template param_store_models +#' @template param_check_values #' #' @export #' @examples +#' # Feature selection on the Palmer Penguins data set #' task = tsk("pima") +#' learner = lrn("classif.rpart") #' +#' # Run feature selection #' instance = fselect( #' method = "random_search", #' task = task, -#' learner = lrn("classif.rpart"), +#' learner = learner, #' resampling = rsmp ("holdout"), #' measures = msr("classif.ce"), #' term_evals = 4) #' -#' # subset task to optimized feature set +#' # Subset task to optimized feature set #' task$select(instance$result_feature_set) -fselect = function(method, task, learner, resampling, measures, term_evals = NULL, term_time = NULL, store_models = FALSE, ...) { +#' +#' # Train the learner with optimal feature set on the full data set +#' learner$train(task) +#' +#' # Inspect all evaluated configurations +#' as.data.table(instance$archive) +fselect = function(method, task, learner, resampling, measures = NULL, term_evals = NULL, term_time = NULL, terminator = NULL, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE, ...) { fselector = if (is.character(method)) { assert_choice(method, mlr_fselectors$keys()) fs(method, ...) } else { assert_fselector(method) } - terminator = terminator_selection(term_evals, term_time) + terminator = terminator %??% terminator_selection(term_evals, term_time) FSelectInstance = if (!is.list(measures)) FSelectInstanceSingleCrit else FSelectInstanceMultiCrit - instance = FSelectInstance$new(task, learner, resampling, measures, terminator, store_models = store_models) + instance = FSelectInstance$new( + task = task, + learner = learner, + resampling = resampling, + measures, + terminator = terminator, + store_benchmark_result = store_benchmark_result, + store_models = store_models, + check_values = check_values) fselector$optimize(instance) instance diff --git a/R/fselect_nested.R b/R/fselect_nested.R index f5edbc3f..4d689917 100644 --- a/R/fselect_nested.R +++ b/R/fselect_nested.R @@ -24,19 +24,20 @@ #' #' @export #' @examples +#' # Nested resampling on Palmer Penguins data set #' rr = fselect_nested( #' method = "random_search", -#' task = tsk("pima"), +#' task = tsk("penguins"), #' learner = lrn("classif.rpart"), #' inner_resampling = rsmp ("holdout"), #' outer_resampling = rsmp("cv", folds = 2), #' measure = msr("classif.ce"), #' term_evals = 4) #' -#' # performance scores estimated on the outer resampling +#' # Performance scores estimated on the outer resampling #' rr$score() #' -#' # unbiased performance of the final model trained on the full data set +#' # Unbiased performance of the final model trained on the full data set #' rr$aggregate() fselect_nested = function(method, task, learner, inner_resampling, outer_resampling, measure, term_evals = NULL, term_time = NULL, ...) { @@ -46,4 +47,4 @@ fselect_nested = function(method, task, learner, inner_resampling, outer_resampl afs = auto_fselector(method, learner, inner_resampling, measure, term_evals, term_time, ...) resample(task, afs, outer_resampling, store_models = TRUE) -} \ No newline at end of file +} diff --git a/R/helper.R b/R/helper.R index bdd299a8..cb87618c 100644 --- a/R/helper.R +++ b/R/helper.R @@ -1,3 +1,9 @@ -catn = function(..., file = "") { - cat(paste0(..., collapse = "\n"), "\n", sep = "", file = file) +task_to_domain = function(task) { + ParamSet$new(map(task$feature_names, function(s) ParamLgl$new(id = s))) +} + +measures_to_codomain = function(measures) { + Codomain$new(map(as_measures(measures), function(s) { + ParamDbl$new(id = s$id, tags = ifelse(s$minimize, "minimize", "maximize")) + })) } diff --git a/R/sugar.R b/R/sugar.R index 4832282b..3eed14f1 100644 --- a/R/sugar.R +++ b/R/sugar.R @@ -1,14 +1,24 @@ #' @title Syntactic Sugar for FSelect Construction #' #' @description -#' This function complements [mlr_fselectors] with functions in the spirit -#' of [mlr3::mlr_sugar]. +#' Functions to retrieve objects, set parameters and assign to fields in one go. +#' Relies on [mlr3misc::dictionary_sugar_get()] to extract objects from the respective [mlr3misc::Dictionary]: +#' +#' * `fs()` for a [FSelector] from [mlr_fselectors]. +#' * `fss()` for a list of [FSelectors][FSelector] from [mlr_fselectors]. +#' * `trm()` for a [Terminator] from [mlr_terminators]. +#' * `trms()` for a list of [Terminators][Terminator] from [mlr_terminators]. #' #' @inheritParams mlr3::mlr_sugar -#' @return [FSelector]. +#' @return [R6::R6Class] object of the respective type, or a list of [R6::R6Class] objects for the plural versions. +#' #' @export #' @examples -#' fs("sequential", max_features = 4) +#' # random search with batch size of 5 +#' fs("random_search", batch_size = 5) +#' +#' # run time terminator with 20 seconds +#' trm("run_time", secs = 20) fs = function(.key, ...) { dictionary_sugar_get(mlr_fselectors, .key, ...) } @@ -18,3 +28,29 @@ fs = function(.key, ...) { fss = function(.keys, ...) { dictionary_sugar_mget(mlr_fselectors, .keys, ...) } + +#' @title Syntactic Sugar for Instance Construction +#' +#' @description +#' Function to construct a [FSelectInstanceSingleCrit] or [FSelectInstanceMultiCrit]. +#' +#' @param measures ([mlr3::Measure] or list of [mlr3::Measure])\cr +#' A single measure creates a [FSelectInstanceSingleCrit] and multiple measures a [FSelectInstanceMultiCrit]. +#' If `NULL`, default measure is used. +#' +#' @template param_task +#' @template param_learner +#' @template param_resampling +#' @template param_terminator +#' @template param_store_benchmark_result +#' @template param_store_models +#' @template param_check_values +#' +#' @inheritSection FSelectInstanceSingleCrit Resources +#' +#' @export +#' @inherit FSelectInstanceSingleCrit examples +fsi = function(task, learner, resampling, measures = NULL, terminator, store_benchmark_result = TRUE, store_models = FALSE, check_values = FALSE) { + FSelectInstance = if (!is.list(measures)) FSelectInstanceSingleCrit else FSelectInstanceMultiCrit + FSelectInstance$new(task, learner, resampling, measures, terminator, store_benchmark_result, store_models, check_values) +} diff --git a/man-roxygen/example.R b/man-roxygen/example.R index 810ce700..be96f193 100644 --- a/man-roxygen/example.R +++ b/man-roxygen/example.R @@ -1,12 +1,12 @@ #' @examples -#' # retrieve task -#' task = tsk("pima") +#' # Feature Selection +#' \donttest{ #' -#' # load learner +#' # retrieve task and load learner +#' task = tsk("penguins") #' learner = lrn("classif.rpart") #' -#' \donttest{ -#' # feature selection on the pima indians diabetes data set +#' # run feature selection on the Palmer Penguins data set #' instance = fselect( #' method = "<%= id %>", #' task = task, @@ -16,10 +16,10 @@ #' term_evals = 10 #' ) #' -#' # best performing feature subset +#' # best performing feature set #' instance$result #' -#' # all evaluated feature subsets +#' # all evaluated feature sets #' as.data.table(instance$archive) #' #' # subset the task and fit the final model diff --git a/man-roxygen/section_dictionary_fselectors.R b/man-roxygen/section_dictionary_fselectors.R index 92448353..384aa101 100644 --- a/man-roxygen/section_dictionary_fselectors.R +++ b/man-roxygen/section_dictionary_fselectors.R @@ -1,7 +1,5 @@ #' @section Dictionary: -#' This [FSelector] can be instantiated via the [dictionary][mlr3misc::Dictionary] -#' [mlr_fselectors] or with the associated sugar function [fs()]: +#' This [FSelector] can be instantiated with the associated sugar function [fs()]: #' ``` -#' mlr_fselectors$get("<%= id %>") #' fs("<%= id %>") #' ``` diff --git a/man/ArchiveFSelect.Rd b/man/ArchiveFSelect.Rd index 83e2b1ef..c2eaef82 100644 --- a/man/ArchiveFSelect.Rd +++ b/man/ArchiveFSelect.Rd @@ -2,10 +2,18 @@ % Please edit documentation in R/ArchiveFSelect.R \name{ArchiveFSelect} \alias{ArchiveFSelect} -\title{Logging Object for Evaluated Feature Sets} +\title{Class for Logging Evaluated Feature Sets} \description{ -Container around a \code{\link[data.table:data.table]{data.table::data.table()}} which stores all evaluated -feature sets and performance scores. +The \link{ArchiveFSelect} stores all evaluated feature sets and performance scores. +} +\details{ +The \link{ArchiveFSelect} is a container around a \code{\link[data.table:data.table]{data.table::data.table()}}. +Each row corresponds to a single evaluation of a feature set. +See the section on Data Structure for more information. +The archive stores additionally a \link[mlr3:BenchmarkResult]{mlr3::BenchmarkResult} (\verb{$benchmark_result}) that records the resampling experiments. +Each experiment corresponds to to a single evaluation of a feature set. +The table (\verb{$data}) and the benchmark result (\verb{$benchmark_result}) are linked by the \code{uhash} column. +If the archive is passed to \code{as.data.table()}, both are joined automatically. } \section{Data structure}{ @@ -15,59 +23,40 @@ The table (\verb{$data}) has the following columns: \item One column for each feature of the task (\verb{$search_space}). \item One column for each performance measure (\verb{$codomain}). \item \code{runtime_learners} (\code{numeric(1)})\cr -Sum of training and predict times logged in learners per -\link[mlr3:ResampleResult]{mlr3::ResampleResult} / evaluation. This does not include potential -overhead time. +Sum of training and predict times logged in learners per \link[mlr3:ResampleResult]{mlr3::ResampleResult} / evaluation. +This does not include potential overhead time. \item \code{timestamp} (\code{POSIXct})\cr Time stamp when the evaluation was logged into the archive. \item \code{batch_nr} (\code{integer(1)})\cr -Feature sets are evaluated in batches. Each batch has a unique batch -number. +Feature sets are evaluated in batches. Each batch has a unique batch number. \item \code{uhash} (\code{character(1)})\cr -Connects each feature set to the resampling experiment -stored in the \link[mlr3:BenchmarkResult]{mlr3::BenchmarkResult}. +Connects each feature set to the resampling experiment stored in the \link[mlr3:BenchmarkResult]{mlr3::BenchmarkResult}. } - -Each row corresponds to a single evaluation of a feature set. - -The archive stores additionally a \link[mlr3:BenchmarkResult]{mlr3::BenchmarkResult} -(\verb{$benchmark_result}) that records the resampling experiments. Each -experiment corresponds to to a single evaluation of a feature set. The table -(\verb{$data}) and the benchmark result (\verb{$benchmark_result}) are linked by the -\code{uhash} column. If the results are viewed with \code{as.data.table()}, both are -joined automatically. } \section{Analysis}{ - -For analyzing the feature selection results, it is recommended to pass the archive to -\code{as.data.table()}. The returned data table is joined with the benchmark result -which adds the \link[mlr3:ResampleResult]{mlr3::ResampleResult} for each feature set. +For analyzing the feature selection results, it is recommended to pass the archive to \code{as.data.table()}. +The returned data table is joined with the benchmark result which adds the \link[mlr3:ResampleResult]{mlr3::ResampleResult} for each feature set. The archive provides various getters (e.g. \verb{$learners()}) to ease the access. -All getters extract by position (\code{i}) or unique hash (\code{uhash}). For a -complete list of all getters see the methods section. +All getters extract by position (\code{i}) or unique hash (\code{uhash}). +For a complete list of all getters see the methods section. -The benchmark result (\verb{$benchmark_result}) allows to score the feature sets -again on a different measure. Alternatively, measures can be supplied to -\code{as.data.table()}. +The benchmark result (\verb{$benchmark_result}) allows to score the feature sets again on a different measure. +Alternatively, measures can be supplied to \code{as.data.table()}. } \section{S3 Methods}{ \itemize{ -\item \code{as.data.table.ArchiveFSelect(x, unnest = NULL, exclude_columns = "uhash", measures = NULL)}\cr +\item \code{as.data.table.ArchiveFSelect(x, exclude_columns = "uhash", measures = NULL)}\cr Returns a tabular view of all evaluated feature sets.\cr \link{ArchiveFSelect} -> \code{\link[data.table:data.table]{data.table::data.table()}}\cr \itemize{ \item \code{x} (\link{ArchiveFSelect}) -\item \code{unnest} (\code{character()})\cr -Transforms list columns to separate columns. Set to \code{NULL} if no column -should be unnested. \item \code{exclude_columns} (\code{character()})\cr -Exclude columns from table. Set to \code{NULL} if no column should be -excluded. +Exclude columns from table. Set to \code{NULL} if no column should be excluded. \item \code{measures} (list of \link[mlr3:Measure]{mlr3::Measure})\cr Score feature sets on additional measures. } @@ -81,13 +70,14 @@ Score feature sets on additional measures. \if{html}{\out{