Merged origin/main into Dpananos-cli_caret

tidymodels · Sep 11, 2024 · ec755a6 · ec755a6
2 parents 01c9341 + 5c84b97
commit ec755a6
Show file tree

Hide file tree

Showing 60 changed files with 388 additions and 303 deletions.
diff --git a/NAMESPACE b/NAMESPACE
@@ -416,6 +416,7 @@ export(validation_time_split)
 export(vfold_cv)
 import(vctrs)
 importFrom(cli,cli_abort)
+importFrom(cli,cli_warn)
 importFrom(dplyr,"%>%")
 importFrom(dplyr,arrange)
 importFrom(dplyr,arrange_)

diff --git a/NEWS.md b/NEWS.md
@@ -2,7 +2,21 @@
 
 * The new `inner_split()` function and its methods for various resamples is for usage in tune to create a inner resample of the analysis set to fit the preprocessor and model on one part and the post-processor on the other part (#483, #488, #489).
 
-* Started moving error messages to cli (#499, #502).
+* Started moving error messages to cli (#499, #502). With contributions from @PriKalra (#523, #526, #528, #530, #531, #532), @Dpananos (#516), and @JamesHWade (#518).
+
+* Fixed example for `nested_cv()` (@seb09, #520).
+
+* `rolling_origin()` is now superseded by `sliding_window()`, `sliding_index()`, and `sliding_period()` which provide more flexibility and control (@nmercadeb, #524).
+
+* Removed trailing space in printing of `mc_cv()` objects (@ccani007, #464).
+
+* Improved documentation for `initial_split()` and friends (@laurabrianna, #519).
+
+* Formatting improvement: package names are now not in backticks anymore (@agmurray, #525).
+
+* Improved documentation and formatting: function names are now more easily identifiable through either `()` at the end or being links to the function documentation (@brshallo , #521).
+
+* `vfold_cv()` and `clustering_cv()` now error on implicit leave-one-out cross-validation (@seb09, #527).
 
 ## Bug fixes
 
@@ -144,7 +158,7 @@
 
 * The `reg_intervals()` function is a convenience function for `lm()`, `glm()`, `survreg()`, and `coxph()` models (#206). 
 
-* A few internal functions were exported so that `rsample`-adjacent packages can use the same underlying code. 
+* A few internal functions were exported so that rsample-adjacent packages can use the same underlying code. 
 
 * The `obj_sum()` method for `rsplit` objects was updated (#215).
 
@@ -165,11 +179,11 @@
 
 * The `print()` methods for `rsplit` and `val_split` objects were adjusted to show `"<Analysis/Assess/Total>"` and `<Training/Validation/Total>`, respectively. 
 
-* The `drinks`, `attrition`, and `two_class_dat` data sets were removed. They are in the `modeldata` package. 
+* The `drinks`, `attrition`, and `two_class_dat` data sets were removed. They are in the modeldata package. 
 
-* Compatability with `dplyr` 1.0.0.
+* Compatability with dplyr 1.0.0.
 
-# `rsample` 0.0.6
+# rsample 0.0.6
 
 * Added `validation_set()` for making a single resample.
 
@@ -181,24 +195,24 @@
 
 * `initial_time_split()` and `rolling_origin()` now have a `lag` parameter that ensures that previous data are available so that lagged variables can be calculated. (#135, #136)
 
-# `rsample` 0.0.5
+# rsample 0.0.5
 
 * Added three functions to compute different bootstrap confidence intervals. 
 * A new function (`add_resample_id()`) augments a data frame with columns for the resampling identifier. 
 * Updated `initial_split()`, `mc_cv()`, `vfold_cv()`, `bootstraps()`, and `group_vfold_cv()` to use tidyselect on the stratification variable.
 * Updated `initial_split()`, `mc_cv()`, `vfold_cv()`, `bootstraps()` with new `breaks` parameter that specifies the number of bins to stratify by for a numeric stratification variable.
 
 
-# `rsample` 0.0.4
+# rsample 0.0.4
 
 Small maintenance release. 
 
 ## Minor improvements and fixes
 
  * `fill()` was removed per the deprecation warning. 
- * Small changes were made for the new version of `tibble`. 
+ * Small changes were made for the new version of tibble. 
 
-# `rsample` 0.0.3
+# rsample 0.0.3
 
 ## New features
 
@@ -210,25 +224,25 @@ Small maintenance release.
 
 * Changed the R version requirement to be R >= 3.1 instead of 3.3.3. 
 
-* The `recipes`-related `prepper` function was [moved to the `recipes` package](https://github.com/tidymodels/rsample/issues/48). This makes the `rsample` install footprint much smaller.
+* The recipes-related `prepper()` function was [moved to the recipes package](https://github.com/tidymodels/rsample/issues/48). This makes the rsample install footprint much smaller.
 
 * `rsplit` objects are shown differently inside of a tibble.
 
-* Moved from the `broom` package to the `generics` package.
+* Moved from the broom package to the generics package.
 
 
-# `rsample` 0.0.2
+# rsample 0.0.2
 
 * `initial_split`, `training`, and `testing` were added to do training/testing splits prior to resampling. 
 * Another resampling method, `group_vfold_cv`, was added. 
 * `caret2rsample` and `rsample2caret` can convert `rset` objects to those used by `caret::trainControl` and vice-versa. 
 * A function called `form_pred` can be used to determine the original names of the predictors in a formula or `terms` object. 
-* A vignette and a function (`prepper`) were included to facilitate using the `recipes` with `rsample`.
+* A vignette and a function (`prepper`) were included to facilitate using the recipes with rsample.
 * A `gather` method was added for `rset` objects.
 * A `labels` method was added for `rsplit` objects. This can help identify which resample is being used even when the whole `rset` object is not available. 
-* A variety of `dplyr` methods were added (e.g. `filter`, `mutate`, etc) that work without dropping classes or attributes of the `rsample` objects. 
+* A variety of dplyr methods were added (e.g. `filter()`, `mutate()`, etc) that work without dropping classes or attributes of the rsample objects. 
 
-# `rsample` 0.0.1 (2017-07-08)
+# rsample 0.0.1 (2017-07-08)
 
 Initial public version on CRAN
 
diff --git a/R/boot.R b/R/boot.R
@@ -17,7 +17,7 @@
 #' @param times The number of bootstrap samples.
 #' @param apparent A logical. Should an extra resample be added where the
 #'  analysis and holdout subset are the entire data set. This is required for
-#'  some estimators used by the `summary` function that require the apparent
+#'  some estimators used by the [summary()] function that require the apparent
 #'  error rate.
 #' @export
 #' @return A tibble with classes `bootstraps`, `rset`, `tbl_df`, `tbl`, and

diff --git a/R/bootci.R b/R/bootci.R
@@ -6,39 +6,37 @@
 
 check_rset <- function(x, app = TRUE) {
   if (!inherits(x, "bootstraps")) {
-    rlang::abort("`.data` should be an `rset` object generated from `bootstraps()`")
+    cli_abort("{.arg .data} should be an {.cls rset} object generated from {.fn bootstraps}.")
   }
 
   if (app) {
     if (x %>% dplyr::filter(id == "Apparent") %>% nrow() != 1) {
-      rlang::abort("Please set `apparent = TRUE` in `bootstraps()` function")
+      cli_abort("Please set {.code apparent = TRUE} in {.fn bootstraps} function.")
     }
   }
   invisible(NULL)
 }
 
 
-stat_fmt_err <- paste("`statistics` should select a list column of tidy results.")
+stat_fmt_err <- "{.arg statistics} should select a list column of tidy results."
 stat_nm_err <- paste(
-  "The tibble in `statistics` should have columns for",
-  "'estimate' and 'term`"
+  "The tibble in {.arg statistics} should have columns for",
+  "'estimate' and 'term'."
 )
 std_exp <- c("std.error", "robust.se")
 
 check_tidy_names <- function(x, std_col) {
   # check for proper columns
   if (sum(colnames(x) == "estimate") != 1) {
-    rlang::abort(stat_nm_err)
+    cli_abort(stat_nm_err)
   }
   if (sum(colnames(x) == "term") != 1) {
-    rlang::abort(stat_nm_err)
+    cli_abort(stat_nm_err)
   }
   if (std_col) {
     std_candidates <- colnames(x) %in% std_exp
     if (sum(std_candidates) != 1) {
-      rlang::abort(
-        "`statistics` should select a single column for the standard error."
-      )
+      cli_abort("{.arg statistics} should select a single column for the standard error.")
     }
   }
   invisible(TRUE)
@@ -59,7 +57,7 @@ check_tidy <- function(x, std_col = FALSE) {
   }
 
   if (inherits(x, "try-error")) {
-    rlang::abort(stat_fmt_err)
+    cli_abort(stat_fmt_err)
   }
 
   check_tidy_names(x, std_col)
@@ -117,7 +115,7 @@ new_stats <- function(x, lo, hi) {
 has_dots <- function(x) {
   nms <- names(formals(x))
   if (!any(nms == "...")) {
-    rlang::abort("`.fn` must have an argument `...`.")
+    cli_abort("{.arg .fn} must have an argument {.arg ...}.")
   }
   invisible(NULL)
 }
@@ -130,15 +128,8 @@ check_num_resamples <- function(x, B = 1000) {
     dplyr::filter(n < B)
 
   if (nrow(x) > 0) {
-    terms <- paste0("`", x$term, "`")
-    msg <-
-      paste0(
-        "Recommend at least ", B, " non-missing bootstrap resamples for ",
-        ifelse(length(terms) > 1, "terms: ", "term "),
-        paste0(terms, collapse = ", "),
-        "."
-      )
-    rlang::warn(msg)
+    terms <- x$term
+    cli_warn("Recommend at least {B} non-missing bootstrap resamples for {cli::qty(terms)} term{?s} {.code {terms}}.")
   }
   invisible(NULL)
 }
@@ -149,11 +140,11 @@ check_num_resamples <- function(x, B = 1000) {
 
 pctl_single <- function(stats, alpha = 0.05) {
   if (all(is.na(stats))) {
-    rlang::abort("All statistics have missing values..")
+    cli_abort("All statistics have missing values.")
   }
 
   if (!is.numeric(stats)) {
-    rlang::abort("`stats` must be a numeric vector.")
+    cli_abort("{.arg stats} must be a numeric vector.")
   }
 
   # stats is a numeric vector of values
@@ -258,7 +249,7 @@ int_pctl.bootstraps <- function(.data, statistics, alpha = 0.05, ...) {
   check_dots_empty()
   check_rset(.data, app = FALSE)
   if (length(alpha) != 1 || !is.numeric(alpha)) {
-    abort("`alpha` must be a single numeric value.")
+    cli_abort("{.arg alpha} must be a single numeric value.")
   }
 
   .data <- .data %>% dplyr::filter(id != "Apparent")
@@ -289,19 +280,20 @@ t_single <- function(stats, std_err, is_orig, alpha = 0.05) {
   # which_orig is the index of stats and std_err that has the original result
 
   if (all(is.na(stats))) {
-    rlang::abort("All statistics have missing values.")
+    cli_abort("All statistics have missing values.")
   }
 
   if (!is.logical(is_orig) || any(is.na(is_orig))) {
-    rlang::abort(
-      "`is_orig` should be a logical column the same length as `stats` with no missing values."
+    cli_abort(
+      "{.arg is_orig} should be a logical column the same length as {.arg stats} with no missing values."
     )
   }
   if (length(stats) != length(std_err) && length(stats) != length(is_orig)) {
-    rlang::abort("`stats`, `std_err`, and `is_orig` should have the same length.")
+    function_args <- c('stats', 'std_err', 'is_orig')
+    cli_abort("{.arg {function_args}} should have the same length.")
   }
   if (sum(is_orig) != 1) {
-    rlang::abort("The original statistic must be in a single row.")
+    cli_abort("The original statistic must be in a single row.")
   }
 
   theta_obs <- stats[is_orig]
@@ -339,12 +331,12 @@ int_t.bootstraps <- function(.data, statistics, alpha = 0.05, ...) {
   check_dots_empty()
   check_rset(.data)
   if (length(alpha) != 1 || !is.numeric(alpha)) {
-    abort("`alpha` must be a single numeric value.")
+    cli_abort("{.arg alpha} must be a single numeric value.")
   }
 
   column_name <- tidyselect::vars_select(names(.data), !!enquo(statistics))
   if (length(column_name) != 1) {
-    rlang::abort(stat_fmt_err)
+    cli_abort(stat_fmt_err)
   }
   stats <- .data %>% dplyr::select(!!column_name, id)
   stats <- check_tidy(stats, std_col = TRUE)
@@ -366,7 +358,7 @@ bca_calc <- function(stats, orig_data, alpha = 0.05, .fn, ...) {
 
   # TODO check per term
   if (all(is.na(stats$estimate))) {
-    rlang::abort("All statistics have missing values.")
+    cli_abort("All statistics have missing values.")
   }
 
   ### Estimating Z0 bias-correction
@@ -381,7 +373,7 @@ bca_calc <- function(stats, orig_data, alpha = 0.05, .fn, ...) {
   if (inherits(loo_test, "try-error")) {
     cat("Running `.fn` on the LOO resamples produced an error:\n")
     print(loo_test)
-    rlang::abort("`.fn` failed.")
+    cli_abort("{.arg .fn} failed.")
   }
 
   loo_res <- furrr::future_map(loo_rs$splits, .fn, ...) %>% list_rbind()
@@ -440,14 +432,14 @@ int_bca <- function(.data, ...) {
 int_bca.bootstraps <- function(.data, statistics, alpha = 0.05, .fn, ...) {
   check_rset(.data)
   if (length(alpha) != 1 || !is.numeric(alpha)) {
-    abort("`alpha` must be a single numeric value.")
+    cli_abort("{.arg alpha} must be a single numeric value.")
   }
 
   has_dots(.fn)
 
   column_name <- tidyselect::vars_select(names(.data), !!enquo(statistics))
   if (length(column_name) != 1) {
-    rlang::abort(stat_fmt_err)
+    cli_abort(stat_fmt_err)
   }
   stats <- .data %>% dplyr::select(!!column_name, id)
   stats <- check_tidy(stats)

diff --git a/R/caret.R b/R/caret.R
@@ -4,10 +4,10 @@
 #'  \pkg{rsample} and \pkg{caret}.
 #'
 #' @param object An `rset` object. Currently,
-#'  `nested_cv` is not supported.
-#' @return `rsample2caret` returns a list that mimics the
+#'  [nested_cv()] is not supported.
+#' @return `rsample2caret()` returns a list that mimics the
 #'  `index` and `indexOut` elements of a
-#'  `trainControl` object. `caret2rsample` returns an
+#'  `trainControl` object. `caret2rsample()` returns an
 #'  `rset` object of the appropriate class.
 #' @export
 rsample2caret <- function(object, data = c("analysis", "assessment")) {
@@ -23,7 +23,7 @@ rsample2caret <- function(object, data = c("analysis", "assessment")) {
 }
 
 #' @rdname rsample2caret
-#' @param ctrl An object produced by `trainControl` that has
+#' @param ctrl An object produced by `caret::trainControl()` that has
 #'  had the `index` and `indexOut` elements populated by
 #'  integers. One method of getting this is to extract the
 #'  `control` objects from an object produced by `train`.

diff --git a/R/form_pred.R b/R/form_pred.R
@@ -1,6 +1,6 @@
 #' Extract Predictor Names from Formula or Terms
 #'
-#' `all.vars` returns all variables used in a formula. This
+#' While [all.vars()] returns all variables used in a formula, this
 #'  function only returns the variables explicitly used on the
 #'  right-hand side (i.e., it will not resolve dots unless the
 #'  object is terms with a data set specified).

diff --git a/R/initial_split.R b/R/initial_split.R
@@ -1,18 +1,20 @@
 #' Simple Training/Test Set Splitting
 #'
-#' `initial_split` creates a single binary split of the data into a training
-#'  set and testing set. `initial_time_split` does the same, but takes the
+#' `initial_split()` creates a single binary split of the data into a training
+#'  set and testing set. `initial_time_split()` does the same, but takes the
 #'  _first_ `prop` samples for training, instead of a random selection.
-#'  `group_initial_split` creates splits of the data based
+#'  `group_initial_split()` creates splits of the data based
 #'  on some grouping variable, so that all data in a "group" is assigned to
-#'  the same split.
-#'  `training` and `testing` are used to extract the resulting data.
+#'  the same split. 
+#' 
+#' @details `training()` and `testing()` are used to extract the resulting data.
+#'
 #' @template strata_details
 #' @inheritParams vfold_cv
 #' @inheritParams make_strata
 #' @param prop The proportion of data to be retained for modeling/analysis.
 #' @export
-#' @return An `rsplit` object that can be used with the `training` and `testing`
+#' @return An `rsplit` object that can be used with the `training()` and `testing()`
 #'  functions to extract the data in each split.
 #' @examplesIf rlang::is_installed("modeldata")
 #' set.seed(1353)
@@ -176,12 +178,12 @@ group_initial_split <- function(data, group, prop = 3 / 4, ..., strata = NULL, p
   attrib <- .get_split_args(res, allow_strata_false = TRUE)
 
   res <- res$splits[[1]]
-  
+
   attrib$times <- NULL
   for (i in names(attrib)) {
     attr(res, i) <- attrib[[i]]
   }
   class(res) <- c("group_initial_split", "initial_split", class(res))
-  
+
   res
 }