Skip to content

Commit

Permalink
Merged origin/main into Dpananos-cli_caret
Browse files Browse the repository at this point in the history
  • Loading branch information
hfrick committed Sep 11, 2024
2 parents 01c9341 + 5c84b97 commit ec755a6
Show file tree
Hide file tree
Showing 60 changed files with 388 additions and 303 deletions.
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,7 @@ export(validation_time_split)
export(vfold_cv)
import(vctrs)
importFrom(cli,cli_abort)
importFrom(cli,cli_warn)
importFrom(dplyr,"%>%")
importFrom(dplyr,arrange)
importFrom(dplyr,arrange_)
Expand Down
44 changes: 29 additions & 15 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,21 @@

* The new `inner_split()` function and its methods for various resamples is for usage in tune to create a inner resample of the analysis set to fit the preprocessor and model on one part and the post-processor on the other part (#483, #488, #489).

* Started moving error messages to cli (#499, #502).
* Started moving error messages to cli (#499, #502). With contributions from @PriKalra (#523, #526, #528, #530, #531, #532), @Dpananos (#516), and @JamesHWade (#518).

* Fixed example for `nested_cv()` (@seb09, #520).

* `rolling_origin()` is now superseded by `sliding_window()`, `sliding_index()`, and `sliding_period()` which provide more flexibility and control (@nmercadeb, #524).

* Removed trailing space in printing of `mc_cv()` objects (@ccani007, #464).

* Improved documentation for `initial_split()` and friends (@laurabrianna, #519).

* Formatting improvement: package names are now not in backticks anymore (@agmurray, #525).

* Improved documentation and formatting: function names are now more easily identifiable through either `()` at the end or being links to the function documentation (@brshallo , #521).

* `vfold_cv()` and `clustering_cv()` now error on implicit leave-one-out cross-validation (@seb09, #527).

## Bug fixes

Expand Down Expand Up @@ -144,7 +158,7 @@

* The `reg_intervals()` function is a convenience function for `lm()`, `glm()`, `survreg()`, and `coxph()` models (#206).

* A few internal functions were exported so that `rsample`-adjacent packages can use the same underlying code.
* A few internal functions were exported so that rsample-adjacent packages can use the same underlying code.

* The `obj_sum()` method for `rsplit` objects was updated (#215).

Expand All @@ -165,11 +179,11 @@

* The `print()` methods for `rsplit` and `val_split` objects were adjusted to show `"<Analysis/Assess/Total>"` and `<Training/Validation/Total>`, respectively.

* The `drinks`, `attrition`, and `two_class_dat` data sets were removed. They are in the `modeldata` package.
* The `drinks`, `attrition`, and `two_class_dat` data sets were removed. They are in the modeldata package.

* Compatability with `dplyr` 1.0.0.
* Compatability with dplyr 1.0.0.

# `rsample` 0.0.6
# rsample 0.0.6

* Added `validation_set()` for making a single resample.

Expand All @@ -181,24 +195,24 @@

* `initial_time_split()` and `rolling_origin()` now have a `lag` parameter that ensures that previous data are available so that lagged variables can be calculated. (#135, #136)

# `rsample` 0.0.5
# rsample 0.0.5

* Added three functions to compute different bootstrap confidence intervals.
* A new function (`add_resample_id()`) augments a data frame with columns for the resampling identifier.
* Updated `initial_split()`, `mc_cv()`, `vfold_cv()`, `bootstraps()`, and `group_vfold_cv()` to use tidyselect on the stratification variable.
* Updated `initial_split()`, `mc_cv()`, `vfold_cv()`, `bootstraps()` with new `breaks` parameter that specifies the number of bins to stratify by for a numeric stratification variable.


# `rsample` 0.0.4
# rsample 0.0.4

Small maintenance release.

## Minor improvements and fixes

* `fill()` was removed per the deprecation warning.
* Small changes were made for the new version of `tibble`.
* Small changes were made for the new version of tibble.

# `rsample` 0.0.3
# rsample 0.0.3

## New features

Expand All @@ -210,25 +224,25 @@ Small maintenance release.

* Changed the R version requirement to be R >= 3.1 instead of 3.3.3.

* The `recipes`-related `prepper` function was [moved to the `recipes` package](https://github.com/tidymodels/rsample/issues/48). This makes the `rsample` install footprint much smaller.
* The recipes-related `prepper()` function was [moved to the recipes package](https://github.com/tidymodels/rsample/issues/48). This makes the rsample install footprint much smaller.

* `rsplit` objects are shown differently inside of a tibble.

* Moved from the `broom` package to the `generics` package.
* Moved from the broom package to the generics package.


# `rsample` 0.0.2
# rsample 0.0.2

* `initial_split`, `training`, and `testing` were added to do training/testing splits prior to resampling.
* Another resampling method, `group_vfold_cv`, was added.
* `caret2rsample` and `rsample2caret` can convert `rset` objects to those used by `caret::trainControl` and vice-versa.
* A function called `form_pred` can be used to determine the original names of the predictors in a formula or `terms` object.
* A vignette and a function (`prepper`) were included to facilitate using the `recipes` with `rsample`.
* A vignette and a function (`prepper`) were included to facilitate using the recipes with rsample.
* A `gather` method was added for `rset` objects.
* A `labels` method was added for `rsplit` objects. This can help identify which resample is being used even when the whole `rset` object is not available.
* A variety of `dplyr` methods were added (e.g. `filter`, `mutate`, etc) that work without dropping classes or attributes of the `rsample` objects.
* A variety of dplyr methods were added (e.g. `filter()`, `mutate()`, etc) that work without dropping classes or attributes of the rsample objects.

# `rsample` 0.0.1 (2017-07-08)
# rsample 0.0.1 (2017-07-08)

Initial public version on CRAN

2 changes: 1 addition & 1 deletion R/boot.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
#' @param times The number of bootstrap samples.
#' @param apparent A logical. Should an extra resample be added where the
#' analysis and holdout subset are the entire data set. This is required for
#' some estimators used by the `summary` function that require the apparent
#' some estimators used by the [summary()] function that require the apparent
#' error rate.
#' @export
#' @return A tibble with classes `bootstraps`, `rset`, `tbl_df`, `tbl`, and
Expand Down
62 changes: 27 additions & 35 deletions R/bootci.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,39 +6,37 @@

check_rset <- function(x, app = TRUE) {
if (!inherits(x, "bootstraps")) {
rlang::abort("`.data` should be an `rset` object generated from `bootstraps()`")
cli_abort("{.arg .data} should be an {.cls rset} object generated from {.fn bootstraps}.")
}

if (app) {
if (x %>% dplyr::filter(id == "Apparent") %>% nrow() != 1) {
rlang::abort("Please set `apparent = TRUE` in `bootstraps()` function")
cli_abort("Please set {.code apparent = TRUE} in {.fn bootstraps} function.")
}
}
invisible(NULL)
}


stat_fmt_err <- paste("`statistics` should select a list column of tidy results.")
stat_fmt_err <- "{.arg statistics} should select a list column of tidy results."
stat_nm_err <- paste(
"The tibble in `statistics` should have columns for",
"'estimate' and 'term`"
"The tibble in {.arg statistics} should have columns for",
"'estimate' and 'term'."
)
std_exp <- c("std.error", "robust.se")

check_tidy_names <- function(x, std_col) {
# check for proper columns
if (sum(colnames(x) == "estimate") != 1) {
rlang::abort(stat_nm_err)
cli_abort(stat_nm_err)
}
if (sum(colnames(x) == "term") != 1) {
rlang::abort(stat_nm_err)
cli_abort(stat_nm_err)
}
if (std_col) {
std_candidates <- colnames(x) %in% std_exp
if (sum(std_candidates) != 1) {
rlang::abort(
"`statistics` should select a single column for the standard error."
)
cli_abort("{.arg statistics} should select a single column for the standard error.")
}
}
invisible(TRUE)
Expand All @@ -59,7 +57,7 @@ check_tidy <- function(x, std_col = FALSE) {
}

if (inherits(x, "try-error")) {
rlang::abort(stat_fmt_err)
cli_abort(stat_fmt_err)
}

check_tidy_names(x, std_col)
Expand Down Expand Up @@ -117,7 +115,7 @@ new_stats <- function(x, lo, hi) {
has_dots <- function(x) {
nms <- names(formals(x))
if (!any(nms == "...")) {
rlang::abort("`.fn` must have an argument `...`.")
cli_abort("{.arg .fn} must have an argument {.arg ...}.")
}
invisible(NULL)
}
Expand All @@ -130,15 +128,8 @@ check_num_resamples <- function(x, B = 1000) {
dplyr::filter(n < B)

if (nrow(x) > 0) {
terms <- paste0("`", x$term, "`")
msg <-
paste0(
"Recommend at least ", B, " non-missing bootstrap resamples for ",
ifelse(length(terms) > 1, "terms: ", "term "),
paste0(terms, collapse = ", "),
"."
)
rlang::warn(msg)
terms <- x$term
cli_warn("Recommend at least {B} non-missing bootstrap resamples for {cli::qty(terms)} term{?s} {.code {terms}}.")
}
invisible(NULL)
}
Expand All @@ -149,11 +140,11 @@ check_num_resamples <- function(x, B = 1000) {

pctl_single <- function(stats, alpha = 0.05) {
if (all(is.na(stats))) {
rlang::abort("All statistics have missing values..")
cli_abort("All statistics have missing values.")
}

if (!is.numeric(stats)) {
rlang::abort("`stats` must be a numeric vector.")
cli_abort("{.arg stats} must be a numeric vector.")
}

# stats is a numeric vector of values
Expand Down Expand Up @@ -258,7 +249,7 @@ int_pctl.bootstraps <- function(.data, statistics, alpha = 0.05, ...) {
check_dots_empty()
check_rset(.data, app = FALSE)
if (length(alpha) != 1 || !is.numeric(alpha)) {
abort("`alpha` must be a single numeric value.")
cli_abort("{.arg alpha} must be a single numeric value.")
}

.data <- .data %>% dplyr::filter(id != "Apparent")
Expand Down Expand Up @@ -289,19 +280,20 @@ t_single <- function(stats, std_err, is_orig, alpha = 0.05) {
# which_orig is the index of stats and std_err that has the original result

if (all(is.na(stats))) {
rlang::abort("All statistics have missing values.")
cli_abort("All statistics have missing values.")
}

if (!is.logical(is_orig) || any(is.na(is_orig))) {
rlang::abort(
"`is_orig` should be a logical column the same length as `stats` with no missing values."
cli_abort(
"{.arg is_orig} should be a logical column the same length as {.arg stats} with no missing values."
)
}
if (length(stats) != length(std_err) && length(stats) != length(is_orig)) {
rlang::abort("`stats`, `std_err`, and `is_orig` should have the same length.")
function_args <- c('stats', 'std_err', 'is_orig')
cli_abort("{.arg {function_args}} should have the same length.")
}
if (sum(is_orig) != 1) {
rlang::abort("The original statistic must be in a single row.")
cli_abort("The original statistic must be in a single row.")
}

theta_obs <- stats[is_orig]
Expand Down Expand Up @@ -339,12 +331,12 @@ int_t.bootstraps <- function(.data, statistics, alpha = 0.05, ...) {
check_dots_empty()
check_rset(.data)
if (length(alpha) != 1 || !is.numeric(alpha)) {
abort("`alpha` must be a single numeric value.")
cli_abort("{.arg alpha} must be a single numeric value.")
}

column_name <- tidyselect::vars_select(names(.data), !!enquo(statistics))
if (length(column_name) != 1) {
rlang::abort(stat_fmt_err)
cli_abort(stat_fmt_err)
}
stats <- .data %>% dplyr::select(!!column_name, id)
stats <- check_tidy(stats, std_col = TRUE)
Expand All @@ -366,7 +358,7 @@ bca_calc <- function(stats, orig_data, alpha = 0.05, .fn, ...) {

# TODO check per term
if (all(is.na(stats$estimate))) {
rlang::abort("All statistics have missing values.")
cli_abort("All statistics have missing values.")
}

### Estimating Z0 bias-correction
Expand All @@ -381,7 +373,7 @@ bca_calc <- function(stats, orig_data, alpha = 0.05, .fn, ...) {
if (inherits(loo_test, "try-error")) {
cat("Running `.fn` on the LOO resamples produced an error:\n")
print(loo_test)
rlang::abort("`.fn` failed.")
cli_abort("{.arg .fn} failed.")
}

loo_res <- furrr::future_map(loo_rs$splits, .fn, ...) %>% list_rbind()
Expand Down Expand Up @@ -440,14 +432,14 @@ int_bca <- function(.data, ...) {
int_bca.bootstraps <- function(.data, statistics, alpha = 0.05, .fn, ...) {
check_rset(.data)
if (length(alpha) != 1 || !is.numeric(alpha)) {
abort("`alpha` must be a single numeric value.")
cli_abort("{.arg alpha} must be a single numeric value.")
}

has_dots(.fn)

column_name <- tidyselect::vars_select(names(.data), !!enquo(statistics))
if (length(column_name) != 1) {
rlang::abort(stat_fmt_err)
cli_abort(stat_fmt_err)
}
stats <- .data %>% dplyr::select(!!column_name, id)
stats <- check_tidy(stats)
Expand Down
8 changes: 4 additions & 4 deletions R/caret.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
#' \pkg{rsample} and \pkg{caret}.
#'
#' @param object An `rset` object. Currently,
#' `nested_cv` is not supported.
#' @return `rsample2caret` returns a list that mimics the
#' [nested_cv()] is not supported.
#' @return `rsample2caret()` returns a list that mimics the
#' `index` and `indexOut` elements of a
#' `trainControl` object. `caret2rsample` returns an
#' `trainControl` object. `caret2rsample()` returns an
#' `rset` object of the appropriate class.
#' @export
rsample2caret <- function(object, data = c("analysis", "assessment")) {
Expand All @@ -23,7 +23,7 @@ rsample2caret <- function(object, data = c("analysis", "assessment")) {
}

#' @rdname rsample2caret
#' @param ctrl An object produced by `trainControl` that has
#' @param ctrl An object produced by `caret::trainControl()` that has
#' had the `index` and `indexOut` elements populated by
#' integers. One method of getting this is to extract the
#' `control` objects from an object produced by `train`.
Expand Down
2 changes: 1 addition & 1 deletion R/form_pred.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#' Extract Predictor Names from Formula or Terms
#'
#' `all.vars` returns all variables used in a formula. This
#' While [all.vars()] returns all variables used in a formula, this
#' function only returns the variables explicitly used on the
#' right-hand side (i.e., it will not resolve dots unless the
#' object is terms with a data set specified).
Expand Down
18 changes: 10 additions & 8 deletions R/initial_split.R
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
#' Simple Training/Test Set Splitting
#'
#' `initial_split` creates a single binary split of the data into a training
#' set and testing set. `initial_time_split` does the same, but takes the
#' `initial_split()` creates a single binary split of the data into a training
#' set and testing set. `initial_time_split()` does the same, but takes the
#' _first_ `prop` samples for training, instead of a random selection.
#' `group_initial_split` creates splits of the data based
#' `group_initial_split()` creates splits of the data based
#' on some grouping variable, so that all data in a "group" is assigned to
#' the same split.
#' `training` and `testing` are used to extract the resulting data.
#' the same split.
#'
#' @details `training()` and `testing()` are used to extract the resulting data.
#'
#' @template strata_details
#' @inheritParams vfold_cv
#' @inheritParams make_strata
#' @param prop The proportion of data to be retained for modeling/analysis.
#' @export
#' @return An `rsplit` object that can be used with the `training` and `testing`
#' @return An `rsplit` object that can be used with the `training()` and `testing()`
#' functions to extract the data in each split.
#' @examplesIf rlang::is_installed("modeldata")
#' set.seed(1353)
Expand Down Expand Up @@ -176,12 +178,12 @@ group_initial_split <- function(data, group, prop = 3 / 4, ..., strata = NULL, p
attrib <- .get_split_args(res, allow_strata_false = TRUE)

res <- res$splits[[1]]

attrib$times <- NULL
for (i in names(attrib)) {
attr(res, i) <- attrib[[i]]
}
class(res) <- c("group_initial_split", "initial_split", class(res))

res
}
Loading

0 comments on commit ec755a6

Please sign in to comment.