Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce the object size by trimming the split elements #930

Open
topepo opened this issue Aug 15, 2024 · 2 comments
Open

reduce the object size by trimming the split elements #930

topepo opened this issue Aug 15, 2024 · 2 comments
Labels
feature a feature request or enhancement

Comments

@topepo
Copy link
Member

topepo commented Aug 15, 2024

I had a conversation at conf with someone who mentioned an issue I’ve had.

When you have a large data set or a workflow set with many different workflows, the resulting object can be very large in memory and on disk. Even though the tune_results object only keeps the original data once, that might be excessive (especially in a workflow map).

I want to test out an option to our control functions called trim_split (or similar) that can replace the data slot in the split objects with a zero-row slice and additionally make the integer indices integer(0). That should significantly reduce the size (barring a lot of out-of-sample predictions that might be saved). The split column stays a split column, and no classes are dropped from it (or the tune_results object).

This means that users would be unable to do anything meaningful with the split objects, but it is very unlikely that they would. Also, since it copies the original rset, they could fix this by replacing the altered split column with the one from the rset.

I don't see much downside.

Should the code to clean the split objects go into rsample?

@topepo topepo added the feature a feature request or enhancement label Aug 15, 2024
@topepo
Copy link
Member Author

topepo commented Aug 15, 2024

A quick example from an initial implementation:

library(tidymodels)

set.seed(6735)
folds <- vfold_cv(mtcars, v = 5)

spline_rec <- recipe(mpg ~ ., data = mtcars) %>%
  step_ns(disp) %>%
  step_ns(wt)

lin_mod <- linear_reg() %>%
  set_engine("lm")

control <- control_resamples(save_pred = TRUE, trim_splits = TRUE, save_workflow = TRUE)

spline_res <- fit_resamples(lin_mod, spline_rec, folds, control = control)

spline_res
#> # Resampling results
#> # 5-fold cross-validation 
#> # A tibble: 5 × 5
#>   splits        id    .metrics         .notes           .predictions    
#>   <list>        <chr> <list>           <list>           <list>          
#> 1 <split [0/0]> Fold1 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [7 × 4]>
#> 2 <split [0/0]> Fold2 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [7 × 4]>
#> 3 <split [0/0]> Fold3 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [6 × 4]>
#> 4 <split [0/0]> Fold4 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [6 × 4]>
#> 5 <split [0/0]> Fold5 <tibble [2 × 4]> <tibble [0 × 4]> <tibble [6 × 4]>

# etc etc
collect_metrics(spline_res)
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   3.11      5   0.168 Preprocessor1_Model1
#> 2 rsq     standard   0.651     5   0.135 Preprocessor1_Model1

# However: 

fit_best(spline_res)
#> Error in `was_split_trimmed()` at tune/R/fit_best.R:148:3:
#> ! The split contains no `data` object. Was `trim_splits` set to `TRUE`
#>   in the control function?

Created on 2024-08-14 with reprex v2.1.0

@jrosell
Copy link

jrosell commented Oct 9, 2024

In fact, when saving tune results I find that sometimes I would like to be able to reconstruct the split elements as resample again. See #947

So, I feel like they could be required to be trimmed or one could want to reuse them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
Status: Backlog
Development

No branches or pull requests

2 participants