-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group_bootstraps takes unusually long when there are rare classes #466
Comments
Hello @StevenWallaert Do you mind clarifying a bit? Below I was able to apply library(rsample)
library(modeldata)
library(dplyr)
ames <- slice_sample(ames, n = 10000, replace = TRUE)
ames$group <- sample(1:500, size = 10000, replace = TRUE)
tictoc::tic()
group_bootstraps(ames, group)
#> # Group bootstrap sampling
#> # A tibble: 25 × 2
#> splits id
#> <list> <chr>
#> 1 <split [9991/3667]> Bootstrap01
#> 2 <split [9998/3715]> Bootstrap02
#> 3 <split [10007/3741]> Bootstrap03
#> 4 <split [9991/3528]> Bootstrap04
#> 5 <split [9996/3837]> Bootstrap05
#> 6 <split [10005/3593]> Bootstrap06
#> 7 <split [9998/3668]> Bootstrap07
#> 8 <split [10007/3974]> Bootstrap08
#> 9 <split [10001/3729]> Bootstrap09
#> 10 <split [10001/3612]> Bootstrap10
#> # ℹ 15 more rows
tictoc::toc()
#> 7.219 sec elapsed |
Hey, thanks for your reply.
However, when I do the following with df as my datatframe, I get:
roughly ten times longer. My df is nothing special: 10607 rows, 17 columns (ames is 75 columns). Problem is that I need to go to 2000 resamples and I don't have a clue how long it will take. I usually abort because I'm afraid it will take too long. Maybe the option of a progress bar would be helpful? |
Do you have a ton of groups? It seems like something in the per-group implementation is inefficient (as the person who wrote that code, I'd believe it 😛 ) and the function takes dramatically longer as the number of groups increases: bench::press(
n = c(2, 5, 10, 50, 100, 1000, 2000),
{
df <- modeldata::sim_noise(
num_samples = 10607,
num_vars = 16,
outcome = "classification",
num_classes = n
)
bench::mark(
rsample::group_bootstraps(df, class)
)
}
)
#> # A tibble: 7 × 7
#> expression n min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 rsample::group_bootstrap… 2 61.68ms 77.85ms 10.1 78.75MB 28.5
#> 2 rsample::group_bootstrap… 5 71.09ms 143.22ms 8.43 75.5MB 30.3
#> 3 rsample::group_bootstrap… 10 55.7ms 82.39ms 10.9 75.76MB 30.8
#> 4 rsample::group_bootstrap… 50 70.44ms 83.85ms 11.2 82.37MB 28.0
#> 5 rsample::group_bootstrap… 100 227.88ms 256.82ms 3.89 103.97MB 17.5
#> 6 rsample::group_bootstrap… 1000 1.55m 1.55m 0.0107 7.27GB 0.430
#> 7 rsample::group_bootstrap… 2000 12.1m 12.1m 0.00138 46.19GB 0.0565 Created on 2024-02-13 with reprex v2.0.2 (I see now you said 500 groups -- I'm wondering if they're unbalanced, and the issues come up when some groups are particularly small) |
Yeah, think the cause might be unbalanced classes (look at the difference in memory usage, wowza): set.seed(1107)
df <- modeldata::sim_noise(
num_samples = 10607,
num_vars = 16
)
df$group_1 <- sample.int(500, 10607, TRUE)
df$group_2 <- sample(c(
rep(1, 4243),
rep(2, 4243),
sample.int(2121, 2121, TRUE)
), 10607)
bench::mark(
rsample::group_bootstraps(df, group_1),
rsample::group_bootstraps(df, group_2),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 rsample::group_bootstraps(df, group… 12.2s 12.2s 0.0820 1.69GB 0.164
#> 2 rsample::group_bootstraps(df, group… 7.4m 7.4m 0.00225 29.02GB 0.0901 Created on 2024-02-13 with reprex v2.0.2 |
Attached is a profile (readable via Looking at the source of balance_prop_helper <- function(prop, data_ind, v, replace) {
freq_table <- vec_count(data_ind$..group, sort = "location")
# Calculate how many groups to sample each iteration
# If sampling with replacement,
# set `n` to the number of resamples we'd need
# if we somehow got the smallest group every time.
# If sampling without replacement, just reshuffle all the groups.
n <- nrow(freq_table)
if (replace) n <- n * prop * sum(freq_table$count) / min(freq_table$count)
n <- ceiling(n)
purrr::map(
seq_len(v),
function(x) {
row_idx <- sample.int(nrow(freq_table), n, replace = replace)
work_table <- freq_table[row_idx, ]
cumulative_proportion <- cumsum(work_table$count) / sum(freq_table$count)
crosses_target <- which(cumulative_proportion > prop)[[1]]
is_closest <- cumulative_proportion[c(crosses_target, crosses_target - 1)]
is_closest <- which.min(abs(is_closest - prop)) - 1
crosses_target <- crosses_target - is_closest
out <- work_table[seq_len(crosses_target), ]
out$assignment <- x
out
}
) %>%
list_rbind()
} That So I think it would make sense to not do this -- get rid of the Few other notes:
|
Changing the helper to append rows is dramatically faster: balance_prop_helper <- function(prop, data_ind, v, replace) {
freq_table <- vec_count(data_ind$..group, sort = "location")
n <- nrow(freq_table)
purrr::map(
seq_len(v),
function(x) {
row_idx <- sample.int(nrow(freq_table), n, replace = replace)
work_table <- freq_table[row_idx, ]
cumulative_proportion <- cumsum(work_table$count) / sum(freq_table$count)
# if bootstrapping with small groups,
# it's possible to repeatedly sample only small groups,
# meaning you construct bootstrap sets with too few observations
#
# so here we loop until we've gotten the right number of groups
while (max(cumulative_proportion) <= 1) {
row_idx <- sample.int(nrow(freq_table), n, replace = replace)
work_table <- rbind(work_table, freq_table[row_idx, ])
cumulative_proportion <- cumsum(work_table$count) / sum(freq_table$count)
}
crosses_target <- which(cumulative_proportion > prop)[[1]]
is_closest <- cumulative_proportion[c(crosses_target, crosses_target - 1)]
is_closest <- which.min(abs(is_closest - prop)) - 1
crosses_target <- crosses_target - is_closest
out <- work_table[seq_len(crosses_target), ]
out$assignment <- x
out
}
) %>%
list_rbind()
} That said, I haven't checked this to make sure it actually works in all the ways we care about -- CI suggests it maybe makes us slightly worse at matching strata, and it triggers a new "many-to-many" warning in |
Yes indeed, I have many small groups (between 1 and 120 members, 90% between 3 and 40 members). For now I work around by sampling group ids with replacement and then join the original tibble. |
I have observed the same behaviour and my workaround is using tidyr nest and unnest.
Based on my understanding of group_bootstraps both versions should be equivalant, but the version using nest and unnest is roughly 20 times quicker. |
Not really a bug per se but I want to report that:
I found it impossible to use group_bootstraps with a dataset of n around 10000, and the number of groups around 500, because it takes so long. I'm not sure but I have a feeling this shouldn't take that long.
The text was updated successfully, but these errors were encountered: