Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

survey_count() is very slow when applied on a lot of groups #164

Open
etiennebacher opened this issue Aug 1, 2023 · 1 comment
Open

Comments

@etiennebacher
Copy link
Contributor

etiennebacher commented Aug 1, 2023

Hello, I have some survey data that contains a few tens of thousands of observations, and I have several groups. I'd like to compute the survey count per combination of groups, but it is much slower than the "non-survey" count using dplyr. I understand that you have to do extra steps and that you call survey under the hood, but it seems to me that this difference in timing is due to the way groups of data are passed to survey_total().

In the example below, dplyr::count() is near instantaneous, but survey_count() takes more than a minute:

library(srvyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

packageVersion("srvyr")
#> [1] '1.2.0'

N <- 50000
set.seed(123)

test <- data.frame(
  grp1 = sample(letters, N, TRUE),
  grp2 = sample(LETTERS, N, TRUE),
  grp3 = sample(1:10, N, TRUE),
  weight = sample(seq(0, 1, 0.01), N, TRUE)
) |> 
  arrange(grp1, grp2, grp3)

# dplyr is fast  
system.time({
  test |> 
    group_by(grp1, grp2, grp3) |> 
    count() |> 
    ungroup()
})
#>    user  system elapsed 
#>    0.03    0.00    0.04

test_sv <- as_survey_design(test, weights = weight)

# srvyr is much slower
system.time({
  test_sv |> 
    group_by(grp1, grp2, grp3) |> 
    survey_count() |> 
    ungroup()
})
#>    user  system elapsed 
#>   81.16    2.71   84.76

Is it something that could be improved? Or maybe I missed something?

Thanks,

@bschneidr
Copy link
Contributor

I think it's a good idea to look into ways to speed up grouped operations, if possible. But fundamentally, the srvyr code is doing many, many more calculations than the 'dplyr' code. In 'srvyr', you're computing ~ 7,000 point estimates and then you're also computing estimated standard errors for those point estimates. The calculation of the standard errors is the hard part that requires a lot of calculation; getting the point estimates is easy.

Is there something you noticed inside the 'srvyr' code that you think is making it especially slow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants