`survey_count()` is very slow when applied on a lot of groups #164

etiennebacher · 2023-08-01T12:47:16Z

Hello, I have some survey data that contains a few tens of thousands of observations, and I have several groups. I'd like to compute the survey count per combination of groups, but it is much slower than the "non-survey" count using dplyr. I understand that you have to do extra steps and that you call survey under the hood, but it seems to me that this difference in timing is due to the way groups of data are passed to survey_total().

In the example below, dplyr::count() is near instantaneous, but survey_count() takes more than a minute:

library(srvyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

packageVersion("srvyr")
#> [1] '1.2.0'

N <- 50000
set.seed(123)

test <- data.frame(
  grp1 = sample(letters, N, TRUE),
  grp2 = sample(LETTERS, N, TRUE),
  grp3 = sample(1:10, N, TRUE),
  weight = sample(seq(0, 1, 0.01), N, TRUE)
) |> 
  arrange(grp1, grp2, grp3)

# dplyr is fast  
system.time({
  test |> 
    group_by(grp1, grp2, grp3) |> 
    count() |> 
    ungroup()
})
#>    user  system elapsed 
#>    0.03    0.00    0.04

test_sv <- as_survey_design(test, weights = weight)

# srvyr is much slower
system.time({
  test_sv |> 
    group_by(grp1, grp2, grp3) |> 
    survey_count() |> 
    ungroup()
})
#>    user  system elapsed 
#>   81.16    2.71   84.76

Is it something that could be improved? Or maybe I missed something?

Thanks,

The text was updated successfully, but these errors were encountered:

bschneidr · 2023-09-27T03:13:07Z

I think it's a good idea to look into ways to speed up grouped operations, if possible. But fundamentally, the srvyr code is doing many, many more calculations than the 'dplyr' code. In 'srvyr', you're computing ~ 7,000 point estimates and then you're also computing estimated standard errors for those point estimates. The calculation of the standard errors is the hard part that requires a lot of calculation; getting the point estimates is easy.

Is there something you noticed inside the 'srvyr' code that you think is making it especially slow?

etiennebacher mentioned this issue Sep 27, 2023

Some speedup #168

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`survey_count()` is very slow when applied on a lot of groups #164

`survey_count()` is very slow when applied on a lot of groups #164

etiennebacher commented Aug 1, 2023 •

edited

Loading

bschneidr commented Sep 27, 2023

survey_count() is very slow when applied on a lot of groups #164

survey_count() is very slow when applied on a lot of groups #164

Comments

etiennebacher commented Aug 1, 2023 • edited Loading

bschneidr commented Sep 27, 2023

`survey_count()` is very slow when applied on a lot of groups #164

`survey_count()` is very slow when applied on a lot of groups #164

etiennebacher commented Aug 1, 2023 •

edited

Loading