You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have some survey data that contains a few tens of thousands of observations, and I have several groups. I'd like to compute the survey count per combination of groups, but it is much slower than the "non-survey" count using dplyr. I understand that you have to do extra steps and that you call survey under the hood, but it seems to me that this difference in timing is due to the way groups of data are passed to survey_total().
In the example below, dplyr::count() is near instantaneous, but survey_count() takes more than a minute:
library(srvyr, warn.conflicts=FALSE)
library(dplyr, warn.conflicts=FALSE)
packageVersion("srvyr")
#> [1] '1.2.0'N<-50000
set.seed(123)
test<-data.frame(
grp1= sample(letters, N, TRUE),
grp2= sample(LETTERS, N, TRUE),
grp3= sample(1:10, N, TRUE),
weight= sample(seq(0, 1, 0.01), N, TRUE)
) |>
arrange(grp1, grp2, grp3)
# dplyr is fast
system.time({
test|>
group_by(grp1, grp2, grp3) |>
count() |>
ungroup()
})
#> user system elapsed #> 0.03 0.00 0.04test_sv<- as_survey_design(test, weights=weight)
# srvyr is much slower
system.time({
test_sv|>
group_by(grp1, grp2, grp3) |>
survey_count() |>
ungroup()
})
#> user system elapsed #> 81.16 2.71 84.76
Is it something that could be improved? Or maybe I missed something?
Thanks,
The text was updated successfully, but these errors were encountered:
I think it's a good idea to look into ways to speed up grouped operations, if possible. But fundamentally, the srvyr code is doing many, many more calculations than the 'dplyr' code. In 'srvyr', you're computing ~ 7,000 point estimates and then you're also computing estimated standard errors for those point estimates. The calculation of the standard errors is the hard part that requires a lot of calculation; getting the point estimates is easy.
Is there something you noticed inside the 'srvyr' code that you think is making it especially slow?
Hello, I have some survey data that contains a few tens of thousands of observations, and I have several groups. I'd like to compute the survey count per combination of groups, but it is much slower than the "non-survey" count using
dplyr
. I understand that you have to do extra steps and that you callsurvey
under the hood, but it seems to me that this difference in timing is due to the way groups of data are passed tosurvey_total()
.In the example below,
dplyr::count()
is near instantaneous, butsurvey_count()
takes more than a minute:Is it something that could be improved? Or maybe I missed something?
Thanks,
The text was updated successfully, but these errors were encountered: