-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow as_survey_design
objects to work inside tidymodels
#140
Comments
This is a cool idea! There's a part of me that wants to use this as an excuse to learn tidymodels, but I have no idea when I'll have time to do so. |
I would love to help work on it (I'll be doing it as part of my PhD work anyway), and I think it would be a great integration. |
I have vague memory that splitting surveys into training and testing data sets is non-trivial because the data is not iid, ie imagine if by chance the training data set omitted a strata then the models generated by the training set would be biased |
If you end up remembering where you came across this, I would love the resource! Thanks for your advice. |
This is an intriguing idea, and the R community could really use some tools for incorporating complex designs into modeling/machine-learning. The big challenges here are much more statistical ("what's the right thing to do?") than about API design ("how do we write tidymodels S3 methods for survey design objects?"). To be clearer, I've tried to describe below the two big statistical challenges here. For the model validation and machine learning algorithms users are turning to {tidymodels} for, the {survey} package doesn't implement methods, and so the {srvyr} package doesn't have something we can just provide a wrapper around in order to give correct results. For that reason, some statistical decisions have to be made, and so I think that {srvyr} is probably not the best fit for this. Nonetheless, I think it's a great idea to provide a {tidymodels} compatible interface to survey design objects. The end of this ridiculously-long comment has some suggestions about ways to go forward. Interfacing with modeling packages when only a handful do the right thing for complex surveysThe main challenge to my mind is that there are very few modeling packages that take into account complex survey design features correctly when producing standard errors, confidence intervals, and the like. The {survey}, {svyVGAM}, and {rpms} are the only such packages I'm aware of that take into account complex design features for regression or tree-based models. Otherwise, modeling packages would interface with survey design object simply by accessing the data frame of variables from the survey and maybe the weights. But the inferential statistics (p-values, AUROC, etc.) would not actually take into account things like stratification, clustering, raking, etc. Even for modeling functions which accept a weights argument, it's not clear that the weights are being used correctly for a survey context. So if {srvyr} had an interface for {tidymodels} functionality for fitting models, it would in most cases likely just pass along the data and weights to a modeling function and provide loud warnings in most circumstances to alert the user that the complex design is being ignored when fitting models. Figuring out what to do for splitting/cross-validationThe other challenge that @carlganz brings up is that it's not clear how to appropriately do training/testing/cross-validation for complex designs; this is an unsettled problem with some active research going on. Here's an interesting paper (not yet fully published) and nice accompanying accessible presentation on the topic. The authors just last month published on CRAN an R package for cross-validation with surveys, {surveyCV}. Some ideas for taking this idea forwardI think it would be very helpful to have a {tidymodels}-compatible package that provides methods for {tidymodels} functions, drawing on the {surveyCV} R package to figure out splitting/CV and provides interfaces to the modeling functions of {survey}, {srvyr}, {svyVGAM}, and {rpms}. If you want to take the lead @themichjam on a package like that, I'd be happy to contribute and I'm sure there are things that the {srvyr} package can add to help. |
Thanks so much Ben! You make some really great points, and your right about the lack of tools for integrating complex survey designs into machine learning and modelling (I've only started asking this question in the last month with only 5 months left on my PhD work in which I wanted to feature this exact thing). I'd be more than happy to take the lead on a package, and it would be great to work with you (and anyone who finds this and wants to get involved). |
Cool! And yeah, thanks so much Ben! I can't really help with the "should this be done" or any of the statistical knowledge, but would only be able to help with syntactic issues and R coding bug stuff. FWIW, I did a quick scan through tidymodels, and there is unfortunately not much use of generics that would allow one to write a package that has commands literally with the same name as their tidymodels equivalents. Instead I think the value would be in making functions for someone who is used to the tidymodels workflow to easily update. Also the reason the command fails is because the survey package's objects (and therefore srvyr's) are not true data.frames, instead the data.frame is stored as the library(srvyr, warn.conflicts = FALSE)
data(api, package = "survey")
dstrata <- apistrat %>%
as_survey_design(strata = stype, weights = pw)
names(apistrat)
#> [1] "cds" "stype" "name" "sname" "snum" "dname"
#> [7] "dnum" "cname" "cnum" "flag" "pcttest" "api00"
#> [13] "api99" "target" "growth" "sch.wide" "comp.imp" "both"
#> [19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3"
#> [25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col"
#> [31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll"
#> [37] "api.stu" "pw" "fpc"
name(dstrata)
#> Error in name(dstrata): could not find function "name"
names(dstrata$variables)
#> [1] "cds" "stype" "name" "sname" "snum" "dname"
#> [7] "dnum" "cname" "cnum" "flag" "pcttest" "api00"
#> [13] "api99" "target" "growth" "sch.wide" "comp.imp" "both"
#> [19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3"
#> [25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col"
#> [31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll"
#> [37] "api.stu" "pw" "fpc" |
Out of curiosity @gergness , is there any way to circumvent this for now, pre-development of @bschneidr package idea (e.g. weight the data without it being a survey design object, so it could then be plugged into |
Just playing around with the structure that's returned from EDIT: Actually no, this doesn't work - the training and testing aren't subset. library(srvyr)
library(tidymodels)
library(dplyr)
data(api, package = "survey")
data_split <- initial_split(apistrat)
# The data is stored here:
all.equal(data_split$data, apistrat)
#> [1] TRUE
# So maybe you can do this?
data_split$data <- data_split$data %>%
as_survey(strata = stype, weigths = pw)
# Doesn't actually work.
nrow(training(data_split))
#> 200
nrow(testing(data_split))
#> 200 Created on 2022-02-22 by the reprex package (v2.0.1) |
I'm wondering if it would be possible to allow
srvyr
survey design objects to work withtidymodels
. Example below (with error):The text was updated successfully, but these errors were encountered: