Implement slice(), *_join(), and other dplyr methods for tbl_svy. #65

krivit · 2020-01-15T06:16:41Z

There is a filter() method for tbl_svy, but there isn't a slice() method, or any of the *_join() methods, as far as I can tell. Would it be possible to implement them? Thanks in advance!

The text was updated successfully, but these errors were encountered:

gergness · 2020-01-15T22:51:32Z

slice isn't in because it kind of messes with the database-backed surveys, though I can't remember exactly how, and it seems like it should be possible.

The *_join() family of functions (and bind_rows() for that matter) makes me nervous because they can have implications for the weights and design that aren't knowable just from the data itself. I recommend you perform these data manipulations before setting the survey design, while the data is still in traditional data.frames.

krivit · 2020-01-15T23:56:34Z

I agree that joins can mess with the design, but the package already handles indexing that duplicates rows (e.g., x[rep(1:2,each=2),]) intelligently by treating them as a cluster sample; I would think that joins would work along similar lines.

tzoltak · 2020-04-09T16:38:30Z

Another approach to joins would be to check whether join adds or duplicates any rows and simply not to allow join in such cases.

bschneidr · 2021-05-12T01:00:51Z

I think it would make sense to add filtering joins (anti_join() and semi_join()), since those won't accidentally alter the survey design. That's why I've added the pull request #120 to implement them if you think that's a good idea.

I'm ambivalent about whether it's worth adding left_join() since I'm also nervous about altering users' survey designs in ways they might not expect or whose ramifications they won't appreciate. But I think it'd be totally fine if we went with @tzoltak's suggestion for left_join(), inner_join(), and full_join(), and throw informative errors when rows are added or duplicated.

gergness · 2021-05-12T02:02:37Z

Yeah, agree on filtering joins, I just don't see them as super useful without the other joins.

For mutating joins, I meant to investigate this comment from krivit, but never did (and likely won't have time for a while):

the package already handles indexing that duplicates rows (e.g., x[rep(1:2,each=2),]) intelligently by treating them as a cluster sample; I would think that joins would work along similar lines.

If this is usually the right thing to do (my ignorance of the math behind surveys is really coming out here), I can imagine a warning instead of an error when a join creates duplicates. I think we also would need a warning for mutating joins when both x & y are surveys to let the user know that only the weights from x are kept.

Anyone have real world examples where they wanted to do this (preferably with sharable data so they're full reprexes, but I'm also just trying to wrap my head around it, so it's okay if not)?

tzoltak · 2021-05-12T08:21:00Z

Well, it may make sense to create clusters on duplicated rows, but whether it actually makes heavily depends on ones workflow - there's no way package can check this. Personally I'm rather devoted to the idea of being explicit about survey design and do not modifying it on the flight (in the operation that don't look like it modifies the design) but that's matter of personal preferences and if srvyr already handles such a thing in the operation of selecting rows it makes sense it will also behave analogously while performing joins. Nevertheless I think there should be warning or at least note in such a situation - my personal experience is duplication of rows in joins often comes from mistakenly assuming that (combinations of) values of key variable(s) are unique while they somehow have unwillingly duplicated on a previous stage of performing complex data transformations.

krivit · 2021-05-12T08:38:11Z

@gergness, The specific case I am dealing with is something called egocentric network data. For example, I might ask each survey respondent about their own demographic information (age, sex, race/ethnicity, etc.) and put them in Table x and demographics of each of their close friends and put them in Table y. (Let's assume that no one is nominated twice.)

Since I selected my respondents using some kind of a sampling design, I might create a srvyr object out of x and set up that design. Now, suppose that I want to analyse the association between a person's demographics and those of their close friends. When I inner join x to y, I would expect the result to be a table with the same number of rows as y and with its design being a cluster sample within x's design, since that's what they become.

You can see those examples in the egor package. Right now, we use a kludge described here.

@tzoltak, my preference would be to emulate the behaviour of survey as much as possible:

library(survey)
data(mtcars)
(carsvy <- svydesign(~1, data=mtcars))
#> Warning in svydesign.default(~1, data = mtcars): No weights or probabilities
#> supplied, assuming equal probability
#> Independent Sampling design (with replacement)
#> svydesign(~1, data = mtcars)
carsvy[rep(1:2, each=2)]
#> 1 - level Cluster Sampling design (with replacement)
#> With (2) clusters.
#> svydesign(~1, data = mtcars)

^{Created on 2021-05-12 by the reprex package (v2.0.0)}

gergness · 2021-05-23T20:29:32Z

Oops, didn't mean to close. Filtering joins are available now though.

krivit changed the title ~~Implement slice() method for tbl_svy.~~ Implement slice(), *_join(), and other dplyr methods for tbl_svy. Jan 15, 2020

krivit mentioned this issue Sep 16, 2020

Modify all summary functions to take into account design information (i.e., weights). tilltnet/egor#3

Closed

bschneidr mentioned this issue May 12, 2021

Add filtering joins, with documentation and tests. #120

Merged

gergness closed this as completed in #120 May 23, 2021

gergness reopened this May 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement slice(), *_join(), and other dplyr methods for tbl_svy. #65

Implement slice(), *_join(), and other dplyr methods for tbl_svy. #65

krivit commented Jan 15, 2020 •

edited

Loading

gergness commented Jan 15, 2020

krivit commented Jan 15, 2020 •

edited

Loading

tzoltak commented Apr 9, 2020

bschneidr commented May 12, 2021 •

edited

Loading

gergness commented May 12, 2021

tzoltak commented May 12, 2021

krivit commented May 12, 2021

gergness commented May 23, 2021

Implement slice(), *_join(), and other dplyr methods for tbl_svy. #65

Implement slice(), *_join(), and other dplyr methods for tbl_svy. #65

Comments

krivit commented Jan 15, 2020 • edited Loading

gergness commented Jan 15, 2020

krivit commented Jan 15, 2020 • edited Loading

tzoltak commented Apr 9, 2020

bschneidr commented May 12, 2021 • edited Loading

gergness commented May 12, 2021

tzoltak commented May 12, 2021

krivit commented May 12, 2021

gergness commented May 23, 2021

krivit commented Jan 15, 2020 •

edited

Loading

krivit commented Jan 15, 2020 •

edited

Loading

bschneidr commented May 12, 2021 •

edited

Loading