Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trtAssignment as a "distribution" to defData to ease flow? #69

Closed
Tracked by #109
kgoldfeld opened this issue Oct 12, 2020 · 5 comments · Fixed by #114
Closed
Tracked by #109

Add trtAssignment as a "distribution" to defData to ease flow? #69

kgoldfeld opened this issue Oct 12, 2020 · 5 comments · Fixed by #114
Labels
feature feature request or enhancement

Comments

@kgoldfeld
Copy link
Owner

Currently, the treatment assignment process using trtAssign breaks the flow of the creation of a data set. Usually there is an outcome variable that is a function of the treatment assignment - so that we need to add a column to the table after the treatment assignment is made.

# Data definitions (requires two definitions for the same data set)

d1 <- defData(varname = "g", formula = ".2;.4;.4", dist = "categorical")
d1 <- defData(d1, varname = "x1", formula = "0;10", dist = "uniform")
d1 <- defData(d1, varname = "x2", formula = "2+.5*2", variance = 3, dist = "normal")

d2 <- defDataAdd(varname = "y", formula = "2 + x2 * 3 + rx*2", variance = 5, dist = "normal")

# Data generation (requires three function calls)

dd <- genData(500, d1)
dd <- trtAssign(dd, strata = "g", grpName = "rx")
dd <- addColumns(d2, dd)

What if we added a "trtAssign" distribution to the data def table so that the treatment assignment can be part of a single data generation process? It would look like this:

# Only one data definition

d <- defData(varname = "g", formula = ".2;.4;.4", dist = "categorical")
d <- defData(d, varname = "x1", formula = "0;10", dist = "uniform")
d <- defData(d, varname = "x2", formula = "2+.5*2", variance = 3, dist = "normal")
d <- defData(d, varname = "rx", formula = "1;1", variance = "g", dist = "trtAssign")
d <- defData(d, varname = "y", formula = "2 + x2 * 3 + rx*2", variance = 5, dist = "normal")

# Only one function call to generate the data

dd <- genData(500, d)

The formula for trtAssign represents the treatment assignment ratio defaults to "1;1", but could be of any length - so, if it is "1;1;1;2" that would be four groups. The variance parameter represents the stratification. Multiple levels of stratification would be represented as "a;b;c", where a, b, and c are variable names (really need to be categorical or factors). The functionality is exactly as it is in function trtAssign.

@kgoldfeld kgoldfeld added the feature feature request or enhancement label Oct 12, 2020
@assignUser
Copy link
Collaborator

I see what you mean, makes sense. I will likely have time to look it over at the end of the week.

@kgoldfeld
Copy link
Owner Author

Should be able to use trtAssign code, or some of it. I guess we would keep trtAssign as well.

@assignUser
Copy link
Collaborator

This just sparked an idea... we could possibly rework the data definitions / "dist" column to contain not only distributions but rather "modifications" that are applied to the data so: dists, trtAssign, addMissing, user defined functions(#71) ....
That way the complete data workflow would be contained in the definition and clearly readable, which, as i understood, is a high priority for you.

This would defintely be quite some work but could be of value and as we are considering breaking changes in several different places, it might be a good time to implement such sweeping changes (maybe as simstudy 1.0.0 ?)

@kgoldfeld
Copy link
Owner Author

I see the appeal of that, though I do have to say I like the current flow of keeping the missing data process different from the underlying (true) data generation process. They are two different processes, so I think I would like to keep them separate. As you know, though, I am very keen on being able to define the randomized treatment assignment in the data definition - that to me is a key part of the underlying data generation process. And the truncation obviously.

Maybe by excluding the missing data from this will simplify things so that it is not as big a lift once the new dataDef arguments are in place.

@assignUser
Copy link
Collaborator

#71 and #75 both seem relevant!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants