You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the documentation for bootstraps(), note that the bootstrap method assumes data come from independent simple random samples (potentially within strata), and provide a reference about when this assumption is (in)appropriate. Optionally, point users to other packages that implement bootstrap methods appropriate to their data.
Why this matters:
This feature is important because the 'tidymodels' suite of packages is sometimes the first (or one of the first) places that users are learning about the bootstrap. That's why materials such as this vignette take the time to give a quick introduction to it. For users without much statistical training, the bootstrap can seem like a silver bullet tool, but it's easy to forget--or never learn--that the basic bootstrap does make strong implicit assumptions about how your data were collected, and so it can easily be misused.
References on bootstrap methods for non-iid data:
For surveys:
Zeinab Mashreghi. David Haziza. Christian Léger. "A survey of bootstrap methods in finite population sampling." Statist. Surv. 10 1 - 52, 2016. https://doi.org/10.1214/16-SS113
For time series:
This is an area where I don't have much experience, so I'm not sure of a good general reference paper to recommend on bootstraps for time series.
R Packages:
For surveys:
'svrep': This vignette discusses bootstrap methods for survey data and how to implement them using the package. The key functions are as_bootstrap_design() and as_gen_boot_design()., which implement bootstrap methods for a wide variety of complex sampling methods.
The basic bootstrap methods implemented in 'rsample' are based on an assumption of independent sampling (potentially within strata), which justifies the use of independent sampling with replacement as a method of forming bootstrap resamples. But this assumption is inappropriate for many datasets used in practice. A couple big examples are complex survey data (such as the widely-used American Community Survey) or cluster-randomized experiments, where the basic bootstrap method can produce drastic underestimates of sampling error.
There are many variations of the bootstrap that have been developed for handling data that aren't simple, independent random samples (the generalized bootstrap, the block bootstrap, the rescaled bootstrap, etc.) I don't know that 'rsample' makes sense as a place to implement these various bootstraps. But I do think that users of 'rsample' would be well-served by documentation that makes them aware of the limitations of the bootstrap method implemented in the package.
The text was updated successfully, but these errors were encountered:
Thank you for your thoughtful issue @bschneider! I think it makes sense to add a bit more to the documentation on where bootstrapping is appropriate but a section on when it is not would likely always be incomplete (or completely take over the documentation). Do you have a favorite reference for when it is appropriate, that you'd like to see included in the docs?
I think a good reference for the context of 'rsample' is the book "Computer Age Statistical Inference" (CASI) by Efron and Hastie. The book in general provides an excellent overview of resampling methods; section 10.3 "Resampling Plans" in particular provides a great introduction to thinking of resampling more generally than just a list of cases included in a resample.
So I'd recommend adding a reference to Chapter 10 of CASI for readers to learn about when the bootstrap method in 'rsample' is appropriate and to get intuition about why the bootstrap works. There are other good references that give alternative justifications for why the basic bootstrap works in IID settings and how it can be extended to non-IID settings (e.g., Fay 1984, Mashreghi (2016), and Beaumont and Patak (2012)). But CASI is a good general, readable introduction with nice examples and visualizations.
Feature Request
In the documentation for
bootstraps()
, note that the bootstrap method assumes data come from independent simple random samples (potentially within strata), and provide a reference about when this assumption is (in)appropriate. Optionally, point users to other packages that implement bootstrap methods appropriate to their data.Why this matters:
This feature is important because the 'tidymodels' suite of packages is sometimes the first (or one of the first) places that users are learning about the bootstrap. That's why materials such as this vignette take the time to give a quick introduction to it. For users without much statistical training, the bootstrap can seem like a silver bullet tool, but it's easy to forget--or never learn--that the basic bootstrap does make strong implicit assumptions about how your data were collected, and so it can easily be misused.
References on bootstrap methods for non-iid data:
For surveys:
For time series:
This is an area where I don't have much experience, so I'm not sure of a good general reference paper to recommend on bootstraps for time series.
R Packages:
For surveys:
as_bootstrap_design()
andas_gen_boot_design()
., which implement bootstrap methods for a wide variety of complex sampling methods.For other types of complex data
Background
The basic bootstrap methods implemented in 'rsample' are based on an assumption of independent sampling (potentially within strata), which justifies the use of independent sampling with replacement as a method of forming bootstrap resamples. But this assumption is inappropriate for many datasets used in practice. A couple big examples are complex survey data (such as the widely-used American Community Survey) or cluster-randomized experiments, where the basic bootstrap method can produce drastic underestimates of sampling error.
There are many variations of the bootstrap that have been developed for handling data that aren't simple, independent random samples (the generalized bootstrap, the block bootstrap, the rescaled bootstrap, etc.) I don't know that 'rsample' makes sense as a place to implement these various bootstraps. But I do think that users of 'rsample' would be well-served by documentation that makes them aware of the limitations of the bootstrap method implemented in the package.
The text was updated successfully, but these errors were encountered: