Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better documentation about when the 'rsample' bootstrap function is (in)appropriate #405

Open
bschneidr opened this issue Dec 21, 2022 · 2 comments

Comments

@bschneidr
Copy link

Feature Request

In the documentation for bootstraps(), note that the bootstrap method assumes data come from independent simple random samples (potentially within strata), and provide a reference about when this assumption is (in)appropriate. Optionally, point users to other packages that implement bootstrap methods appropriate to their data.

Why this matters:

This feature is important because the 'tidymodels' suite of packages is sometimes the first (or one of the first) places that users are learning about the bootstrap. That's why materials such as this vignette take the time to give a quick introduction to it. For users without much statistical training, the bootstrap can seem like a silver bullet tool, but it's easy to forget--or never learn--that the basic bootstrap does make strong implicit assumptions about how your data were collected, and so it can easily be misused.

References on bootstrap methods for non-iid data:

For surveys:

  • Zeinab Mashreghi. David Haziza. Christian Léger. "A survey of bootstrap methods in finite population sampling." Statist. Surv. 10 1 - 52, 2016. https://doi.org/10.1214/16-SS113

For time series:

This is an area where I don't have much experience, so I'm not sure of a good general reference paper to recommend on bootstraps for time series.

R Packages:

For surveys:

For other types of complex data

Background

The basic bootstrap methods implemented in 'rsample' are based on an assumption of independent sampling (potentially within strata), which justifies the use of independent sampling with replacement as a method of forming bootstrap resamples. But this assumption is inappropriate for many datasets used in practice. A couple big examples are complex survey data (such as the widely-used American Community Survey) or cluster-randomized experiments, where the basic bootstrap method can produce drastic underestimates of sampling error.

There are many variations of the bootstrap that have been developed for handling data that aren't simple, independent random samples (the generalized bootstrap, the block bootstrap, the rescaled bootstrap, etc.) I don't know that 'rsample' makes sense as a place to implement these various bootstraps. But I do think that users of 'rsample' would be well-served by documentation that makes them aware of the limitations of the bootstrap method implemented in the package.

@hfrick
Copy link
Member

hfrick commented Apr 27, 2023

Thank you for your thoughtful issue @bschneider! I think it makes sense to add a bit more to the documentation on where bootstrapping is appropriate but a section on when it is not would likely always be incomplete (or completely take over the documentation). Do you have a favorite reference for when it is appropriate, that you'd like to see included in the docs?

@bschneidr
Copy link
Author

bschneidr commented Sep 21, 2023

Thanks for considering, Hannah.

I think a good reference for the context of 'rsample' is the book "Computer Age Statistical Inference" (CASI) by Efron and Hastie. The book in general provides an excellent overview of resampling methods; section 10.3 "Resampling Plans" in particular provides a great introduction to thinking of resampling more generally than just a list of cases included in a resample.

CASI also has the big advantage of being made freely available by the authors: https://hastie.su.domains/CASI_files/PDF/casi.pdf.

So I'd recommend adding a reference to Chapter 10 of CASI for readers to learn about when the bootstrap method in 'rsample' is appropriate and to get intuition about why the bootstrap works. There are other good references that give alternative justifications for why the basic bootstrap works in IID settings and how it can be extended to non-IID settings (e.g., Fay 1984, Mashreghi (2016), and Beaumont and Patak (2012)). But CASI is a good general, readable introduction with nice examples and visualizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants