Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contribution instructions should state dataset limitations (CRAN) #29

Open
tomjemmett opened this issue Oct 6, 2020 · 3 comments
Open
Labels
documentation Improvements or additions to documentation

Comments

@tomjemmett
Copy link
Member

Currently the contribution section on Readme.md doesn't state any of the limitations imposed by CRAN, namely that the entire package must be <5MB in size.

@chrismainey
Copy link
Collaborator

Agreed. We should probably write guidance for other ways to submit larger datasets. Emphasis on this being for training, not a general data warehouse. Hacktoberfest?

@tomjemmett
Copy link
Member Author

Here are some of my (highly opinionated) thoughts on this. I think we should aim to follow the Tidyverse Style Guide where possible.

  • datasets should be designed for teaching how to do things in R, so should be easy to understand and relevant datasets to a general audience
  • datasets need to be relatively small in size; CRAN has a limit of 5MB for the entire package, so each dataset should be no more than 500KB in size. You can check with object.size()
  • datasets should not contain any sensitive or disclosive information; they are being released publicly. The data ideally should be from a published source, or synthetic/generated data
  • datasets that come from other sources must be licensed under a suitable license for reshaping, e.g. MIT, GPL, OGL, CC. Attribution must be included to the source data
  • datasets should be saved as a tibble - you can use as_tibble() to convert
  • datasets should be named using camel case, as should all columns within the dataset
  • datasets should be documented with roxygen2; this documentation should be a high level overview
  • datasets should have a vignette that describes what the data is in more detail than the documentation goes into as well as containing a useful example (ideally examples) of how to use the data demonstrating useful R functions
  • vignettes should use tidyverse functions and avoid base R and data.table; this is more so because the introductory training NHS-R offers focussed on the tidyverse
  • vignettes should not require the use of too many extra packages. Any packages you use must be included in the Suggests section of DESCRIPTION

@Lextuga007 Lextuga007 added the documentation Improvements or additions to documentation label Oct 1, 2024
@Lextuga007
Copy link
Member

Adding to this list:

  • each function should have an example that is more than the use of glimpse() which is now listed in the Get Started vignette. An example being in ons_mortality the example is how to view the data in wide form with each date as a column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants