Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lg update pr instructions #763

Merged
merged 8 commits into from
Oct 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 57 additions & 41 deletions .github/pr_instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,64 +4,80 @@ Thank you for helping us help learners!

There are 4 main steps to submit a dataset:

1. [Find a dataset.](#find-a-dataset)
2. [Prepare your repositry.](#prepare-your-repository)
3. [Create a branch.](#create-a-branch)
4. [Prepare the dataset.](#prepare-the-dataset)
1. [Find a dataset.](#find-a-dataset)
2. [Prepare your repositry.](#prepare-your-repository)
3. [Create a branch.](#create-a-branch)
4. [Prepare the dataset.](#prepare-the-dataset)

## Find a dataset

Find a dataset that would be good for TidyTuesday: either one that is already ready for analysis, or one that you can clean so that it meets the criteria.
These are the requirements for a dataset:
- Data can be saved as one or more CSV files.
- The whole dataset (all files) is less than 20MB.
- You can describe each variable (either using an existing data dictionary or by creating your own dictionary).
- The data is publicly available and free for reuse, either with or without attribution.
Find a dataset that would be good for TidyTuesday: either one that is already ready for analysis, or one that you can clean so that it meets the criteria. These are the requirements for a dataset:

- Data can be saved as one or more CSV files.

- The whole dataset (all files) is less than 20MB.

- You can describe each variable (either using an existing data dictionary or by creating your own dictionary).

- The data is publicly available and free for reuse, either with or without attribution.

You will also need:
- The source of the dataset
- An article about the dataset or that uses the dataset
- At least one image related to or using the dataset

- The source of the dataset

- An article about the dataset or that uses the dataset

- At least one image related to or using the dataset

## Prepare your repository

You'll need to perform this step the first time you submit a pull request to this repository. A "pull request" is a submission of code to a git repository.
If you have never worked with git before, that's fine! We'll help you get set up.
You'll need to perform this step the first time you submit a pull request to this repository.
A "pull request" is a submission of code to a git repository. If you have never worked with git before, that's fine! We'll help you get set up.

1. Set up git, github, and your IDE (such as RStudio). We have step-by-step [instructions for setting up things to work with the Data Science Learning Community](https://github.com/r4ds/bookclub-setup?tab=readme-ov-file#setting-up-for-data-science-learning-community-book-clubs).
2. Fork the tidytuesday repository (this one). In R, you can use `usethis::create_from_github("rfordatascience/tidytuesday")` to create your personal fork on GitHub and copy it to your computer. Note: This step is currently quite slow, ad requires about **8 GB** of space on disk.
1. Set up git, github, and your IDE (such as RStudio). We have step-by-step [instructions for setting up things to work with the Data Science Learning Community](https://github.com/r4ds/bookclub-setup?tab=readme-ov-file#setting-up-for-data-science-learning-community-book-clubs).
2. Fork the tidytuesday repository. In R, you can use `usethis::create_from_github("rfordatascience/tidytuesday")` to create your personal fork on GitHub and copy it to your computer. Note: This requires about **8 GB** of space on disk.

## Create a branch

We use a fork/branch approach to pull requests, meaning you'll create a version of the repo specifically for your changes, and then ask us to merge those changes into the main tidytuesday repository.

1. If you are on anything other than the `master` branch of your local repository, switch back to master. This might happen if you submitted a previous dataset, for example. In R, you can use `usethis::pr_pause()` (if your previous submission is still pending), or `usethis::pr_finish()` (if we've accepted your submission).
2. Pull the latest version of the repository to your computer. In R, use `usethis::pr_merge_main()`
3. Create a new branch, with something similar to the name of the dataset you're submitting. For instance if it's a dataset on American baseball, something like "american-baseball"" or "baseball" works. In R, you can create this branch using `usethis::pr_init(BRANCHNAME)`, eg `usethis::pr_init("american-baseball")`.
4. Navigate to the `data/curated` folder in your branch of the repository.
5. Make a copy of the `template` folder for your dataset, inside the `curated` folder. Name it something descriptive -- the same name as your branch would work, so "baseball" not "my_dataset". We might accept your dataset without assigning it to a specific week yet, in which case this folder will need to be unique.
6. Navigate to the folder you just created. That's where you're going to do your work.
- A copy of the next set of instructions is also available in that folder, as `instructions.md`.
1. If you are on anything other than the `master` branch of your local repository, switch back to master. In R, you can use `usethis::pr_pause()` (if your previous submission is still pending), or `usethis::pr_finish()` (if we've accepted your submission).

2. Pull the latest version of the repository to your computer. In R, use `usethis::pr_merge_main()`

3. Create a new branch, with something similar to the name of the dataset you're submitting. In R, you can create this branch using `usethis::pr_init(BRANCHNAME)`. For instance if it's a dataset on American baseball, something like "american-baseball" using `usethis::pr_init("american-baseball")`.

4. Navigate to the `data/curated` folder in your branch of the repository.

5. Make a copy of the `template` folder for your dataset, inside the `curated` folder. Name it something descriptive -- the same name as your branch would work, so "american-baseball" not "my_dataset".

6. Inside the folder you just created is where you're going to do your work.

## Prepare the dataset

These instructions are for preparing a dataset using the R programming language.
We hope to provide instructions for other programming languages eventually.

1. `cleaning.R`: Modify the `cleaning.R` file to get and clean the data.
- Write the code to download and clean the data in `cleaning.R`.
- If you're getting the data from a github repo, remember to use the 'raw' version of the URL.
- This script should result in one or more data.frames, with descriptive variable names (eg `players` and `teams`, not `df1` and `df2`).
2. `saving.R`: Use`saving.R` to save your datasets. This process creates both the `.csv` file(s) and the data dictionary template file(s) for your datasets. **It is not enough to simply save the CSV files using a separate process. We also need the data dictionaries.**
- Run the first line of `saving.R` to create the functions we'll use to save your dataset.
- Provide the name of your directory as `dir_name`.
- Use `ttsave()` for each dataset you created in `cleaning.R`, substituting the name fo the dataset for `YOUR_DATASET_DF`.
3. `{dataset}.md`: Edit the `{dataset}.md` files to describe your datasets. There should be one file for each of your datasets. You most likely only need to edit the "description" column to provide a description of each variable.
4. `intro.md`: Edit the `intro.md` file to describe your dataset. You don't need to add a `# Title` at the top; this is just a paragraph or two to introduce the week.
5. Find at least one image for your dataset, and ideally two. These often come from the article about your dataset. If you can't find an image, create an example data visualization, and save that. Save the images in your folder as `png` files.
6. `meta.yaml`: Edit `meta.yaml` to provide information about your dataset. Also provide information about how we can credit you in the `credit` block, and delete lines from this block that do not apply to you.
A copy the following instructions is also available in the folder you've created, as `instructions.md`.
These instructions are for preparing a dataset using the R programming language, but we hope to provide instructions for other programming languages eventually.

1. `cleaning.R`: Modify the `cleaning.R` file to get and clean the data.
- Write the code to download and clean the data in `cleaning.R`.
- If you're getting the data from a github repo, remember to use the 'raw' version of the URL.
- This script should result in one or more data.frames, with descriptive variable names (eg `players` and `teams`, not `df1` and `df2`).

2. `saving.R`: Use`saving.R` to save your datasets. This process creates both the `.csv` file(s) and the data dictionary template file(s) for your datasets. **Don't save the CSV files using a separate process because we also need the data dictionaries.**
- Run the first line of `saving.R` to create the functions we'll use to save your dataset.
- Provide the name of your directory as `dir_name`.
- Use `ttsave()` for each dataset you created in `cleaning.R`, substituting the name for the dataset for `YOUR_DATASET_DF`.

3. `{dataset}.md`: Edit the `{dataset}.md` files to describe your datasets (where `{dataset}` is the name of the dataset). These files are created by `saving.R`. There should be one file for each of your datasets. You most likely only need to edit the "description" column to provide a description of each variable.

4. `intro.md`: Edit the `intro.md` file to describe your dataset. You don't need to add a `# Title` at the top; this is just a paragraph or two to introduce the week.

5. Find at least one image for your dataset. These often come from the article about your dataset. If you can't find an image, create an example data visualization, and save the images in your folder as `png` files.

6. `meta.yaml`: Edit `meta.yaml` to provide information about your dataset and how we can credit you. You can delete lines from the `credit` block that do not apply to you.

### Submit your pull request with the data

1. Commit the changes with this folder to your branch. In RStudio, you can do this on the "Git" tab (the "Commit" button).
2. Submit a pull request to https://github.com/rfordatascience/tidytuesday. In R, you can do this with `usethis::pr_push()`, and then follow the instructions in your browser.
1. Commit the changes with this folder to your branch. In RStudio, you can do this on the "Git" tab (the "Commit" button).

2. Submit a pull request to <https://github.com/rfordatascience/tidytuesday>. In R, you can do this with `usethis::pr_push()`, and then follow the instructions in your browser.
34 changes: 20 additions & 14 deletions data/curated/template/instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,26 @@ We hope to provide instructions for other programming languages eventually.

If you have not yet set up your computer for submitting a dataset, please see the full instructions at <https://github.com/rfordatascience/tidytuesday/blob/master/.github/pr_instructions.md>.

1. `cleaning.R`: Modify the `cleaning.R` file to get and clean the data.
- Write the code to download and clean the data in `cleaning.R`.
- If you're getting the data from a github repo, remember to use the 'raw' version of the URL.
- This script should result in one or more data.frames, with descriptive variable names (eg `players` and `teams`, not `df1` and `df2`).
2. `saving.R`: Use`saving.R` to save your datasets. This process creates both the `.csv` file(s) and the data dictionary template file(s) for your datasets. **It is not enough to simply save the CSV files using a separate process. We also need the data dictionaries.**
- Run the first line of `saving.R` to create the functions we'll use to save your dataset.
- Provide the name of your directory as `dir_name`.
- Use `ttsave()` for each dataset you created in `cleaning.R`, substituting the name fo the dataset for `YOUR_DATASET_DF`.
3. `{dataset}.md`: Edit the `{dataset}.md` files to describe your datasets. There should be one file for each of your datasets. You most likely only need to edit the "description" column to provide a description of each variable.
4. `intro.md`: Edit the `intro.md` file to describe your dataset. You don't need to add a `# Title` at the top; this is just a paragraph or two to introduce the week.
5. Find at least one image for your dataset, and ideally two. These often come from the article about your dataset. If you can't find an image, create an example data visualization, and save that. Save the images in your folder as `png` files.
6. `meta.yaml`: Edit `meta.yaml` to provide information about your dataset. Also provide information about how we can credit you in the `credit` block, and delete lines from this block that do not apply to you.
1. `cleaning.R`: Modify the `cleaning.R` file to get and clean the data.
- Write the code to download and clean the data in `cleaning.R`.
- If you're getting the data from a github repo, remember to use the 'raw' version of the URL.
- This script should result in one or more data.frames, with descriptive variable names (eg `players` and `teams`, not `df1` and `df2`).

2. `saving.R`: Use`saving.R` to save your datasets. This process creates both the `.csv` file(s) and the data dictionary template file(s) for your datasets. **Don't save the CSV files using a separate process because we also need the data dictionaries.**
- Run the first line of `saving.R` to create the functions we'll use to save your dataset.
- Provide the name of your directory as `dir_name`.
- Use `ttsave()` for each dataset you created in `cleaning.R`, substituting the name for the dataset for `YOUR_DATASET_DF`.

3. `{dataset}.md`: Edit the `{dataset}.md` files to describe your datasets (where `{dataset}` is the name of the dataset). These files are created by `saving.R`. There should be one file for each of your datasets. You most likely only need to edit the "description" column to provide a description of each variable.

4. `intro.md`: Edit the `intro.md` file to describe your dataset. You don't need to add a `# Title` at the top; this is just a paragraph or two to introduce the week.

5. Find at least one image for your dataset. These often come from the article about your dataset. If you can't find an image, create an example data visualization, and save the images in your folder as `png` files.

6. `meta.yaml`: Edit `meta.yaml` to provide information about your dataset and how we can credit you. You can delete lines from the `credit` block that do not apply to you.

### Submit your pull request with the data

1. Commit the changes in this folder to your branch. In RStudio, you can do this on the "Git" tab (the "Commit" button).
2. Submit a pull request to https://github.com/rfordatascience/tidytuesday. In R, you can do this with `usethis::pr_push()`, and then follow the instructions in your browser.
1. Commit the changes with this folder to your branch. In RStudio, you can do this on the "Git" tab (the "Commit" button).

2. Submit a pull request to <https://github.com/rfordatascience/tidytuesday>. In R, you can do this with `usethis::pr_push()`, and then follow the instructions in your browser.
4 changes: 0 additions & 4 deletions data/curated/template/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,6 @@ data_source:
url: URL TO THAT SOURCE
images:
# Please include at least one image, and up to three images
- file: FILENAME.png
alt: >
ALT TEXT FOR THIS IMAGE. THIS TEXT SHOULD BE IN "Sentence case." AND SHOULD
SERVE AS A *REPLACEMENT* FOR THE IMAGE, NOT JUST *DESCRIBE* THE IMAGE
- file: FILENAME.png
alt: >
ALT TEXT FOR THIS IMAGE. THIS TEXT SHOULD BE IN "Sentence case." AND SHOULD
Expand Down
2 changes: 1 addition & 1 deletion data/curated/template/saving.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Run this
source("data/curated/curation_scripts.R")

# Fill in the name of your directory, then run this.
# Fill in the name of the folder you created in "curated", then run this.
dir_name <- "name_of_your_dir"

# Run this for each of your datasets, replacing YOUR_DATASET_DF with the name of
Expand Down