Skip to content

Commit

Permalink
Merge pull request #91 from tidy-survey-r/reproducible-data
Browse files Browse the repository at this point in the history
Draft of Reproducible Research Chapter
  • Loading branch information
ivelasq authored Jan 30, 2024
2 parents 5992ecf + 6953861 commit e3f00a0
Show file tree
Hide file tree
Showing 3 changed files with 138 additions and 73 deletions.
73 changes: 1 addition & 72 deletions 08-communicating-results.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ The inclusion of this material is especially important if no methodology report

In addition to how the survey was conducted and how weights were calculated, providing information about what data prep, cleaning, and analyses were used to obtain these results is also important. For example, in Chapter \@ref(c06-statistical-testing), we compared the distributions of education from the survey to the ACS. To do this, we needed to collapse education categories provided in the ANES data to match the ACS. Providing both the original question wording and response options and the steps taken to map to the ACS data are important for the audience to know to ensure transparency and a better understanding of the results.

This particular example may seem obvious (combining a Bachelor's Degree and a Graduate Degree into a single category). Still, there are cases where re-coding or handling missing data is more important to disclose as there could be multiple ways to handle the data, and the choice we made as researchers was just one of many. For example, many examples and exercises in this book remove missing data, as this is often the easiest way to handle missing data. However, in some cases, missing data could be a substantively important piece of information, and removing it could bias results. Disclosing how data was handled is crucial in helping the audience better understand the results.
This particular example may seem obvious (combining a Bachelor's Degree and a Graduate Degree into a single category). Still, there are cases where re-coding or handling missing data is more important to disclose as there could be multiple ways to handle the data, and the choice we made as researchers was just one of many. For example, many examples and exercises in this book remove missing data, as this is often the easiest way to handle missing data. However, in some cases, missing data could be a substantively important piece of information, and removing it could bias results (see Chapter \@ref(c11-missing-data)). Disclosing how data was handled is crucial in helping the audience better understand the results.

### Results

Expand Down Expand Up @@ -507,75 +507,4 @@ pfull <-
pfull
```

## Reproducibility

Reproducibility is the ability to recreate or replicate the results of a data analysis. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. Reproducibility is a crucial aspect of survey research because it enables the verification of findings and ensures that the conclusions are not dependent on a particular person running the workflow. Others can review and rerun projects to build on existing work, reducing redundancy and errors.

Reproducibility requires that we consider several key components:

- **Code**: The source code used for data cleaning, analysis, modeling, and reporting must be available, discoverable, documented, and shared.
- **Data**: The raw data used in the workflow must be available, discoverable, documented, and shared. If the raw data is sensitive or proprietary, we should strive to provide as much data as possible that would allow others to run our workflow or direct others to where they can access a restricted use file (RUF).
- **Environment**: The environment of the project must be documented. Another analyst should be able to recreate the environment, including the R version, packages, operating system, and other dependencies used in the analysis.
- **Methodology**: The analysis methodology, including the rationale behind specific decisions, interpretations, and assumptions, must be documented. Others should be able to achieve the same analysis results based on the methodology report.

Many tools, practices, and project management techniques exist to make survey analysis projects easy to reproduce. For best results, they should be decided upon and applied at the beginning of a project. Below are our suggestions for a survey analysis data workflow. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow.

### Setting Random Number Seeds

Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R will change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility.

In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function:

```r
set.seed(999)

runif(5)
```

The `runif()` function generates five random numbers from a uniform distribution. Since the seed is set to `999`, running `runif()` multiple times will always produce the same sequence:

```
[1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676
```

It is important to note that `set.seed()` should be used *before* random number generation but is only necessary once per program to make the entire program reproducible. For example, we might set the seed at the top of a program where libraries tend to be loaded.

### Git

A survey analysis project produces a lot of code. As code evolves throughout a project, keeping track of the latest version becomes challenging. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work.

Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysis can use Git to follow the evolution of code and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve conflicts between versions.

Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time.

<!--TODO: Add image-->

In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place.

Using version control in data science projects makes collaboration and maintenance more manageable. One excellent resource is [Happy Git and GitHub for the useR by Jenny Bryan and Jim Hester](https://happygitwithr.com/).

### {renv}

The {renv} package is a popular option for managing dependencies and creating virtual environments in R. It creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated, {renv} checks whether the installed packages are consistent with the record. If not, it restores the correct versions for running the project.

With {renv}, others can replicate the project's environment to rerun the code and obtain consistent results.

### Quarto/R Markdown

Quarto and R Markdown are powerful tools that allow us to create documents that combine code and text. These documents present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output.

Rerunning a Quarto or R Markdown document re-executes the underlying code. Another team member can recreate the report and obtain the same results.

#### Parameterization {-}

Quarto and R Markdown's parameterization is an important aspect of reproducibility in reporting. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. By parameterizing our code, we can define and modify these parameters to explore different scenarios or inputs. For example, we can create a document that provides survey analysis results for Michigan. By defining a `state` parameter, we can rerun the same analysis for Wisconsin without having to edit the code throughout the document.

We can define parameterization in the header or code chunks of our Quarto/R Markdown documents. Again, we can easily modify and document the values of these parameters, reducing errors that may occur by manually editing code throughout the script. Parameterization is also a flexible way for others to replicate the analysis and explore variations.

### The {targets} package

The {targets} package is a workflow manager enabling us to document, automate, and execute complex data workflows with multiple steps and dependencies. We define the order of execution for our code. Only the affected code and its downstream targets are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline.

This tool helps with reproducibility by tracking dependencies, inputs, and outputs of each step of our workflow.

As noted above, many tools, practices, and project management techniques exist for achieving reproducibility. Most critical is deciding on reproducibility goals with our team and the requirements to achieve them before deciding on workflow and documentation.
Loading

0 comments on commit e3f00a0

Please sign in to comment.