diff --git a/08-communicating-results.Rmd b/08-communicating-results.Rmd index 3e9b7a5b..fd905c4d 100644 --- a/08-communicating-results.Rmd +++ b/08-communicating-results.Rmd @@ -70,7 +70,7 @@ The inclusion of this material is especially important if no methodology report In addition to how the survey was conducted and how weights were calculated, providing information about what data prep, cleaning, and analyses were used to obtain these results is also important. For example, in Chapter \@ref(c06-statistical-testing), we compared the distributions of education from the survey to the ACS. To do this, we needed to collapse education categories provided in the ANES data to match the ACS. Providing both the original question wording and response options and the steps taken to map to the ACS data are important for the audience to know to ensure transparency and a better understanding of the results. -This particular example may seem obvious (combining a Bachelor's Degree and a Graduate Degree into a single category). Still, there are cases where re-coding or handling missing data is more important to disclose as there could be multiple ways to handle the data, and the choice we made as researchers was just one of many. For example, many examples and exercises in this book remove missing data, as this is often the easiest way to handle missing data. However, in some cases, missing data could be a substantively important piece of information, and removing it could bias results. Disclosing how data was handled is crucial in helping the audience better understand the results. +This particular example may seem obvious (combining a Bachelor's Degree and a Graduate Degree into a single category). Still, there are cases where re-coding or handling missing data is more important to disclose as there could be multiple ways to handle the data, and the choice we made as researchers was just one of many. For example, many examples and exercises in this book remove missing data, as this is often the easiest way to handle missing data. However, in some cases, missing data could be a substantively important piece of information, and removing it could bias results (see Chapter \@ref(c11-missing-data)). Disclosing how data was handled is crucial in helping the audience better understand the results. ### Results @@ -507,75 +507,4 @@ pfull <- pfull ``` -## Reproducibility -Reproducibility is the ability to recreate or replicate the results of a data analysis. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. Reproducibility is a crucial aspect of survey research because it enables the verification of findings and ensures that the conclusions are not dependent on a particular person running the workflow. Others can review and rerun projects to build on existing work, reducing redundancy and errors. - -Reproducibility requires that we consider several key components: - - - **Code**: The source code used for data cleaning, analysis, modeling, and reporting must be available, discoverable, documented, and shared. - - **Data**: The raw data used in the workflow must be available, discoverable, documented, and shared. If the raw data is sensitive or proprietary, we should strive to provide as much data as possible that would allow others to run our workflow or direct others to where they can access a restricted use file (RUF). - - **Environment**: The environment of the project must be documented. Another analyst should be able to recreate the environment, including the R version, packages, operating system, and other dependencies used in the analysis. - - **Methodology**: The analysis methodology, including the rationale behind specific decisions, interpretations, and assumptions, must be documented. Others should be able to achieve the same analysis results based on the methodology report. - -Many tools, practices, and project management techniques exist to make survey analysis projects easy to reproduce. For best results, they should be decided upon and applied at the beginning of a project. Below are our suggestions for a survey analysis data workflow. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow. - -### Setting Random Number Seeds - -Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R will change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility. - -In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function: - -```r -set.seed(999) - -runif(5) -``` - -The `runif()` function generates five random numbers from a uniform distribution. Since the seed is set to `999`, running `runif()` multiple times will always produce the same sequence: - -``` -[1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676 -``` - -It is important to note that `set.seed()` should be used *before* random number generation but is only necessary once per program to make the entire program reproducible. For example, we might set the seed at the top of a program where libraries tend to be loaded. - -### Git - -A survey analysis project produces a lot of code. As code evolves throughout a project, keeping track of the latest version becomes challenging. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. - -Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysis can use Git to follow the evolution of code and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve conflicts between versions. - -Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. - - - -In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place. - -Using version control in data science projects makes collaboration and maintenance more manageable. One excellent resource is [Happy Git and GitHub for the useR by Jenny Bryan and Jim Hester](https://happygitwithr.com/). - -### {renv} - -The {renv} package is a popular option for managing dependencies and creating virtual environments in R. It creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated, {renv} checks whether the installed packages are consistent with the record. If not, it restores the correct versions for running the project. - -With {renv}, others can replicate the project's environment to rerun the code and obtain consistent results. - -### Quarto/R Markdown - -Quarto and R Markdown are powerful tools that allow us to create documents that combine code and text. These documents present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. - -Rerunning a Quarto or R Markdown document re-executes the underlying code. Another team member can recreate the report and obtain the same results. - -#### Parameterization {-} - -Quarto and R Markdown's parameterization is an important aspect of reproducibility in reporting. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. By parameterizing our code, we can define and modify these parameters to explore different scenarios or inputs. For example, we can create a document that provides survey analysis results for Michigan. By defining a `state` parameter, we can rerun the same analysis for Wisconsin without having to edit the code throughout the document. - -We can define parameterization in the header or code chunks of our Quarto/R Markdown documents. Again, we can easily modify and document the values of these parameters, reducing errors that may occur by manually editing code throughout the script. Parameterization is also a flexible way for others to replicate the analysis and explore variations. - -### The {targets} package - -The {targets} package is a workflow manager enabling us to document, automate, and execute complex data workflows with multiple steps and dependencies. We define the order of execution for our code. Only the affected code and its downstream targets are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. - -This tool helps with reproducibility by tracking dependencies, inputs, and outputs of each step of our workflow. - -As noted above, many tools, practices, and project management techniques exist for achieving reproducibility. Most critical is deciding on reproducibility goals with our team and the requirements to achieve them before deciding on workflow and documentation. diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd index d330259f..85659210 100644 --- a/09-reproducible-data.Rmd +++ b/09-reproducible-data.Rmd @@ -1 +1,131 @@ -# Reproducible data {#c09-reprex-data} +# Reproducible Research {#c09-reprex-data} + +Reproducing a data analysis's results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Enabling the verification of our analysis is another goal of reproducibility. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment. + +Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. These journals now require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. + +Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are: + + - **Code**: source code used for data cleaning, analysis, modeling, and reporting + - **Data**: raw data used in the workflow, or if data is sensitive or proprietary, as much data as possible that would allow others to run our workflow (e.g., access to a restricted use file (RUF)) + - **Environment**: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis + - **Methodology**: analysis methodology, including rationale behind decisions, interpretations, and assumptions + +In Chapter \@ref(c08-communicating-results), we briefly mention each of these is important to include in the methodology report and when communicating the findings and results of a study. However, to be transparent and effective researchers, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, researchers will dive into the data and make decisions as they go without full documentation, which can be challenging if we need to go back and make changes or understand even what we did a few months ago. Therefore, it would benefit other researchers and potentially our future selves to better document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, researchers should decide which techniques and tools will be used before starting a project (or very early on). + +This chapter covers some of our suggestions for tools and techniques we can use in projects. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow. + +## Project-based workflows + +We recommend a project-based workflow for analysis projects as described in Hadley Wickham Mine Çetinkaya-Rundel, and Garrett Grolemund's book, R for Data Science, found at [r4ds.hadley.nz](https://r4ds.hadley.nz/). A project-based workflow maintains a "source of truth" for our analyses. It helps with file system discipline by putting everything related to a project in a designated folder. Since all associated files are in a single location, they are easy to find and organize. When we reopen the project, we can recreate the environment in which we originally ran the code to reproduce our results. + +The RStudio IDE has built-in support for projects. When we create a project in RStudio, it creates a `.Rproj` file that store settings specific to that project. Once we have created a project, we can create folders that help us organize our workflow. For example, a project directory could look like this: + +``` +| anes_analysis/ + | anes_analysis.Rproj + | README.md + | codebooks + | codebook2020.pdf + | codebook2016.pdf + | rawdata + | anes2020_raw.csv + | anes2016_raw.csv + | scripts + | data-prep.R + | data + | anes2020_clean.csv + | anes2016_clean.csv + | report + | anes_report.Rmd + | anes_report.html + | anes_report.pdf +``` + +The {here} package enables easy file referencing. In a project-based workflow, all paths are relative and, by default, relative to the project’s folder. By using relative paths, others can open and run our files even if their directory configuration differs from ours. Use the `here::here()` function to build the path when we load or save data. Below, we ask R to read the CSV file `anes_2020.csv` in the project directory's `data` folder: + +```{r} +#| eval: false +#| label: project-file-example +anes <- + read_csv(here::here("data", "anes2020_clean.csv")) +``` + +The combination of projects and the {here} package keep all associated files in an organized manner. This workflow makes it more likely that our analyses can be reproduced by us or our colleagues. + +## Version Control: Git + +Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. + +Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysts can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). + +Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. + + + +In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place. + +Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting Git with R, we recommend the [Happy Git and GitHub for the useR by Jenny Bryan and Jim Hester](https://happygitwithr.com/) [@git-w-R]. + +## Package Management: {renv} + +Ensuring reproducibility involves not only using version control of code, but also managing the versions of packages. If two people run the same code but use different versions of a package, the results might differ because of changes in those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} has some bugs (errors) when doing some calculations. The version on GitHub has corrected these errors, so we have asked users to install the GitHub version to obtain the same results. + +One way to handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the appropriate versions so that others can replicate the project's environment to rerun the code and obtain consistent results. + +## Workflow Management: {targets} + +With complex studies involving multiple code files and dependencies, it is important to ensures each step is executed in the intended sequence. We can do this manually, e.g., numbering files to indicate the order or providing detailed documentation on the order. Alternatively, we can automate the process so the code flows sequentially. Making sure that the code runs in the correct order helps ensure that the research is reproducible. Anyone should be able to pick up the set of scripts and get the same results by following the workflow. + +The {targets} package is growing as a popular workflow manager that documents, automates, and executes complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One nice feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. + +## Documentation: Quarto and R Markdown + +Tools like Quarto and R Markdown aid in reproducibility by creating documents that integrate code, text, and results. We can present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. + +Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Another team member can see the steps we took, follow the scripts, and recreate the report. We can include details about our work in one place thanks to the combination of text and code, making our work transparent and easier to verify. + +### Parameterization + +Another great feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. We can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for Michigan but then later decide we want to look at multiple states. In that case, we can define a `state` parameter and rerun the same analysis for other states like Wisconsin without having to edit the code throughout the document. + +Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily be modified and documented. Thus, we are reducing errors that may occur by manually editing code throughout the script, and it is a flexible way for others to replicate the analysis and explore variations. + +## Other Tips for Reproducibility + +### Random Number Seeds + +Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility. + +In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function: + +```r +set.seed(999) + +runif(5) +``` + +The `runif()` function generates five random numbers from a uniform distribution. Since the seed is set to `999`, running `runif()` multiple times will always produce the same sequence: + +``` +[1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676 +``` + +The choice of the seed number is up to the researcher. For example, this could be the date (`20240102`) or time of day (`1056`) when the analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number can be up to the researcher to decide. However, it is important to note that `set.seed()` should be used *before* random number generation but is only necessary once per program to make the entire program reproducible. For example, we could set the seed at the top of a program where libraries are loaded. + +### Descriptive Names and Labels + +Something else to assist with reproducible research is using descriptive variable names or labeling data. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive of what they contained (e.g., `Age`). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data)). As with random number seeds, the exact method is up to the researcher, but providing this information can help ensure your research is reproducible. + +### Databases + +Researchers may consider creating a database for projects with complex or large data structures to manage the data and any changes. Many databases will allow for a history of changes, which can be useful when recoding variables to ensure no inadvertent errors are introduced. Additionally, a database may be more accessible to pass to other researchers if existing relationships between tables and types are complex to map. + +## Summary + +We can promote accuracy and verification of results by making our analysis reproducible. This chapter discussed different ways to make research reproducible. There are various tools and guides available to help you achieve reproducibility in your work. Here are additional resources to explore: + +* R for Data Science chapter on project-based workflows: [https://r4ds.hadley.nz/workflow-scripts.html#projects](https://r4ds.hadley.nz/workflow-scripts.html#projects) +* Building reproducible analytical pipelines with R by Bruno Rodrigues: [https://raps-with-r.dev/](https://raps-with-r.dev/) +* Posit Solutions Site page on reproducible environments: [https://solutions.posit.co/envs-pkgs/environments/](https://solutions.posit.co/envs-pkgs/environments/) + diff --git a/book.bib b/book.bib index 21aba0c4..0529d60b 100644 --- a/book.bib +++ b/book.bib @@ -439,4 +439,10 @@ @proceedings{Scott2007 pages = {3514-3518}, year = {2007}, howpublished = {\url{http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000874.pdf}} +} + +@book{git-w-R, + title = {Happy Git and GitHub for the useR}, + author = {Jenny Bryan and Jim Hester}, + howpublished = {\url{https://happygitwithr.com/}} } \ No newline at end of file