From d1d8cac7e99c6286ce2308aa4c700282b1dc27c6 Mon Sep 17 00:00:00 2001 From: rpowell22 Date: Sun, 14 Jan 2024 11:27:42 -0500 Subject: [PATCH 1/5] Draft of Reproducible Research Chapter and moving text from communicating results ch to reprex ch. --- 08-communicating-results.Rmd | 73 +------------------------------- 09-reproducible-data.Rmd | 81 +++++++++++++++++++++++++++++++++++- book.bib | 6 +++ 3 files changed, 87 insertions(+), 73 deletions(-) diff --git a/08-communicating-results.Rmd b/08-communicating-results.Rmd index 84ab6009..873ee3df 100644 --- a/08-communicating-results.Rmd +++ b/08-communicating-results.Rmd @@ -69,7 +69,7 @@ The inclusion of this material is especially important if no methodology report In addition to how the survey was conducted and how weights were calculated, providing information about what data prep, cleaning, and analyses were used to obtain these results is also important. For example, in Chapter \@ref(c06-statistical-testing), we compared the distributions of education from the survey to the ACS. To do this, we needed to collapse education categories provided in the ANES data to match the ACS. Providing both the original question wording and response options and the steps taken to map to the ACS data are important for the audience to know to ensure transparency and a better understanding of the results. -This particular example may seem obvious (combining a Bachelor's Degree and a Graduate Degree into a single category). Still, there are cases where re-coding or handling missing data is more important to disclose as there could be multiple ways to handle the data, and the choice we made as researchers was just one of many. For example, many examples and exercises in this book remove missing data, as this is often the easiest way to handle missing data. However, in some cases, missing data could be a substantively important piece of information, and removing it could bias results. Disclosing how data was handled is crucial in helping the audience better understand the results. +This particular example may seem obvious (combining a Bachelor's Degree and a Graduate Degree into a single category). Still, there are cases where re-coding or handling missing data is more important to disclose as there could be multiple ways to handle the data, and the choice we made as researchers was just one of many. For example, many examples and exercises in this book remove missing data, as this is often the easiest way to handle missing data. However, in some cases, missing data could be a substantively important piece of information, and removing it could bias results (see Chapter \@ref(c11-missing-data)). Disclosing how data was handled is crucial in helping the audience better understand the results. ### Results @@ -407,75 +407,4 @@ pfull <- pfull ``` -## Reproducibility -Reproducibility is the ability to recreate or replicate the results of a data analysis. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. Reproducibility is a crucial aspect of survey research because it enables the verification of findings and ensures that the conclusions are not dependent on a particular person running the workflow. Others can review and rerun projects to build on existing work, reducing redundancy and errors. - -Reproducibility requires that we consider several key components: - - - **Code**: The source code used for data cleaning, analysis, modeling, and reporting must be available, discoverable, documented, and shared. - - **Data**: The raw data used in the workflow must be available, discoverable, documented, and shared. If the raw data is sensitive or proprietary, we should strive to provide as much data as possible that would allow others to run our workflow or direct others to where they can access a restricted use file (RUF). - - **Environment**: The environment of the project must be documented. Another analyst should be able to recreate the environment, including the R version, packages, operating system, and other dependencies used in the analysis. - - **Methodology**: The analysis methodology, including the rationale behind specific decisions, interpretations, and assumptions, must be documented. Others should be able to achieve the same analysis results based on the methodology report. - -Many tools, practices, and project management techniques exist to make survey analysis projects easy to reproduce. For best results, they should be decided upon and applied at the beginning of a project. Below are our suggestions for a survey analysis data workflow. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow. - -### Setting Random Number Seeds - -Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R will change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility. - -In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function: - -```r -set.seed(999) - -runif(5) -``` - -The `runif()` function generates five random numbers from a uniform distribution. Since the seed is set to `999`, running `runif()` multiple times will always produce the same sequence: - -``` -[1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676 -``` - -It is important to note that `set.seed()` should be used *before* random number generation but is only necessary once per program to make the entire program reproducible. For example, we might set the seed at the top of a program where libraries tend to be loaded. - -### Git - -A survey analysis project produces a lot of code. As code evolves throughout a project, keeping track of the latest version becomes challenging. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. - -Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysis can use Git to follow the evolution of code and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve conflicts between versions. - -Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. - - - -In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place. - -Using version control in data science projects makes collaboration and maintenance more manageable. One excellent resource is [Happy Git and GitHub for the useR by Jenny Bryan and Jim Hester](https://happygitwithr.com/). - -### {renv} - -The {renv} package is a popular option for managing dependencies and creating virtual environments in R. It creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated, {renv} checks whether the installed packages are consistent with the record. If not, it restores the correct versions for running the project. - -With {renv}, others can replicate the project's environment to rerun the code and obtain consistent results. - -### Quarto/R Markdown - -Quarto and R Markdown are powerful tools that allow us to create documents that combine code and text. These documents present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. - -Rerunning a Quarto or R Markdown document re-executes the underlying code. Another team member can recreate the report and obtain the same results. - -#### Parameterization {-} - -Quarto and R Markdown's parameterization is an important aspect of reproducibility in reporting. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. By parameterizing our code, we can define and modify these parameters to explore different scenarios or inputs. For example, we can create a document that provides survey analysis results for Michigan. By defining a `state` parameter, we can rerun the same analysis for Wisconsin without having to edit the code throughout the document. - -We can define parameterization in the header or code chunks of our Quarto/R Markdown documents. Again, we can easily modify and document the values of these parameters, reducing errors that may occur by manually editing code throughout the script. Parameterization is also a flexible way for others to replicate the analysis and explore variations. - -### The {targets} package - -The {targets} package is a workflow manager enabling us to document, automate, and execute complex data workflows with multiple steps and dependencies. We define the order of execution for our code. Only the affected code and its downstream targets are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. - -This tool helps with reproducibility by tracking dependencies, inputs, and outputs of each step of our workflow. - -As noted above, many tools, practices, and project management techniques exist for achieving reproducibility. Most critical is deciding on reproducibility goals with our team and the requirements to achieve them before deciding on workflow and documentation. diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd index d330259f..2becf97d 100644 --- a/09-reproducible-data.Rmd +++ b/09-reproducible-data.Rmd @@ -1 +1,80 @@ -# Reproducible data {#c09-reprex-data} +# Reproducible Research {#c09-reprex-data} + +The ability to recreate or replicate the results of a data analysis is a crucial aspect of any research. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. This idea of reproducibility enables the verification of findings and ensures that the conclusions are not dependent on a particular person running the code or workflow or on a particular day and environment. Thus, having reproducible research also can reduce errors in the code as researchers need to be able to replicate and verify the results. Not only is this idea ethical in ensuring that findings are accurate, but it is also becoming a requirement for publishing output in many scientific journals. These journals are now requiring that authors make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. + +To ensure the research is reproducible, we need to ensure that the key components of analysis are available, discoverable, documented, and shared with others. The main four components that we should consider are: + + - **Code**: source code used for data cleaning, analysis, modeling, and reporting + - **Data**: raw data used in the workflow, or if data is sensitive or proprietary, as much data as possible that would allow others to run our workflow (e.g., access to a restricted use file (RUF)) + - **Environment**: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis + - **Methodology**: analysis methodology, including rationale behind decisions, interpretations, and assumptions + +In Chapter \@ref(c08-communicating-results), we briefly mention how discussion on each of these in important to include in the methodology report and when communicating the findings and results of a study. However, for us to be transparent and effective researchers, we need to make sure we not only discuss these through text, but also provide files and additional information when requested. Often when starting a project, researchers will dive into the data and make decisions as they go without full documentation which can be challenging if we need to go back and make changes or understand even what we did a few months ago. Therefore, it would benefit not only other researchers, but potentially our future selves to better document everything from the start. The good news is that there are many tools, practices, and project management techniques to make survey analysis projects easy to reproduce. For best results, researchers should decide which techniques and tools will be used prior (or very early on) starting a project. This chapter covers some of our suggestions of tools and techniques that can be used in projects. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow. + +## Version Control: Git + +Often a survey analysis project produces a lot of code. As code evolves throughout a project, keeping track of the latest version can become challenging. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. + +Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysis can use Git to follow the evolution of code and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). + +Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. + + + +In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place. + +Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting git with R, we recommend the [Happy Git and GitHub for the useR by Jenny Bryan and Jim Hester](https://happygitwithr.com/) [@git-w-R]. + +## Package Management: {renv} + +Along with version control of code, version control for packages is also important for reproducibility. If two people run the same code, but use different versions of a package, the outcomes could be different depending on the changes that occurred. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} has some bugs (errors) when doing some calculations. The version on GitHub has corrected these errors, so we have asked users to install the GitHub version to obtain the same results. + +One way to easily handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and easily manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the correct versions and ensures that others can replicate the project's environment to rerun the code and obtain consistent results. + +## Workflow Management: {targets} + +With complex studies that involve multiple code files, steps, and dependencies it can be important to ensure that all pieces are implemented in the correct order. This could be done manually (e.g., numbering files to indicate the order or providing detailed documentation on the order) or we can automate the process to ensure there is no confusion on how things flow. Ensuring code is run in the correct order helps ensure that the research is reproducible and that anyone can easily pick up the code and know exactly how to run the code to get the same results. + +To automate, we recommend using the {targets} package, which is a workflow manager enabling us to document, automate, and execute complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code and then it will consistently execute the code in that order each time it is run. One nice feature of {targets} is if you are only changing code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. + +## Documentation: Quarto/R Markdown + +Documenting and describing decisions help other researchers reproduce your work and findings. Methodology reports can be one way to do this, by providing a large document of all project decisions. Additionally, using tools like Quarto and R Markdown allow us to create documents that integrate code and text. These documents present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. + +Quarto and R Markdown documents also allow users to re-execute the underlying code when needed, therefore, another team member can easily recreate the report and obtain the same results. If outputting the file as html, these tools can also allow for more powerful integration of interactive plots, images, videos, links, and anything else researchers might need to document for a project. + +### Parameterization + +Another great feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. By parameterizing our code, we can define and modify these parameters to explore different scenarios or inputs. For example, if we start by creating a document that provides survey analysis results for Michigan but then later decide we want to look at multiple states, we can define a `state` parameter and rerun the same analysis for other states like Wisconsin without having to edit the code throughout the document. + +Parameters can be defined in the header or code chunks of our Quarto/R Markdown documents and can easily be modified and documented. Thus, we are reducing errors that may occur by manually editing code throughout the script and it is a flexible way for others to replicate the analysis and explore variations. + +## Other Tips for Reproducibility + +### Random Number Seeds + +Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R will change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility. + +In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function: + +```r +set.seed(999) + +runif(5) +``` + +The `runif()` function generates five random numbers from a uniform distribution. Since the seed is set to `999`, running `runif()` multiple times will always produce the same sequence: + +``` +[1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676 +``` + +The choice of the seed number is up to the researcher. For example, this could be the date (`20240102`) or time of day (`1056`) when analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number can be up to the researcher to decide. However, it is important to note that `set.seed()` should be used *before* random number generation but is only necessary once per program to make the entire program reproducible. For example, we might set the seed at the top of a program where libraries tend to be loaded. + +### Descriptive Names and Labels + +Something else to assist with reproducible research, is using descriptive variable names or labeling data. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive to what they contained (e.g., `Age`). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data)). As with random number seeds, the exact method for this is up to the researcher, but providing this information can help ensure your research is reproducible. + +### Databases + +For projects with complex or large data structures, researchers may consider creating a database to better manage the data and any changes. Many databases will allow for a history of changes, which can be useful when recoding variables to ensure no inadvertent errors were introduced. Additionally, a database may be easier to pass to other researchers if there are existing relationships between tables and types that are complex to map. diff --git a/book.bib b/book.bib index 21aba0c4..0529d60b 100644 --- a/book.bib +++ b/book.bib @@ -439,4 +439,10 @@ @proceedings{Scott2007 pages = {3514-3518}, year = {2007}, howpublished = {\url{http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000874.pdf}} +} + +@book{git-w-R, + title = {Happy Git and GitHub for the useR}, + author = {Jenny Bryan and Jim Hester}, + howpublished = {\url{https://happygitwithr.com/}} } \ No newline at end of file From 8a065d28f367221bd6a46f95182cf10866784375 Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Sun, 28 Jan 2024 11:34:25 -0800 Subject: [PATCH 2/5] Grammarly review --- 09-reproducible-data.Rmd | 37 +++++++++++++++++++------------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd index 2becf97d..58678ee4 100644 --- a/09-reproducible-data.Rmd +++ b/09-reproducible-data.Rmd @@ -1,21 +1,23 @@ # Reproducible Research {#c09-reprex-data} -The ability to recreate or replicate the results of a data analysis is a crucial aspect of any research. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. This idea of reproducibility enables the verification of findings and ensures that the conclusions are not dependent on a particular person running the code or workflow or on a particular day and environment. Thus, having reproducible research also can reduce errors in the code as researchers need to be able to replicate and verify the results. Not only is this idea ethical in ensuring that findings are accurate, but it is also becoming a requirement for publishing output in many scientific journals. These journals are now requiring that authors make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. +Recreating or replicating a data analysis's results is a crucial aspect of any research. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. This idea of reproducibility enables the verification of findings. It ensures that the conclusions are not dependent on a particular person running the code or workflow or on a particular day and environment. Reproducible research can also reduce errors in the code as researchers need to be able to replicate and verify the results. Not only is this idea ethical in ensuring that findings are accurate, but it is also becoming a requirement for publishing output in many scientific journals. These journals are now requiring that authors make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. -To ensure the research is reproducible, we need to ensure that the key components of analysis are available, discoverable, documented, and shared with others. The main four components that we should consider are: +Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are: - **Code**: source code used for data cleaning, analysis, modeling, and reporting - **Data**: raw data used in the workflow, or if data is sensitive or proprietary, as much data as possible that would allow others to run our workflow (e.g., access to a restricted use file (RUF)) - **Environment**: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis - **Methodology**: analysis methodology, including rationale behind decisions, interpretations, and assumptions -In Chapter \@ref(c08-communicating-results), we briefly mention how discussion on each of these in important to include in the methodology report and when communicating the findings and results of a study. However, for us to be transparent and effective researchers, we need to make sure we not only discuss these through text, but also provide files and additional information when requested. Often when starting a project, researchers will dive into the data and make decisions as they go without full documentation which can be challenging if we need to go back and make changes or understand even what we did a few months ago. Therefore, it would benefit not only other researchers, but potentially our future selves to better document everything from the start. The good news is that there are many tools, practices, and project management techniques to make survey analysis projects easy to reproduce. For best results, researchers should decide which techniques and tools will be used prior (or very early on) starting a project. This chapter covers some of our suggestions of tools and techniques that can be used in projects. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow. +In Chapter \@ref(c08-communicating-results), we briefly mention each of these is important to include in the methodology report and when communicating the findings and results of a study. However, to be transparent and effective researchers, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, researchers will dive into the data and make decisions as they go without full documentation, which can be challenging if we need to go back and make changes or understand even what we did a few months ago. Therefore, it would benefit other researchers and potentially our future selves to better document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, researchers should decide which techniques and tools will be used before starting a project (or very early on). + +This chapter covers some of our suggestions for tools and techniques we can use in projects. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow. ## Version Control: Git -Often a survey analysis project produces a lot of code. As code evolves throughout a project, keeping track of the latest version can become challenging. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. +Often, a survey analysis project produces a lot of code. As code evolves throughout a project, keeping track of the latest version can become challenging. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. -Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysis can use Git to follow the evolution of code and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). +Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysis can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. @@ -23,37 +25,36 @@ Services such as GitHub or GitLab provide hosting and sharing of files as well a In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place. -Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting git with R, we recommend the [Happy Git and GitHub for the useR by Jenny Bryan and Jim Hester](https://happygitwithr.com/) [@git-w-R]. +Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting Git with R, we recommend the [Happy Git and GitHub for the useR by Jenny Bryan and Jim Hester](https://happygitwithr.com/) [@git-w-R]. ## Package Management: {renv} -Along with version control of code, version control for packages is also important for reproducibility. If two people run the same code, but use different versions of a package, the outcomes could be different depending on the changes that occurred. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} has some bugs (errors) when doing some calculations. The version on GitHub has corrected these errors, so we have asked users to install the GitHub version to obtain the same results. +Along with version control of code, version control for packages is also important for reproducibility. If two people run the same code but use different versions of a package, the outcomes could be different depending on the package updates. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} has some bugs (errors) when doing some calculations. The version on GitHub has corrected these errors, so we have asked users to install the GitHub version to obtain the same results. One way to easily handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and easily manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the correct versions and ensures that others can replicate the project's environment to rerun the code and obtain consistent results. ## Workflow Management: {targets} -With complex studies that involve multiple code files, steps, and dependencies it can be important to ensure that all pieces are implemented in the correct order. This could be done manually (e.g., numbering files to indicate the order or providing detailed documentation on the order) or we can automate the process to ensure there is no confusion on how things flow. Ensuring code is run in the correct order helps ensure that the research is reproducible and that anyone can easily pick up the code and know exactly how to run the code to get the same results. +With complex studies involving multiple code files, steps, and dependencies it can be important to ensure that all pieces are implemented in the correct order. This could be done manually (e.g., numbering files to indicate the order or providing detailed documentation on the order) or we can automate the process to ensure the code flows sequentially. Ensuring that the code is run in the correct order helps ensure that the research is reproducible and that anyone can easily pick up the code and know exactly how to run the code to get the same results. -To automate, we recommend using the {targets} package, which is a workflow manager enabling us to document, automate, and execute complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code and then it will consistently execute the code in that order each time it is run. One nice feature of {targets} is if you are only changing code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. +To automate, we recommend using the {targets} package, a workflow manager enabling us to document, automate, and execute complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One nice feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. ## Documentation: Quarto/R Markdown -Documenting and describing decisions help other researchers reproduce your work and findings. Methodology reports can be one way to do this, by providing a large document of all project decisions. Additionally, using tools like Quarto and R Markdown allow us to create documents that integrate code and text. These documents present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. - -Quarto and R Markdown documents also allow users to re-execute the underlying code when needed, therefore, another team member can easily recreate the report and obtain the same results. If outputting the file as html, these tools can also allow for more powerful integration of interactive plots, images, videos, links, and anything else researchers might need to document for a project. +Documenting and describing decisions help other researchers reproduce your work and findings. Methodology reports can be one way to do this, providing a significant document of all project decisions. Additionally, using tools like Quarto and R Markdown allow us to create documents that integrate code and text. These documents present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. +Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Therefore, another team member can easily recreate the report and obtain the same results. If outputting the file as HTML, these tools can also allow for more robust integration of interactive plots, images, videos, links, and anything else researchers might need to document for a project. ### Parameterization -Another great feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. By parameterizing our code, we can define and modify these parameters to explore different scenarios or inputs. For example, if we start by creating a document that provides survey analysis results for Michigan but then later decide we want to look at multiple states, we can define a `state` parameter and rerun the same analysis for other states like Wisconsin without having to edit the code throughout the document. +Another great feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. By parameterizing our code, we can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for Michigan but then later decide we want to look at multiple states. In that case, we can define a `state` parameter and rerun the same analysis for other states like Wisconsin without having to edit the code throughout the document. -Parameters can be defined in the header or code chunks of our Quarto/R Markdown documents and can easily be modified and documented. Thus, we are reducing errors that may occur by manually editing code throughout the script and it is a flexible way for others to replicate the analysis and explore variations. +Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily be modified and documented. Thus, we are reducing errors that may occur by manually editing code throughout the script, and it is a flexible way for others to replicate the analysis and explore variations. ## Other Tips for Reproducibility ### Random Number Seeds -Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R will change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility. +Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers that R generates will change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility. In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function: @@ -69,12 +70,12 @@ The `runif()` function generates five random numbers from a uniform distribution [1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676 ``` -The choice of the seed number is up to the researcher. For example, this could be the date (`20240102`) or time of day (`1056`) when analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number can be up to the researcher to decide. However, it is important to note that `set.seed()` should be used *before* random number generation but is only necessary once per program to make the entire program reproducible. For example, we might set the seed at the top of a program where libraries tend to be loaded. +The choice of the seed number is up to the researcher. For example, this could be the date (`20240102`) or time of day (`1056`) when the analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number can be up to the researcher to decide. However, it is important to note that `set.seed()` should be used *before* random number generation but is only necessary once per program to make the entire program reproducible. For example, we could set the seed at the top of a program where libraries are loaded. ### Descriptive Names and Labels -Something else to assist with reproducible research, is using descriptive variable names or labeling data. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive to what they contained (e.g., `Age`). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data)). As with random number seeds, the exact method for this is up to the researcher, but providing this information can help ensure your research is reproducible. +Something else to assist with reproducible research is using descriptive variable names or labeling data. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive of what they contained (e.g., `Age`). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data)). As with random number seeds, the exact method is up to the researcher, but providing this information can help ensure your research is reproducible. ### Databases -For projects with complex or large data structures, researchers may consider creating a database to better manage the data and any changes. Many databases will allow for a history of changes, which can be useful when recoding variables to ensure no inadvertent errors were introduced. Additionally, a database may be easier to pass to other researchers if there are existing relationships between tables and types that are complex to map. +Researchers may consider creating a database for projects with complex or large data structures to manage the data and any changes. Many databases will allow for a history of changes, which can be useful when recoding variables to ensure no inadvertent errors are introduced. Additionally, a database may be more accessible to pass to other researchers if existing relationships between tables and types are complex to map. From a9ebf1569ce3b1f420c67f8f3e9fe1ab0effafc1 Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Sun, 28 Jan 2024 17:37:53 -0800 Subject: [PATCH 3/5] Small edits to the chapter --- 09-reproducible-data.Rmd | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd index 58678ee4..c50319ed 100644 --- a/09-reproducible-data.Rmd +++ b/09-reproducible-data.Rmd @@ -1,6 +1,8 @@ # Reproducible Research {#c09-reprex-data} -Recreating or replicating a data analysis's results is a crucial aspect of any research. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. This idea of reproducibility enables the verification of findings. It ensures that the conclusions are not dependent on a particular person running the code or workflow or on a particular day and environment. Reproducible research can also reduce errors in the code as researchers need to be able to replicate and verify the results. Not only is this idea ethical in ensuring that findings are accurate, but it is also becoming a requirement for publishing output in many scientific journals. These journals are now requiring that authors make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. +Reproducing a data analysis's results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Enabling the verification of our analysis is another goal of reproducibility. When someone else is able to run our code from top to bottom, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment. + +Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. These journals now require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are: @@ -15,9 +17,9 @@ This chapter covers some of our suggestions for tools and techniques we can use ## Version Control: Git -Often, a survey analysis project produces a lot of code. As code evolves throughout a project, keeping track of the latest version can become challenging. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. +Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as code evolves throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. -Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysis can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). +Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysts can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. From 3b50903176e760e541f985ee1bdd8ec572139cf6 Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Sun, 28 Jan 2024 18:15:29 -0800 Subject: [PATCH 4/5] Make a few small edits --- 09-reproducible-data.Rmd | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd index c50319ed..a47eaec7 100644 --- a/09-reproducible-data.Rmd +++ b/09-reproducible-data.Rmd @@ -1,6 +1,6 @@ # Reproducible Research {#c09-reprex-data} -Reproducing a data analysis's results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Enabling the verification of our analysis is another goal of reproducibility. When someone else is able to run our code from top to bottom, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment. +Reproducing a data analysis's results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Enabling the verification of our analysis is another goal of reproducibility. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment. Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. These journals now require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. @@ -17,7 +17,7 @@ This chapter covers some of our suggestions for tools and techniques we can use ## Version Control: Git -Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as code evolves throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. +Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Survey analysts can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). @@ -31,24 +31,24 @@ Using version control in analysis projects makes collaboration and maintenance m ## Package Management: {renv} -Along with version control of code, version control for packages is also important for reproducibility. If two people run the same code but use different versions of a package, the outcomes could be different depending on the package updates. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} has some bugs (errors) when doing some calculations. The version on GitHub has corrected these errors, so we have asked users to install the GitHub version to obtain the same results. +Ensuring reproducibility involves not only using version control of code, but also managing the versions of packages. If two people run the same code but use different versions of a package, the results might differ because of changes in those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} has some bugs (errors) when doing some calculations. The version on GitHub has corrected these errors, so we have asked users to install the GitHub version to obtain the same results. -One way to easily handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and easily manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the correct versions and ensures that others can replicate the project's environment to rerun the code and obtain consistent results. +One way to handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the appropriate versions so that others can replicate the project's environment to rerun the code and obtain consistent results. ## Workflow Management: {targets} -With complex studies involving multiple code files, steps, and dependencies it can be important to ensure that all pieces are implemented in the correct order. This could be done manually (e.g., numbering files to indicate the order or providing detailed documentation on the order) or we can automate the process to ensure the code flows sequentially. Ensuring that the code is run in the correct order helps ensure that the research is reproducible and that anyone can easily pick up the code and know exactly how to run the code to get the same results. +With complex studies involving multiple code files and dependencies, it is important to ensures each step is executed in the intended sequence. We can do this manually, e.g., numbering files to indicate the order or providing detailed documentation on the order. Alternatively, we can automate the process so the code flows sequentially. Making sure that the code runs in the correct order helps ensure that the research is reproducible. Anyone should be able to pick up the set of scripts and get the same results by following the workflow. -To automate, we recommend using the {targets} package, a workflow manager enabling us to document, automate, and execute complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One nice feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. +The {targets} package is growing as a popular workflow manager that documents, automates, and executes complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One nice feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. -## Documentation: Quarto/R Markdown +## Documentation: Quarto and R Markdown Documenting and describing decisions help other researchers reproduce your work and findings. Methodology reports can be one way to do this, providing a significant document of all project decisions. Additionally, using tools like Quarto and R Markdown allow us to create documents that integrate code and text. These documents present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Therefore, another team member can easily recreate the report and obtain the same results. If outputting the file as HTML, these tools can also allow for more robust integration of interactive plots, images, videos, links, and anything else researchers might need to document for a project. ### Parameterization -Another great feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. By parameterizing our code, we can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for Michigan but then later decide we want to look at multiple states. In that case, we can define a `state` parameter and rerun the same analysis for other states like Wisconsin without having to edit the code throughout the document. +Another great feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. We can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for Michigan but then later decide we want to look at multiple states. In that case, we can define a `state` parameter and rerun the same analysis for other states like Wisconsin without having to edit the code throughout the document. Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily be modified and documented. Thus, we are reducing errors that may occur by manually editing code throughout the script, and it is a flexible way for others to replicate the analysis and explore variations. @@ -56,7 +56,7 @@ Parameters can be defined in the header or code chunks of our Quarto or R Markdo ### Random Number Seeds -Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers that R generates will change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility. +Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results, facilitating reproducibility. In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function: From 69538616a0cb21b9b7c93613e574a8764fa1464d Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Mon, 29 Jan 2024 20:46:54 -0800 Subject: [PATCH 5/5] Add a few more sections to chapter --- 09-reproducible-data.Rmd | 54 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 51 insertions(+), 3 deletions(-) diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd index a47eaec7..85659210 100644 --- a/09-reproducible-data.Rmd +++ b/09-reproducible-data.Rmd @@ -15,6 +15,44 @@ In Chapter \@ref(c08-communicating-results), we briefly mention each of these is This chapter covers some of our suggestions for tools and techniques we can use in projects. This list is not comprehensive but aims to provide a starting point for teams looking to create a reproducible workflow. +## Project-based workflows + +We recommend a project-based workflow for analysis projects as described in Hadley Wickham Mine Çetinkaya-Rundel, and Garrett Grolemund's book, R for Data Science, found at [r4ds.hadley.nz](https://r4ds.hadley.nz/). A project-based workflow maintains a "source of truth" for our analyses. It helps with file system discipline by putting everything related to a project in a designated folder. Since all associated files are in a single location, they are easy to find and organize. When we reopen the project, we can recreate the environment in which we originally ran the code to reproduce our results. + +The RStudio IDE has built-in support for projects. When we create a project in RStudio, it creates a `.Rproj` file that store settings specific to that project. Once we have created a project, we can create folders that help us organize our workflow. For example, a project directory could look like this: + +``` +| anes_analysis/ + | anes_analysis.Rproj + | README.md + | codebooks + | codebook2020.pdf + | codebook2016.pdf + | rawdata + | anes2020_raw.csv + | anes2016_raw.csv + | scripts + | data-prep.R + | data + | anes2020_clean.csv + | anes2016_clean.csv + | report + | anes_report.Rmd + | anes_report.html + | anes_report.pdf +``` + +The {here} package enables easy file referencing. In a project-based workflow, all paths are relative and, by default, relative to the project’s folder. By using relative paths, others can open and run our files even if their directory configuration differs from ours. Use the `here::here()` function to build the path when we load or save data. Below, we ask R to read the CSV file `anes_2020.csv` in the project directory's `data` folder: + +```{r} +#| eval: false +#| label: project-file-example +anes <- + read_csv(here::here("data", "anes2020_clean.csv")) +``` + +The combination of projects and the {here} package keep all associated files in an organized manner. This workflow makes it more likely that our analyses can be reproduced by us or our colleagues. + ## Version Control: Git Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or duplicative work. @@ -43,8 +81,9 @@ The {targets} package is growing as a popular workflow manager that documents, a ## Documentation: Quarto and R Markdown -Documenting and describing decisions help other researchers reproduce your work and findings. Methodology reports can be one way to do this, providing a significant document of all project decisions. Additionally, using tools like Quarto and R Markdown allow us to create documents that integrate code and text. These documents present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. -Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Therefore, another team member can easily recreate the report and obtain the same results. If outputting the file as HTML, these tools can also allow for more robust integration of interactive plots, images, videos, links, and anything else researchers might need to document for a project. +Tools like Quarto and R Markdown aid in reproducibility by creating documents that integrate code, text, and results. We can present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. + +Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Another team member can see the steps we took, follow the scripts, and recreate the report. We can include details about our work in one place thanks to the combination of text and code, making our work transparent and easier to verify. ### Parameterization @@ -80,4 +119,13 @@ Something else to assist with reproducible research is using descriptive variabl ### Databases -Researchers may consider creating a database for projects with complex or large data structures to manage the data and any changes. Many databases will allow for a history of changes, which can be useful when recoding variables to ensure no inadvertent errors are introduced. Additionally, a database may be more accessible to pass to other researchers if existing relationships between tables and types are complex to map. +Researchers may consider creating a database for projects with complex or large data structures to manage the data and any changes. Many databases will allow for a history of changes, which can be useful when recoding variables to ensure no inadvertent errors are introduced. Additionally, a database may be more accessible to pass to other researchers if existing relationships between tables and types are complex to map. + +## Summary + +We can promote accuracy and verification of results by making our analysis reproducible. This chapter discussed different ways to make research reproducible. There are various tools and guides available to help you achieve reproducibility in your work. Here are additional resources to explore: + +* R for Data Science chapter on project-based workflows: [https://r4ds.hadley.nz/workflow-scripts.html#projects](https://r4ds.hadley.nz/workflow-scripts.html#projects) +* Building reproducible analytical pipelines with R by Bruno Rodrigues: [https://raps-with-r.dev/](https://raps-with-r.dev/) +* Posit Solutions Site page on reproducible environments: [https://solutions.posit.co/envs-pkgs/environments/](https://solutions.posit.co/envs-pkgs/environments/) +