Skip to content

Commit

Permalink
Indexing (#131)
Browse files Browse the repository at this point in the history
* Making data plural

* Standardize A/C format

* Standardize cross-tab format

* change final section header in chapter 9 to not be "summary" to match all other chapters.

* Removing "you" language.

* Adjusting tense "we will.." to just "we..."

* Remove markdown comments

* Changing from target population to population of interest.

* Begin indexing

* Add package index tags

* Refine index of functions

* More index tags

* Updates to ch1 from one voice review

* Edits to ch02 from one-voice

* Ch03 edits from one-voice review

* Ch04 updates from one-voice

* Fix broken reference link in ch04.

* Ch05 edits from one-voice

* Ch06 edits from one-voice

* Ch07 edits from one-voice

* Ch08 edits from one-voice

* More indices

* Ch09 edits from one-voice

* Ch10 edits from one-voice

* Ch11 edits from one-voice

* Ch12 edits from one-voice

* Ch13 edits from one-voice

* Ch14 edits from one-voice

* Appendix A edits from one-voice

* Adding blank line, to add a comment.

* Fixing reference type for Scott2007 to have author show up in bibliography.

* Adding spaces at ends of lines to add comment.

* Fixing typo in formula in ch7.

* Adding space to end of line to add a comment.

* Adding space at end of line to add comment.

* Fix ref to C10

* Remove draft watermark

* SZ full book review (#129)

* Change interaction example (#130)

* Fix merge issues

* Batch 4? of indexing

* Add imputation for mc

* Add survey life cycle to index

* Add questionnaire to the index

* Add design object

* Add point/uncertainty estimates to index

* Edit A//C -> A/C

* Edit where we use airconditioning instead of A/C

* Edit where we use air conditioning instead of A/C

* Remove hyphen from air conditioning

* Edit quotation marks

* Capitalize Git

* Standardize language; add survey analysis process index; compulsively style code

* Add survey analysis process in chapter to index

* Edit design object index to be longer

* Small edit to sentence to not be redundant with the point estimate definition before

---------

Co-authored-by: rpowell22 <[email protected]>
Co-authored-by: Isabella Velásquez <[email protected]>
  • Loading branch information
3 people authored Apr 30, 2024
1 parent b6e5ec4 commit b9f1b93
Show file tree
Hide file tree
Showing 17 changed files with 700 additions and 349 deletions.
101 changes: 54 additions & 47 deletions 02-overview-surveys.Rmd

Large diffs are not rendered by default.

32 changes: 23 additions & 9 deletions 03-survey-data-documentation.Rmd

Large diffs are not rendered by default.

60 changes: 39 additions & 21 deletions 04-set-up.Rmd

Large diffs are not rendered by default.

199 changes: 125 additions & 74 deletions 05-descriptive-analysis.Rmd

Large diffs are not rendered by default.

151 changes: 111 additions & 40 deletions 06-statistical-testing.Rmd

Large diffs are not rendered by default.

107 changes: 71 additions & 36 deletions 07-modeling.Rmd

Large diffs are not rendered by default.

43 changes: 34 additions & 9 deletions 08-communicating-results.Rmd

Large diffs are not rendered by default.

21 changes: 18 additions & 3 deletions 09-reproducible-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ This chapter covers some of our suggestions for tools and techniques we can use

## Project-based workflows

\index{R projects|(}
We recommend a project-based workflow for analysis projects as described by @wickham2023r4ds. A project-based workflow maintains a "source of truth" for our analyses. It helps with file system discipline by putting everything related to a project in a designated folder. Since all associated files are in a single location, they are easy to find and organize. When we reopen the project, we can recreate the environment in which we originally ran the code to reproduce our results.

The RStudio IDE has built-in support for projects. When we create a project in RStudio, it creates an `.Rproj` file that stores settings specific to that project. Once we have created a project, we can create folders that help us organize our workflow. For example, a project directory could look like this:
Expand All @@ -51,6 +52,7 @@ The RStudio IDE has built-in support for projects. When we create a project in R
| anes_report.pdf
```

\index{here package|(}
In a project-based workflow, all paths are relative and, by default, relative to the folder the `.Rproj` file is located in. By using relative paths, others can open and run our files even if their directory configuration differs from ours (e.g., Mac and Windows users have different directory path structures.) The {here} package enables easy file referencing, and we can start by using the `here::here()` function to build the path for loading or saving data [@R-here]. Below, we ask R to read the CSV file `anes_2020.csv` in the project directory's `data` folder:

```{r}
Expand All @@ -61,6 +63,7 @@ anes <-
```

The combination of projects and the {here} package keep all associated files in an organized manner. This workflow makes it more likely that our analyses can be reproduced by us or our colleagues.
\index{here package|)} \index{R projects|)}

## Functions and packages

Expand All @@ -70,6 +73,7 @@ A package is made up of a collection of functions. If we find ourselves sharing

## Version control with Git

\index{Version control|(} \index{Git| see {Version control}}
Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or redundant work.

Version control systems like Git can help alleviate these pains. Git is a system that tracks changes in files. We can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts.)
Expand All @@ -80,15 +84,22 @@ In addition to code scripts, platforms like GitHub can store data and documentat

Using version control in analysis projects makes collaboration and maintenance more manageable. To connect Git with R, we recommend referencing the book [Happy Git and GitHub for the useR](https://happygitwithr.com/) [@git-w-R].

\index{Version control|)}

## Package management with {renv}

\index{renv package|(} \index{Package management|see {renv package}}
Ensuring reproducibility involves not only using version control of code but also managing the versions of packages. If two people run the same code but use different package versions, the results might differ because of changes to those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} on CRAN has some bugs (errors) that result in incorrect calculations. The version on GitHub has corrected these errors, so we have asked readers to install the GitHub version to obtain the same results.

One way to handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the appropriate versions so that others can replicate the project's environment to rerun the code and obtain consistent results [@R-renv].

\index{renv package|)}

## R environments with Docker

\index{Environment management|(} \index{Docker|see {Environment management}}
Just as different versions of packages can introduce discrepancies or compatibility issues, the version of R can also prevent reproducibility. Tools such as Docker can help with this potential issue by creating isolated environments that define the version of R being used, along with other dependencies and configurations. The entire environment is bundled in a container. The container, defined by a Dockerfile, can be shared so anybody, regardless of their local setup, can run the R code in the same environment.
\index{Environment management|)}

## Workflow management with {targets}

Expand All @@ -98,6 +109,7 @@ The {targets} package is an increasingly popular workflow manager that documents

## Documentation with Quarto and R Markdown

\index{R Markdown|(} \index{Quarto|(}
Tools like Quarto and R Markdown aid in reproducibility by creating documents that weave together code, text, and results. We can present analysis results alongside the report's narrative, so there's no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output.

Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Another analyst can see the steps we took, follow the scripts, and recreate the report. We can include details about our work in one place thanks to the combination of text and code, making our work transparent and easier to verify [@R-quarto; @rmarkdown2020].
Expand All @@ -108,15 +120,17 @@ Another useful feature of Quarto and R Markdown is the ability to reduce repetit

Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily modified and documented. By manually editing code throughout the script, we reduce errors that may occur and offer a flexible way for others to replicate the analysis and explore variations.

\index{R Markdown|)} \index{Quarto|)}

## Other tips for reproducibility

### Random number seeds

Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results.
Some tasks in survey analysis require randomness, such as imputation\index{Imputation}, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results.

In R, we can use the `set.seed()` function to control the randomness in our code. We set a seed value by providing an integer in the function argument. The following code chunk sets a seed using `999`, then runs a random number function (`runif()`) to get five random numbers from a uniform distribution.

```{r, set.seed(999)}
```{r}
#| label: reprex-set-seed
set.seed(999)
runif(5)
Expand All @@ -126,7 +140,8 @@ Since the seed is set to `999`, running `runif(5)` multiple times always produce

### Descriptive names and labels

Using descriptive variable names or labeling data can also assist with reproducible research. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce in this book, we opted to change the variable names to be more descriptive of what they contained (e.g., `Age`.) This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data).) As with random number seeds, the exact method is up to the analyst, but providing this information can help ensure our research is reproducible.
\index{American National Election Studies (ANES)|(}
Using descriptive variable names or labeling data can also assist with reproducible research. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce in this book, we opted to change the variable names to be more descriptive of what they contained (e.g., `Age`.)\index{American National Election Studies (ANES)|)} This can also be done with the data values themselves. \index{Categorical data|(}\index{Factor|(}One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example.\index{Factor|)} There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data).) \index{Categorical data|)} As with random number seeds, the exact method is up to the analyst, but providing this information can help ensure our research is reproducible.

## Additional resources

Expand Down
Loading

0 comments on commit b9f1b93

Please sign in to comment.