From 10a8830740b4d7fbe20d4cb8dc10bd0a5c685232 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Isabella=20Vel=C3=A1squez?= Date: Tue, 20 Aug 2024 19:21:55 -0500 Subject: [PATCH] Source/Question Text Edits (#164) * Edit Question Text * Italicize Source * Remove highlights with italics/bold * Add Source to tables in Chapter 08 * Add italics to gt tables and remove from ggplot2 * Add missing text to Question text --------- Co-authored-by: Stephanie Zimmer --- 01-introduction.Rmd | 2 +- 03-survey-data-documentation.Rmd | 10 +++++----- 05-descriptive-analysis.Rmd | 30 +++++++++++++++--------------- 07-modeling.Rmd | 8 ++++---- 08-communicating-results.Rmd | 8 ++++---- 09-reproducible-data.Rmd | 10 +++++----- 11-missing-data.Rmd | 14 +++++++------- 13-ncvs-vignette.Rmd | 18 +++++++++--------- 90-AppendixA.Rmd | 2 +- 91-AppendixB.Rmd | 2 +- 10 files changed, 52 insertions(+), 52 deletions(-) diff --git a/01-introduction.Rmd b/01-introduction.Rmd index ea5f0ee0..d0edfe7f 100644 --- a/01-introduction.Rmd +++ b/01-introduction.Rmd @@ -111,7 +111,7 @@ Throughout the book, we use the following typographical conventions: ## Getting help -We recommend first trying to resolve errors and issues independently using the tips provided in **Chapter \@ref(c12-recommendations)**. +We recommend first trying to resolve errors and issues independently using the tips provided in Chapter \@ref(c12-recommendations). There are several community forums for asking questions, including: diff --git a/03-survey-data-documentation.Rmd b/03-survey-data-documentation.Rmd index 16bed22d..5de8dbbd 100644 --- a/03-survey-data-documentation.Rmd +++ b/03-survey-data-documentation.Rmd @@ -21,10 +21,10 @@ Survey documentation can vary in organization, type, and ease of use. The inform The technical documentation, also known as user guides or methodology/analysis guides, highlights the variables necessary to specify the survey design. We recommend concentrating on these key sections: - * **Introduction:** The introduction orients us to the survey. This section provides the project's background, the study's purpose, and \index{Research topic|(}the main research questions.\index{Research topic|)} - * **Study design:** The study design section describes how researchers prepared and administered the survey. - * \index{Sampling error|(}\index{Sampling frame|(}\index{Sample|(}**Sample:** The sample section describes the sample frame, any known sampling errors, and limitations of the sample.\index{Sampling frame|)} \index{Weighting|(}This section can contain recommendations on how to use sampling weights. Look for weight information, whether the survey design contains strata, clusters/PSUs, or replicate weights. Also, look for population sizes, finite population correction, or replicate weight scaling information. Additional detail on sample designs is available in Chapter \@ref(c10-sample-designs-replicate-weights).\index{Sampling error|)}\index{Sample|)}\index{Weighting|)} - * **Notes on fielding:** Any additional notes on fielding, such as response rates, may be found in the technical documentation. + * Introduction: The introduction orients us to the survey. This section provides the project's background, the study's purpose, and \index{Research topic|(}the main research questions.\index{Research topic|)} + * Study design: The study design section describes how researchers prepared and administered the survey. + * \index{Sampling error|(}\index{Sampling frame|(}\index{Sample|(}Sample: The sample section describes the sample frame, any known sampling errors, and limitations of the sample.\index{Sampling frame|)} \index{Weighting|(}This section can contain recommendations on how to use sampling weights. Look for weight information, whether the survey design contains strata, clusters/PSUs, or replicate weights. Also, look for population sizes, finite population correction, or replicate weight scaling information. Additional detail on sample designs is available in Chapter \@ref(c10-sample-designs-replicate-weights).\index{Sampling error|)}\index{Sample|)}\index{Weighting|)} + * Notes on fielding: Any additional notes on fielding, such as response rates, may be found in the technical documentation. The technical documentation may include other helpful resources. For example, some technical documentation includes syntax for SAS, SUDAAN, Stata, and/or R, so we do not have to create this code from scratch. @@ -64,7 +64,7 @@ knitr::include_graphics(path = "images/questionnaire-example-2.jpg") ### Codebooks \index{Missing data|(} \index{Codebook|(} \index{Data dictionary|see {Codebook}} -While a questionnaire provides information about the questions posed to respondents, the codebook explains how the survey data were coded and recorded. It lists details such as variable names, variable labels, variable meanings, codes for missing data, value labels, and value types (whether categorical, continuous, etc.). The codebook helps us understand and use the variables appropriately in our analysis. In particular, the codebook (as opposed to the questionnaire) often includes information on missing data. Note that the term *data dictionary* is sometimes used interchangeably with codebook, but a data dictionary may include more details on the structure and elements of the data. +While a questionnaire provides information about the questions posed to respondents, the codebook explains how the survey data were coded and recorded. It lists details such as variable names, variable labels, variable meanings, codes for missing data, value labels, and value types (whether categorical, continuous, etc.). The codebook helps us understand and use the variables appropriately in our analysis. In particular, the codebook (as opposed to the questionnaire) often includes information on missing data. Note that the term data dictionary is sometimes used interchangeably with codebook, but a data dictionary may include more details on the structure and elements of the data. \index{Missing data|)} \index{American National Election Studies (ANES)|(} diff --git a/05-descriptive-analysis.Rmd b/05-descriptive-analysis.Rmd index f46badb6..dd3acb3a 100644 --- a/05-descriptive-analysis.Rmd +++ b/05-descriptive-analysis.Rmd @@ -62,7 +62,7 @@ recs_des <- recs_2020 %>% ## Introduction -\index{Point estimates|(}\index{Uncertainty estimates|(}Descriptive analyses, such as basic counts, cross-tabulations, or means, are among the first steps in making sense of our survey results. During descriptive analyses, we calculate *point estimates* of unknown population parameters, such as population mean, and *uncertainty estimates*, such as confidence intervals. By reviewing the findings, we can glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if only 10% of survey respondents are male, it could indicate a unique population, a potential error or bias, an intentional survey sampling method, or other factors. Additionally, descriptive analyses provide summaries of distribution and other measures. These analyses lay the groundwork for the next steps of running statistical tests or developing models.\index{Point estimates|)}\index{Uncertainty estimates|)} +\index{Point estimates|(}\index{Uncertainty estimates|(}Descriptive analyses, such as basic counts, cross-tabulations, or means, are among the first steps in making sense of our survey results. During descriptive analyses, we calculate point estimates of unknown population parameters, such as population mean, and uncertainty estimates, such as confidence intervals. By reviewing the findings, we can glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if only 10% of survey respondents are male, it could indicate a unique population, a potential error or bias, an intentional survey sampling method, or other factors. Additionally, descriptive analyses provide summaries of distribution and other measures. These analyses lay the groundwork for the next steps of running statistical tests or developing models.\index{Point estimates|)}\index{Uncertainty estimates|)} We discuss many different types of descriptive analyses in this chapter. However, it is important to know what type of data we are working with and which statistics are appropriate. In survey data, we typically consider data as one of four main types: @@ -71,11 +71,11 @@ We discuss many different types of descriptive analyses in this chapter. However * \index{Discrete data|(}Discrete data: variables that are counted or measured, such as number of children\index{Discrete data|)} * \index{Continuous data|(}Continuous data: variables that are measured and whose values can lie anywhere on an interval, such as income\index{Continuous data|)} -This chapter discusses how to analyze *measures of distribution* (e.g., cross-tabulations), *central tendency* (e.g., means), *relationship* (e.g., ratios), and *dispersion* (e.g., standard deviation) using functions from the {srvyr} package [@R-srvyr]. +This chapter discusses how to analyze measures of distribution (e.g., cross-tabulations), central tendency (e.g., means), relationship (e.g., ratios), and dispersion (e.g., standard deviation) using functions from the {srvyr} package [@R-srvyr]. \index{Measures of distribution|(} -**Measures of distribution** describe how often an event or response occurs. These measures include counts and totals. We cover the following functions: +Measures of distribution describe how often an event or response occurs. These measures include counts and totals. We cover the following functions: * Count of observations (`survey_count()` and `survey_tally()`) * Summation of variables (`survey_total()`) @@ -84,7 +84,7 @@ This chapter discusses how to analyze *measures of distribution* (e.g., cross-ta \index{Central tendency|(} -**Measures of central tendency** find the central (or average) responses. These measures include means and medians. We cover the following functions: +Measures of central tendency find the central (or average) responses. These measures include means and medians. We cover the following functions: * Means and proportions (`survey_mean()` and `survey_prop()`) * Quantiles and medians (`survey_quantile()` and `survey_median()`) @@ -93,7 +93,7 @@ This chapter discusses how to analyze *measures of distribution* (e.g., cross-ta \index{Relationship|(} -**Measures of relationship** describe how variables relate to each other. These measures include correlations and ratios. We cover the following functions: +Measures of relationship describe how variables relate to each other. These measures include correlations and ratios. We cover the following functions: * Correlations (`survey_corr()`) * Ratios (`survey_ratio()`) @@ -102,7 +102,7 @@ This chapter discusses how to analyze *measures of distribution* (e.g., cross-ta \index{Measures of dispersion|(} -**Measures of dispersion** describe how data spread around the central tendency for continuous variables. These measures include standard deviations and variances. We cover the following functions: +Measures of dispersion describe how data spread around the central tendency for continuous variables. These measures include standard deviations and variances. We cover the following functions: * Variances and standard deviations (`survey_var()` and `survey_sd()`) @@ -121,7 +121,7 @@ To incorporate each of these survey functions, recall the general process for su This chapter walks through how to apply the survey functions in Step 4. Note that unless otherwise specified, our estimates are weighted as a result of setting up the survey design object. -To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only *after* creating the design object. This ensures that the results accurately reflect the survey design. If we filter or group data before creating the survey design object, the data for those cases are not included in the survey design information and estimations of the variance, leading to inaccurate results. +To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only after creating the design object. This ensures that the results accurately reflect the survey design. If we filter or group data before creating the survey design object, the data for those cases are not included in the survey design information and estimations of the variance, leading to inaccurate results. For the sake of simplicity, we've removed cases with missing values in the examples below. For a more detailed explanation of how to handle missing data, please refer to Chapter \@ref(c11-missing-data). @@ -473,7 +473,7 @@ recs_des %>% summarize(p = survey_prop()) ``` -When specifying multiple variables, the proportions are conditional. In the results above, notice that the proportions sum to 1 within each region. This can be interpreted as the proportion of housing units with A/C *within* each region. For example, in the Northeast region, approximately `r scales::percent(recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "FALSE") %>% pull(p), accuracy = 0.1)` of housing units don't have A/C, while around `r scales::percent(recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "TRUE") %>% pull(p), accuracy = 0.1)` have A/C. +When specifying multiple variables, the proportions are conditional. In the results above, notice that the proportions sum to 1 within each region. This can be interpreted as the proportion of housing units with A/C within each region. For example, in the Northeast region, approximately `r scales::percent(recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "FALSE") %>% pull(p), accuracy = 0.1)` of housing units don't have A/C, while around `r scales::percent(recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "TRUE") %>% pull(p), accuracy = 0.1)` have A/C. #### Example 3: Joint proportions {.unnumbered} \index{Functions in srvyr!interact|(} \index{interact|see {Functions in srvyr}} @@ -896,7 +896,7 @@ The arguments are: #### Example 1: Overall correlation {.unnumbered} -We can calculate the correlation between the total square footage of homes (`TOTSQFT_EN`)^[Question text: "What is the square footage of your home?" [@recs-svy]] and electricity consumption (`BTUEL`)^[BTUEL is derived from the supplier side component of the survey where `BTUEL` represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year [@recs-svy].]. +We can calculate the correlation between the total square footage of homes (`TOTSQFT_EN`)^[Question text: "What is the square footage of your home?" [@recs-svy]] and electricity consumption (`BTUEL`)^[`BTUEL` is derived from the supplier side component of the survey where `BTUEL` represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year [@recs-svy].]. ```{r} #| label: desc-corr-1 @@ -987,7 +987,7 @@ recs_des %>% ### Unweighted analysis \index{Functions in srvyr!unweighted|(} \index{unweighted|see {Functions in srvyr}} -Sometimes, it is helpful to calculate an unweighted estimate of a given variable. For this, we use the `unweighted()` function in the `summarize()` function. The `unweighted()` function calculates unweighted summaries from a `tbl_svy` object, providing the summary among the *respondents* without extrapolating to a population estimate. The `unweighted()` function can be used in conjunction with any {dplyr} functions. Here is an example looking at the average household electricity cost: \index{Functions in srvyr!survey\_mean|(} +Sometimes, it is helpful to calculate an unweighted estimate of a given variable. For this, we use the `unweighted()` function in the `summarize()` function. The `unweighted()` function calculates unweighted summaries from a `tbl_svy` object, providing the summary among the respondents without extrapolating to a population estimate. The `unweighted()` function can be used in conjunction with any {dplyr} functions. Here is an example looking at the average household electricity cost: \index{Functions in srvyr!survey\_mean|(} ```{r} #| label: desc-mn-unwgt @@ -1022,7 +1022,7 @@ recs_des %>% ) ``` -It is estimated that American residential households spent an average of `r .elbill_mn_unwgt %>% pull(elec_bill)` on electricity in 2020, and the estimate has a standard error of `r .elbill_mn_unwgt %>% pull(elec_bill_se)`. The `unweighted()` function calculates the unweighted average and represents the average amount of money spent on electricity in 2020 by the *respondents*, which was `r .elbill_mn_unwgt %>% pull(elec_unweight)`. \index{Functions in srvyr!unweighted|)} +It is estimated that American residential households spent an average of `r .elbill_mn_unwgt %>% pull(elec_bill)` on electricity in 2020, and the estimate has a standard error of `r .elbill_mn_unwgt %>% pull(elec_bill_se)`. The `unweighted()` function calculates the unweighted average and represents the average amount of money spent on electricity in 2020 by the respondents, which was `r .elbill_mn_unwgt %>% pull(elec_unweight)`. \index{Functions in srvyr!unweighted|)} ### Subpopulation analysis \index{Functions in srvyr!filter|(} \index{filter|see {Functions in srvyr}} @@ -1052,7 +1052,7 @@ Based on this calculation, the estimated average amount spent on natural gas is ### Design effects {#desc-deff} -The design effect measures how the precision of an estimate is influenced by the sampling design. In other words, it measures how much more or less statistically efficient the survey design is compared to a simple random sample (SRS). It is computed by taking the ratio of the estimate's variance under the design at hand to the estimate's variance under a simple random sample without replacement. \index{Stratified sampling|(}A design effect less than 1 indicates that the design is *more* statistically efficient than an SRS design, which is rare but possible in a stratified sampling design where the outcome correlates with the stratification variable(s).\index{Stratified sampling|)} A design effect greater than 1 indicates that the design is *less* statistically efficient than an SRS design. From a design effect, we can calculate the effective sample size as follows: +The design effect measures how the precision of an estimate is influenced by the sampling design. In other words, it measures how much more or less statistically efficient the survey design is compared to a simple random sample (SRS). It is computed by taking the ratio of the estimate's variance under the design at hand to the estimate's variance under a simple random sample without replacement. \index{Stratified sampling|(}A design effect less than 1 indicates that the design is more statistically efficient than an SRS design, which is rare but possible in a stratified sampling design where the outcome correlates with the stratification variable(s).\index{Stratified sampling|)} A design effect greater than 1 indicates that the design is less statistically efficient than an SRS design. From a design effect, we can calculate the effective sample size as follows: $$n_{eff}=\frac{n}{D_{eff}} $$ @@ -1078,7 +1078,7 @@ For the values less than 1 (`BTUEL_deff` and `BTUFO_deff`), the results suggest \index{Functions in srvyr!cascade|(} \index{cascade|see {Functions in srvyr}} -When using `group_by()` in analysis, the results are returned with a row for each group or combination of groups. Often, we want both breakdowns by group and a summary row for the estimate representing the entire population. For example, we may want the average electricity consumption by region *and* nationally. The {srvyr} package has the convenient `cascade()` function, which adds summary rows for the total of a group. It is used instead of `summarize()` and has similar functionalities along with some additional features. +When using `group_by()` in analysis, the results are returned with a row for each group or combination of groups. Often, we want both breakdowns by group and a summary row for the estimate representing the entire population. For example, we may want the average electricity consumption by region and nationally. The {srvyr} package has the convenient `cascade()` function, which adds summary rows for the total of a group. It is used instead of `summarize()` and has similar functionalities along with some additional features. #### Syntax {.unnumbered} @@ -1238,7 +1238,7 @@ recs_des %>% We estimate `r scales::percent(recs_des %>% group_by(ACUsed) %>% summarize(p = survey_prop()) %>% filter(ACUsed == TRUE) %>% pull(p), accuracy = 0.1)` of households have A/C and `r scales::percent(recs_des %>% group_by(SpaceHeatingUsed) %>% summarize(p = survey_prop()) %>% filter(SpaceHeatingUsed == TRUE) %>% pull(p), accuracy = 0.1)` have heating. -If we are *only* interested in the `TRUE` outcomes, that is, the proportion of households that have A/C and the proportion that have heating, we can simplify the code. \index{Functions in srvyr!survey\_mean|(} Applying `survey_mean()` to a logical variable is the same as using `survey_prop()`, as shown below: +If we are only interested in the `TRUE` outcomes, that is, the proportion of households that have A/C and the proportion that have heating, we can simplify the code. \index{Functions in srvyr!survey\_mean|(} Applying `survey_mean()` to a logical variable is the same as using `survey_prop()`, as shown below: ```{r} #| label: desc-multip-2 @@ -1268,7 +1268,7 @@ cool_heat_tab %>% Loops are a common tool when dealing with repetitive calculations. The {purrr} package provides the `map()` functions, which, like a loop, allow us to perform the same task across different elements [@R-purrr]. In our case, we may want to calculate proportions from the same design multiple times. A straightforward approach is to design the calculation for one variable, build a function based on that, and then apply it iteratively for the rest of the variables. \index{American National Election Studies (ANES)|(} -Suppose we want to create a table that shows the proportion of people who express trust in their government (`TrustGovernment`)^[Question: "How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)" [@anes-svy].] as well as those that trust in people (`TrustPeople`)^[Question: "Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)" [@anes-svy].] using data from the 2020 ANES. +Suppose we want to create a table that shows the proportion of people who express trust in their government (`TrustGovernment`)^[Question text: "How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never)" [@anes-svy]] as well as those that trust in people (`TrustPeople`)^[Question text: "Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never)" [@anes-svy]] using data from the 2020 ANES. First, we create a table for a single variable. The table includes the variable name as a column, the response, and the corresponding percentage with its standard error. \index{Functions in srvyr!drop\_na|(} \index{drop\_na|see {Functions in srvyr}} diff --git a/07-modeling.Rmd b/07-modeling.Rmd index d77385fb..9164bde3 100644 --- a/07-modeling.Rmd +++ b/07-modeling.Rmd @@ -165,7 +165,7 @@ The arguments are: ### Example \index{Residential Energy Consumption Survey (RECS)|(} -Looking at an example helps us discuss the output and how to interpret the results. In RECS, respondents are asked what temperature they set their thermostat to during the evening when using A/C during the summer^[Question text: "During the summer, what is your home’s typical indoor temperature inside your home at night?” [@recs-svy].]. To analyze these data, we filter the respondents to only those using A/C (`ACUsed`)^[Question text: "Is any air conditioning equipment used in your home?" [@recs-svy].]. Then, if we want to see if there are regional differences, we can use `group_by()`. A descriptive analysis of the temperature at night (`SummerTempNight`) set by region and the sample sizes is displayed below. \index{Functions in srvyr!survey\_mean|(} \index{Functions in srvyr!unweighted|(} \index{Functions in srvyr!filter|(} \index{Functions in srvyr!summarize|(} +Looking at an example helps us discuss the output and how to interpret the results. In RECS, respondents are asked what temperature they set their thermostat to during the evening when using A/C during the summer^[Question text: "During the summer, what is your home’s typical indoor temperature inside your home at night?" [@recs-svy]]. To analyze these data, we filter the respondents to only those using A/C (`ACUsed`)^[Question text: "Is any air conditioning equipment used in your home?" [@recs-svy]]. Then, if we want to see if there are regional differences, we can use `group_by()`. A descriptive analysis of the temperature at night (`SummerTempNight`) set by region and the sample sizes is displayed below. \index{Functions in srvyr!survey\_mean|(} \index{Functions in srvyr!unweighted|(} \index{Functions in srvyr!filter|(} \index{Functions in srvyr!summarize|(} ```{r} #| label: model-anova-prep @@ -283,7 +283,7 @@ As discussed in Section \@ref(model-intro), the formula on the right-hand side c #### Example 1: Linear regression with a single variable {.unnumbered} -On RECS, we can obtain information on the square footage of homes^[Question text: "What is the square footage of your home?" [@recs-svy].] and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure \@ref(fig:model-plot-sf-elbill), each hexagon represents the weighted count of households in the bin, and we can see a general positive linear trend (as the square footage increases, so does the amount of money spent on electricity). +On RECS, we can obtain information on the square footage of homes^[Question text: "What is the square footage of your home?" [@recs-svy]] and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure \@ref(fig:model-plot-sf-elbill), each hexagon represents the weighted count of households in the bin, and we can see a general positive linear trend (as the square footage increases, so does the amount of money spent on electricity). ```{r} #| label: model-plot-sf-elbill @@ -404,7 +404,7 @@ urb_reg_test This output indicates there is a significant interaction between urbanicity and region (p-value is `r pretty_p_value(urb_reg_test[["p"]])`). \index{p-value|)} -To examine the predictions, residuals, and more from the model, the `augment()` function from {broom} can be used. The `augment()` function returns a tibble with the independent and dependent variables and other fit statistics. The `augment()` function has not been specifically written for objects of class `svyglm`, and as such, a warning is displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using `augment()`. To obtain the standard error of the predicted values (`.se.fit`), we need to use the `attr()` function on the predicted values (`.fitted`) created by `augment()`. Additionally, the predicted values created are outputted with a type of `svrep`. If we want to plot the predicted values, we need to use `as.numeric()` to get the predicted values into a numeric format to work with. However, it is important to note that this adjustment must be completed **after** the standard error adjustment. +To examine the predictions, residuals, and more from the model, the `augment()` function from {broom} can be used. The `augment()` function returns a tibble with the independent and dependent variables and other fit statistics. The `augment()` function has not been specifically written for objects of class `svyglm`, and as such, a warning is displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using `augment()`. To obtain the standard error of the predicted values (`.se.fit`), we need to use the `attr()` function on the predicted values (`.fitted`) created by `augment()`. Additionally, the predicted values created are outputted with a type of `svrep`. If we want to plot the predicted values, we need to use `as.numeric()` to get the predicted values into a numeric format to work with. However, it is important to note that this adjustment must be completed after the standard error adjustment. ```{r} #| label: model-aug-examp-se @@ -528,7 +528,7 @@ Note `svyglm()` is the same function used in both ANOVA and normal linear regres #### Example 1: Logistic regression with single variable {.unnumbered} \index{American National Election Studies (ANES)|(} -In the following example, we use the ANES data to model whether someone usually has trust in the government^[Question text: "How often can you trust the federal government in Washington to do what is right?" [@anes-svy].] by whom someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump, though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an "Other" category. \index{Factor|(}We first create a binary outcome for trusting in the government by collapsing "Always" and "Most of the time" into a single-factor level, and the other response options ("About half the time," "Some of the time," and "Never") into a second factor level. Next, a scatter plot of the raw data is not useful, as it is all 0 and 1 outcomes; so instead, we plot a summary of the data. \index{Functions in srvyr!survey\_mean|(} \index{Functions in srvyr!summarize|(} \index{Factor|)} +In the following example, we use the ANES data to model whether someone usually has trust in the government^[Question text: "How often can you trust the federal government in Washington to do what is right?" [@anes-svy]] by whom someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump, though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an "Other" category. \index{Factor|(}We first create a binary outcome for trusting in the government by collapsing "Always" and "Most of the time" into a single-factor level, and the other response options ("About half the time," "Some of the time," and "Never") into a second factor level. Next, a scatter plot of the raw data is not useful, as it is all 0 and 1 outcomes; so instead, we plot a summary of the data. \index{Functions in srvyr!survey\_mean|(} \index{Functions in srvyr!summarize|(} \index{Factor|)} ```{r} #| label: model-logisticexamp-plot diff --git a/08-communicating-results.Rmd b/08-communicating-results.Rmd index 37dd0b99..36b7dec8 100644 --- a/08-communicating-results.Rmd +++ b/08-communicating-results.Rmd @@ -170,7 +170,7 @@ We can add a few more enhancements, such as a title (which is different from a c trust_gov_gt2 <- trust_gov_gt %>% tab_header("American voter's trust in the federal government, 2020") %>% - tab_source_note("American National Election Studies, 2020") %>% + tab_source_note(md("*Source*: American National Election Studies, 2020")) %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" @@ -309,7 +309,7 @@ anes_des_gtsum3 <- anes_des %>% as_gt() %>% tab_header("American voter's trust in the federal government, 2020") %>% - tab_source_note("American National Election Studies, 2020") %>% + tab_source_note(md("*Source*: American National Election Studies, 2020")) %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" @@ -356,7 +356,7 @@ anes_des_gtsum4 <- anes_des %>% as_gt() %>% tab_header( "American voter's trust in the federal government, 2020") %>% - tab_source_note("American National Election Studies, 2020") %>% + tab_source_note(md("*Source*: American National Election Studies, 2020")) %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" @@ -408,7 +408,7 @@ anes_des_gtsum5 <- anes_des %>% in the federal government by whether they voted in the 2020 presidential election" ) %>% - tab_source_note("American National Election Studies, 2020") %>% + tab_source_note(md("*Source*: American National Election Studies, 2020")) %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd index 425511d3..216c55ad 100644 --- a/09-reproducible-data.Rmd +++ b/09-reproducible-data.Rmd @@ -15,10 +15,10 @@ Not only is reproducibility a key component in ethical and accurate research, bu Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are: - - **Code**: source code used for data cleaning, analysis, modeling, and reporting - - **Data**: raw data used in the workflow, or if data are sensitive or proprietary, as much data as possible that would allow others to run our workflow or provide details on how to access the data (e.g., access to a restricted use file (RUF)) - - **Environment**: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis - - **Methodology**: survey and analysis methodology, including rationale behind sample, questionnaire and analysis decisions, interpretations, and assumptions + - Code: source code used for data cleaning, analysis, modeling, and reporting + - Data: raw data used in the workflow, or if data are sensitive or proprietary, as much data as possible that would allow others to run our workflow or provide details on how to access the data (e.g., access to a restricted use file (RUF)) + - Environment: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis + - Methodology: survey and analysis methodology, including rationale behind sample, questionnaire and analysis decisions, interpretations, and assumptions In Chapter \@ref(c08-communicating-results), we briefly mention how each of these is important to include in the methodology report and when communicating the findings of a study. However, to be transparent and effective analysts, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, we may be eager to jump into the data and make decisions as we go without full documentation. This can be challenging if we need to go back and make changes or understand even what we did a few months ago. It benefits other analysts and potentially our future selves to document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, we should decide which techniques and tools to use before starting a project (or very early on). @@ -136,7 +136,7 @@ set.seed(999) runif(5) ``` -Since the seed is set to `999`, running `runif(5)` multiple times always produces the same output. The choice of the seed number is up to the analyst. For example, this could be the date (`20240102`) or time of day (`1056`) when the analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number is up to the analyst to decide. It is important to note that `set.seed()` should be used **before** random number generation. Run it once per program, and the seed is applied to the entire script. We recommend setting the seed at the beginning of a script, where libraries are loaded. +Since the seed is set to `999`, running `runif(5)` multiple times always produces the same output. The choice of the seed number is up to the analyst. For example, this could be the date (`20240102`) or time of day (`1056`) when the analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number is up to the analyst to decide. It is important to note that `set.seed()` should be used before random number generation. Run it once per program, and the seed is applied to the entire script. We recommend setting the seed at the beginning of a script, where libraries are loaded. ### Descriptive names and labels diff --git a/11-missing-data.Rmd b/11-missing-data.Rmd index ca76a6bf..2a971978 100644 --- a/11-missing-data.Rmd +++ b/11-missing-data.Rmd @@ -71,15 +71,15 @@ Missing data in surveys refer to situations where participants do not provide co \index{Item nonresponse|(}There are two main categories that missing data typically fall into: missing by design and unintentional missing data. Missing by design is part of the survey plan and can be more easily incorporated into weights and analyses. Unintentional missing data, on the other hand, can lead to bias in survey estimates if not correctly accounted for. Below we provide more information on the types of missing data. -1. **Missing by design/questionnaire skip logic**: This type of missingness occurs when certain respondents are intentionally directed to skip specific questions based on their previous responses or characteristics. For example, in a survey about employment, if a respondent indicates that they are not employed, they may be directed to skip questions related to their job responsibilities. Additionally, some surveys randomize questions or modules so that not all participants respond to all questions. In these instances, respondents would have missing data for the modules not randomly assigned to them. +1. Missing by design/questionnaire skip logic: This type of missingness occurs when certain respondents are intentionally directed to skip specific questions based on their previous responses or characteristics. For example, in a survey about employment, if a respondent indicates that they are not employed, they may be directed to skip questions related to their job responsibilities. Additionally, some surveys randomize questions or modules so that not all participants respond to all questions. In these instances, respondents would have missing data for the modules not randomly assigned to them. -2. **Unintentional missing data**: This type of missingness occurs when researchers do not intend for there to be missing data on a particular question, for example, if respondents did not finish the survey or refused to answer individual questions. There are three main types of unintentional missing data that each should be considered and handled differently [@mack; @Schafer2002]: +2. Unintentional missing data: This type of missingness occurs when researchers do not intend for there to be missing data on a particular question, for example, if respondents did not finish the survey or refused to answer individual questions. There are three main types of unintentional missing data that each should be considered and handled differently [@mack; @Schafer2002]: - a. **Missing completely at random (MCAR)**: The missing data are unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency. + a. Missing completely at random (MCAR): The missing data are unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency. - b. **Missing at random (MAR)**: The missing data are related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, we know the respondents' ages and older respondents choose not to answer specific questions but younger respondents do answer them. + b. Missing at random (MAR): The missing data are related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, we know the respondents' ages and older respondents choose not to answer specific questions but younger respondents do answer them. - c. **Missing not at random (MNAR)**: The missing data are related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity. + c. Missing not at random (MNAR): The missing data are related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity. ## Assessing missing data @@ -163,7 +163,7 @@ In Figure \@ref(fig:missing-anes-ggmissfct), we can see that if respondents did \index{American National Election Studies (ANES)|)} \index{Residential Energy Consumption Survey (RECS)|(} -There are other visualizations that work well with numeric data. For example, in the RECS 2020 data, we can plot two continuous variables and the missing data associated with them to see if there are any patterns in the missingness. To do this, we can use the `bind_shadow()` function from the {naniar} package. This creates a **nabular** (combination of "na" with "tabular"), which features the original columns followed by the same number of columns with a specific `NA` format. These `NA` columns are indicators of whether the value in the original data is missing or not. The example printed below shows how most levels of `HeatingBehavior` are not missing (`!NA`) in the NA variable of `HeatingBehavior_NA`, but those missing in `HeatingBehavior` are also missing in `HeatingBehavior_NA`. +There are other visualizations that work well with numeric data. For example, in the RECS 2020 data, we can plot two continuous variables and the missing data associated with them to see if there are any patterns in the missingness. To do this, we can use the `bind_shadow()` function from the {naniar} package. This creates a nabular (combination of "na" with "tabular"), which features the original columns followed by the same number of columns with a specific `NA` format. These `NA` columns are indicators of whether the value in the original data is missing or not. The example printed below shows how most levels of `HeatingBehavior` are not missing (`!NA`) in the NA variable of `HeatingBehavior_NA`, but those missing in `HeatingBehavior` are also missing in `HeatingBehavior_NA`. ```{r} #| label: missing-recs-shadow @@ -298,7 +298,7 @@ pct_2 <- heat_cntl_2 %>% ``` -If we ran the first analysis, we would say that `r pct_1`% **of households with heat** use a programmable or smart thermostat for heating their home. If we used the results from the second analysis, we would say that `r pct_2`% **of households** use a programmable or smart thermostat for heating their home. The distinction between the two statements is made bold for emphasis. Skip patterns often change the universe we are talking about and need to be carefully examined. \index{Residential Energy Consumption Survey (RECS)|)} +If we ran the first analysis, we would say that `r pct_1`% of households with heat use a programmable or smart thermostat for heating their home. If we used the results from the second analysis, we would say that `r pct_2`% of households use a programmable or smart thermostat for heating their home. The distinction between the two statements is made bold for emphasis. Skip patterns often change the universe we are talking about and need to be carefully examined. \index{Residential Energy Consumption Survey (RECS)|)} \index{American National Election Studies (ANES)|(} Filtering to the correct universe is important when handling these types of missing data. The `nabular` we created above can also help with this. If we have `NA_skip` values in the shadow, we can make sure that we filter out all of these values and only include relevant missing values. To do this with survey data, we could first create the `nabular`, then create the \index{Functions in srvyr!as\_survey\_design|(} design object on that data, and then use the shadow variables to assist with filtering the data. Let's use the `nabular` we created above for ANES 2020 (`anes_2020_shadow`) to create the design object. diff --git a/13-ncvs-vignette.Rmd b/13-ncvs-vignette.Rmd index db282957..7ab0ec83 100644 --- a/13-ncvs-vignette.Rmd +++ b/13-ncvs-vignette.Rmd @@ -64,23 +64,23 @@ The NCVS User Guide [@ncvs_user_guide] uses the following notation: In this vignette, we discuss four estimates: -1. *Victimization totals* estimate the number of criminal victimizations with a given characteristic. As demonstrated below, these can be calculated from any of the data files. The estimated victimization total, $\hat{t}_D$ for domain $D$ is estimated as +1. Victimization totals estimate the number of criminal victimizations with a given characteristic. As demonstrated below, these can be calculated from any of the data files. The estimated victimization total, $\hat{t}_D$ for domain $D$ is estimated as $$ \hat{t}_D = \sum_{ijkl \in D} v_{ijkl}$$ where $v_{ijkl}$ is the series-adjusted victimization weight for household $i$, respondent $j$, reporting period $k$, and victimization $l$, represented in the data as `WGTVICCY`. -2. *Victimization proportions* estimate characteristics among victimizations or victims. Victimization proportions are calculated using the incident data file. The estimated victimization proportion for domain $D$ across level $a$ of covariate $A$, $\hat{p}_{A_a,D}$ is +2. Victimization proportions estimate characteristics among victimizations or victims. Victimization proportions are calculated using the incident data file. The estimated victimization proportion for domain $D$ across level $a$ of covariate $A$, $\hat{p}_{A_a,D}$ is $$ \hat{p}_{A_a,D} =\frac{\sum_{ijkl \in A_a, D} v_{ijkl}}{\sum_{ijkl \in D} v_{ijkl}}.$$ The numerator is the number of incidents with a particular characteristic in a domain, and the denominator is the number of incidents in a domain. -3. *Victimization rates* are estimates of the number of victimizations per 1,000 persons or households in the population^[BJS publishes victimization rates per 1,000, which are also presented in these examples.]. Victimization rates are calculated using the household or person-level data files. The estimated victimization rate for crime $C$ in domain $D$ is +3. Victimization rates are estimates of the number of victimizations per 1,000 persons or households in the population^[BJS publishes victimization rates per 1,000, which are also presented in these examples.]. Victimization rates are calculated using the household or person-level data files. The estimated victimization rate for crime $C$ in domain $D$ is $$\hat{VR}_{C,D}= \frac{\sum_{ijkl \in C,D} v_{ijkl}}{\sum_{ijk \in D} w_{ijk}}\times 1000$$ where $w_{ijk}$ is the person weight (`WGTPERCY`) for personal crimes or household weight (`WGTHHCY`) for household crimes. The numerator is the number of incidents in a domain, and the denominator is the number of persons or households in a domain. Notice that the weights in the numerator and denominator are different; this is important, and in the syntax and examples below, we discuss how to make an estimate that involves two weights. -4. *Prevalence rates* are estimates of the percentage of the population (persons or households) who are victims of a crime. These are estimated using the household or person-level data files. The estimated prevalence rate for crime $C$ in domain $D$ is +4. Prevalence rates are estimates of the percentage of the population (persons or households) who are victims of a crime. These are estimated using the household or person-level data files. The estimated prevalence rate for crime $C$ in domain $D$ is $$ \hat{PR}_{C, D}= \frac{\sum_{ijk \in {C,D}} I_{ij}w_{ijk}}{\sum_{ijk \in D} w_{ijk}} \times 100$$ @@ -600,13 +600,13 @@ pers_des <- pers_vsum_slim %>% Now that we have prepared our data and created the design objects, we can calculate our estimates. As a reminder, those are: -1. *Victimization totals* estimate the number of criminal victimizations with a given characteristic. +1. Victimization totals estimate the number of criminal victimizations with a given characteristic. -2. *Victimization proportions* estimate characteristics among victimizations or victims. +2. Victimization proportions estimate characteristics among victimizations or victims. -3. *Victimization rates* are estimates of the number of victimizations per 1,000 persons or households in the population. +3. Victimization rates are estimates of the number of victimizations per 1,000 persons or households in the population. -4. *Prevalence rates* are estimates of the percentage of the population (persons or households) who are victims of a crime. +4. Prevalence rates are estimates of the percentage of the population (persons or households) who are victims of a crime. ### Estimation 1: Victimization totals {#vic-tot} @@ -1004,7 +1004,7 @@ The output of the statistical test shown in Table \@ref(tab:ncvs-vign-prop-stat- ## Exercises -1. What proportion of completed motor vehicle thefts are **not** reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529). +1. What proportion of completed motor vehicle thefts are not reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529). 2. How many violent crimes occur in each region? diff --git a/90-AppendixA.Rmd b/90-AppendixA.Rmd index 806cdb94..8fedca22 100644 --- a/90-AppendixA.Rmd +++ b/90-AppendixA.Rmd @@ -132,7 +132,7 @@ make_section <- function(sec){ de <- vi %>% pull(Description) cat(str_c("Description: ", de, "\n\n")) qt <- vi %>% pull(Question) - if (!is.na(qt)) cat(str_c("Question: ", qt, "\n\n")) + if (!is.na(qt)) cat(str_c("Question text: ", qt, "\n\n")) vc <- vi %>% pull(VarClass) cat(str_c("Variable class: ", vc, "\n\n")) vt <- vi %>% pull(VarType) %>% unlist() diff --git a/91-AppendixB.Rmd b/91-AppendixB.Rmd index df818530..13b463ec 100644 --- a/91-AppendixB.Rmd +++ b/91-AppendixB.Rmd @@ -79,7 +79,7 @@ make_section <- function(sec){ de <- vi %>% pull(Description) cat(str_c("Description: ", de, "\n\n")) qt <- vi %>% pull(Question) - if (!is.na(qt)) cat(str_c("Question: ", qt, "\n\n")) + if (!is.na(qt)) cat(str_c("Question text: ", qt, "\n\n")) vt <- vi %>% pull(VarType) %>% unlist() if (any(c("factor", "character", "logical") %in% vt)){