diff --git a/404.html b/404.html index 7994762f..14556d2c 100644 --- a/404.html +++ b/404.html @@ -23,7 +23,7 @@ - + @@ -300,13 +300,13 @@
When working with new survey data, analysts should review the survey documentation (see Chapter 3) to understand the data collection methods. The original ANES data contains variables starting with V20
(DeBell 2010), so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent’s age is now in a variable called Age
, and gender is in a variable called Gender
. These descriptive variables are included in the {srvyrexploR} package, and Table 4.1 displays the list of these renamed variables. A complete overview of all variables can be found in Appendix A.
NWEIGHT
and will be described in more detail in Chapter 10). An overview of all variables can be found in Appendix B.
-We will be using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information).
Modeling data is a way for researchers to investigate the relationship between a single dependent variable and one or more independent variables. This builds upon the analyses conducted in Chapter 6, which looked at the relationships between just two variables. For example, in Example 3 in Section 6.3.2, we investigated if there is a relationship between the electrical bill cost and whether or not the household used air-conditioning. However, there are potentially other elements that could go into what the cost of electrical bill is in a household (e.g., outside temperature, desired internal temperature, types and number of appliances, etc.). T-tests only allow us to investigate the relationship of one independent variable at a time, but using models we can look into multiple variables and even explore interactions between these variables. There are several types of models, but in this chapter we will cover Analysis of Variance (ANOVA) and linear regression models following common Gaussian and logit distributions. Jonas Kristoffer Lindeløv has an interesting discussion of many statistical tests and models being equivalent to a linear model. For example, a one-way ANOVA is a linear model with one categorical independent variable, and a two-sample t-test is an ANOVA where the independent variable has exactly two levels.
+Modeling data is a way for researchers to investigate the relationship between a single dependent variable and one or more independent variables. This builds upon the analyses conducted in Chapter 6, which looked at the relationships between just two variables. For example, in Example 3 in Section 6.3.2, we investigated if there is a relationship between the electrical bill cost and whether or not the household used air-conditioning. However, there are potentially other elements that could go into what the cost of electrical bills are in a household (e.g., outside temperature, desired internal temperature, types and number of appliances, etc.).
+T-tests only allow us to investigate the relationship of one independent variable at a time, but using models we can look into multiple variables and even explore interactions between these variables. There are several types of models, but in this chapter we will cover Analysis of Variance (ANOVA) and linear regression models following common normal (Gaussian) and logit models. Jonas Kristoffer Lindeløv has an interesting discussion of many statistical tests and models being equivalent to a linear model. For example, a one-way ANOVA is a linear model with one categorical independent variable, and a two-sample t-test is an ANOVA where the independent variable has exactly two levels.
When modeling data, it is helpful to first create an equation that provides an overview as to what it is that we are modeling. The main structure of these models is as follows:
\[y_i=\beta_0 +\sum_{i=1}^p \beta_i x_i + \epsilon_i\]
-where \(y_i\) is the outcome, \(\beta_0\) is an intercept, \(x_1, \cdots, x_p\) are the predictors with \(\beta_1, \cdots, \beta_p\) as the associated coefficients, and \(\epsilon_i\) is the error. Different models may not include an intercept, have interactions between different independent variables (\(x_i\)), or may have different underlying structures for the dependent variable (\(y_i\)). However, all linear models have the independent variables related to the dependent variable in a linear form.
+where \(y_i\) is the outcome, \(\beta_0\) is an intercept, \(x_1, \cdots, x_p\) are the predictors with \(\beta_1, \cdots, \beta_p\) as the associated coefficients, and \(\epsilon_i\) is the error. Not all models will have all components. For example, some models may not include an intercept (\(\beta_0\)), may have interactions between different independent variables (\(x_i\)), or may have different underlying structures for the dependent variable (\(y_i\)). However, all linear models have the independent variables related to the dependent variable in a linear form.
To specify these models in R, the formulas are the same with both survey data and other data. The left side of the formula is the response/dependent variable, and the right side of the formula has the predictor/independent variable(s). There are many symbols used in R to specify the formula.
-For example, a linear formula mathematically specified as
-\[Y_i=\beta_0+\beta_1 X_i+\epsilon_i\] would be specified in R as y~x
where the intercept is not explicitly included. To fit a model with no intercept, that is,
\[Y_i=\beta_1 X_i+\epsilon_i\]
-it can be specified as y~x-1
. Formula notation details in R can be found in the help file for formula26. A quick overview of the common formula notation is in the following table:
For example, a linear formula mathematically notated as
+\[y_i=\beta_0+\beta_1 x_i+\epsilon_i\] would be specified in R as y~x
where the intercept is not explicitly included. To fit a model with no intercept, that is,
\[y_i=\beta_1 x_i+\epsilon_i\]
+it can be specified in R as y~x-1
. Formula notation details in R can be found in the help file for formula26. A quick overview of the common formula notation is in Table 7.1:
There are often multiple ways to specify the same formula. For example, consider the following equation using the mtcars data
+There are often multiple ways to specify the same formula. For example, consider the following equation using the mtcars
dataset that is built into R:
\[mpg_i=\beta_0+\beta_1cyl_{i}+\beta_2disp_{i}+\beta_3hp_{i}+\beta_4cyl_{i}disp_{i}+\beta_5cyl_{i}hp_{i}+\beta_6disp_{i}hp_{i}+\epsilon_i\]
-This could be specified as any of the following:
+This could be specified in R code as any of the following:
mpg~(cyl+disp+hp)^2
mpg~cyl+disp+hp+cyl:disp+cyl:hp+disp:hp
mpg~cyl*disp+cyl*hp+disp*hp
mpg ~ (cyl + disp + hp)^2
mpg ~ cyl + disp + hp + cyl:disp + cyl:hp + disp:hp
mpg ~ cyl*disp + cyl*hp + disp*hp
Note that the following two specifications are not the same:
-mpg~cyl:disp:hp
this only has the interactions and not the main effectmpg~cyl*disp*hp
this also has the 3-way interaction in addition to the main effects and 2-way interactionsWhen using non-survey data such as experimental or observational data, researchers will use the glm()
function for linear models. With survey data, however, we use svyglm()
from the {survey} package to ensure that we account for the survey design and weights in modeling27. This allows us to generalize a model to the target population and accounts for the fact that the observations in the survey data may not be independent. As discussed in Chapter 6, modeling survey data cannot be directly done in {srvyr}, but can be done in the {survey} (Lumley 2010, 2023) package. In this chapter, we will provide syntax and examples for linear models, including ANOVA, Gaussian linear regression, and logistic regression. For details on other types of regression, including ordinal regression, log-linear models, and survival analysis, refer to Lumley (2010). Lumley (2010) also discusses custom models such as a negative binomial or Poisson model in Appendix E of his book.
In the above options, the way the :
and *
notation are implemented are different. Using :
only includes the interactions and not the main effects, while using *
includes the main effects and all possible interactions. Table 7.2 provides an overview of the syntax and differences between the two notations.
Symbol | +Syntax | +Formula | +
---|---|---|
: | +mpg ~ cyl:disp:hp |
+\[ \begin{aligned} mpg_i = &\beta_0+\beta_4cyl_{i}disp_{i}+\beta_5cyl_{i}hp_{i}+ \\& \beta_6disp_{i}hp_{i}+\epsilon_i\end{aligned}\] | +
* | +mpg ~ cyl*disp*hp |
+\[ \begin{aligned} mpg_i= &\beta_0+\beta_1cyl_{i}+\beta_2disp_{i}+\beta_3hp_{i}+\\& \beta_4cyl_{i}disp_{i}+\beta_5cyl_{i}hp_{i}+\beta_6disp_{i}hp_{i}+\\&\beta_7cyl_{i}disp_{i}hp_{i}+\epsilon_i\end{aligned}\] | +
When using non-survey data such as experimental or observational data, researchers will use the glm()
function for linear models. With survey data, however, we use svyglm()
from the {survey} package to ensure that we account for the survey design and weights in modeling27. This allows us to generalize a model to the target population and accounts for the fact that the observations in the survey data may not be independent. As discussed in Chapter 6, modeling survey data cannot be directly done in {srvyr}, but can be done in the {survey} package (Lumley 2010, 2023). In this chapter, we will provide syntax and examples for linear models, including ANOVA, normal linear regression, and logistic regression. For details on other types of regression, including ordinal regression, log-linear models, and survival analysis, refer to Lumley (2010). Lumley (2010) also discusses custom models such as a negative binomial or Poisson model in Appendix E of his book.
Using the framework, an ANOVA test is also a linear model, we can re-frame the problem as:
+\[ y_i=\sum_{i=1}^k \mu_i x_i + \epsilon_i\]
+where \(x_i\) is a group indicator for groups \(1, \cdots, k\).
Some assumptions when using ANOVA on survey data include:
The function svyglm()
does not have the design as the first argument so the dot (.
) notation is used to pass it with a pipe (see Chapter 6 for more details). The default for missing data is na.omit
, this means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data and we recommend looking at the help documentation for na.omit
(run help(na.omit)
or ?na.omit
) for more information on options to use for na.action
. For a discussion of how to handle missing data see Chapter 3.
The function svyglm()
does not have the design as the first argument so the dot (.
) notation is used to pass it with a pipe (see Chapter 6 for more details). The default for missing data is na.omit
, this means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data and we recommend looking at the help documentation for na.omit
(run help(na.omit)
or ?na.omit
) for more information on options to use for na.action
. For a discussion of how to handle missing data see Chapter 11.
anova_out <- recs_des %>%
svyglm(design = .,
- formula = SummerTempNight ~ Region,
- na.action = na.omit)
-
-tidy(anova_out)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
@@ -665,15 +693,507 @@ 7.2.2 Example28. The other coefficients indicate the difference in temperature relative to the Northeast region. For example, in the Midwest, temperatures are set, on average, 1.34 degrees higher than in the Northeast during summer nights.
+
In the output above, we can see the estimated coefficients (estimate
), estimated standard errors of the coefficients (std.error
), the t-statistic (statistic
), and the p-value for each coefficient. In this output, the intercept represents the reference value of the Northeast region. The other coefficients indicate the difference in temperature relative to the Northeast region. For example, in the Midwest, temperatures are set, on average, 1.34 (p-value<0.0001) degrees higher than in the Northeast during summer nights and each region sets their thermostats at significantly higher temperatures than the Northeast.
+If we wanted to change the reference value we would reorder the factor before modeling using the function relevel()
from {stats} or using one of many factor ordering functions in {forcats} such as fct_relevel()
or fct_infreq()
. For example, if we wanted the reference level to be the Midwest region, we could use the following code. Note the usage of the gt()
function on top of tidy()
to print a nice looking output table - we will go over more usage of this package in Chapter 8.
+anova_out_relevel <- recs_des %>%
+ mutate(Region=fct_relevel(Region, "Midwest", after = 0)) %>%
+ svyglm(design = .,
+ formula = SummerTempNight ~ Region)
+
+
+
+
+
+ TABLE 7.3: ANOVA output for estimates of thermostat temperature setting at night by region with Midwest as the reference region, RECS 2020
+
+
+
+ term
+ estimate
+ std.error
+ statistic
+ p.value
+
+
+
+ (Intercept)
+71.04
+0.09
+791.83
+<0.0001
+ RegionNortheast
+−1.34
+0.14
+−9.68
+<0.0001
+ RegionSouth
+0.71
+0.10
+6.91
+<0.0001
+ RegionWest
+1.47
+0.16
+9.17
+<0.0001
+
+
+
+
+This output now has the coefficients indicating the difference in temperature relative to the Midwest region. For example, in the Northeast, temperatures are set, on average, -1.34 (p-value<0.0001) degrees lower than in the Midwest during summer nights and each region sets their thermostats at significantly lower temperatures than the Midwest. This is the reverse from what we saw in the prior model as we are still comparing the same two regions, just from different reference points.
Gaussian linear regression is a more generalized method than ANOVA where we fit a model of a continuous outcome with any number of categorical or continuous predictors, such that
-\[y_i=\beta_0 +\sum_{i=1}^p \beta_i x_i + \epsilon_i\]
+Normal linear regression is a more generalized method than ANOVA where we fit a model of a continuous outcome with any number of categorical or continuous predictors whereas ANOVA only has categorical predictors and is similarly specified as:
+\[\begin{equation} +y_i=\beta_0 +\sum_{i=1}^p \beta_i x_i + \epsilon_i +\tag{7.1} +\end{equation}\]where \(y_i\) is the outcome, \(\beta_0\) is an intercept, \(x_1, \cdots, x_n\) are the predictors with \(\beta_1, \cdots, \beta_p\) as the associated coefficients, and \(\epsilon_i\) is the error.
-Assumptions in Gaussian linear regression using survey data include:
+Assumptions in normal linear regression using survey data include:
The syntax for this regression uses the same function as ANOVA, but can have more than one variable listed on the right-hand side of the formula:
-des_obj %>%
- svyglm(
- formula = outcomevar ~ x1 + x2 + x3,
- design = .,
- na.action = na.omit,
- df.resid = NULL
- )
des_obj %>%
+ svyglm(
+ formula = outcomevar ~ x1 + x2 + x3,
+ design = .,
+ na.action = na.omit,
+ df.resid = NULL
+ )
The arguments are:
formula
: Formula in the form of y~x
As discussed at the beginning of the chapter, the formula on the right-hand side can be specified in many ways, whether interactions are desired or not, for example.
+As discussed in Section 7.1, the formula on the right-hand side can be specified in many ways, whether interactions are desired or not, for example.
On RECS, we can obtain information on the square footage of homes and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure 7.1, each hexagon represents the weighted count of households in the bin and we can see a general positive linear trend (as the square footage increases so does the amount of money spent on electricity).
+On RECS, we can obtain information on the square footage of homes and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure 7.1, each hexagon represents the weighted count of households in the bin, and we can see a general positive linear trend (as the square footage increases so does the amount of money spent on electricity).
+recs_2020 %>%
+ ggplot(aes(
+ x = TOTSQFT_EN,
+ y = DOLLAREL,
+ weight = NWEIGHT / 1000000
+ )) +
+ geom_hex() +
+ scale_fill_gradientn(
+ guide = "colorbar",
+ name = "Housing Units\n(Millions)",
+ labels = scales::comma,
+ colors = book_colors[c(3, 2, 1)]
+ ) +
+ xlab("Total square footage") + ylab("Amount spent on electricity") +
+ scale_y_continuous(labels = scales::dollar_format()) +
+ scale_x_continuous(labels = scales::comma_format()) +
+ theme_minimal()
Given that the plot shows a potential relationship, fitting a model will allow us to determine if the relationship is statistically significant. The model is fit below with electricity expenditure as the outcome.
-m_electric_sqft <- recs_des %>%
- svyglm(design = .,
- formula = DOLLAREL ~ TOTSQFT_EN,
- na.action = na.omit)
-tidy(m_electric_sqft)
## # A tibble: 2 × 5
-## term estimate std.error statistic p.value
-## <chr> <dbl> <dbl> <dbl> <dbl>
-## 1 (Intercept) 837. 12.8 65.5 4.43e-56
-## 2 TOTSQFT_EN 0.299 0.00717 41.7 6.34e-45
-In the output above, we can see the estimated coefficients (estimate
), estimated standard errors of the coefficients (std.error
), the t-statistic (statistic
), and the p-value for each coefficient. In these results, we can say that, on average, for every additional square foot of house size, the electricity bill increases by 29.9 cents and that square footage is significantly associated with electricity expenditure.
This is a very simple model, and there are likely many more factors in electricity expenditure, including the type of cooling, number of appliances, location, and more. However, often starting with one variable models can help researchers understand what potential relationships there are between variables before fitting more complex models. Often researchers start with known relationships before building models to determine what impact additional variables have on the model.
+Given that the plot shows a potential increasing relationship between square footage and electricity expenditure, fitting a model will allow us to determine if the relationship is statistically significant. The model is fit below with electricity expenditure as the outcome.
+m_electric_sqft <- recs_des %>%
+ svyglm(design = .,
+ formula = DOLLAREL ~ TOTSQFT_EN,
+ na.action = na.omit)
term | +estimate | +std.error | +statistic | +p.value | +
---|---|---|---|---|
(Intercept) | +836.72 | +12.77 | +65.51 | +<0.0001 |
TOTSQFT_EN | +0.30 | +0.01 | +41.67 | +<0.0001 |
In the following example, a model is fit to predict electricity expenditure, including Census region (factor/categorical), urbanicity (factor/categorical), square footage (double/numeric), and whether air-conditioning is used (logical/categorical) with all two-way interactions also included. As a reminder, using -1
means that we are fitting this model without an intercept.
m_electric_multi <- recs_des %>%
- svyglm(
- design = .,
- formula = DOLLAREL ~ (Region + Urbanicity + TOTSQFT_EN + ACUsed)^2 - 1,
- na.action = na.omit
- )
-
-tidy(m_electric_multi) %>% print(n = 50)
## # A tibble: 25 × 5
-## term estimate std.error statistic p.value
-## <chr> <dbl> <dbl> <dbl> <dbl>
-## 1 RegionNortheast 5.44e+2 56.6 9.61 2.37e-11
-## 2 RegionMidwest 7.02e+2 78.1 8.99 1.28e-10
-## 3 RegionSouth 9.39e+2 47.0 20.0 1.02e-20
-## 4 RegionWest 6.03e+2 36.3 16.6 3.54e-18
-## 5 UrbanicityUrban Cluster 7.30e+1 81.5 0.896 3.76e- 1
-## 6 UrbanicityRural 2.04e+2 80.7 2.53 1.61e- 2
-## 7 TOTSQFT_EN 2.41e-1 0.0279 8.65 3.28e-10
-## 8 ACUsedTRUE 2.52e+2 54.1 4.66 4.42e- 5
-## 9 RegionMidwest:UrbanicityUrban … 1.83e+2 82.4 2.22 3.28e- 2
-## 10 RegionSouth:UrbanicityUrban Cl… 1.53e+2 76.0 2.01 5.26e- 2
-## 11 RegionWest:UrbanicityUrban Clu… 9.80e+1 75.2 1.30 2.01e- 1
-## 12 RegionMidwest:UrbanicityRural 3.13e+2 50.9 6.15 4.92e- 7
-## 13 RegionSouth:UrbanicityRural 2.20e+2 55.0 4.00 3.12e- 4
-## 14 RegionWest:UrbanicityRural 1.81e+2 58.7 3.08 3.98e- 3
-## 15 RegionMidwest:TOTSQFT_EN -4.88e-2 0.0234 -2.09 4.41e- 2
-## 16 RegionSouth:TOTSQFT_EN 2.97e-3 0.0264 0.113 9.11e- 1
-## 17 RegionWest:TOTSQFT_EN -2.93e-2 0.0294 -0.997 3.25e- 1
-## 18 RegionMidwest:ACUsedTRUE -2.93e+2 60.2 -4.86 2.42e- 5
-## 19 RegionSouth:ACUsedTRUE -2.94e+2 57.4 -5.12 1.12e- 5
-## 20 RegionWest:ACUsedTRUE -7.77e+1 47.0 -1.65 1.08e- 1
-## 21 UrbanicityUrban Cluster:TOTSQF… -3.93e-2 0.0241 -1.63 1.11e- 1
-## 22 UrbanicityRural:TOTSQFT_EN -6.45e-2 0.0248 -2.60 1.37e- 2
-## 23 UrbanicityUrban Cluster:ACUsed… -1.30e+2 60.3 -2.16 3.77e- 2
-## 24 UrbanicityRural:ACUsedTRUE -3.38e+1 59.3 -0.570 5.72e- 1
-## 25 TOTSQFT_EN:ACUsedTRUE 8.29e-2 0.0238 3.48 1.35e- 3
-As shown above, there are many terms in this model. To test whether coefficients for a term are different from zero, the function regTermTest()
can be used. For example, in the above regression, we can test whether the interaction of region and urbanicity is significant as follows:
## Wald test for Urbanicity:Region
-## in svyglm(design = ., formula = DOLLAREL ~ (Region + Urbanicity +
-## TOTSQFT_EN + ACUsed)^2 - 1, na.action = na.omit)
-## F = 6.851 on 6 and 35 df: p= 7.2e-05
-This output indicates there is a significant interaction between urbanicity and region (p-value=\(<0.0001\)).
-To examine the predictions, residuals and more from the model, the function augment()
from {broom} can be used. The augment()
function will return a tibble with the independent and dependent variables and other fit statistics. The augment()
function has not been specifically written for objects of class svyglm
, and as such, a warning will be displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using augment to get the predicted (.fitted
) and standard error (.se.fit
) values. To obtain the standard error of the fitted values we need to use the attr()
function on the .fitted
values created by augment()
.
fitstats <-
- augment(m_electric_multi) %>%
- mutate(.se.fit = sqrt(attr(.fitted, "var")),
- .fitted = as.numeric(.fitted))
-
-fitstats
## # A tibble: 18,496 × 13
-## DOLLAREL Region Urbanicity TOTSQFT_EN ACUsed `(weights)` .fitted
-## <dbl> <fct> <fct> <dbl> <lgl> <dbl> <dbl>
-## 1 1955. West Urban Area 2100 TRUE 0.492 1397.
-## 2 713. South Urban Area 590 TRUE 1.35 1090.
-## 3 335. West Urban Area 900 TRUE 0.849 1043.
-## 4 1425. South Urban Area 2100 TRUE 0.793 1584.
-## 5 1087 Northeast Urban Area 800 TRUE 1.49 1055.
-## 6 1896. South Urban Area 4520 TRUE 1.09 2375.
+In the output above, we can see the estimated coefficients (estimate
), estimated standard errors of the coefficients (std.error
), the t-statistic (statistic
), and the p-value for each coefficient. In these results, we can say that, on average, for every additional square foot of house size, the electricity bill increases by 29.9 cents and that square footage is significantly associated with electricity expenditure (p-value<0.0001).
+This is a very simple model, and there are likely many more factors related to electricity expenditure, including the type of cooling, number of appliances, location, and more. However, starting with one variable models can help researchers understand what potential relationships there are between variables before fitting more complex models. Often researchers start with known relationships before building models to determine what impact additional variables have on the model.
+
In the following example, a model is fit to predict electricity expenditure, including Census region (factor/categorical), urbanicity (factor/categorical), square footage (double/numeric), and whether air-conditioning is used (logical/categorical) with all two-way interactions also included. In this example, we are choosing to fit this model without an intercept (using -1
in the formula). This will result in an intercept estimate for each region instead of a single intercept for all data.
m_electric_multi <- recs_des %>%
+ svyglm(
+ design = .,
+ formula = DOLLAREL ~ (Region + Urbanicity + TOTSQFT_EN + ACUsed)^2 - 1,
+ na.action = na.omit
+ )
term | +estimate | +std.error | +statistic | +p.value | +
---|---|---|---|---|
RegionNortheast | +543.73 | +56.57 | +9.61 | +<0.0001 |
RegionMidwest | +702.16 | +78.12 | +8.99 | +<0.0001 |
RegionSouth | +938.74 | +46.99 | +19.98 | +<0.0001 |
RegionWest | +603.27 | +36.31 | +16.61 | +<0.0001 |
UrbanicityUrban Cluster | +73.03 | +81.50 | +0.90 | +0.3764 |
UrbanicityRural | +204.13 | +80.69 | +2.53 | +0.0161 |
TOTSQFT_EN | +0.24 | +0.03 | +8.65 | +<0.0001 |
ACUsedTRUE | +252.06 | +54.05 | +4.66 | +<0.0001 |
RegionMidwest:UrbanicityUrban Cluster | +183.06 | +82.38 | +2.22 | +0.0328 |
RegionSouth:UrbanicityUrban Cluster | +152.56 | +76.03 | +2.01 | +0.0526 |
RegionWest:UrbanicityUrban Cluster | +98.02 | +75.16 | +1.30 | +0.2007 |
RegionMidwest:UrbanicityRural | +312.83 | +50.88 | +6.15 | +<0.0001 |
RegionSouth:UrbanicityRural | +220.00 | +55.00 | +4.00 | +0.0003 |
RegionWest:UrbanicityRural | +180.97 | +58.70 | +3.08 | +0.0040 |
RegionMidwest:TOTSQFT_EN | +−0.05 | +0.02 | +−2.09 | +0.0441 |
RegionSouth:TOTSQFT_EN | +0.00 | +0.03 | +0.11 | +0.9109 |
RegionWest:TOTSQFT_EN | +−0.03 | +0.03 | +−1.00 | +0.3254 |
RegionMidwest:ACUsedTRUE | +−292.97 | +60.24 | +−4.86 | +<0.0001 |
RegionSouth:ACUsedTRUE | +−294.07 | +57.44 | +−5.12 | +<0.0001 |
RegionWest:ACUsedTRUE | +−77.68 | +47.05 | +−1.65 | +0.1076 |
UrbanicityUrban Cluster:TOTSQFT_EN | +−0.04 | +0.02 | +−1.63 | +0.1112 |
UrbanicityRural:TOTSQFT_EN | +−0.06 | +0.02 | +−2.60 | +0.0137 |
UrbanicityUrban Cluster:ACUsedTRUE | +−130.23 | +60.30 | +−2.16 | +0.0377 |
UrbanicityRural:ACUsedTRUE | +−33.80 | +59.30 | +−0.57 | +0.5724 |
TOTSQFT_EN:ACUsedTRUE | +0.08 | +0.02 | +3.48 | +0.0014 |
As shown above, there are many terms in this model. To test whether coefficients for a term are different from zero, the function regTermTest()
can be used. For example, in the above regression, we can test whether the interaction of region and urbanicity is significant as follows:
## Wald test for Urbanicity:Region
+## in svyglm(design = ., formula = DOLLAREL ~ (Region + Urbanicity +
+## TOTSQFT_EN + ACUsed)^2 - 1, na.action = na.omit)
+## F = 6.851 on 6 and 35 df: p= 7.2e-05
+This output indicates there is a significant interaction between urbanicity and region (p-value=<0.0001).
+To examine the predictions, residuals, and more from the model, the function augment()
from {broom} can be used. The augment()
function will return a tibble with the independent and dependent variables and other fit statistics. The augment()
function has not been specifically written for objects of class svyglm
, and as such, a warning will be displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using augment()
. To obtain the standard error of the predicted values (.se.fit
) we need to use the attr()
function on the predicted values (.fitted
) created by augment()
. Additionally, the predicted values created are outputted as a svrep
type of data. If we want to plot the predicted values, we need to use as.numeric()
to get the predicted values into a numeric format to work with. However, it is important to note that this adjustment must be completed after the standard error adjustment.
fitstats <-
+ augment(m_electric_multi) %>%
+ mutate(.se.fit = sqrt(attr(.fitted, "var")),
+ .fitted = as.numeric(.fitted))
+
+fitstats
## # A tibble: 18,496 × 13
+## DOLLAREL Region Urbanicity TOTSQFT_EN ACUsed `(weights)` .fitted
+## <dbl> <fct> <fct> <dbl> <lgl> <dbl> <dbl>
+## 1 1955. West Urban Area 2100 TRUE 0.492 1397.
+## 2 713. South Urban Area 590 TRUE 1.35 1090.
+## 3 335. West Urban Area 900 TRUE 0.849 1043.
+## 4 1425. South Urban Area 2100 TRUE 0.793 1584.
+## 5 1087 Northeast Urban Area 800 TRUE 1.49 1055.
+## 6 1896. South Urban Area 4520 TRUE 1.09 2375.
## 7 1418. South Urban Area 2100 TRUE 0.851 1584.
## 8 1237. South Urban Clust… 900 FALSE 1.45 1349.
## 9 538. South Urban Area 750 TRUE 0.185 1142.
@@ -793,16 +2355,16 @@ Example 2: Linear Regression with Additional Variables and Interactions
-These results can then be used in a variety of ways, including examining residual plots as illustrated in the code below and Figure 7.2.
-fitstats %>%
- ggplot(aes(x = .fitted, .resid)) +
- geom_point() +
- geom_hline(yintercept = 0, color = "red") +
- theme_minimal() +
- xlab("Fitted value of electricity cost") +
- ylab("Residual of model") +
- scale_y_continuous(labels = scales::dollar_format()) +
- scale_x_continuous(labels = scales::dollar_format())
These results can then be used in a variety of ways, including examining residual plots as illustrated in the code below and Figure 7.2. In the residual plot, we look for any patterns in the data. If we do see patterns, this may indicate a violation of the heteroscedasticity assumption and the standard errors of the coefficients may be incorrect. In Figure 7.2, we do not see a strong pattern indicating that our assumption of heteroscedasticity may hold.
+fitstats %>%
+ ggplot(aes(x = .fitted, .resid)) +
+ geom_point() +
+ geom_hline(yintercept = 0, color = "red") +
+ theme_minimal() +
+ xlab("Fitted value of electricity cost") +
+ ylab("Residual of model") +
+ scale_y_continuous(labels = scales::dollar_format()) +
+ scale_x_continuous(labels = scales::dollar_format())
Additionally, augment()
can be used to predict outcomes for data not used in modeling. Perhaps, we would like to predict the energy expenditure for a home in an urban area in the south that uses air-conditioning and is 2,500 square feet. To do this, we first make a tibble including that additional data and then use the newdata
argument in the augment()
function. As before, to obtain the standard error of the predicted values we need to use the attr()
function.
add_data <-
- recs_2020 %>% select(DOEID,
- Region,
- Urbanicity,
- TOTSQFT_EN,
- ACUsed,
- DOLLAREL) %>%
- rbind(
- tibble(
- DOEID = NA,
- Region = "South",
- Urbanicity = "Urban Area",
- TOTSQFT_EN = 2500,
- ACUsed = TRUE,
- DOLLAREL = NA
- )
- ) %>%
- tail(1)
-
-pred_data <- augment(m_electric_multi, newdata = add_data) %>%
- mutate(.se.fit = sqrt(attr(.fitted, "var")),
- .fitted = as.numeric(.fitted))
-
-pred_data
add_data <- recs_2020 %>%
+ select(DOEID, Region, Urbanicity,
+ TOTSQFT_EN, ACUsed,
+ DOLLAREL) %>%
+ rbind(
+ tibble(
+ DOEID = NA,
+ Region = "South",
+ Urbanicity = "Urban Area",
+ TOTSQFT_EN = 2500,
+ ACUsed = TRUE,
+ DOLLAREL = NA
+ )
+ ) %>%
+ tail(1)
+
+pred_data <- augment(m_electric_multi, newdata = add_data) %>%
+ mutate(.se.fit = sqrt(attr(.fitted, "var")),
+ .fitted = as.numeric(.fitted))
+
+pred_data
## # A tibble: 1 × 8
## DOEID Region Urbanicity TOTSQFT_EN ACUsed DOLLAREL .fitted .se.fit
## <dbl> <fct> <fct> <dbl> <lgl> <dbl> <dbl> <dbl>
## 1 NA South Urban Area 2500 TRUE NA 1715. 22.6
-In the above example, it is predicted that the energy expenditure would be $1714.57.
+In the above example, it is predicted that the energy expenditure would be $1,715.
Logistic regression is used to model a binary outcome and is a specific case of the generalized linear model (GLM). A GLM uses a link function to link the response variable to the linear model. In logistic regression, the link model is the logit function. Specifically, the model is specified as follows:
+Logistic regression is used to model binary outcomes such as whether or not someone voted. There are several instances where an outcome may not be originally binary but is collapsed into being binary. For example, given that gender is often asked in surveys with multiple response options and not a binary scale, many researchers now code gender in logistic modeling as cis-male compared to not cis-male. We could also convert a 4-point likert scale that has levels of “Strongly Agree”, “Agree”, “Disagree”, and “Strongly Disagree” to group the agreement levels into one group and disagreement levels into a second group.
+Logistic regression is a specific case of the generalized linear model (GLM). A GLM uses a link function to link the response variable to the linear model. If we tried to use a normal linear regression with a binary outcome, many assumptions are not held - namely the response is not continuous. Logistic regression allows us to link a linear model between the covariates and a propensity of an outcome. In logistic regression, the link model is the logit function. Specifically, the model is specified as follows:
\[ y_i \sim \text{Bernoulli}(\pi_i)\]
\[\begin{equation} -\log \left(\frac{\pi_i}{1-\pi_i} \right)=\beta_0 +\sum_{i=1}^p \beta_i x_i -\tag{7.1} +\log \left(\frac{\pi_i}{1-\pi_i} \right)=\beta_0 +\sum_{i=1}^n \beta_i x_i +\tag{7.2} \end{equation}\]which can be re-expressed as
-\[ \pi_i=\frac{\exp \left(\beta_0 +\sum_{i=1}^p \beta_i x_i \right)}{1+\exp \left(\beta_0 +\sum_{i=1}^p \beta_i x_i \right)}.\] where \(y_i\) is the outcome, \(\beta_0\) is an intercept, and \(x_1, \cdots, x_n\) are the predictors with \(\beta_1, \cdots, \beta_n\) as the associated coefficients.
+\[ \pi_i=\frac{\exp \left(\beta_0 +\sum_{i=1}^n \beta_i x_i \right)}{1+\exp \left(\beta_0 +\sum_{i=1}^n \beta_i x_i \right)}.\] where \(y_i\) is the outcome, \(\beta_0\) is an intercept, and \(x_1, \cdots, x_n\) are the predictors with \(\beta_1, \cdots, \beta_n\) as the associated coefficients.
+The Bernoulli distribution is a distribution which has an outcome of 0 or 1 given some probability (\(\pi_i\)) in this case and we model \(\pi_i\) as a function of the covariates \(x_i\) using this logit link.
Assumptions in logistic regression using survey data include:
The syntax for logistic regression is as follows:
-des_obj %>%
- svyglm(
- formula = outcomevar ~ x1 + x2 + x3,
- design = .,
- na.action = na.omit,
- df.resid = NULL,
- family = quasibinomial
- )
he arguments are:
+des_obj %>%
+ svyglm(
+ formula = outcomevar ~ x1 + x2 + x3,
+ design = .,
+ na.action = na.omit,
+ df.resid = NULL,
+ family = quasibinomial
+ )
The arguments are:
formula
: Formula in the form of y~x
design
: a tbl_svy
object created by as_survey
family
: the error distribution/link function to be used in the modelNote svyglm()
is the same function used in both ANOVA and linear regression. However, we’ve added the link function quasibinomial. While we can use the binomial link function, it is recommended to use the quasibinomial as our weights may not be integers, and the quasibinomial also allows for overdispersion. The quasibinomial family has a default logit link which is what is specified in the equations above. When specifying the outcome variable, it will likely be specified in one of two ways with survey data:
Note svyglm()
is the same function used in both ANOVA and normal linear regression. However, we’ve added the link function quasibinomial. While we can use the binomial link function, it is recommended to use the quasibinomial as our weights may not be integers, and the quasibinomial also allows for overdispersion (McCullagh and Nelder 1989; Lumley 2010; R Core Team 2023). The quasibinomial family has a default logit link which is what is specified in the equations above. When specifying the outcome variable, it will likely be specified in one of three ways with survey data:
In the following example, the ANES data is used, and we are modeling whether someone usually has trust in the government29 by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an “Other” category. We first create a binary outcome for trusting in the government and plot the data. A scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data.
-anes_des_der <- anes_des %>%
- mutate(TrustGovernmentUsually = case_when(
- is.na(TrustGovernment) ~ NA,
- TRUE ~ TrustGovernment %in% c("Always", "Most of the time")
- ))
-
-anes_des_der %>%
- group_by(VotedPres2020_selection) %>%
- summarize(pct_trust = survey_mean(TrustGovernmentUsually,
- na.rm = TRUE,
- proportion = TRUE,
- vartype = "ci"),
- .groups = "drop") %>%
- filter(complete.cases(.)) %>%
- ggplot(aes(x = VotedPres2020_selection, y = pct_trust,
- fill = VotedPres2020_selection)) +
- geom_bar(stat = "identity") +
- geom_errorbar(aes(ymin = pct_trust_low, ymax = pct_trust_upp),
- width = .2) +
- scale_fill_manual(values = c("#0b3954", "#bfd7ea", "#8d6b94")) +
- xlab("Election choice (2022)") +
- ylab("Usually trust the government") +
- scale_y_continuous(labels = scales::percent) +
- guides(fill = "none") +
- theme_minimal()
In the following example, the ANES data is used, and we are modeling whether someone usually has trust in the government28 by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an “Other” category. We first create a binary outcome for trusting in the government by collapsing “Always” and “Most of the time” into a single factor level, and the other response options (“About half the time”, “Some of the time”, and “Never”) into a second factor level. Next, a scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data.
+anes_des_der <- anes_des %>%
+ mutate(TrustGovernmentUsually = case_when(
+ is.na(TrustGovernment) ~ NA,
+ TRUE ~ TrustGovernment %in% c("Always", "Most of the time")
+ ))
+
+anes_des_der %>%
+ group_by(VotedPres2020_selection) %>%
+ summarize(pct_trust = survey_mean(TrustGovernmentUsually,
+ na.rm = TRUE,
+ proportion = TRUE,
+ vartype = "ci"),
+ .groups = "drop") %>%
+ filter(complete.cases(.)) %>%
+ ggplot(aes(x = VotedPres2020_selection, y = pct_trust,
+ fill = VotedPres2020_selection)) +
+ geom_bar(stat = "identity") +
+ geom_errorbar(aes(ymin = pct_trust_low, ymax = pct_trust_upp),
+ width = .2) +
+ scale_fill_manual(values = c("#0b3954", "#bfd7ea", "#8d6b94")) +
+ xlab("Election choice (2020)") +
+ ylab("Usually trust the government") +
+ scale_y_continuous(labels = scales::percent) +
+ guides(fill = "none") +
+ theme_minimal()
By looking at Figure 7.3 it appears that people who voted for Trump are more likely to say that they usually have trust in the government compared to those who voted for Biden and Other candidates. To determine if this insight is accurate, we next we fit the model.
-logistic_trust_vote <- anes_des_der %>%
- svyglm(design = .,
- formula = TrustGovernmentUsually ~ VotedPres2020_selection,
- family = quasibinomial)
-
-tidy(logistic_trust_vote)
## # A tibble: 3 × 5
-## term estimate std.error statistic p.value
-## <chr> <dbl> <dbl> <dbl> <dbl>
-## 1 (Intercept) -1.96 0.0714 -27.5 2.07e-31
-## 2 VotedPres2020_selectionTrump 0.435 0.0920 4.72 1.98e- 5
-## 3 VotedPres2020_selectionOther -0.655 0.440 -1.49 1.43e- 1
-In the output above, we can see the estimated coefficients (estimate
), estimated standard errors of the coefficients (std.error
), the t-statistic (statistic
), and the p-value for each coefficient. This output indicates that respondents who voted for Trump are 0.435 times more likely to usually have trust in the government compared to those who voted for Biden (the reference level).
Sometimes it is easier to talk about the odds instead of the likelihood. In this case, we can also see the exponentiated coefficients which illustrates the odds:
- -## # A tibble: 3 × 2
-## term estimate
-## <chr> <dbl>
-## 1 (Intercept) 0.141
-## 2 VotedPres2020_selectionTrump 1.54
-## 3 VotedPres2020_selectionOther 0.520
+By looking at Figure 7.3 it appears that people who voted for Trump are more likely to say that they usually have trust in the government compared to those who voted for Biden and Other candidates. To determine if this insight is accurate, we next we fit the model.
+logistic_trust_vote <- anes_des_der %>%
+ svyglm(design = .,
+ formula = TrustGovernmentUsually ~ VotedPres2020_selection,
+ family = quasibinomial)
tidy(logistic_trust_vote) %>%
+ mutate(p.value=pretty_p_value(p.value)) %>%
+ gt() %>%
+ fmt_number()
term | +estimate | +std.error | +statistic | +p.value | +
---|---|---|---|---|
(Intercept) | +−1.96 | +0.07 | +−27.45 | +<0.0001 |
VotedPres2020_selectionTrump | +0.43 | +0.09 | +4.72 | +<0.0001 |
VotedPres2020_selectionOther | +−0.65 | +0.44 | +−1.49 | +0.1429 |
In the output above, we can see the estimated coefficients (estimate
), estimated standard errors of the coefficients (std.error
), the t-statistic (statistic
), and the p-value for each coefficient. This output indicates that respondents who voted for Trump are 0.435 times more likely to usually have trust in the government compared to those who voted for Biden (the reference level).
Sometimes it is easier to talk about the odds instead of the likelihood. To do this, we need to exponentiate the coefficients. We can use the same tidy()
function, but include the argument exponentiate = TRUE
to see the odds.
tidy(logistic_trust_vote, exponentiate = TRUE) %>%
+ select(term, estimate) %>%
+ gt() %>%
+ fmt_number()
term | +estimate | +
---|---|
(Intercept) | +0.14 |
VotedPres2020_selectionTrump | +1.54 |
VotedPres2020_selectionOther | +0.52 |
We can interpret this as saying that the odds of usually trusting the government for someone who voted for Trump is 154% as likely to trust the government compared to a person who voted for Biden (the reference level). In comparison, a person who voted for neither Biden nor Trump is 52% as likely to trust the government as someone who voted for Biden.
+As with linear regression, the augment()
can be used to predict values. By default, the prediction is the link function and not the probability. To predict the probability, add an argument of type.predict="response"
as demonstrated below:
logistic_trust_vote %>%
- augment(type.predict = "response") %>%
- mutate(.se.fit = sqrt(attr(.fitted, "var")),
- .fitted = as.numeric(.fitted)) %>%
- select(TrustGovernmentUsually,
- VotedPres2020_selection,
- .fitted,
- .se.fit)
logistic_trust_vote %>%
+ augment(type.predict = "response") %>%
+ mutate(.se.fit = sqrt(attr(.fitted, "var")),
+ .fitted = as.numeric(.fitted)) %>%
+ select(TrustGovernmentUsually,
+ VotedPres2020_selection,
+ .fitted,
+ .se.fit)
## # A tibble: 6,212 × 4
## TrustGovernmentUsually VotedPres2020_selection .fitted .se.fit
## <lgl> <fct> <dbl> <dbl>
@@ -970,90 +3460,1029 @@ Example 1: Logistic Regression with Single Variable
Example 2: Interaction Effects
-Let’s look at another example with interaction effects. If we’re interested in understanding the demographics of people who voted for Biden, we could include Gender
and Education
in our model.
-First we need to create an indicator for voted for Biden. Note that this indicator places anyone who did not vote at all into VoteBiden = 0
.
-anes_des_ind <- anes_des %>%
- mutate(VoteBiden = case_when(VotedPres2020_selection == "Biden"~1,
- TRUE ~ 0))
-Let’s first look at the main effects of gender and education.
-log_biden_main <- anes_des_ind %>%
- svyglm(design = .,
- formula = VoteBiden ~ Gender + Education,
- family = quasibinomial)
-
-tidy(log_biden_main)
-## # A tibble: 6 × 5
-## term estimate std.error statistic p.value
-## <chr> <dbl> <dbl> <dbl> <dbl>
-## 1 (Intercept) -1.24 0.191 -6.48 0.0000000545
-## 2 GenderFemale 0.157 0.0763 2.05 0.0458
-## 3 EducationHigh school 0.384 0.202 1.90 0.0631
-## 4 EducationPost HS 0.619 0.186 3.32 0.00175
-## 5 EducationBachelor's 1.20 0.191 6.32 0.0000000961
-## 6 EducationGraduate 1.53 0.211 7.26 0.00000000371
-This main effect model indicates that respondents with a graduate degree are 1.53 times more likely to vote for Biden compared to respondents with less than a high school degree. However, we see that gender is not significant.
-It is possible that there is an interaction between gender and education. To determine this we can create a model that includes the interaction effects:
-log_biden_int <- anes_des_ind %>%
- svyglm(design = .,
- formula = VoteBiden ~ (Gender + Education)^2,
- family = quasibinomial)
-
-tidy(log_biden_int)
-## # A tibble: 10 × 5
-## term estimate std.error statistic p.value
-## <chr> <dbl> <dbl> <dbl> <dbl>
-## 1 (Intercept) -0.994 0.260 -3.82 4.32e-4
-## 2 GenderFemale -0.377 0.441 -0.856 3.97e-1
-## 3 EducationHigh school 0.0762 0.290 0.263 7.94e-1
-## 4 EducationPost HS 0.411 0.273 1.51 1.39e-1
-## 5 EducationBachelor's 1.01 0.270 3.75 5.30e-4
-## 6 EducationGraduate 1.13 0.282 4.02 2.36e-4
-## 7 GenderFemale:EducationHigh scho… 0.665 0.490 1.36 1.82e-1
-## 8 GenderFemale:EducationPost HS 0.474 0.452 1.05 3.00e-1
-## 9 GenderFemale:EducationBachelor's 0.436 0.451 0.967 3.39e-1
-## 10 GenderFemale:EducationGraduate 0.844 0.463 1.82 7.56e-2
-The results from the interaction model show a single interaction effect that is significant. To better understand what this interaction means, we will want to plot the predicted probabilities. Let’s first obtain the predicted probabilities for each possible combination of variables using the augment()
function.
-log_biden_pred <- log_biden_int %>%
- augment(type.predict = "response") %>%
- mutate(.se.fit = sqrt(attr(.fitted, "var")),
- .fitted = as.numeric(.fitted)) %>%
- select(VoteBiden, Gender, Education, .fitted, .se.fit)
-We can then use this information to plot the predicted probabilities to better understand the interaction effects. To create an interaction plot, the y-axis will be the predicted probabilities, and one of our x-variables will be on the x-axis and the other will be represented by multiple lines. Figure 7.4 shows the interaction plot with the gender variable on the x-axis and education represented by the lines.
-biden_int_plot <- log_biden_pred %>%
- filter(VoteBiden==1) %>%
- distinct() %>%
- arrange(Gender, Education) %>%
- mutate(Education = fct_reorder2(Education, Gender, .fitted)) %>%
- ggplot(aes(x = Gender, y = .fitted, group = Education,
- color = Education, linetype = Education)) +
- geom_line(linewidth = 1.1) +
- scale_color_manual(values = book_colors) +
- ylab("Predicted Probability of Voting for Biden") +
- guides(fill = "none") +
- theme_minimal()
-
-biden_int_plot
+Let’s look at another example with interaction effects. If we’re interested in understanding the demographics of people who voted for Biden among all voters in 2020, we could include EarlyVote2020
and Gender
in our model.
+First we need to subset the data to 2020 voters and then create an indicator for voted for Biden.
+anes_des_ind <- anes_des %>%
+ filter(!is.na(VotedPres2020_selection)) %>%
+ mutate(VoteBiden = case_when(VotedPres2020_selection == "Biden"~1,
+ TRUE ~ 0))
+Let’s first look at the main effects of gender and early voting behavior.
+log_biden_main <- anes_des_ind %>%
+ mutate(EarlyVote2020 = fct_relevel(EarlyVote2020, "No", after = 0)) %>%
+ svyglm(design = .,
+ formula = VoteBiden ~ EarlyVote2020 + Gender,
+ family = quasibinomial)
+
+
+
+
+
+ TABLE 7.8: Logistic regression output for predicting voting for Biden given early voting behavior and gender - main effects only, RECS 2020
+
+
+
+ term
+ estimate
+ std.error
+ statistic
+ p.value
+
+
+
+ (Intercept)
+0.06
+0.05
+1.13
+0.2650
+ EarlyVote2020Yes
+0.56
+0.17
+3.35
+0.0016
+ GenderFemale
+0.17
+0.07
+2.43
+0.0187
+
+
+
+
+
+This main effect model indicates that respondents with who early voted in 2020 are 0.559 (p-value=0.0016) times more likely to vote for Biden compared to respondents who did not early vote in the 2020 election (the reference level). We see that gender is also significant with females more likely to vote for Biden compared to males (p-value=0.0187).
+It is possible that there is an interaction between gender and early voting behavior. To determine this we can create a model that includes the interaction effects:
+log_biden_int <- anes_des_ind %>%
+ mutate(EarlyVote2020 = fct_relevel(EarlyVote2020, "No", after = 0)) %>%
+ svyglm(design = .,
+ formula = VoteBiden ~ (EarlyVote2020 + Gender)^2,
+ family = quasibinomial)
+
+
+
+
+
+ TABLE 7.9: Logistic regression output for predicting voting for Biden given early voting behavior and gender - with interaction, RECS 2020
+
+
+
+ term
+ estimate
+ std.error
+ statistic
+ p.value
+
+
+
+ (Intercept)
+0.08
+0.05
+1.51
+0.1386
+ EarlyVote2020Yes
+0.10
+0.25
+0.41
+0.6837
+ GenderFemale
+0.14
+0.07
+1.86
+0.0695
+ EarlyVote2020Yes:GenderFemale
+0.90
+0.30
+3.04
+0.0038
+
+
+
+
+
+The results from the interaction model show that the interaction between early voting behavior and gender is significant. To better understand what this interaction means, we will want to plot the predicted probabilities with an interaction plot. Let’s first obtain the predicted probabilities for each possible combination of variables using the augment()
function.
+log_biden_pred <- log_biden_int %>%
+ augment(type.predict = "response") %>%
+ mutate(.se.fit = sqrt(attr(.fitted, "var")),
+ .fitted = as.numeric(.fitted)) %>%
+ select(VoteBiden, EarlyVote2020, Gender, .fitted, .se.fit)
+To create an interaction plot, the y-axis will be the predicted probabilities, and one of our x-variables will be on the x-axis and the other will be represented by multiple lines. Figure 7.4 shows the interaction plot with gender on the x-axis and early voting behavior represented by the lines.
+log_biden_pred %>%
+ filter(VoteBiden==1) %>%
+ distinct() %>%
+ arrange(Gender, EarlyVote2020) %>%
+ mutate(EarlyVote2020 = fct_reorder2(EarlyVote2020, Gender, .fitted)) %>%
+ ggplot(aes(x = Gender, y = .fitted, group = EarlyVote2020,
+ color = EarlyVote2020, linetype = EarlyVote2020)) +
+ geom_line(linewidth = 1.1) +
+ scale_color_manual(values = book_colors[c(2,4)]) +
+ ylab("Predicted Probability of Voting for Biden") +
+ labs(color="Voted Early",
+ linetype="Voted Early") +
+ coord_cartesian(ylim=c(0,1)) +
+ guides(fill = "none") +
+ theme_minimal()
-
+
-
From this plot we can see that respondents who indicated a male gender and had less than a high school education were more likely to vote for Biden than females among those with less than a high school education. Additionally, females with a graduate degree were more likely to vote for Biden than males with a graduate degree.
-Interactions in models can be difficult to understand from the coefficients alone. Using these interaction plots can help others understand the nuances of the results.
+From this plot we can see that respondents who indicated a male gender had roughly the same probability of voting for Biden regardless of if they voted early or not. However, females who voted early were more likely to vote for Biden if they voted early than if they did not vote early.
+Interactions in models can be difficult to understand from the coefficients alone. Using these interaction plots can help others understand the nuances of the results, and often can become even more helpful with more than two levels in a given factor (e.g., education or race/ethnicity).
HousingUnitType
) and total energy expenditure (TOTALDOL
)? First, find the average energy expenditure by housing unit type as a descriptive analysis and then do the test. The reference level in the comparison should be the housing unit type that is most common.HousingUnitType
) and total energy expenditure (TOTALDOL
)? First, find the average energy expenditure by housing unit type as a descriptive analysis and then do the test. The reference level in the comparison should be the housing unit type that is most common.recs_des %>%
- group_by(HousingUnitType) %>%
- summarize(Expense = survey_mean(TOTALDOL, na.rm = TRUE),
- HUs = survey_total()) %>%
- arrange(desc(HUs))
recs_des %>%
+ group_by(HousingUnitType) %>%
+ summarize(Expense = survey_mean(TOTALDOL, na.rm = TRUE),
+ HUs = survey_total()) %>%
+ arrange(desc(HUs))
## # A tibble: 5 × 5
## HousingUnitType Expense Expense_se HUs HUs_se
## <fct> <dbl> <dbl> <dbl> <dbl>
@@ -1062,15 +4491,15 @@ 7.5 Exercisesexp_unit_out <- recs_des %>%
- mutate(HousingUnitType = fct_infreq(HousingUnitType, NWEIGHT)) %>%
- svyglm(
- design = .,
- formula = TOTALDOL ~ HousingUnitType,
- na.action = na.omit
- )
-
-tidy(exp_unit_out)
exp_unit_out <- recs_des %>%
+ mutate(HousingUnitType = fct_infreq(HousingUnitType, NWEIGHT)) %>%
+ svyglm(
+ design = .,
+ formula = TOTALDOL ~ HousingUnitType,
+ na.action = na.omit
+ )
+
+tidy(exp_unit_out)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
@@ -1079,19 +4508,19 @@ 7.5 Exercises# Single-family detached units are most common
-# There is a significant relationship between energy expenditure and housing unit type
+# Single-family detached units are most common
+# There is a significant relationship between energy expenditure and housing unit type
-- Does temperature play a role in energy expenditure? Cooling degree days are a measure of how hot a place is. CDD65 for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0. While a day that averages 85°F would have CDD80=20 because it is 20 degrees warmer. For each day in the year, this is summed to give an indicator of how hot the place is throughout the year. Similarly, HDD65 indicates the days colder than 65°F (18.3°C)30. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions.
+- Using the RECS data, does temperature play a role in energy expenditure? Cooling degree days are a measure of how hot a place is. Variable
CDD65
for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0
. While a day that averages 85°F would have CDD65=20
because it is 20 degrees warmer. For each day in the year, this is summed to give an indicator of how hot the place is throughout the year. Similarly, HDD65
indicates the days colder than 65°F (18.3°C)29. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions.
-temps_sqft_exp <- recs_des %>%
- svyglm(
- design = .,
- formula = DOLLAREL ~ (TOTSQFT_EN + CDD65 + HDD65) ^ 2,
- na.action = na.omit
- )
-
-tidy(temps_sqft_exp)
+temps_sqft_exp <- recs_des %>%
+ svyglm(
+ design = .,
+ formula = DOLLAREL ~ (TOTSQFT_EN + CDD65 + HDD65) ^ 2,
+ na.action = na.omit
+ )
+
+tidy(temps_sqft_exp)
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
@@ -1105,33 +4534,33 @@ 7.5 Exercises
- Continuing with our results from question 2, create a plot between the actual and predicted expenditures and a residual plot for the predicted expenditures.
-temps_sqft_exp_fit <- temps_sqft_exp %>%
- augment() %>%
- mutate(.se.fit = sqrt(attr(.fitted, "var")),
- # extract the variance of the fitted value
- .fitted = as.numeric(.fitted))
-temps_sqft_exp_fit %>%
- ggplot(aes(x = DOLLAREL, y = .fitted)) +
- geom_point() +
- geom_abline(intercept = 0,
- slope = 1,
- color = "red") +
- xlab("Actual expenditures") +
- ylab("Predicted expenditures") +
- theme_minimal()
+temps_sqft_exp_fit <- temps_sqft_exp %>%
+ augment() %>%
+ mutate(.se.fit = sqrt(attr(.fitted, "var")),
+ # extract the variance of the fitted value
+ .fitted = as.numeric(.fitted))
+temps_sqft_exp_fit %>%
+ ggplot(aes(x = DOLLAREL, y = .fitted)) +
+ geom_point() +
+ geom_abline(intercept = 0,
+ slope = 1,
+ color = "red") +
+ xlab("Actual expenditures") +
+ ylab("Predicted expenditures") +
+ theme_minimal()
-
temps_sqft_exp_fit %>%
- ggplot(aes(x = .fitted, y = .resid)) +
- geom_point() +
- geom_hline(yintercept = 0, color = "red") +
- xlab("Predicted expenditure") +
- ylab("Residual value of expenditure") +
- theme_minimal()
+temps_sqft_exp_fit %>%
+ ggplot(aes(x = .fitted, y = .resid)) +
+ geom_point() +
+ geom_hline(yintercept = 0, color = "red") +
+ xlab("Predicted expenditure") +
+ ylab("Residual value of expenditure") +
+ theme_minimal()
7.5 Exercises
-- Early voting expanded in 202031. Build a logistic model predicting early voting in 2020 (
EarlyVote2020
) using age (Age
), education (Education
), and party identification (PartyID
). Include two-way interactions.
+- Early voting expanded in 202030. Using the ANES data, build a logistic model predicting early voting in 2020 (
EarlyVote2020
) using age (Age
), education (Education
), and party identification (PartyID
). Include two-way interactions.
-earlyvote_mod <- anes_des %>%
- filter(!is.na(EarlyVote2020)) %>%
- svyglm(
- design = .,
- formula = EarlyVote2020 ~ (Age + Education + PartyID) ^ 2 ,
- family = quasibinomial
- )
-
-tidy(earlyvote_mod) %>% arrange(p.value)
+earlyvote_mod <- anes_des %>%
+ filter(!is.na(EarlyVote2020)) %>%
+ svyglm(
+ design = .,
+ formula = EarlyVote2020 ~ (Age + Education + PartyID) ^ 2 ,
+ family = quasibinomial
+ )
+
+tidy(earlyvote_mod) %>% arrange(p.value)
## # A tibble: 46 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
@@ -1165,23 +4594,23 @@ 7.5 Exercises
-- Continuing from Exercise 1, predict the probability of early voting for two people. Both are 28 years old and have a graduate degree, but one person is a strong Democrat, and the other is a strong Republican.
+- Continuing from Exercise 4, predict the probability of early voting for two people. Both are 28 years old and have a graduate degree, but one person is a strong Democrat, and the other is a strong Republican.
-add_vote_dat <- anes_2020 %>%
- select(EarlyVote2020, Age, Education, PartyID) %>%
- rbind(tibble(
- EarlyVote2020 = NA,
- Age = 28,
- Education = "Graduate",
- PartyID = c("Strong democrat", "Strong republican")
- )) %>%
- tail(2)
-
-log_ex_2_out <- earlyvote_mod %>%
- augment(newdata = add_vote_dat, type.predict = "response") %>%
- mutate(.se.fit = sqrt(attr(.fitted, "var")),
- # extract the variance of the fitted value
- .fitted = as.numeric(.fitted))
+add_vote_dat <- anes_2020 %>%
+ select(EarlyVote2020, Age, Education, PartyID) %>%
+ rbind(tibble(
+ EarlyVote2020 = NA,
+ Age = 28,
+ Education = "Graduate",
+ PartyID = c("Strong democrat", "Strong republican")
+ )) %>%
+ tail(2)
+
+log_ex_2_out <- earlyvote_mod %>%
+ augment(newdata = add_vote_dat, type.predict = "response") %>%
+ mutate(.se.fit = sqrt(attr(.fitted, "var")),
+ # extract the variance of the fitted value
+ .fitted = as.numeric(.fitted))
@@ -1202,16 +4631,21 @@ References
———. 2023. Survey: Analysis of Complex Survey Samples. http://r-survey.r-forge.r-project.org/survey/.
+
+McCullagh, Peter, and John Ashworth Nelder. 1989. “Binary Data.” In Generalized Linear Models, 98–148. Springer.
+
+
+R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
+
Use help(formula)
or ?formula
in R or find the documentation online at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html↩︎
There is some debate about whether weights should be used in regression (Gelman 2007; Bollen et al. 2016). However, for the purposes of providing complete information on how to analyze complex survey data, this chapter will include weights.↩︎
-To change the reference level, reorder the factor before modeling using the function relevel()
from {stats} or using one of many factor ordering functions in {forcats} such as fct_relevel()
or fct_infreq()
↩︎
-Question: How often can you trust the federal government in Washington to do what is right?↩︎
-https://www.eia.gov/energyexplained/units-and-calculators/degree-days.php↩︎
-https://www.npr.org/2020/10/26/927803214/62-million-and-counting-americans-are-breaking-early-voting-records↩︎
+Question: How often can you trust the federal government in Washington to do what is right?↩︎
+https://www.eia.gov/energyexplained/units-and-calculators/degree-days.php↩︎
+https://www.npr.org/2020/10/26/927803214/62-million-and-counting-americans-are-breaking-early-voting-records↩︎
diff --git a/c08-communicating-results.html b/c08-communicating-results.html
index ee294e4a..80a146c9 100644
--- a/c08-communicating-results.html
+++ b/c08-communicating-results.html
@@ -23,7 +23,7 @@
-
+
@@ -300,13 +300,13 @@
- 7 Modeling
- Prerequisites
-- 7.1 Introduction
+- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
-- 7.3 Gaussian Linear Regression
+
- 7.3 Normal Linear Regression
- 7.3.1 Syntax
- 7.3.2 Examples
@@ -322,7 +322,7 @@
- 8 Communicating results
- Prerequisites
-- 8.1 Introduction
+- 8.1 Introduction
- 8.2 Describing results through text
- 8.2.1 Methodology
@@ -337,7 +337,7 @@
- 9 Reproducible research
-- 9.1 Introduction
+- 9.1 Introduction
- 9.2 Project-based workflows
- 9.3 Functions and packages
- 9.4 Version control with Git
@@ -359,7 +359,7 @@
- 10 Specifying sample designs and replicate weights in {srvyr}
- Prerequisites
-- 10.1 Introduction
+- 10.1 Introduction
- 10.2 Common sampling designs
- 10.2.1 Simple random sample without replacement
@@ -383,7 +383,7 @@
- 11 Missing data
- Prerequisites
-- 11.1 Introduction
+- 11.1 Introduction
- 11.2 Missing data mechanisms
- 11.3 Assessing missing data
@@ -401,7 +401,7 @@
- 13 National Crime Victimization Survey Vignette
- Prerequisites
-- 13.1 Introduction
+- 13.1 Introduction
- 13.2 Data Structure
- 13.3 Survey Notation
- 13.4 Data File Preparation
@@ -423,7 +423,7 @@
- 14 AmericasBarometer Vignette
- Prerequisites
-- 14.1 Introduction
+- 14.1 Introduction
- 14.2 Data structure
- 14.3 Preparing files
- 14.4 Survey design objects
@@ -497,29 +497,29 @@ Prerequisites
For this chapter, load the following packages:
-library(tidyverse)
-library(survey)
-library(srvyr)
-library(srvyrexploR)
-library(gt)
-library(gtsummary)
+library(tidyverse)
+library(survey)
+library(srvyr)
+library(srvyrexploR)
+library(gt)
+library(gtsummary)
We will be using data from ANES as described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information).
-targetpop <- 231592693
-data(anes_2020)
-
-anes_adjwgt <- anes_2020 %>%
- mutate(Weight = Weight / sum(Weight) * targetpop)
-
-anes_des <- anes_adjwgt %>%
- as_survey_design(
- weights = Weight,
- strata = Stratum,
- ids = VarUnit,
- nest = TRUE
- )
+targetpop <- 231592693
+data(anes_2020)
+
+anes_adjwgt <- anes_2020 %>%
+ mutate(Weight = Weight / sum(Weight) * targetpop)
+
+anes_des <- anes_adjwgt %>%
+ as_survey_design(
+ weights = Weight,
+ strata = Stratum,
+ ids = VarUnit,
+ nest = TRUE
+ )
-
-8.1 Introduction
+
+8.1 Introduction
After finishing the analysis and modeling, we proceed to the important task of communicating the survey results. Our audience may range from seasoned researchers familiar with our survey data to newcomers encountering the information for the first time. We should aim to explain the methodology and analysis while presenting findings in an accessible way, and it is our responsibility to report information with care.
Before beginning any dissemination of results, consider questions such as:
@@ -572,12 +572,12 @@ 8.3.1 Tables
8.3.1.1 Transitioning {srvyr} output to a {gt} table
Let’s start by using some of the data we calculated earlier in this book. In Chapter 6, we looked at data on trust in government with the proportions calculated below:
-trust_gov <- anes_des %>%
- drop_na(TrustGovernment) %>%
- group_by(TrustGovernment) %>%
- summarize(trust_gov_p = survey_prop())
-
-trust_gov
+trust_gov <- anes_des %>%
+ drop_na(TrustGovernment) %>%
+ group_by(TrustGovernment) %>%
+ summarize(trust_gov_p = survey_prop())
+
+trust_gov
## # A tibble: 5 × 3
## TrustGovernment trust_gov_p trust_gov_p_se
## <fct> <dbl> <dbl>
@@ -589,33 +589,33 @@ 8.3.1.1 Transitioning {srvyr} out
The default output generated by R may work for initial viewing inside RStudio or when creating basic output in an R Markdown or Quarto document. However, when presenting these results in other publications, such as the print version of this book or with other formal dissemination modes, modifying the display can improve our reader’s experience.
Looking at the output from trust_gov
, a couple of improvements are obvious: (1) switching to percentages instead of proportions and (2) using the variable names as column headers. The {gt} package is a good tool for implementing better labeling and creating publishable tables. Let’s walk through some code as we make a few changes to improve the table’s usefulness.
First, we initiate the table with the gt()
function. Next, we use the argument rowname_col()
to designate the TrustGovernment
column as the labels for each row (called the table “stub”). We apply the cols_label()
function to create informative column labels instead of variable names, and then the tab_spanner()
function to add a label across multiple columns. In this case, we label all columns except the stub with “Trust in Government, 2020”. We then format the proportions into percentages with the fmt_percent()
function and reduce the number of decimals shown with decimals = 1
. Finally, the tab_caption()
function adds a table title for HTML version of the book. We can use the caption for cross-referencing in R Markdown, Quarto, and bookdown, as well as adding it to the list of tables in the book.
-trust_gov_gt <- trust_gov %>%
- gt(rowname_col = "TrustGovernment") %>%
- cols_label(trust_gov_p = "%",
- trust_gov_p_se = "s.e. (%)") %>%
- tab_spanner(label = "Trust in Government, 2020",
- columns = c(trust_gov_p, trust_gov_p_se)) %>%
- fmt_percent(decimals = 1)
-
-
-
-
@@ -1076,35 +1076,35 @@ 8.3.1.1 Transitioning {srvyr} out
We can add a few more enhancements, such as a title, a data source note, and a footnote with the question information, using the functions tab_header()
, tab_source_note()
, and tab_footnote()
. If having the percentage sign in both the header and the cells seems redundant, we can opt for fmt_number()
instead of fmt_percent()
and scale the number by 100 with scale_by = 100
.
-trust_gov_gt2 <- trust_gov_gt %>%
- tab_header("American voter's trust
- in the federal government, 2020") %>%
- tab_source_note("American National Election Studies, 2020") %>%
- tab_footnote(
- "Question text: How often can you trust the federal government
- in Washington to do what is right?"
- ) %>%
- fmt_number(scale_by = 100,
- decimals = 1)
-
-
-
-
@@ -1605,28 +1605,28 @@ Expanding tables using {gtsummary}anes_des_gtsum <- anes_des %>%
- tbl_svysummary(include = TrustGovernment,
- statistic = list(all_categorical() ~ "{p} ({p.std.error})"))
-
+anes_des_gtsum <- anes_des %>%
+ tbl_svysummary(include = TrustGovernment,
+ statistic = list(all_categorical() ~ "{p} ({p.std.error})"))
+
-
-
@@ -2084,33 +2084,33 @@ Expanding tables using {gtsummary}anes_des_gtsum2 <- anes_des %>%
- tbl_svysummary(
- include = TrustGovernment,
- statistic = list(all_categorical() ~ "{p} ({p.std.error})"),
- missing = "no",
- digits = list(TrustGovernment ~ style_percent),
- label = list(TrustGovernment ~ "Trust in Government, 2020")
- )
-
-
-
-
@@ -2566,45 +2566,45 @@ Expanding tables using {gtsummary}8.3.1.1.
-anes_des_gtsum3 <- anes_des %>%
- tbl_svysummary(
- include = TrustGovernment,
- statistic = list(all_categorical() ~ "{p} ({p.std.error})"),
- missing = "no",
- digits = list(TrustGovernment ~ style_percent),
- label = list(TrustGovernment ~ "Trust in Government, 2020")
- ) %>%
- modify_footnote(update = everything() ~ NA) %>%
- modify_header(label = " ",
- stat_0 = "% (s.e.)") %>%
- as_gt() %>%
- tab_header("American voter's trust
- in the federal government, 2020") %>%
- tab_source_note("American National Election Studies, 2020") %>%
- tab_footnote(
- "Question text: How often can you trust
- the federal government in Washington
- to do what is right?"
- )
-
-
-
-
@@ -3070,49 +3070,49 @@ Expanding tables using {gtsummary}anes_des_gtsum4 <- anes_des %>%
- tbl_svysummary(
- include = c(TrustGovernment, Age),
- statistic = list(
- all_categorical() ~ "{p} ({p.std.error})",
- all_continuous() ~ "{mean} ({mean.std.error})"
- ),
- missing = "no",
- digits = list(TrustGovernment ~ style_percent,
- Age ~ c(1, 2)),
- label = list(TrustGovernment ~ "Trust in Government, 2020")
- ) %>%
- modify_footnote(update = everything() ~ NA) %>%
- modify_header(label = " ",
- stat_0 = "% (s.e.)") %>%
- as_gt() %>%
- tab_header("American voter's trust in the federal government, 2020") %>%
- tab_source_note("American National Election Studies, 2020") %>%
- tab_footnote(
- "Question text: How often can you trust the federal government
- in Washington to do what is right?"
- ) %>%
- tab_caption("Example of gtsummary table with trust in government
- estimates and average age")
-
-
-
-
@@ -3578,50 +3578,50 @@ Expanding tables using {gtsummary}anes_des_gtsum5 <- anes_des %>%
- drop_na(VotedPres2020) %>%
- tbl_svysummary(
- include = TrustGovernment,
- statistic = list(all_categorical() ~ "{p} ({p.std.error})"),
- missing = "no",
- digits = list(TrustGovernment ~ style_percent),
- label = list(TrustGovernment ~ "Trust in Government, 2020"),
- by = VotedPres2020
- ) %>%
- modify_footnote(update = everything() ~ NA) %>%
- modify_header(label = " ",
- stat_1 = "Voted",
- stat_2 = "Didn't vote") %>%
- as_gt() %>%
- tab_header(
- "American voter's trust
- in the federal government by whether they voted
- in the 2020 presidential election"
- ) %>%
- tab_source_note("American National Election Studies, 2020") %>%
- tab_footnote(
- "Question text: How often can you trust the federal government
- in Washington to do what is right?"
- )
-
-
-
-
@@ -4100,24 +4100,24 @@ 8.3.2 Charts and plotsanes_des_der <- anes_des %>%
- mutate(TrustGovernmentUsually = case_when(
- is.na(TrustGovernment) ~ NA,
- TRUE ~ TrustGovernment %in% c("Always", "Most of the time")
- )) %>%
- drop_na(VotedPres2020_selection) %>%
- group_by(VotedPres2020_selection) %>%
- summarize(
- pct_trust = survey_mean(
- TrustGovernmentUsually,
- na.rm = TRUE,
- proportion = TRUE,
- vartype = "ci"
- ),
- .groups = "drop"
- )
-
-anes_des_der
+anes_des_der <- anes_des %>%
+ mutate(TrustGovernmentUsually = case_when(
+ is.na(TrustGovernment) ~ NA,
+ TRUE ~ TrustGovernment %in% c("Always", "Most of the time")
+ )) %>%
+ drop_na(VotedPres2020_selection) %>%
+ group_by(VotedPres2020_selection) %>%
+ summarize(
+ pct_trust = survey_mean(
+ TrustGovernmentUsually,
+ na.rm = TRUE,
+ proportion = TRUE,
+ vartype = "ci"
+ ),
+ .groups = "drop"
+ )
+
+anes_des_der
## # A tibble: 3 × 4
## VotedPres2020_selection pct_trust pct_trust_low pct_trust_upp
## <fct> <dbl> <dbl> <dbl>
@@ -4125,12 +4125,12 @@ 8.3.2 Charts and plots8.1.
-p <- anes_des_der %>%
- ggplot(aes(x = VotedPres2020_selection,
- y = pct_trust)) +
- geom_bar(stat = "identity")
-
-p
+p <- anes_des_der %>%
+ ggplot(aes(x = VotedPres2020_selection,
+ y = pct_trust)) +
+ geom_bar(stat = "identity")
+
+p
8.3.2 Charts and plotspcolor <- anes_des_der %>%
- ggplot(aes(x = VotedPres2020_selection,
- y = pct_trust,
- fill = VotedPres2020_selection)) +
- geom_bar(stat = "identity")
-
-pcolor
+pcolor <- anes_des_der %>%
+ ggplot(aes(x = VotedPres2020_selection,
+ y = pct_trust,
+ fill = VotedPres2020_selection)) +
+ geom_bar(stat = "identity")
+
+pcolor
8.3.2 Charts and plotspcol_error <- anes_des_der %>%
- ggplot(aes(x = VotedPres2020_selection,
- y = pct_trust,
- fill = VotedPres2020_selection)) +
- geom_bar(stat = "identity") +
- geom_errorbar(aes(ymin = pct_trust_low,
- ymax = pct_trust_upp),
- width = .2)
-
-pcol_error
+pcol_error <- anes_des_der %>%
+ ggplot(aes(x = VotedPres2020_selection,
+ y = pct_trust,
+ fill = VotedPres2020_selection)) +
+ geom_bar(stat = "identity") +
+ geom_errorbar(aes(ymin = pct_trust_low,
+ ymax = pct_trust_upp),
+ width = .2)
+
+pcol_error
8.3.2 Charts and plots. We can specify specific colors for fill
using scale_fill_manual()
. Inside the function, we provide a vector of values corresponding to the colors in our plot. These values are hexadecimal (hex) color codes, denoted by a leading pound sign #
followed by six letters or numbers. The hex code #0b3954
used below is a dark blue. There are many tools online that help pick hex codes, such as htmlcolorcodes.com/.
-pfull <-
- anes_des_der %>%
- ggplot(aes(x = VotedPres2020_selection,
- y = pct_trust,
- fill = VotedPres2020_selection)) +
- geom_bar(stat = "identity") +
- geom_errorbar(aes(ymin = pct_trust_low,
- ymax = pct_trust_upp),
- width = .2) +
- scale_fill_manual(values = c("#0b3954", "#bfd7ea", "#8d6b94")) +
- xlab("Election choice (2020)") +
- ylab("Usually trust the government") +
- scale_y_continuous(labels = scales::percent) +
- guides(fill = "none") +
- labs(title = "Percent of voters who usually trust the government
- by chosen 2020 presidential candidate",
- caption = "Source: American National Election Studies, 2020")
-
-pfull
+pfull <-
+ anes_des_der %>%
+ ggplot(aes(x = VotedPres2020_selection,
+ y = pct_trust,
+ fill = VotedPres2020_selection)) +
+ geom_bar(stat = "identity") +
+ geom_errorbar(aes(ymin = pct_trust_low,
+ ymax = pct_trust_upp),
+ width = .2) +
+ scale_fill_manual(values = c("#0b3954", "#bfd7ea", "#8d6b94")) +
+ xlab("Election choice (2020)") +
+ ylab("Usually trust the government") +
+ scale_y_continuous(labels = scales::percent) +
+ guides(fill = "none") +
+ labs(title = "Percent of voters who usually trust the government
+ by chosen 2020 presidential candidate",
+ caption = "Source: American National Election Studies, 2020")
+
+pfull
- 7 Modeling
- Prerequisites
-- 7.1 Introduction
+- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
-- 7.3 Gaussian Linear Regression
+
- 7.3 Normal Linear Regression
- 7.3.1 Syntax
- 7.3.2 Examples
@@ -322,7 +322,7 @@
- 8 Communicating results
- Prerequisites
-- 8.1 Introduction
+- 8.1 Introduction
- 8.2 Describing results through text
- 8.2.1 Methodology
@@ -337,7 +337,7 @@
- 9 Reproducible research
-- 9.1 Introduction
+- 9.1 Introduction
- 9.2 Project-based workflows
- 9.3 Functions and packages
- 9.4 Version control with Git
@@ -359,7 +359,7 @@
- 10 Specifying sample designs and replicate weights in {srvyr}
- Prerequisites
-- 10.1 Introduction
+- 10.1 Introduction
- 10.2 Common sampling designs
- 10.2.1 Simple random sample without replacement
@@ -383,7 +383,7 @@
- 11 Missing data
- Prerequisites
-- 11.1 Introduction
+- 11.1 Introduction
- 11.2 Missing data mechanisms
- 11.3 Assessing missing data
@@ -401,7 +401,7 @@
- 13 National Crime Victimization Survey Vignette
- Prerequisites
-- 13.1 Introduction
+- 13.1 Introduction
- 13.2 Data Structure
- 13.3 Survey Notation
- 13.4 Data File Preparation
@@ -423,7 +423,7 @@
- 14 AmericasBarometer Vignette
- Prerequisites
-- 14.1 Introduction
+- 14.1 Introduction
- 14.2 Data structure
- 14.3 Preparing files
- 14.4 Survey design objects
@@ -492,8 +492,8 @@
Chapter 9 Reproducible research
-
-9.1 Introduction
+
+9.1 Introduction
Reproducing a data analysis’s results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Another goal of reproducibility is enabling the verification of our analysis. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment.
Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. For example, the Journal of Survey Statistics and Methodology (JSSAM) and Public Opinion Quarterly (POQ) require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work.
Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are:
@@ -529,8 +529,8 @@ 9.2 Project-based workflows
In a project-based workflow, all paths are relative and, by default, relative to the project’s folder. By using relative paths, others can open and run our files even if their directory configuration differs from ours. The {here} package enables easy file referencing, and we can start with using the here::here()
function to build the path for loading or saving data. Below, we ask R to read the CSV file anes_2020.csv
in the project directory’s data
folder:
-
+
The combination of projects and the {here} package keep all associated files in an organized manner. This workflow makes it more likely that our analyses can be reproduced by us or our colleagues.
@@ -576,9 +576,9 @@ 9.9 Other tips for reproducibilit
9.9.1 Random number seeds
Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By “setting the seed,” we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results.
In R, we can use the set.seed()
function to control the randomness in our code. Set a seed value by providing an integer to the function:
-
+
The runif()
function generates five random numbers from a uniform distribution. Since the seed is set to 999
, running runif()
multiple times will always produce the same sequence:
[1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676
The choice of the seed number is up to the analyst. For example, this could be the date (20240102
) or time of day (1056
) when the analysis was first conducted, a phone number (8675309
), or the first few numbers that come to mind (369
). As long as the seed is set for a given analysis, the actual number is up to the analyst to decide. It is important to note that set.seed()
should be used before random number generation. Run it once per program, and the seed will be applied to the entire script. We recommend setting the seed at the beginning of a script, where libraries are loaded.
diff --git a/c10-specifying-sample-designs.html b/c10-specifying-sample-designs.html
index 38841895..aa174785 100644
--- a/c10-specifying-sample-designs.html
+++ b/c10-specifying-sample-designs.html
@@ -23,7 +23,7 @@
-
+
@@ -300,13 +300,13 @@
- 7 Modeling
- Prerequisites
-- 7.1 Introduction
+- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
-- 7.3 Gaussian Linear Regression
+
- 7.3 Normal Linear Regression
- 7.3.1 Syntax
- 7.3.2 Examples
@@ -322,7 +322,7 @@
- 8 Communicating results
- Prerequisites
-- 8.1 Introduction
+- 8.1 Introduction
- 8.2 Describing results through text
- 8.2.1 Methodology
@@ -337,7 +337,7 @@
- 9 Reproducible research
-- 9.1 Introduction
+- 9.1 Introduction
- 9.2 Project-based workflows
- 9.3 Functions and packages
- 9.4 Version control with Git
@@ -359,7 +359,7 @@
- 10 Specifying sample designs and replicate weights in {srvyr}
- Prerequisites
-- 10.1 Introduction
+- 10.1 Introduction
- 10.2 Common sampling designs
- 10.2.1 Simple random sample without replacement
@@ -383,7 +383,7 @@
- 11 Missing data
- Prerequisites
-- 11.1 Introduction
+- 11.1 Introduction
- 11.2 Missing data mechanisms
- 11.3 Assessing missing data
@@ -401,7 +401,7 @@
- 13 National Crime Victimization Survey Vignette
- Prerequisites
-- 13.1 Introduction
+- 13.1 Introduction
- 13.2 Data Structure
- 13.3 Survey Notation
- 13.4 Data File Preparation
@@ -423,7 +423,7 @@
- 14 AmericasBarometer Vignette
- Prerequisites
-- 14.1 Introduction
+- 14.1 Introduction
- 14.2 Data structure
- 14.3 Preparing files
- 14.4 Survey design objects
@@ -497,19 +497,19 @@ Prerequisites
For this chapter, load the following packages:
-
+
To help explain the different types of sample designs, this chapter will use the api
and scd
data that are included in the {survey} package:
-
+
This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data from the {srvyr.data} package:
-
+
-
-10.1 Introduction
+
+10.1 Introduction
The primary reason for using packages like {survey} and {srvyr} is to account for the sampling design or replicate weights into estimates. By incorporating the sampling design or replicate weights, precision estimates (e.g., standard errors and confidence intervals) are appropriately calculated.
In this chapter, we will introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we will show the math behind the estimates, the functions in these packages will do the calculation. To deeply understand the math and the derivation, refer to Penn State (2019), Särndal, Swensson, and Wretman (2003), Wolter (2007), or Fuller (2011) (these are listed in order of increasing statistical rigorousness).
The general process for estimation in the {srvyr} package is to:
@@ -556,27 +556,27 @@ The math
The syntax
If a sample was drawn through SRS and had no nonresponse or other weighting adjustments, in R, specify this design as:
-
+
where dat
is a tibble or data.frame with the survey data, and fpcvar
is a variable in the data indicating the sampling frame’s size (this variable will have the same value for all cases in an SRS design). If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and specify the design as:
-
+
If some post-survey adjustments were implemented and the weights are not all equal, specify the design as:
-
+
where wtvar
is a variable in the data indicating the weight for each case. Again, the FPC can be omitted if it is unnecessary because the frame is large compared to the sample size.
Example
The {survey} package in R provides some example datasets that we will use throughout this chapter. The documentation provides detailed information about the variables. One of the example datasets we will use is from the Academic Performance Index (API). The API was a program administered by the California Department of Education, and the {survey} package includes a population file (sample frame) of all schools with at least 100 students and several different samples pulled from that data using different sampling methods. For this first example, we will use the apisrs
dataset, which contains an SRS of 200 schools. For printing purposes, we create a new dataset called apisrs_slim
, which sorts the data by the school district and school ID and subsets the data to only a few columns. The SRS sample data is illustrated below:
-apisrs_slim <-
- apisrs %>%
- as_tibble() %>%
- arrange(dnum, snum) %>%
- select(cds, dnum, snum, dname, sname, fpc, pw)
-
-apisrs_slim
+apisrs_slim <-
+ apisrs %>%
+ as_tibble() %>%
+ arrange(dnum, snum) %>%
+ select(cds, dnum, snum, dname, sname, fpc, pw)
+
+apisrs_slim
## # A tibble: 200 × 7
## cds dnum snum dname sname fpc pw
## <chr> <int> <dbl> <chr> <chr> <dbl> <dbl>
@@ -632,11 +632,11 @@ Exampleapisrs_des <- apisrs_slim %>%
- as_survey_design(weights = pw,
- fpc = fpc)
-
-apisrs_des
+
## Independent Sampling design
## Called via srvyr
## Sampling variables:
@@ -647,7 +647,7 @@ Example10.2.4), the FPC variable is indicated, and the weights are indicated. We can also look at the summary of the design object, and see the distribution of the probabilities (inverse of the weights) along with the population size and a list of the variables in the dataset.
-
+
## Independent Sampling design
## Called via srvyr
## Probabilities:
@@ -688,28 +688,28 @@ The math
The syntax
If we had a sample that was drawn through SRSWR and had no nonresponse or other weighting adjustments, in R, we should specify this design as:
-
+
where dat
is a tibble or data.frame containing our survey data. This syntax is the same as a SRS design, except a finite population correction (FPC) is not included. This is because when you claculate a sample with replacement, the population pool to select from is no longer finite, so a correction is not needed. Therefore, with large populations where the FPC is negligble, the underlying formulas for SRS and SRSWR designs are the same.
If some post-survey adjustments were implemented and the weights are not all equal, specify the design as:
-
+
where wtvar
is the variable for the weight on the data.
Example
The {survey} package does not include an example of SRSWR, so to illustrate this design we need to create an example. We use the api population data provided by the {survey} package apipop
and select a sample of 200 cases using the slice_sample()
function from the tidyverse. One of the arguments in the slice_sample()
function is replace
. If replace=TRUE
, then we are conducting a SRSWR. We then calculate selection weights as the inverse of the probability of selection and call this new dataset apisrswr
.
-set.seed(409963)
-
-apisrswr <- apipop %>%
- as_tibble() %>%
- slice_sample(n = 200, replace = TRUE) %>%
- select(cds, dnum, snum, dname, sname) %>%
- mutate(
- weight = nrow(apipop)/200
- )
-
-head(apisrswr)
+set.seed(409963)
+
+apisrswr <- apipop %>%
+ as_tibble() %>%
+ slice_sample(n = 200, replace = TRUE) %>%
+ select(cds, dnum, snum, dname, sname) %>%
+ mutate(
+ weight = nrow(apipop)/200
+ )
+
+head(apisrswr)
## # A tibble: 6 × 6
## cds dnum snum dname sname weight
## <chr> <int> <dbl> <chr> <chr> <dbl>
@@ -720,10 +720,10 @@ Exampleapisrswr %>%
- group_by(cds) %>%
- filter(n()>1) %>%
- arrange(cds)
+
## # A tibble: 4 × 6
## # Groups: cds [2]
## cds dnum snum dname sname weight
@@ -733,10 +733,10 @@ Exampleapisrswr_des <- apisrswr %>%
- as_survey_design(weights = weight)
-
-apisrswr_des
+
## Independent Sampling design (with replacement)
## Called via srvyr
## Sampling variables:
@@ -745,7 +745,7 @@ Examplesummary(apisrswr_des)
+
## Independent Sampling design (with replacement)
## Called via srvyr
## Probabilities:
@@ -791,25 +791,25 @@ The math
The syntax
In addition to the fpc
and weights
arguments discussed in the types above, stratified designs requires the addition of the strata
argument. For example, to specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, specify the design as:
-
+
where fpcvar
is a variable on our data that indicates \(N_h\) for each row, and stratavar
is a variable indicating the stratum for each row. You can omit the FPC if it is not applicable. Additionally, we can indicate the weight variable if it is present where wtvar
is a variable on our data with a numeric weight.
-
+
Example
In the example API data, apistrat
is a stratified random sample, stratified by school type (stype
) with three levels: E
for elementary school, M
for middle school, and H
for high school. As with the SRS example above, we sort and select specific variables for use in printing. The data are illustrated below, including a count of the number of cases per stratum:
-apistrat_slim <-
- apistrat %>%
- as_tibble() %>%
- arrange(dnum, snum) %>%
- select(cds, dnum, snum, dname, sname, stype, fpc, pw)
-
-apistrat_slim %>%
- count(stype, fpc)
+apistrat_slim <-
+ apistrat %>%
+ as_tibble() %>%
+ arrange(dnum, snum) %>%
+ select(cds, dnum, snum, dname, sname, stype, fpc, pw)
+
+apistrat_slim %>%
+ count(stype, fpc)
## # A tibble: 3 × 3
## stype fpc n
## <fct> <dbl> <int>
@@ -817,12 +817,12 @@ Exampleapistrat_des <- apistrat_slim %>%
- as_survey_design(strata = stype,
- weights = pw,
- fpc = fpc)
-
-apistrat_des
+apistrat_des <- apistrat_slim %>%
+ as_survey_design(strata = stype,
+ weights = pw,
+ fpc = fpc)
+
+apistrat_des
## Stratified Independent Sampling design
## Called via srvyr
## Sampling variables:
@@ -833,7 +833,7 @@ Examplesummary(apistrat_des)
+
## Stratified Independent Sampling design
## Called via srvyr
## Probabilities:
@@ -886,31 +886,31 @@ The math
The syntax
Clustered sampling designs require the addition of the ids
argument which specifies what variables are the cluster levels. To specify a two-stage clustered design without replacement, use the following syntax:
-
+
where PSU
and SSU
are the variables indicating the PSU and SSU identifiers, and A
and B
are the variables indicating the population sizes for each level (i.e., A
is the number of clusters, and B
is the number of units within each cluster). Note that A
will be the same for all records (within a strata), and B
will be the same for all records within the same cluster.
If clusters were sampled with replacement or from a very large population, a FPC is unnecessary. Additionally, only the first stage of selection is necessary regardless of whether the units were selected with replacement at any stage. The subsequent stages of selection are ignored in computation as their contribution to the variance is overpowered by the first stage (see Särndal, Swensson, and Wretman (2003) or Wolter (2007) for a more in-depth discussion). Therefore, the syntax below will yield the same estimates in the end:
-clus2wra_des <- dat %>%
- as_survey_design(weights = wtvar,
- ids = c(PSU, SSU))
-
-clus2wrb_des <- dat %>%
- as_survey_design(weights = wtvar,
- ids = PSU)
+clus2wra_des <- dat %>%
+ as_survey_design(weights = wtvar,
+ ids = c(PSU, SSU))
+
+clus2wrb_des <- dat %>%
+ as_survey_design(weights = wtvar,
+ ids = PSU)
Note that there is one additional argument that is sometimes necessary which is nest = TRUE
. This option relabels cluster IDs to enforce nesting within strata. Sometimes, as an example, there may be a cluster 1
and a cluster 2
within each stratum but these are actually different clusters. This option indicates that the repeated use of numbering does not mean it is the same cluster. If this option is not used and there are repeated cluster IDs across different strata, an error will be generated.
Example
The survey
package includes a two-stage cluster sample data, apiclus2
, in which school districts were sampled, and then a random sample of five schools was selected within each district. For districts with fewer than five schools, all schools were sampled. School districts are identified by dnum
, and schools are identified by snum
. The variable fpc1
indicates how many districts there are in California (A
), and fpc2
indicates how many schools were in a given district with at least 100 students (B
). The data has a row for each school. In the data printed below, there are 757 school districts, as indicated by fpc1
, and there are nine schools in District 731, one school in District 742, two schools in District 768, and so on as indicated by fpc2
. For illustration purposes, the object apiclus2_slim
has been created from apiclus2
, which subsets the data to only the necessary columns and sorts data.
-apiclus2_slim <-
- apiclus2 %>%
- as_tibble() %>%
- arrange(desc(dnum), snum) %>%
- select(cds, dnum, snum, fpc1, fpc2, pw)
-
-apiclus2_slim
+apiclus2_slim <-
+ apiclus2 %>%
+ as_tibble() %>%
+ arrange(desc(dnum), snum) %>%
+ select(cds, dnum, snum, fpc1, fpc2, pw)
+
+apiclus2_slim
## # A tibble: 126 × 6
## cds dnum snum fpc1 fpc2 pw
## <chr> <int> <dbl> <dbl> <int[1d]> <dbl>
@@ -926,12 +926,12 @@ Exampleapiclus2_des <- apiclus2_slim %>%
- as_survey_design(ids = c(dnum, snum),
- fpc = c(fpc1, fpc2),
- weights = pw)
-
-apiclus2_des
+apiclus2_des <- apiclus2_slim %>%
+ as_survey_design(ids = c(dnum, snum),
+ fpc = c(fpc1, fpc2),
+ weights = pw)
+
+apiclus2_des
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## Called via srvyr
@@ -942,7 +942,7 @@ Examplesummary(apiclus2_des)
+
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## Called via srvyr
@@ -962,16 +962,16 @@ 10.3 Combining sampling methodsA common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User’s Guides, methodology, analysis guides, or technical documentation (see Chapter 3 for more details).
Example
-For example, the 2017-2019 National Survey of Family Growth (NSFG)32 had a stratified multi-stage area probability sample:
+
For example, the 2017-2019 National Survey of Family Growth (NSFG)31 had a stratified multi-stage area probability sample:
1. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS.
2. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection.
3. In the third stage, housing units were selected within the sampled neighborhoods.
4. In the fourth stage, a person was randomly chosen within the selected housing units among eligible persons using unequal probabilities based on the person’s age and sex.
The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is SEST
, the cluster variable is SECU
, and the weight variable is WGT2017_2019
. Thus, to specify this design in R, use the following syntax:
-
+
@@ -1002,57 +1002,57 @@ The math
The syntax
-Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, …, PWGTP80 for the person weights in the American Community Survey (ACS) (U.S. Census Bureau 2021) or BRRWT1, BRRWT2, …, BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) (U.S. Energy Information Administration 2017). This makes it easy to use some of the tidy selection33 functions in R.
+Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, …, PWGTP80 for the person weights in the American Community Survey (ACS) (U.S. Census Bureau 2021) or BRRWT1, BRRWT2, …, BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) (U.S. Energy Information Administration 2017). This makes it easy to use some of the tidy selection32 functions in R.
To specify a BRR design, we need to specify the weight variable (weights
), the replicate weight variables (repweights
), the type of replicate weights is BRR (type = BRR
), and whether the mean squared error should be used (mse = TRUE
) or not (mse = FALSE
). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, …, WT20, we can use the following syntax (both are equivalent):
-brr_des <- dat %>%
- as_survey_rep(weights = WT0,
- repweights = all_of(str_c("WT", 1:20)),
- type = "BRR",
- mse = TRUE)
-
-brr_des <- dat %>%
- as_survey_rep(weights = WT0,
- repweights = num_range("WT", 1:20),
- type = "BRR",
- mse = TRUE)
-If a dataset had WT for the main weight and had 20 BRR weights indicated REPWT1, REPWT2, …, REPWT20, the following syntax could be used (both are equivalent):
-brr_des <- dat %>%
- as_survey_rep(weights = WT,
- repweights = all_of(str_c("REPWT", 1:20)),
- type = "BRR",
- mse = TRUE)
-
-brr_des <- dat %>%
- as_survey_rep(weights = WT,
- repweights = starts_with("REPWT"),
- type = "BRR",
- mse = TRUE)
-If the replicate weight variables are in the file consecutively, the following syntax can also be used:
brr_des <- dat %>%
- as_survey_rep(weights = WT,
- repweights = REPWT1:REPWT20,
+ as_survey_rep(weights = WT0,
+ repweights = all_of(str_c("WT", 1:20)),
type = "BRR",
- mse = TRUE)
-Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter 3 for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option combined_weights = FALSE
should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below:
+ mse = TRUE)
+
+brr_des <- dat %>%
+ as_survey_rep(weights = WT0,
+ repweights = num_range("WT", 1:20),
+ type = "BRR",
+ mse = TRUE)
+If a dataset had WT for the main weight and had 20 BRR weights indicated REPWT1, REPWT2, …, REPWT20, the following syntax could be used (both are equivalent):
brr_des <- dat %>%
as_survey_rep(weights = WT,
- repweights = starts_with("REPWT"),
+ repweights = all_of(str_c("REPWT", 1:20)),
type = "BRR",
- combined_weights = FALSE,
- mse = TRUE)
+ mse = TRUE)
+
+brr_des <- dat %>%
+ as_survey_rep(weights = WT,
+ repweights = starts_with("REPWT"),
+ type = "BRR",
+ mse = TRUE)
+If the replicate weight variables are in the file consecutively, the following syntax can also be used:
+brr_des <- dat %>%
+ as_survey_rep(weights = WT,
+ repweights = REPWT1:REPWT20,
+ type = "BRR",
+ mse = TRUE)
+Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter 3 for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option combined_weights = FALSE
should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below:
+brr_des <- dat %>%
+ as_survey_rep(weights = WT,
+ repweights = starts_with("REPWT"),
+ type = "BRR",
+ combined_weights = FALSE,
+ mse = TRUE)
Example
The {survey} package includes a data example from Section 12.2 of Levy and Lemeshow (2013). In this fictional data, two out of five ambulance stations were sampled from each of three emergency service areas (ESAs), thus BRR weights are appropriate with 2 PSUs (stations) sampled in each stratum (ESA). In the code below, BRR weights are created as was done by Levy and Lemeshow (2013).
-scdbrr <- scd %>%
- as_tibble() %>%
- mutate(wt = 5 / 2,
- rep1 = 2 * c(1, 0, 1, 0, 1, 0),
- rep2 = 2 * c(1, 0, 0, 1, 0, 1),
- rep3 = 2 * c(0, 1, 1, 0, 0, 1),
- rep4 = 2 * c(0, 1, 0, 1, 1, 0))
-
-scdbrr
+scdbrr <- scd %>%
+ as_tibble() %>%
+ mutate(wt = 5 / 2,
+ rep1 = 2 * c(1, 0, 1, 0, 1, 0),
+ rep2 = 2 * c(1, 0, 0, 1, 0, 1),
+ rep3 = 2 * c(0, 1, 1, 0, 0, 1),
+ rep4 = 2 * c(0, 1, 0, 1, 1, 0))
+
+scdbrr
## # A tibble: 6 × 9
## ESA ambulance arrests alive wt rep1 rep2 rep3 rep4
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
@@ -1063,13 +1063,13 @@ Examplescdbrr_des <- scdbrr %>%
- as_survey_rep(type = "BRR",
- repweights = starts_with("rep"),
- combined_weights = FALSE,
- weight = wt)
-
-scdbrr_des
+scdbrr_des <- scdbrr %>%
+ as_survey_rep(type = "BRR",
+ repweights = starts_with("rep"),
+ combined_weights = FALSE,
+ weight = wt)
+
+scdbrr_des
## Call: Called via srvyr
## Balanced Repeated Replicates with 4 replicates.
## Sampling variables:
@@ -1078,7 +1078,7 @@ Examplesummary(scdbrr_des)
+
## Call: Called via srvyr
## Balanced Repeated Replicates with 4 replicates.
## Sampling variables:
@@ -1104,25 +1104,25 @@ The math
The syntax
The syntax is very similar for BRR and Fay’s BRR. To specify a Fay’s BRR design, we need to specify the weight variable (weights
), the replicate weight variables (repweights
), the type of replicate weights is Fay’s BRR (type = Fay
), whether the mean squared error should be used (mse = TRUE
) or not (mse = FALSE
), and Fay’s multiplier (rho
). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, …, WT20, and Fay’s multiplier is 0.3, use the following syntax:
-fay_des <- dat %>%
- as_survey_rep(weights = WT0,
- repweights = num_range("WT", 1:20),
- type = "Fay",
- mse = TRUE,
- rho = 0.3)
+fay_des <- dat %>%
+ as_survey_rep(weights = WT0,
+ repweights = num_range("WT", 1:20),
+ type = "Fay",
+ mse = TRUE,
+ rho = 0.3)
Example
The 2015 RECS (U.S. Energy Information Administration 2017) uses Fay’s BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 and the documentation specifies a Fay’s multiplier of 0.5. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already pulled in the 2015 RECS data from the {srvyrexploR} package that provides data for this book. To specify the design for the recs_2015
data, use the following syntax:
-recs_2015_des <- recs_2015 %>%
- as_survey_rep(weights = NWEIGHT,
- repweights = BRRWT1:BRRWT96,
- type = "Fay",
- rho = 0.5,
- mse = TRUE,
- variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC))
-
-recs_2015_des
+recs_2015_des <- recs_2015 %>%
+ as_survey_rep(weights = NWEIGHT,
+ repweights = BRRWT1:BRRWT96,
+ type = "Fay",
+ rho = 0.5,
+ mse = TRUE,
+ variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC))
+
+recs_2015_des
## Call: Called via srvyr
## Fay's variance method (rho= 0.5 ) with 96 replicates and MSE variances.
## Sampling variables:
@@ -1144,7 +1144,7 @@ Examplesummary(recs_2015_des)
+
## Call: Called via srvyr
## Fay's variance method (rho= 0.5 ) with 96 replicates and MSE variances.
## Sampling variables:
@@ -1185,35 +1185,35 @@ The math
The syntax
To specify the jackknife method, we use the survey documentation to understand the type of jackknife (1, n, or 2) and the multiplier. In the syntax we need to specify the weight variable (weights
), the replicate weight variables (repweights
), the type of replicate weights as jackknife 1 (type = "JK1"
), n (type = "JKN"
), or 2 (type = "JK2"
), whether the mean squared error should be used (mse = TRUE
) or not (mse = FALSE
), and the multiplier (scale
). For example, if the survey is a jackknife 1 method with a multiplier of \(\alpha_r=(R-1)/R=19/20=0.95\), the dataset has WT0 for the main weight and 20 replicate weights indicated as WT1, WT2, …, WT20, use the following syntax:
-jk1_des <- dat %>%
- as_survey_rep(weights = WT0,
- repweights= num_range("WT", 1:20),
- type="JK1",
- mse=TRUE,
- scale=0.95)
+jk1_des <- dat %>%
+ as_survey_rep(weights = WT0,
+ repweights= num_range("WT", 1:20),
+ type="JK1",
+ mse=TRUE,
+ scale=0.95)
For a jackknife n method, we need to specify the multiplier for all replicates. In this case we use the rscales
argument to specify each one. The documentation will provide details on what the multipliers (\(\alpha_r\)) are, and they may be the same for all replicates. For example, consider a case where \(\alpha_r=0.1\) for all replicates and the dataset had WT0 for the main weight and had 20 replicate weights indicated as WT1, WT2, …, WT20. We specify the type as type = "JKN"
, and the multiplier as rscales=rep(0.1,20)
:
-jkn_des <- dat %>%
- as_survey_rep(weights = WT0,
- repweights= num_range("WT", 1:20),
- type="JKN",
- mse=TRUE,
- rscales=rep(0.1, 20))
+jkn_des <- dat %>%
+ as_survey_rep(weights = WT0,
+ repweights= num_range("WT", 1:20),
+ type="JKN",
+ mse=TRUE,
+ rscales=rep(0.1, 20))
Example
The 2020 RECS (U.S. Energy Information Administration 2023b) uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of \((R-1)/R=59/60\). On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called recs_2020
above in the prerequisites.
To specify this design, use the following syntax:
-recs_des <- recs_2020 %>%
- as_survey_rep(
- weights = NWEIGHT,
- repweights = NWEIGHT1:NWEIGHT60,
- type = "JK1",
- scale = 59/60,
- mse = TRUE,
- variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC)
- )
-
-recs_des
+recs_des <- recs_2020 %>%
+ as_survey_rep(
+ weights = NWEIGHT,
+ repweights = NWEIGHT1:NWEIGHT60,
+ type = "JK1",
+ scale = 59/60,
+ mse = TRUE,
+ variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC)
+ )
+
+recs_des
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
## Sampling variables:
@@ -1232,7 +1232,7 @@ Examplesummary(recs_des)
+
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
## Sampling variables:
@@ -1268,40 +1268,40 @@ The math
The syntax
To specify a bootstrap method, we need to specify the weight variable (weights
), the replicate weight variables (repweights
), the type of replicate weights as bootstrap (type = "bootstrap"
), whether the mean squared error should be used (mse = TRUE
) or not (mse = FALSE
), and the multiplier (scale
). For example, if a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, …, WT20, and a multiplier of \(\alpha=.02\), use the following syntax:
-bs_des <- dat %>%
- as_survey_rep(weights = WT0,
- repweights= num_range("WT", 1:20),
- type="bootstrap",
- mse=TRUE,
- scale=.02)
+bs_des <- dat %>%
+ as_survey_rep(weights = WT0,
+ repweights= num_range("WT", 1:20),
+ type="bootstrap",
+ mse=TRUE,
+ scale=.02)
Example
-Returning to the api example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with fifty replicate weights.34
-apiclus1_slim <-
- apiclus1 %>%
- as_tibble() %>%
- arrange(dnum) %>%
- select(cds, dnum, fpc, pw)
-
-set.seed(662152)
-apibw <-
- bootweights(psu = apiclus1_slim$dnum,
- strata = rep(1, nrow(apiclus1_slim)),
- fpc = apiclus1_slim$fpc,
- replicates = 50)
-
-bwmata <-
- apibw$repweights$weights[apibw$repweights$index,] * apiclus1_slim$pw
-
-apiclus1_slim <- bwmata %>%
- as.data.frame() %>%
- set_names(str_c("pw", 1:50)) %>%
- cbind(apiclus1_slim) %>%
- as_tibble() %>%
- select(cds, dnum, fpc, pw, everything())
-
-apiclus1_slim
+Returning to the api example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with fifty replicate weights.33
+apiclus1_slim <-
+ apiclus1 %>%
+ as_tibble() %>%
+ arrange(dnum) %>%
+ select(cds, dnum, fpc, pw)
+
+set.seed(662152)
+apibw <-
+ bootweights(psu = apiclus1_slim$dnum,
+ strata = rep(1, nrow(apiclus1_slim)),
+ fpc = apiclus1_slim$fpc,
+ replicates = 50)
+
+bwmata <-
+ apibw$repweights$weights[apibw$repweights$index,] * apiclus1_slim$pw
+
+apiclus1_slim <- bwmata %>%
+ as.data.frame() %>%
+ set_names(str_c("pw", 1:50)) %>%
+ cbind(apiclus1_slim) %>%
+ as_tibble() %>%
+ select(cds, dnum, fpc, pw, everything())
+
+apiclus1_slim
## # A tibble: 183 × 54
## cds dnum fpc pw pw1 pw2 pw3 pw4 pw5 pw6 pw7
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
@@ -1325,14 +1325,14 @@ Example10.1), but now additionally includes bootstrap weights pw1
, …, pw50
. When creating the survey design object, we use the bootstrap weights as the replicate weights. Additionally, with replicate weights we need to include the scale (\(\alpha\)). For this example we created,
\[\alpha=\frac{M}{(M-1)(R-1)}=\frac{15}{(15-1)*(50-1)}=0.02186589\]
where \(M\) is the average number of PSUs per strata and \(R\) is the number of replicates. There is only 1 stratum and the number of clusters/PSUs is 15 so \(M=15\).
-api1_bs_des <- apiclus1_slim %>%
- as_survey_rep(weights = pw,
- repweights = pw1:pw50,
- type = "bootstrap",
- scale = 0.02186589,
- mse = TRUE)
-
-api1_bs_des
+api1_bs_des <- apiclus1_slim %>%
+ as_survey_rep(weights = pw,
+ repweights = pw1:pw50,
+ type = "bootstrap",
+ scale = 0.02186589,
+ mse = TRUE)
+
+api1_bs_des
## Call: Called via srvyr
## Survey bootstrap with 50 replicates and MSE variances.
## Sampling variables:
@@ -1354,7 +1354,7 @@ Examplesummary(api1_bs_des)
+
## Call: Called via srvyr
## Survey bootstrap with 50 replicates and MSE variances.
## Sampling variables:
@@ -1391,20 +1391,20 @@ Example10.5 Exercises
-- The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description35. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). How would you specify the design using {srvyr} using either
as_survey_design
or as_survey_rep()
?
+- The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description34. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). How would you specify the design using {srvyr} using either
as_survey_design
or as_survey_rep()
?
-nhis_adult_des <- nhis_adult_data %>%
- as_survey_design(ids=PPSU,
- strata=PSTRAT,
- nest=TRUE,
- weights=WTFA_A)
+nhis_adult_des <- nhis_adult_data %>%
+ as_survey_design(ids=PPSU,
+ strata=PSTRAT,
+ nest=TRUE,
+ weights=WTFA_A)
-- The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook36 provides examples of setting up syntax in SAS and Stata but not R. How would you specify the design in R?
+- The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook35 provides examples of setting up syntax in SAS and Stata but not R. How would you specify the design in R?
-
+
@@ -1449,12 +1449,12 @@ References
-
-2017-2019 National Survey of Family Growth (NSFG): Sample Design Documentation - https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf↩︎
-dplyr documentation on tidy-select: https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html↩︎
-We provide the code here for you to replicate this example, but are not focusing on the creation of the weights as that is outside the scope of this book. We recommend you reference Wolter (2007) for more information on creating bootstrap weights.↩︎
-2022 National Health Interview Survey (NHIS) Survey Description: https://www.cdc.gov/nchs/nhis/2022nhis.htm↩︎
-2016-2020 GSS Panel Codebook Release 1a: https://gss.norc.org/Documents/codebook/2016-2020%20GSS%20Panel%20Codebook%20-%20R1a.pdf↩︎
+
+2017-2019 National Survey of Family Growth (NSFG): Sample Design Documentation - https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf↩︎
+dplyr documentation on tidy-select: https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html↩︎
+We provide the code here for you to replicate this example, but are not focusing on the creation of the weights as that is outside the scope of this book. We recommend you reference Wolter (2007) for more information on creating bootstrap weights.↩︎
+2022 National Health Interview Survey (NHIS) Survey Description: https://www.cdc.gov/nchs/nhis/2022nhis.htm↩︎
+2016-2020 GSS Panel Codebook Release 1a: https://gss.norc.org/Documents/codebook/2016-2020%20GSS%20Panel%20Codebook%20-%20R1a.pdf↩︎
diff --git a/c11-missing-data.html b/c11-missing-data.html
index 19b2c167..8e226f70 100644
--- a/c11-missing-data.html
+++ b/c11-missing-data.html
@@ -23,7 +23,7 @@
-
+
@@ -300,13 +300,13 @@
- 7 Modeling
- Prerequisites
-- 7.1 Introduction
+- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
-- 7.3 Gaussian Linear Regression
+
- 7.3 Normal Linear Regression
- 7.3.1 Syntax
- 7.3.2 Examples
@@ -322,7 +322,7 @@
- 8 Communicating results
- Prerequisites
-- 8.1 Introduction
+- 8.1 Introduction
- 8.2 Describing results through text
- 8.2.1 Methodology
@@ -337,7 +337,7 @@
- 9 Reproducible research
-- 9.1 Introduction
+- 9.1 Introduction
- 9.2 Project-based workflows
- 9.3 Functions and packages
- 9.4 Version control with Git
@@ -359,7 +359,7 @@
- 10 Specifying sample designs and replicate weights in {srvyr}
- Prerequisites
-- 10.1 Introduction
+- 10.1 Introduction
- 10.2 Common sampling designs
- 10.2.1 Simple random sample without replacement
@@ -383,7 +383,7 @@
- 11 Missing data
- Prerequisites
-- 11.1 Introduction
+- 11.1 Introduction
- 11.2 Missing data mechanisms
- 11.3 Assessing missing data
@@ -401,7 +401,7 @@
- 13 National Crime Victimization Survey Vignette
- Prerequisites
-- 13.1 Introduction
+- 13.1 Introduction
- 13.2 Data Structure
- 13.3 Survey Notation
- 13.4 Data File Preparation
@@ -423,7 +423,7 @@
- 14 AmericasBarometer Vignette
- Prerequisites
-- 14.1 Introduction
+- 14.1 Introduction
- 14.2 Data structure
- 14.3 Preparing files
- 14.4 Survey design objects
@@ -497,41 +497,41 @@ Prerequisites
For this chapter, load the following packages:
-library(tidyverse)
-library(survey)
-library(srvyr)
-library(srvyrexploR)
-library(naniar)
-library(haven)
-library(gt)
+library(tidyverse)
+library(survey)
+library(srvyr)
+library(srvyrexploR)
+library(naniar)
+library(haven)
+library(gt)
We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 3 for more information).
-targetpop <- 231592693
-data(anes_2020)
-
-anes_adjwgt <- anes_2020 %>%
- mutate(Weight = Weight / sum(Weight) * targetpop)
-
-anes_des <- anes_adjwgt %>%
- as_survey_design(
- weights = Weight,
- strata = Stratum,
- ids = VarUnit,
- nest = TRUE
- )
+targetpop <- 231592693
+data(anes_2020)
+
+anes_adjwgt <- anes_2020 %>%
+ mutate(Weight = Weight / sum(Weight) * targetpop)
+
+anes_des <- anes_adjwgt %>%
+ as_survey_design(
+ weights = Weight,
+ strata = Stratum,
+ ids = VarUnit,
+ nest = TRUE
+ )
For RECS, details are included in the RECS documentation and Chapter 10.
-
+
-
-11.1 Introduction
+
+11.1 Introduction
Missing data in surveys refers to situations where participants do not provide complete responses to survey questions. Respondents may not have seen a question by design. Or, they may not respond to a question for various other reasons, such as not wanting to answer a particular question, not understanding the question, or simply forgetting to answer. Missing data is important to consider and account for, as it can introduce bias and reduce the representativeness of the data. This chapter provides an overview of the types of missing data, how to assess missing data in surveys, and how to conduct analysis when missing data is present. Understanding this complex topic can help ensure accurate reporting of survey results and can provide insight into potential changes to the survey design for the future.
@@ -553,9 +553,9 @@ 11.3 Assessing missing data
11.3.1 Summarize data
A very rudimentary first exploration is to use the summary()
function to summarize the data which will illuminate NA
values in the data. Let’s look at a few analytic variables on the ANES 2020 data using summary()
:
-
+
## V202051 V202066 V202072 VotedPres2020
## Min. :-9.000 Min. :-9.0 Min. :-9.000 Yes :5952
## 1st Qu.:-1.000 1st Qu.: 4.0 1st Qu.: 1.000 No : 77
@@ -578,8 +578,8 @@ 11.3.1 Summarize dataanes_2020 %>%
- count(VotedPres2020,V202072)
+
## # A tibble: 5 × 3
## VotedPres2020 V202072 n
## <fct> <dbl+lbl> <int>
@@ -593,14 +593,14 @@ 11.3.1 Summarize data
11.3.2 Visualization of missing data
It can be challenging to look at tables for every variable, and instead may be more efficient to view missing data in a graphical format to help narrow in on patterns or unique variables. The {naniar} package is very useful in exploring missing data visually. It provides quick graphics to explore the missingness patterns in the data. We can use the vis_miss()
function available in both {visdat} and {naniar} packages to view the amount of missing data by variable.
-anes_2020_derived<-anes_2020 %>%
- select(!starts_with("V2"),-CaseID,-InterviewMode,-Weight,-Stratum,-VarUnit)
-
-anes_2020_derived %>%
- vis_miss(cluster= TRUE, show_perc = FALSE) +
- scale_fill_manual(values = book_colors[c(3,1)],
- labels = c("Present","Missing"),
- name = "")
+anes_2020_derived<-anes_2020 %>%
+ select(!starts_with("V2"),-CaseID,-InterviewMode,-Weight,-Stratum,-VarUnit)
+
+anes_2020_derived %>%
+ vis_miss(cluster= TRUE, show_perc = FALSE) +
+ scale_fill_manual(values = book_colors[c(3,1)],
+ labels = c("Present","Missing"),
+ name = "")
11.3.2 Visualization of missing d
From this visualization, we can start to get a picture of what questions may be related to each other in terms of missing data. Even if we did not have the informative variable names, we could be able to deduce that VotedPres2020
, VotedPres2020_selection
, and EarlyVote2020
are likely related since their missing data patterns are similar.
Additionally, we can also look at VotedPres2016_selection
and see that there is a lot of missing data in that variable. Most likely this is due to a skip pattern, and we can look at further graphics to see how it might be related to other variables. The {naniar} package has multiple visualization functions that can help dive deeper such as the gg_miss_fct()
function which looks at missing data for all variables by levels of another variable.
-anes_2020_derived %>%
- gg_miss_fct(VotedPres2016) +
- scale_fill_gradientn(
- guide = "colorbar",
- name = "% Miss",
- colors = book_colors[c(3, 2, 1)]
- ) +
- ylab("Variable") +
- xlab("Voted for President in 2016")
+anes_2020_derived %>%
+ gg_miss_fct(VotedPres2016) +
+ scale_fill_gradientn(
+ guide = "colorbar",
+ name = "% Miss",
+ colors = book_colors[c(3, 2, 1)]
+ ) +
+ ylab("Variable") +
+ xlab("Voted for President in 2016")
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
-
-
+
+
In this case, we can see that if they did not vote for president in 2016 or did not answer that question, then they were not asked about who they voted for in 2016 (the percentage of missing data if 100%). Additionally, we can see with this graphic, that there is more missing data across all questions if they did not provide an answer to VotedPres2016
.
There are other graphics that work well with numeric data. For example, in the RECS 2020 data we can plot two continuous variables and the missing data associated with it to see if there are any patterns to the missingness. To do this, we can use the bind_shadow()
function from the {naniar} package. This creates a nabular (combination of “na” with “tabular”), which features the original columns followed by the same number of columns with a specific NA
format. These NA
columns are indicators of if the value in the original data is missing or not. The example printed below shows how most levels of HeatingBehavior
are not missing !NA
in the NA variable of HeatingBehavior_NA
, but those missing in HeatingBehavior
are also missing in HeatingBehavior_NA
.
-
+
## [1] 118
-
+
## [1] 236
-
+
## # A tibble: 7 × 3
## HeatingBehavior HeatingBehavior_NA n
## <fct> <fct> <int>
@@ -648,17 +648,17 @@ 11.3.2 Visualization of missing d
## 6 Other !NA 46
## 7 <NA> NA 751
We can then use these new variables to plot the missing data along side the actual data. For example, let’s plot a histogram of the total electric bill grouped by those that are missing and not missing by heating behavior.
-recs_2020_shadow %>%
- filter(TOTALDOL < 5000) %>%
- ggplot(aes(x=TOTALDOL,fill=HeatingBehavior_NA)) +
- geom_histogram() +
- scale_fill_manual(values = book_colors[c(3, 1)],
- labels = c("Present", "Missing"),
- name = "Heating Behavior") +
- theme_minimal() +
- xlab("Total Energy Cost (Truncated at $5000)") +
- ylab("Number of Households") +
- labs(title = "Histogram of Energy Cost by Heating Behavior Missing Data")
+recs_2020_shadow %>%
+ filter(TOTALDOL < 5000) %>%
+ ggplot(aes(x=TOTALDOL,fill=HeatingBehavior_NA)) +
+ geom_histogram() +
+ scale_fill_manual(values = book_colors[c(3, 1)],
+ labels = c("Present", "Missing"),
+ name = "Heating Behavior") +
+ theme_minimal() +
+ xlab("Total Energy Cost (Truncated at $5000)") +
+ ylab("Number of Households") +
+ labs(title = "Histogram of Energy Cost by Heating Behavior Missing Data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
@@ -687,15 +687,15 @@ 11.4.1 Recoding missing data
When we created the derived variables for use in this book, we coded all negative values as NA
and proceeded to analyze the data. For most cases this is an appropriate approach as long as you filter the data appropriately to account for skip patterns (see next section). However, the {nanair} package does have the option to code special missing values. For example, if we wanted to have two NA
values, one that indicated the question was missing by design (e.g., due to skip patterns) and one for the other missing categories we can use the nabular
format to incorporate these with the recode_shadow()
function.
-anes_2020_shadow<-anes_2020 %>%
- select(starts_with("V2")) %>%
- mutate(across(everything(),~case_when(.x < -1 ~ NA,
- TRUE~.x))) %>%
- bind_shadow() %>%
- recode_shadow(V201103 = .where(V201103==-1~"skip"))
-
-anes_2020_shadow %>%
- count(V201103,V201103_NA)
+anes_2020_shadow<-anes_2020 %>%
+ select(starts_with("V2")) %>%
+ mutate(across(everything(),~case_when(.x < -1 ~ NA,
+ TRUE~.x))) %>%
+ bind_shadow() %>%
+ recode_shadow(V201103 = .where(V201103==-1~"skip"))
+
+anes_2020_shadow %>%
+ count(V201103,V201103_NA)
## # A tibble: 5 × 3
## V201103 V201103_NA n
## <dbl+lbl> <fct> <int>
@@ -710,14 +710,14 @@ 11.4.1 Recoding missing data11.4.2 Accounting for skip patterns
When questions are skipped by design in a survey, it is meaningful that the data is later missing. For example the RECS survey asks people how they control the heat in their home in the winter (HeatingBehavior
). This is only among those who have heat in their home (SpaceHeatingUsed
). If no there is no heating equipment used, the value of HeatingBehavior
is missing. One has several choices when analyzing this data which include 1) only including those with a valid value of HeatingBehavior
and specifying the universe as those with heat or 2) including those who do not have heat. It is important to specify what population an analysis generalizes to.
Here is example code where we only include those with a valid value of HeatingBehavior
(choice 1). Note that we use the design object (recs_des
) then filter to those that are not missing on HeatingBehavior
.
-heat_cntl_1 <- recs_des %>%
- filter(!is.na(HeatingBehavior)) %>%
- group_by(HeatingBehavior) %>%
- summarize(
- p=survey_prop()
- )
-
-heat_cntl_1
+heat_cntl_1 <- recs_des %>%
+ filter(!is.na(HeatingBehavior)) %>%
+ group_by(HeatingBehavior) %>%
+ summarize(
+ p=survey_prop()
+ )
+
+heat_cntl_1
## # A tibble: 6 × 3
## HeatingBehavior p p_se
## <fct> <dbl> <dbl>
@@ -728,13 +728,13 @@ 11.4.2 Accounting for skip patter
## 5 No control 0.0333 1.70e-3
## 6 Other 0.00208 3.59e-4
Here is example code where we include those that do not have heat (choice 2). To help understand what we are looking at we have included the output to show both variables SpaceHeatingUsed
and HeatingBehavior
.
-heat_cntl_2 <- recs_des %>%
- group_by(interact(SpaceHeatingUsed, HeatingBehavior)) %>%
- summarize(
- p=survey_prop()
- )
-
-heat_cntl_2
+heat_cntl_2 <- recs_des %>%
+ group_by(interact(SpaceHeatingUsed, HeatingBehavior)) %>%
+ summarize(
+ p=survey_prop()
+ )
+
+heat_cntl_2
## # A tibble: 7 × 4
## SpaceHeatingUsed HeatingBehavior p p_se
## <lgl> <fct> <dbl> <dbl>
@@ -747,24 +747,24 @@ 11.4.2 Accounting for skip patter
## 7 TRUE Other 0.00198 3.41e-4
If we ran the first analysis, we would say that 16.8% of households with heat use a programmable or smart thermostat for the heating of their home. While if we used the results from the second analysis, we could say that 16% of households use a programmable or smart thermostat for the heating of their home. The distinction of the two statements is bolded for emphasis. Skip patterns often change the universe that we are talking about and need to be carefully examined.
Filtering to the correct universe is important when handling these types of missing data. The nabular
we created above can also help with this. If we have NA_skip
values in the shadow, we can make sure that we filter out all of these values and only include relevant missing. To do this with survey data we could first create the nabular
, then create the design object on that data, and then use the shadow variables to assist with filtering the data. Let’s use the nabular
we created above for ANES 2020 (anes_2020_shadow
) to create the design object.
-anes_adjwgt_shadow <- anes_2020_shadow %>%
- mutate(V200010b = V200010b/sum(V200010b)*targetpop)
-
-anes_des_shadow <- anes_adjwgt_shadow %>%
- as_survey_design(
- weights = V200010b,
- strata = V200010d,
- ids = V200010c,
- nest = TRUE
- )
+anes_adjwgt_shadow <- anes_2020_shadow %>%
+ mutate(V200010b = V200010b/sum(V200010b)*targetpop)
+
+anes_des_shadow <- anes_adjwgt_shadow %>%
+ as_survey_design(
+ weights = V200010b,
+ strata = V200010d,
+ ids = V200010c,
+ nest = TRUE
+ )
Then we can use this design object to look at the percent of the population that voted for each candidate in 2016 (V201103
). First, let’s look at the percentages without removing any cases:
-pres16_select1<-anes_des_shadow %>%
- group_by(V201103) %>%
- summarize(
- All_Missing=survey_prop()
- )
-
-pres16_select1
+pres16_select1<-anes_des_shadow %>%
+ group_by(V201103) %>%
+ summarize(
+ All_Missing=survey_prop()
+ )
+
+pres16_select1
## # A tibble: 5 × 3
## V201103 All_Missing All_Missing_se
## <dbl+lbl> <dbl> <dbl>
@@ -774,14 +774,14 @@ 11.4.2 Accounting for skip patter
## 4 5 [5. Other {SPECIFY}] 0.0409 0.00230
## 5 NA 0.00627 0.00121
Next, we will look at the percentages removing only those that were missing due to skip patterns (i.e., they did not receive this question).
-pres16_select2<-anes_des_shadow %>%
- filter(V201103_NA!="NA_skip") %>%
- group_by(V201103) %>%
- summarize(
- No_Skip_Missing=survey_prop()
- )
-
-pres16_select2
+pres16_select2<-anes_des_shadow %>%
+ filter(V201103_NA!="NA_skip") %>%
+ group_by(V201103) %>%
+ summarize(
+ No_Skip_Missing=survey_prop()
+ )
+
+pres16_select2
## # A tibble: 4 × 3
## V201103 No_Skip_Missing No_Skip_Missing_se
## <dbl+lbl> <dbl> <dbl>
@@ -790,14 +790,14 @@ 11.4.2 Accounting for skip patter
## 3 5 [5. Other {SPECIFY}] 0.0606 0.00330
## 4 NA 0.00928 0.00178
Finally, we will look at the percentages removing all missing values both due to skip patterns and due to those who refused to answer the question.
-pres16_select3<-anes_des_shadow %>%
- filter(V201103_NA=="!NA") %>%
- group_by(V201103) %>%
- summarize(
- No_Missing=survey_prop()
- )
-
-pres16_select3
+pres16_select3<-anes_des_shadow %>%
+ filter(V201103_NA=="!NA") %>%
+ group_by(V201103) %>%
+ summarize(
+ No_Missing=survey_prop()
+ )
+
+pres16_select3
## # A tibble: 3 × 3
## V201103 No_Missing No_Missing_se
## <dbl+lbl> <dbl> <dbl>
diff --git a/c12-pitfalls.html b/c12-pitfalls.html
index b4fa1be0..427ccd25 100644
--- a/c12-pitfalls.html
+++ b/c12-pitfalls.html
@@ -23,7 +23,7 @@
-
+
@@ -300,13 +300,13 @@
- 7 Modeling
- Prerequisites
-- 7.1 Introduction
+- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
-- 7.3 Gaussian Linear Regression
+
- 7.3 Normal Linear Regression
- 7.3.1 Syntax
- 7.3.2 Examples
@@ -322,7 +322,7 @@
- 8 Communicating results
- Prerequisites
-- 8.1 Introduction
+- 8.1 Introduction
- 8.2 Describing results through text
- 8.2.1 Methodology
@@ -337,7 +337,7 @@
- 9 Reproducible research
-- 9.1 Introduction
+- 9.1 Introduction
- 9.2 Project-based workflows
- 9.3 Functions and packages
- 9.4 Version control with Git
@@ -359,7 +359,7 @@
- 10 Specifying sample designs and replicate weights in {srvyr}
- Prerequisites
-- 10.1 Introduction
+- 10.1 Introduction
- 10.2 Common sampling designs
- 10.2.1 Simple random sample without replacement
@@ -383,7 +383,7 @@
- 11 Missing data
- Prerequisites
-- 11.1 Introduction
+- 11.1 Introduction
- 11.2 Missing data mechanisms
- 11.3 Assessing missing data
@@ -401,7 +401,7 @@
- 13 National Crime Victimization Survey Vignette
- Prerequisites
-- 13.1 Introduction
+- 13.1 Introduction
- 13.2 Data Structure
- 13.3 Survey Notation
- 13.4 Data File Preparation
@@ -423,7 +423,7 @@
- 14 AmericasBarometer Vignette
- Prerequisites
-- 14.1 Introduction
+- 14.1 Introduction
- 14.2 Data structure
- 14.3 Preparing files
- 14.4 Survey design objects
diff --git a/c13-ncvs-vignette.html b/c13-ncvs-vignette.html
index ef52ec52..d9a19cff 100644
--- a/c13-ncvs-vignette.html
+++ b/c13-ncvs-vignette.html
@@ -23,7 +23,7 @@
-
+
@@ -300,13 +300,13 @@
- 7 Modeling
- Prerequisites
-- 7.1 Introduction
+- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
-- 7.3 Gaussian Linear Regression
+
- 7.3 Normal Linear Regression
- 7.3.1 Syntax
- 7.3.2 Examples
@@ -322,7 +322,7 @@
- 8 Communicating results
- Prerequisites
-- 8.1 Introduction
+- 8.1 Introduction
- 8.2 Describing results through text
- 8.2.1 Methodology
@@ -337,7 +337,7 @@
- 9 Reproducible research
-- 9.1 Introduction
+- 9.1 Introduction
- 9.2 Project-based workflows
- 9.3 Functions and packages
- 9.4 Version control with Git
@@ -359,7 +359,7 @@
- 10 Specifying sample designs and replicate weights in {srvyr}
- Prerequisites
-- 10.1 Introduction
+- 10.1 Introduction
- 10.2 Common sampling designs
- 10.2.1 Simple random sample without replacement
@@ -383,7 +383,7 @@
- 11 Missing data
- Prerequisites
-- 11.1 Introduction
+- 11.1 Introduction
- 11.2 Missing data mechanisms
- 11.3 Assessing missing data
@@ -401,7 +401,7 @@
- 13 National Crime Victimization Survey Vignette
- Prerequisites
-- 13.1 Introduction
+- 13.1 Introduction
- 13.2 Data Structure
- 13.3 Survey Notation
- 13.4 Data File Preparation
@@ -423,7 +423,7 @@
- 14 AmericasBarometer Vignette
- Prerequisites
-- 14.1 Introduction
+- 14.1 Introduction
- 14.2 Data structure
- 14.3 Preparing files
- 14.4 Survey design objects
@@ -497,21 +497,21 @@ Prerequisites
For this chapter, load the following packages:
-
+
We will use data from the United States National Crime Victimization Survey (NCVS). Here is the code to read in the three datasets from the {srvyrexploR} package:
-
+
-
-13.1 Introduction
+
+13.1 Introduction
The NCVS is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The target population of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters.
The NCVS has been ongoing since 1992. An earlier survey, the National Crime Survey, was run from 1972 to 1991 (Bureau of Justice Statistics 2017). The survey is administered using a rotating panel. When an address enters the sample, the residents of that address are interviewed every six months for a total of seven interviews. If the initial residents move away from the address during the period, the new residents are included in the survey, as people are not followed when they move.
-NCVS data is publicly available and distributed by Inter-university Consortium for Political and Social Research (ICPSR)37, with data going back to 1992. The vignette in this book will include data from 2021 (United States. Bureau of Justice Statistics 2022). The NCVS data structure is complicated, and the User’s Guide contains examples for analysis in SAS, SUDAAN, SPSS, and Stata, but not R (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus 2015). This vignette will adapt those examples for R.
+NCVS data is publicly available and distributed by Inter-university Consortium for Political and Social Research (ICPSR)36, with data going back to 1992. The vignette in this book will include data from 2021 (United States. Bureau of Justice Statistics 2022). The NCVS data structure is complicated, and the User’s Guide contains examples for analysis in SAS, SUDAAN, SPSS, and Stata, but not R (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus 2015). This vignette will adapt those examples for R.
13.2 Data Structure
@@ -523,7 +523,7 @@ 13.2 Data Structure38.
+
We will focus on the household, person, and incident files. From these files, we selected a subset of columns for examples to use in this vignette. We have included data in the {srvyexploR} package with a subset of columns, but you can download the complete files at ICPSR37.
13.3 Survey Notation
@@ -549,7 +549,7 @@ 13.3 Survey Notation\[ \hat{p}_{A_a,D} =\frac{\sum_{ijkl \in A_a, D} v_{ijkl}}{\sum_{ijkl \in D} v_{ijkl}}.\]
The numerator is the number of incidents with a particular characteristic in a domain, and the denominator is the number of incidents in a domain.
-- Victimization rates are estimates of the number of victimizations per 1,000 persons or households in the population39. Victimization rates are calculated using the household or person-level data files. The estimated victimization rate for crime \(C\) in domain \(D\) is
+- Victimization rates are estimates of the number of victimizations per 1,000 persons or households in the population38. Victimization rates are calculated using the household or person-level data files. The estimated victimization rate for crime \(C\) in domain \(D\) is
\[\hat{VR}_{C,D}= \frac{\sum_{ijkl \in C,D} v_{ijkl}}{\sum_{ijk \in D} w_{ijk}}\times 1000\]
where \(w_{ijk}\) is the person weight (WGTPERCY
) or household weight (WGTHHCY
) for personal and household crimes, respectively. The numerator is the number of incidents in a domain, and the denominator is the number of persons or households in a domain. Notice that the weights in the numerator and denominator are different - this is important, and in the syntax and examples below, we will discuss how to make an estimate that involves two weights.
@@ -659,21 +659,21 @@ 13.4.1 Preparing Files for Estima
We want to create four variables to indicate if an incident is a series crime. First, we create a variable called series using V4017
, V4018
, and V4019
where an incident is considered a series crime if there are 6 or more incidents (V4107
), the incidents are similar in detail (V4018
), or there is not enough detail to distinguish the incidents (V4019
). Next, we top-code the number of incidents (V4016
) by creating a variable n10v4016
which is set to 10 if V4016 > 10
. Finally, we create the series weight using our new top-coded variable and the existing weight.
-inc_series <- ncvs_2021_incident %>%
- mutate(
- series = case_when(V4017 %in% c(1, 8) ~ 1,
- V4018 %in% c(2, 8) ~ 1,
- V4019 %in% c(1, 8) ~ 1,
- TRUE ~ 2
- ),
- n10v4016 = case_when(V4016 %in% c(997, 998) ~ NA_real_,
- V4016 > 10 ~ 10,
- TRUE ~ V4016),
- serieswgt = case_when(series == 2 & is.na(n10v4016) ~ 6,
- series == 2 ~ n10v4016,
- TRUE ~ 1),
- NEWWGT = WGTVICCY * serieswgt
- )
+inc_series <- ncvs_2021_incident %>%
+ mutate(
+ series = case_when(V4017 %in% c(1, 8) ~ 1,
+ V4018 %in% c(2, 8) ~ 1,
+ V4019 %in% c(1, 8) ~ 1,
+ TRUE ~ 2
+ ),
+ n10v4016 = case_when(V4016 %in% c(997, 998) ~ NA_real_,
+ V4016 > 10 ~ 10,
+ TRUE ~ V4016),
+ serieswgt = case_when(series == 2 & is.na(n10v4016) ~ 6,
+ series == 2 ~ n10v4016,
+ TRUE ~ 1),
+ NEWWGT = WGTVICCY * serieswgt
+ )
The next step in preparing the files for estimation is to create indicators on the victimization file for characteristics of interest. Almost all BJS publications limit the analysis to records where the victimization occurred in the United States, where V4022
is not equal to 1, and we will do this for all estimates as well. A brief codebook of variables for this task is located in Table 13.2
TABLE 13.2: Codebook for incident variables - crime type indicators and characteristics
@@ -1067,32 +1067,32 @@ 13.4.1 Preparing Files for Estima
- Variable:
AAST_Other
-inc_ind <- inc_series %>%
- filter(V4022 != 1) %>%
- mutate(
- WeapCat = case_when(
- is.na(V4049) ~ NA_character_,
- V4049 == 2 ~ "NoWeap",
- V4049 == 3 ~ "UnkWeapUse",
- V4050 == 3 ~ "Other",
- V4051 == 1 | V4052 == 1 | V4050 == 7 ~ "Firearm",
- V4053 == 1 | V4054 == 1 ~ "Knife",
- TRUE ~ "Other"
- ),
- V4529_num = parse_number(as.character(V4529)),
- ReportPolice = V4399 == 1,
- Property = V4529_num >= 31,
- Violent = V4529_num <= 20,
- Property_ReportPolice = Property & ReportPolice,
- Violent_ReportPolice = Violent & ReportPolice,
- AAST = V4529_num %in% 11:13,
- AAST_NoWeap = AAST & WeapCat == "NoWeap",
- AAST_Firearm = AAST & WeapCat == "Firearm",
- AAST_Knife = AAST & WeapCat == "Knife",
- AAST_Other = AAST & WeapCat == "Other"
- )
+inc_ind <- inc_series %>%
+ filter(V4022 != 1) %>%
+ mutate(
+ WeapCat = case_when(
+ is.na(V4049) ~ NA_character_,
+ V4049 == 2 ~ "NoWeap",
+ V4049 == 3 ~ "UnkWeapUse",
+ V4050 == 3 ~ "Other",
+ V4051 == 1 | V4052 == 1 | V4050 == 7 ~ "Firearm",
+ V4053 == 1 | V4054 == 1 ~ "Knife",
+ TRUE ~ "Other"
+ ),
+ V4529_num = parse_number(as.character(V4529)),
+ ReportPolice = V4399 == 1,
+ Property = V4529_num >= 31,
+ Violent = V4529_num <= 20,
+ Property_ReportPolice = Property & ReportPolice,
+ Violent_ReportPolice = Violent & ReportPolice,
+ AAST = V4529_num %in% 11:13,
+ AAST_NoWeap = AAST & WeapCat == "NoWeap",
+ AAST_Firearm = AAST & WeapCat == "Firearm",
+ AAST_Knife = AAST & WeapCat == "Knife",
+ AAST_Other = AAST & WeapCat == "Other"
+ )
This is a good point to pause to look at the output of crosswalks between an original variable and a derived one to check that the logic was programmed correctly and that everything ends up in the expected category.
-
+
## # A tibble: 6 × 2
## V4022 n
## <fct> <int>
@@ -1102,7 +1102,7 @@ 13.4.1 Preparing Files for Estima
## 4 4 1143
## 5 5 39
## 6 8 4
-
+
## # A tibble: 5 × 2
## V4022 n
## <fct> <int>
@@ -1111,8 +1111,8 @@ 13.4.1 Preparing Files for Estima
## 3 4 1143
## 4 5 39
## 5 8 4
-
+
## # A tibble: 13 × 8
## WeapCat V4049 V4050 V4051 V4052 V4053 V4054 n
## <chr> <fct> <fct> <fct> <fct> <fct> <fct> <int>
@@ -1129,7 +1129,7 @@ 13.4.1 Preparing Files for Estima
## 11 Other 1 3 0 0 0 0 26
## 12 UnkWeapUse 3 <NA> <NA> <NA> <NA> <NA> 519
## 13 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 6228
-
+
## # A tibble: 34 × 5
## V4529 Property Violent AAST n
## <fct> <lgl> <lgl> <lgl> <int>
@@ -1167,7 +1167,7 @@ 13.4.1 Preparing Files for Estima
## 32 57 TRUE FALSE FALSE 1420
## 33 58 TRUE FALSE FALSE 798
## 34 59 TRUE FALSE FALSE 395
-
+
## # A tibble: 4 × 3
## ReportPolice V4399 n
## <lgl> <fct> <int>
@@ -1175,13 +1175,13 @@ 13.4.1 Preparing Files for Estima
## 2 FALSE 3 103
## 3 FALSE 8 12
## 4 TRUE 1 3163
-
+
## # A tibble: 11 × 7
## AAST WeapCat AAST_NoWeap AAST_Firearm AAST_Knife AAST_Other n
## <lgl> <chr> <lgl> <lgl> <lgl> <lgl> <int>
@@ -1197,43 +1197,43 @@ 13.4.1 Preparing Files for Estima
## 10 TRUE Other FALSE FALSE FALSE TRUE 146
## 11 TRUE UnkWeapUse FALSE FALSE FALSE FALSE 3
After creating indicators of victimization types and characteristics, the file is summarized, and crimes are summed across persons or households by YEARQ.
Property crimes (i.e., crimes committed against households, such as household burglary or motor vehicle theft) are summed across households, and personal crimes (i.e., crimes committed against an individual, such as assault, robbery, and personal theft) are summed across persons. The indicators are summed using the serieswgt
, and the variable WGTVICCY
needs to be retained for later analysis.
-inc_hh_sums <-
- inc_ind %>%
- filter(V4529_num > 23) %>% # restrict to household crimes
- group_by(YEARQ, IDHH) %>%
- summarize(WGTVICCY = WGTVICCY[1],
- across(starts_with("Property"),
- ~ sum(. * serieswgt),
- .names = "{.col}"),
- .groups = "drop")
-
-inc_pers_sums <-
- inc_ind %>%
- filter(V4529_num <= 23) %>% # restrict to person crimes
- group_by(YEARQ, IDHH, IDPER) %>%
- summarize(WGTVICCY = WGTVICCY[1],
- across(c(starts_with("Violent"), starts_with("AAST")),
- ~ sum(. * serieswgt),
- .names = "{.col}"),
- .groups = "drop")
+inc_hh_sums <-
+ inc_ind %>%
+ filter(V4529_num > 23) %>% # restrict to household crimes
+ group_by(YEARQ, IDHH) %>%
+ summarize(WGTVICCY = WGTVICCY[1],
+ across(starts_with("Property"),
+ ~ sum(. * serieswgt),
+ .names = "{.col}"),
+ .groups = "drop")
+
+inc_pers_sums <-
+ inc_ind %>%
+ filter(V4529_num <= 23) %>% # restrict to person crimes
+ group_by(YEARQ, IDHH, IDPER) %>%
+ summarize(WGTVICCY = WGTVICCY[1],
+ across(c(starts_with("Violent"), starts_with("AAST")),
+ ~ sum(. * serieswgt),
+ .names = "{.col}"),
+ .groups = "drop")
Now, we merge the victimization summary files into the appropriate files. For any record on the household or person file that is not on the victimization file, the victimization counts are set to 0 after merging. In this step, we will also create the victimization adjustment factor. See 2.2.4 in the User’s Guide for details of why this adjustment is created (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus (2015)). It is calculated as follows:
\[ A_{ijk}=\frac{v_{ijk}}{w_{ijk}}\]
where \(w_{ijk}\) is the person weight (WGTPERCY
) for personal crimes or the household weight (WGTHHCY
) for household crimes, and \(v_{ijk}\) is the victimization weight (WGTVICCY
) for household \(i\), respondent \(j\), in reporting period \(k\). The adjustment factor is set to 0 if no incidents are reported.
-# Set up a list of 0s for each crime type/characteristic to replace NA's
-hh_z_list <- rep(0, ncol(inc_hh_sums) - 3) %>% as.list() %>%
- setNames(names(inc_hh_sums)[-(1:3)])
-pers_z_list <- rep(0, ncol(inc_pers_sums) - 4) %>% as.list() %>%
- setNames(names(inc_pers_sums)[-(1:4)])
-
-hh_vsum <- ncvs_2021_household %>%
- full_join(inc_hh_sums, by = c("YEARQ", "IDHH")) %>%
- replace_na(hh_z_list) %>%
- mutate(ADJINC_WT = if_else(is.na(WGTVICCY), 0, WGTVICCY / WGTHHCY))
-
-pers_vsum <- ncvs_2021_person %>%
- full_join(inc_pers_sums, by = c("YEARQ", "IDHH", "IDPER")) %>%
- replace_na(pers_z_list) %>%
- mutate(ADJINC_WT = if_else(is.na(WGTVICCY), 0, WGTVICCY / WGTPERCY))
+# Set up a list of 0s for each crime type/characteristic to replace NA's
+hh_z_list <- rep(0, ncol(inc_hh_sums) - 3) %>% as.list() %>%
+ setNames(names(inc_hh_sums)[-(1:3)])
+pers_z_list <- rep(0, ncol(inc_pers_sums) - 4) %>% as.list() %>%
+ setNames(names(inc_pers_sums)[-(1:4)])
+
+hh_vsum <- ncvs_2021_household %>%
+ full_join(inc_hh_sums, by = c("YEARQ", "IDHH")) %>%
+ replace_na(hh_z_list) %>%
+ mutate(ADJINC_WT = if_else(is.na(WGTVICCY), 0, WGTVICCY / WGTHHCY))
+
+pers_vsum <- ncvs_2021_person %>%
+ full_join(inc_pers_sums, by = c("YEARQ", "IDHH", "IDPER")) %>%
+ replace_na(pers_z_list) %>%
+ mutate(ADJINC_WT = if_else(is.na(WGTVICCY), 0, WGTVICCY / WGTPERCY))
-hh_vsum_der <- hh_vsum %>%
- mutate(
- Tenure = factor(case_when(V2015 == 1 ~ "Owned",
- !is.na(V2015) ~ "Rented"),
- levels = c("Owned", "Rented")),
- Urbanicity = factor(case_when(V2143 == 1 ~ "Urban",
- V2143 == 2 ~ "Suburban",
- V2143 == 3 ~ "Rural"),
- levels = c("Urban", "Suburban", "Rural")),
- SC214A_num = as.numeric(as.character(SC214A)),
- Income = case_when(SC214A_num <= 8 ~ "Less than $25,000",
- SC214A_num <= 12 ~ "$25,000-49,999",
- SC214A_num <= 15 ~ "$50,000-99,999",
- SC214A_num <= 17 ~ "$100,000-199,999",
- SC214A_num <= 18 ~ "$200,000 or more"),
- Income = fct_reorder(Income, SC214A_num, .na_rm = FALSE),
- PlaceSize = case_match(as.numeric(as.character(V2126B)),
- 0 ~ "Not in a place",
- 13 ~ "Under 10,000",
- 16 ~ "10,000-49,999",
- 17 ~ "50,000-99,999",
- 18 ~ "100,000-249,999",
- 19 ~ "250,000-499,999",
- 20 ~ "500,000-999,999",
- c(21, 22, 23) ~ "1,000,000 or more"),
- PlaceSize = fct_reorder(PlaceSize, as.numeric(V2126B)),
- Region = case_match(as.numeric(V2127B),
- 1 ~ "Northeast",
- 2 ~ "Midwest",
- 3 ~ "South",
- 4 ~ "West"),
- Region = fct_reorder(Region, as.numeric(V2127B))
- )
+hh_vsum_der <- hh_vsum %>%
+ mutate(
+ Tenure = factor(case_when(V2015 == 1 ~ "Owned",
+ !is.na(V2015) ~ "Rented"),
+ levels = c("Owned", "Rented")),
+ Urbanicity = factor(case_when(V2143 == 1 ~ "Urban",
+ V2143 == 2 ~ "Suburban",
+ V2143 == 3 ~ "Rural"),
+ levels = c("Urban", "Suburban", "Rural")),
+ SC214A_num = as.numeric(as.character(SC214A)),
+ Income = case_when(SC214A_num <= 8 ~ "Less than $25,000",
+ SC214A_num <= 12 ~ "$25,000-49,999",
+ SC214A_num <= 15 ~ "$50,000-99,999",
+ SC214A_num <= 17 ~ "$100,000-199,999",
+ SC214A_num <= 18 ~ "$200,000 or more"),
+ Income = fct_reorder(Income, SC214A_num, .na_rm = FALSE),
+ PlaceSize = case_match(as.numeric(as.character(V2126B)),
+ 0 ~ "Not in a place",
+ 13 ~ "Under 10,000",
+ 16 ~ "10,000-49,999",
+ 17 ~ "50,000-99,999",
+ 18 ~ "100,000-249,999",
+ 19 ~ "250,000-499,999",
+ 20 ~ "500,000-999,999",
+ c(21, 22, 23) ~ "1,000,000 or more"),
+ PlaceSize = fct_reorder(PlaceSize, as.numeric(V2126B)),
+ Region = case_match(as.numeric(V2127B),
+ 1 ~ "Northeast",
+ 2 ~ "Midwest",
+ 3 ~ "South",
+ 4 ~ "West"),
+ Region = fct_reorder(Region, as.numeric(V2127B))
+ )
As before, we want to check to make sure the recoded variables we create match the existing data as expected.
-
+
## # A tibble: 4 × 3
## Tenure V2015 n
## <fct> <fct> <int>
@@ -1518,14 +1518,14 @@ 13.4.2.1 Household Variables
-
+
## # A tibble: 3 × 3
## Urbanicity V2143 n
## <fct> <fct> <int>
## 1 Urban 1 26878
## 2 Suburban 2 173491
## 3 Rural 3 56091
-
+
## # A tibble: 18 × 3
## Income SC214A n
## <fct> <fct> <int>
@@ -1547,7 +1547,7 @@ 13.4.2.1 Household Variables
-
+
## # A tibble: 10 × 3
## PlaceSize V2126B n
## <fct> <fct> <int>
@@ -1561,7 +1561,7 @@ 13.4.2.1 Household Variables
-
+
## # A tibble: 4 × 3
## Region V2127B n
## <fct> <fct> <int>
@@ -1766,49 +1766,49 @@ 13.4.2.2 Person Variables
-# Set label for usage later
-NHOPI <- "Native Hawaiian or Other Pacific Islander"
-
-pers_vsum_der <- pers_vsum %>%
- mutate(
- Sex = factor(case_when(V3018 == 1 ~ "Male",
- V3018 == 2 ~ "Female")),
- RaceHispOrigin = factor(case_when(V3024 == 1 ~ "Hispanic",
- V3023A == 1 ~ "White",
- V3023A == 2 ~ "Black",
- V3023A == 4 ~ "Asian",
- V3023A == 5 ~ NHOPI,
- TRUE ~ "Other"),
- levels = c("White", "Black", "Hispanic",
- "Asian", NHOPI, "Other")),
- V3014_num = as.numeric(as.character(V3014)),
- AgeGroup = case_when(V3014_num <= 17 ~ "12-17",
- V3014_num <= 24 ~ "18-24",
- V3014_num <= 34 ~ "25-34",
- V3014_num <= 49 ~ "35-49",
- V3014_num <= 64 ~ "50-64",
- V3014_num <= 90 ~ "65 or older"),
- AgeGroup = fct_reorder(AgeGroup, V3014_num),
- MaritalStatus = factor(case_when(V3015 == 1 ~ "Married",
- V3015 == 2 ~ "Widowed",
- V3015 == 3 ~ "Divorced",
- V3015 == 4 ~ "Separated",
- V3015 == 5 ~ "Never married"),
- levels = c("Never married", "Married",
- "Widowed","Divorced",
- "Separated"))
- ) %>%
- left_join(hh_vsum_der %>% select(YEARQ, IDHH,
- V2117, V2118, Tenure:Region),
- by = c("YEARQ", "IDHH"))
+# Set label for usage later
+NHOPI <- "Native Hawaiian or Other Pacific Islander"
+
+pers_vsum_der <- pers_vsum %>%
+ mutate(
+ Sex = factor(case_when(V3018 == 1 ~ "Male",
+ V3018 == 2 ~ "Female")),
+ RaceHispOrigin = factor(case_when(V3024 == 1 ~ "Hispanic",
+ V3023A == 1 ~ "White",
+ V3023A == 2 ~ "Black",
+ V3023A == 4 ~ "Asian",
+ V3023A == 5 ~ NHOPI,
+ TRUE ~ "Other"),
+ levels = c("White", "Black", "Hispanic",
+ "Asian", NHOPI, "Other")),
+ V3014_num = as.numeric(as.character(V3014)),
+ AgeGroup = case_when(V3014_num <= 17 ~ "12-17",
+ V3014_num <= 24 ~ "18-24",
+ V3014_num <= 34 ~ "25-34",
+ V3014_num <= 49 ~ "35-49",
+ V3014_num <= 64 ~ "50-64",
+ V3014_num <= 90 ~ "65 or older"),
+ AgeGroup = fct_reorder(AgeGroup, V3014_num),
+ MaritalStatus = factor(case_when(V3015 == 1 ~ "Married",
+ V3015 == 2 ~ "Widowed",
+ V3015 == 3 ~ "Divorced",
+ V3015 == 4 ~ "Separated",
+ V3015 == 5 ~ "Never married"),
+ levels = c("Never married", "Married",
+ "Widowed","Divorced",
+ "Separated"))
+ ) %>%
+ left_join(hh_vsum_der %>% select(YEARQ, IDHH,
+ V2117, V2118, Tenure:Region),
+ by = c("YEARQ", "IDHH"))
As before, we want to check to make sure the recoded variables we create match the existing data as expected.
-
+
## # A tibble: 2 × 3
## Sex V3018 n
## <fct> <fct> <int>
## 1 Female 2 150956
## 2 Male 1 140922
-
+
## # A tibble: 11 × 3
## RaceHispOrigin V3024 n
## <fct> <fct> <int>
@@ -1823,10 +1823,10 @@ 13.4.2.2 Person Variables
-pers_vsum_der %>%
- filter(RaceHispOrigin != "Hispanic" |
- is.na(RaceHispOrigin)) %>%
- count(RaceHispOrigin, V3023A)
+pers_vsum_der %>%
+ filter(RaceHispOrigin != "Hispanic" |
+ is.na(RaceHispOrigin)) %>%
+ count(RaceHispOrigin, V3023A)
## # A tibble: 20 × 3
## RaceHispOrigin V3023A n
## <fct> <fct> <int>
@@ -1850,10 +1850,10 @@ 13.4.2.2 Person Variables
-pers_vsum_der %>% group_by(AgeGroup) %>%
- summarize(minAge = min(V3014),
- maxAge = max(V3014),
- .groups = "drop")
+pers_vsum_der %>% group_by(AgeGroup) %>%
+ summarize(minAge = min(V3014),
+ maxAge = max(V3014),
+ .groups = "drop")
## # A tibble: 6 × 3
## AgeGroup minAge maxAge
## <fct> <dbl> <dbl>
@@ -1863,7 +1863,7 @@ 13.4.2.2 Person Variables
-
+
## # A tibble: 6 × 3
## MaritalStatus V3015 n
## <fct> <fct> <int>
@@ -1874,36 +1874,36 @@ 13.4.2.2 Person Variables
We then create tibbles that contain only the variables we need, which makes it easier for analyses.
-hh_vsum_slim <- hh_vsum_der %>%
- select(YEARQ:V2118,
- WGTVICCY:ADJINC_WT,
- Tenure,
- Urbanicity,
- Income,
- PlaceSize,
- Region)
-
-pers_vsum_slim <- pers_vsum_der %>%
- select(YEARQ:WGTPERCY, WGTVICCY:ADJINC_WT, Sex:Region)
+hh_vsum_slim <- hh_vsum_der %>%
+ select(YEARQ:V2118,
+ WGTVICCY:ADJINC_WT,
+ Tenure,
+ Urbanicity,
+ Income,
+ PlaceSize,
+ Region)
+
+pers_vsum_slim <- pers_vsum_der %>%
+ select(YEARQ:WGTPERCY, WGTVICCY:ADJINC_WT, Sex:Region)
To calculate estimates about types of crime, such as what percentage of violent crimes are reported to the police, we must use the incident file. The incident file is not guaranteed to have every pseudostratum and half-sample code, so dummy records are created to append before estimation. Finally, we merge demographic variables onto the incident tibble.
-dummy_records <- hh_vsum_slim %>%
- distinct(V2117, V2118) %>%
- mutate(Dummy = 1,
- WGTVICCY = 1,
- NEWWGT = 1)
-
-inc_analysis <- inc_ind %>%
- mutate(Dummy = 0) %>%
- left_join(select(pers_vsum_slim, YEARQ, IDHH, IDPER, Sex:Region),
- by = c("YEARQ", "IDHH", "IDPER")) %>%
- bind_rows(dummy_records) %>%
- select(YEARQ:IDPER,
- WGTVICCY,
- NEWWGT,
- V4529,
- WeapCat,
- ReportPolice,
- Property:Region)
+dummy_records <- hh_vsum_slim %>%
+ distinct(V2117, V2118) %>%
+ mutate(Dummy = 1,
+ WGTVICCY = 1,
+ NEWWGT = 1)
+
+inc_analysis <- inc_ind %>%
+ mutate(Dummy = 0) %>%
+ left_join(select(pers_vsum_slim, YEARQ, IDHH, IDPER, Sex:Region),
+ by = c("YEARQ", "IDHH", "IDPER")) %>%
+ bind_rows(dummy_records) %>%
+ select(YEARQ:IDPER,
+ WGTVICCY,
+ NEWWGT,
+ V4529,
+ WeapCat,
+ ReportPolice,
+ Property:Region)
The tibbles hh_vsum_slim
, pers_vsum_slim
, and inc_analysis
can now be used to create design objects and calculate crime rate estimates.
@@ -1911,29 +1911,29 @@ 13.4.2.2 Person Variables
13.5 Survey Design Objects
All the data prep above is necessary to prepare the data for survey analysis. At this point, we can create the design objects and finally begin analysis. We will create three design objects for different types of analysis as they depend on which type of estimate we are creating. For the incident data, the weight of analysis is NEWWGT
, which we constructed previously. The household and person-level data use WGTHHCY
and WGTPERCY
, respectively. For all analyses, V2117
is the strata variable, and V2118
is the cluster/PSU variable for analysis.
-inc_des <- inc_analysis %>%
- as_survey(
- weight = NEWWGT,
- strata = V2117,
- ids = V2118,
- nest = TRUE
- )
-
-hh_des <- hh_vsum_slim %>%
- as_survey(
- weight = WGTHHCY,
- strata = V2117,
- ids = V2118,
- nest = TRUE
- )
-
-pers_des <- pers_vsum_slim %>%
- as_survey(
- weight = WGTPERCY,
- strata = V2117,
- ids = V2118,
- nest = TRUE
- )
+inc_des <- inc_analysis %>%
+ as_survey(
+ weight = NEWWGT,
+ strata = V2117,
+ ids = V2118,
+ nest = TRUE
+ )
+
+hh_des <- hh_vsum_slim %>%
+ as_survey(
+ weight = WGTHHCY,
+ strata = V2117,
+ ids = V2118,
+ nest = TRUE
+ )
+
+pers_des <- pers_vsum_slim %>%
+ as_survey(
+ weight = WGTPERCY,
+ strata = V2117,
+ ids = V2118,
+ nest = TRUE
+ )
13.6 Calculating Estimates
@@ -1947,29 +1947,29 @@ 13.6 Calculating Estimates
13.6.1 Estimation 1: Victimization Totals
There are two ways to calculate victimization totals. Using the incident design object (inc_des
) is the most straightforward method, but the person (pers_des
) and household (hh_des
) design objects can be used as well if the adjustment factor (ADJINC_WT
) is incorporated. In the example below, the total number of property and violent victimizations is first calculated using the incident file and then using the household and person design objects. The incident file is smaller, and thus, estimation is faster using that file, but the estimates will be the same as illustrated below:
-vt1 <- inc_des %>%
- summarize(Property_Vzn = survey_total(Property, na.rm = TRUE),
- Violent_Vzn = survey_total(Violent, na.rm = TRUE))
-
-vt2a <- hh_des %>%
- summarize(Property_Vzn = survey_total(Property * ADJINC_WT,
- na.rm = TRUE))
-
-vt2b <- pers_des %>%
- summarize(Violent_Vzn = survey_total(Violent * ADJINC_WT,
- na.rm = TRUE))
-
-vt1
+vt1 <- inc_des %>%
+ summarize(Property_Vzn = survey_total(Property, na.rm = TRUE),
+ Violent_Vzn = survey_total(Violent, na.rm = TRUE))
+
+vt2a <- hh_des %>%
+ summarize(Property_Vzn = survey_total(Property * ADJINC_WT,
+ na.rm = TRUE))
+
+vt2b <- pers_des %>%
+ summarize(Violent_Vzn = survey_total(Violent * ADJINC_WT,
+ na.rm = TRUE))
+
+vt1
## # A tibble: 1 × 4
## Property_Vzn Property_Vzn_se Violent_Vzn Violent_Vzn_se
## <dbl> <dbl> <dbl> <dbl>
## 1 11682056. 263844. 4598306. 198115.
-
+
## # A tibble: 1 × 2
## Property_Vzn Property_Vzn_se
## <dbl> <dbl>
## 1 11682056. 263844.
-
+
## # A tibble: 1 × 2
## Violent_Vzn Violent_Vzn_se
## <dbl> <dbl>
@@ -1980,21 +1980,21 @@ 13.6.1 Estimation 1: Victimizatio
13.6.2 Estimation 2: Victimization Proportions
Victimization proportions are proportions describing features of a victimization. The key here is that these are questions among victimizations, not among the population. These types of estimates can only be calculated using the incident design object (inc_des
).
For example, we could be interested in the percentage of property victimizations reported to the police as shown in the following code with an estimate, the standard error, and 95% confidence interval:
-prop1 <- inc_des %>%
- filter(Property) %>%
- summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE, proportion=TRUE, vartype=c("se", "ci")) * 100)
-
-prop1
+prop1 <- inc_des %>%
+ filter(Property) %>%
+ summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE, proportion=TRUE, vartype=c("se", "ci")) * 100)
+
+prop1
## # A tibble: 1 × 4
## Pct Pct_se Pct_low Pct_upp
## <dbl> <dbl> <dbl> <dbl>
## 1 30.8 0.798 29.2 32.4
Or, the percentage of violent victimizations that are in urban areas:
-prop2 <- inc_des %>%
- filter(Violent) %>%
- summarize(Pct = survey_mean(Urbanicity=="Urban", na.rm = TRUE) * 100)
-
-prop2
+prop2 <- inc_des %>%
+ filter(Violent) %>%
+ summarize(Pct = survey_mean(Urbanicity=="Urban", na.rm = TRUE) * 100)
+
+prop2
## # A tibble: 1 × 2
## Pct Pct_se
## <dbl> <dbl>
@@ -2005,34 +2005,34 @@ 13.6.2 Estimation 2: Victimizatio
13.6.3 Estimation 3: Victimization Rates
Victimization rates measure the number of victimizations per population. They are not an estimate of the proportion of households or persons who are victimized, which is a prevalence rate described in section 13.6.4. Victimization rates are estimated using the household (hh_des
) or person (pers_des
) design objects depending on the type of crime, and the adjustment factor (ADJINC_WT
) must be incorporated. We return to the example of property and violent victimizations used in the example for victimization totals (section 13.6.1). In the following example, the property victimization totals are calculated as above, as well as the property victimization rate (using survey_mean()
) and the population size using survey_total()
.
As mentioned in the introduction, victimization rates use the incident weight in the numerator and the person or household weight in the denominator. This is accomplished by calculating the rates with the weight adjustment (ADJINC_WT
) multiplied by the estimate of interest. Let’s look at an example of property victimization.
-vr_prop <- hh_des %>%
- summarize(
- Property_Vzn = survey_total(Property * ADJINC_WT,
- na.rm = TRUE),
- Property_Rate = survey_mean(Property * ADJINC_WT * 1000,
- na.rm = TRUE),
- PopSize = survey_total(1, vartype = NULL)
- )
-
-vr_prop
+vr_prop <- hh_des %>%
+ summarize(
+ Property_Vzn = survey_total(Property * ADJINC_WT,
+ na.rm = TRUE),
+ Property_Rate = survey_mean(Property * ADJINC_WT * 1000,
+ na.rm = TRUE),
+ PopSize = survey_total(1, vartype = NULL)
+ )
+
+vr_prop
## # A tibble: 1 × 5
## Property_Vzn Property_Vzn_se Property_Rate Property_Rate_se PopSize
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11682056. 263844. 90.3 1.95 129319232.
In the output above, we see the estimate for property victimization rate in 2021 was 90.3 per 1,000 households, which is consistent with calculating as the number of victimizations per 1,000 population as demonstrated in the next chunk:
-
+
## # A tibble: 1 × 4
## Property_Vzn Property_Rate PopSize Property_Rate_manual
## <dbl> <dbl> <dbl> <dbl>
## 1 11682056. 90.3 129319232. 90.3
Victimization rates can also be calculated for particular characteristics of the victimization. In the following example, the rate of aggravated assault with no weapon, with a firearm, with a knife, and with another weapon.
-pers_des %>%
- summarize(across(
- starts_with("AAST_"),
- ~ survey_mean(. * ADJINC_WT * 1000, na.rm = TRUE)
- ))
+pers_des %>%
+ summarize(across(
+ starts_with("AAST_"),
+ ~ survey_mean(. * ADJINC_WT * 1000, na.rm = TRUE)
+ ))
## # A tibble: 1 × 8
## AAST_NoWeap AAST_NoWeap_se AAST_Firearm AAST_Firearm_se AAST_Knife
## <dbl> <dbl> <dbl> <dbl> <dbl>
@@ -2040,85 +2040,85 @@ 13.6.3 Estimation 3: Victimizatio
## # ℹ 3 more variables: AAST_Knife_se <dbl>, AAST_Other <dbl>,
## # AAST_Other_se <dbl>
A common desire is to calculate victimization rates by several characteristics. For example, we may want to calculate the violent victimization rate and aggravated assault rate by sex, race/Hispanic origin, age group, marital status, and household income. This requires a group_by()
statement for each categorization separately. Thus, we make a function to do this and then use map_df()
from the {purrr} package (part of the tidyverse) to loop through the variables. This function takes a demographic variable as its input (byarvar
) and calculates the violent and aggravated assault vicitimization rate for each level. It then creates some columns with the variable, the level of each variable, and a numeric version of the variable (LevelNum
) for sorting later. The function is run across multiple variables using map()
and then stacks the results into a single output using bind_rows()
.
-pers_est_by <- function(byvar) {
- pers_des %>%
- rename(Level := {{byvar}}) %>%
- filter(!is.na(Level)) %>%
- group_by(Level) %>%
- summarize(
- Violent = survey_mean(Violent * ADJINC_WT * 1000, na.rm = TRUE),
- AAST = survey_mean(AAST * ADJINC_WT * 1000, na.rm = TRUE)
- ) %>%
- mutate(
- Variable = byvar,
- LevelNum = as.numeric(Level),
- Level = as.character(Level)
- ) %>%
- select(Variable, Level, LevelNum, everything())
-}
-
-pers_est_df <-
- c("Sex", "RaceHispOrigin", "AgeGroup", "MaritalStatus", "Income") %>%
- map(pers_est_by) %>%
- bind_rows()
+pers_est_by <- function(byvar) {
+ pers_des %>%
+ rename(Level := {{byvar}}) %>%
+ filter(!is.na(Level)) %>%
+ group_by(Level) %>%
+ summarize(
+ Violent = survey_mean(Violent * ADJINC_WT * 1000, na.rm = TRUE),
+ AAST = survey_mean(AAST * ADJINC_WT * 1000, na.rm = TRUE)
+ ) %>%
+ mutate(
+ Variable = byvar,
+ LevelNum = as.numeric(Level),
+ Level = as.character(Level)
+ ) %>%
+ select(Variable, Level, LevelNum, everything())
+}
+
+pers_est_df <-
+ c("Sex", "RaceHispOrigin", "AgeGroup", "MaritalStatus", "Income") %>%
+ map(pers_est_by) %>%
+ bind_rows()
The output from all the estimates is cleanded to create better labels such as going from “RaceHispOrigin” to “Race/Hispanic Origin”. Finally, the {gt} package is used to make a publishable table (Table 13.5). Using the functions from the {gt} package, column labels and footnotes are added and estimates are presented to the first decimal place.
-vr_gt<-pers_est_df %>%
- mutate(
- Variable = case_when(
- Variable == "RaceHispOrigin" ~ "Race/Hispanic origin",
- Variable == "MaritalStatus" ~ "Marital status",
- Variable == "AgeGroup" ~ "Age",
- TRUE ~ Variable
- )
- ) %>%
- select(-LevelNum) %>%
- group_by(Variable) %>%
- gt(rowname_col = "Level") %>%
- tab_spanner(
- label = "Violent crime",
- id = "viol_span",
- columns = c("Violent", "Violent_se")
- ) %>%
- tab_spanner(label = "Aggravated assault",
- columns = c("AAST", "AAST_se")) %>%
- cols_label(
- Violent = "Rate",
- Violent_se = "SE",
- AAST = "Rate",
- AAST_se = "SE",
- ) %>%
- fmt_number(
- columns = c("Violent", "Violent_se", "AAST", "AAST_se"),
- decimals = 1
- ) %>%
- tab_footnote(
- footnote = "Includes rape or sexual assault, robbery,
- aggravated assault, and simple assault.",
- locations = cells_column_spanners(spanners = "viol_span")
- ) %>%
- tab_footnote(
- footnote = "Excludes persons of Hispanic origin",
- locations =
- cells_stub(rows = Level %in%
- c("White", "Black", "Asian", NHOPI, "Other"))) %>%
- tab_footnote(
- footnote = "Includes persons who identified as
- Native Hawaiian or Other Pacific Islander only.",
- locations = cells_stub(rows = Level == NHOPI)
- ) %>%
- tab_footnote(
- footnote = "Includes persons who identified as American Indian or
- Alaska Native only or as two or more races.",
- locations = cells_stub(rows = Level == "Other")
- ) %>%
- tab_source_note(
- source_note = "Note: Rates per 1,000 persons age 12 or older.") %>%
- tab_source_note(source_note = "Source: Bureau of Justice Statistics,
- National Crime Victimization Survey, 2021.") %>%
- tab_stubhead(label = "Victim demographic") %>%
- tab_caption("Rate and standard error of violent victimization,
- by type of crime and demographic characteristics, 2021")
-
+vr_gt<-pers_est_df %>%
+ mutate(
+ Variable = case_when(
+ Variable == "RaceHispOrigin" ~ "Race/Hispanic origin",
+ Variable == "MaritalStatus" ~ "Marital status",
+ Variable == "AgeGroup" ~ "Age",
+ TRUE ~ Variable
+ )
+ ) %>%
+ select(-LevelNum) %>%
+ group_by(Variable) %>%
+ gt(rowname_col = "Level") %>%
+ tab_spanner(
+ label = "Violent crime",
+ id = "viol_span",
+ columns = c("Violent", "Violent_se")
+ ) %>%
+ tab_spanner(label = "Aggravated assault",
+ columns = c("AAST", "AAST_se")) %>%
+ cols_label(
+ Violent = "Rate",
+ Violent_se = "SE",
+ AAST = "Rate",
+ AAST_se = "SE",
+ ) %>%
+ fmt_number(
+ columns = c("Violent", "Violent_se", "AAST", "AAST_se"),
+ decimals = 1
+ ) %>%
+ tab_footnote(
+ footnote = "Includes rape or sexual assault, robbery,
+ aggravated assault, and simple assault.",
+ locations = cells_column_spanners(spanners = "viol_span")
+ ) %>%
+ tab_footnote(
+ footnote = "Excludes persons of Hispanic origin",
+ locations =
+ cells_stub(rows = Level %in%
+ c("White", "Black", "Asian", NHOPI, "Other"))) %>%
+ tab_footnote(
+ footnote = "Includes persons who identified as
+ Native Hawaiian or Other Pacific Islander only.",
+ locations = cells_stub(rows = Level == NHOPI)
+ ) %>%
+ tab_footnote(
+ footnote = "Includes persons who identified as American Indian or
+ Alaska Native only or as two or more races.",
+ locations = cells_stub(rows = Level == "Other")
+ ) %>%
+ tab_source_note(
+ source_note = "Note: Rates per 1,000 persons age 12 or older.") %>%
+ tab_source_note(source_note = "Source: Bureau of Justice Statistics,
+ National Crime Victimization Survey, 2021.") %>%
+ tab_stubhead(label = "Victim demographic") %>%
+ tab_caption("Rate and standard error of violent victimization,
+ by type of crime and demographic characteristics, 2021")
+