Merge branch 'main' into add-styler

tidy-survey-r · Mar 11, 2024 · a1baef9 · a1baef9
2 parents c83a075 + 02bcccb
commit a1baef9
Show file tree

Hide file tree

Showing 18 changed files with 553 additions and 568 deletions.
diff --git a/04-set-up.Rmd b/04-set-up.Rmd
diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd
diff --git a/10-specifying-sample-designs.Rmd b/10-specifying-sample-designs.Rmd
diff --git a/13-ncvs-vignette.Rmd b/13-ncvs-vignette.Rmd
@@ -26,7 +26,7 @@ library(srvyrexploR)
 library(gt)
 ```
 
-We will use data from NCVS. Here is the code to read in the three datasets from the {srvyrexploR} package:
+We will use data from the United States National Crime Victimization Survey (NCVS). Here is the code to read in the three datasets from the {srvyrexploR} package:
 ```{r}
 #| label: ncvs-data
 #| cache: TRUE
@@ -38,7 +38,7 @@ data(ncvs_2021_person)
 
 ## Introduction
 
-The United States National Crime Victimization Survey (NCVS) is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The target population of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters.
+The NCVS is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The target population of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters.
 
 The NCVS has been ongoing since 1992. An earlier survey, the National Crime Survey, was run from 1972 to 1991 [@ncvs_tech_2016]. The survey is administered using a rotating panel. When an address enters the sample, the residents of that address are interviewed every six months for a total of seven interviews. If the initial residents move away from the address during the period, the new residents are included in the survey, as people are not followed when they move. 
 
@@ -54,7 +54,7 @@ The data from ICPSR is distributed with five files, each having its unique ident
   - Incident Record - `YEARQ`, `IDHH`, `IDPER`
   - 2021 Collection Year Incident - `YEARQ`, `IDHH`, `IDPER`
 
-We will focus on the household, person, and incident files. From these files, we selected a subset of columns for examples to use in this vignette. We have included data in our OSF repository, but you can download the complete files at ICPSR^[https://www.icpsr.umich.edu/web/NACJD/studies/38429].
+We will focus on the household, person, and incident files. From these files, we selected a subset of columns for examples to use in this vignette. We have included data in the {srvyexploR} package with a subset of columns, but you can download the complete files at ICPSR^[https://www.icpsr.umich.edu/web/NACJD/studies/38429].
 
 ## Survey Notation
 
@@ -102,7 +102,7 @@ For victimization rates, we need to know the victimization status for both victi
 
 Each record on the incident file represents one victimization, which is not the same as one incident. Some victimizations have several instances that make it difficult for the victim to differentiate the details of these incidents, labeled as "series crimes". Appendix A of the User's Guide indicates how to calculate the series weight in other statistical languages.
 
-Here, we adapt that code for R. Essentially, if a victimization is a series crime, its series weight is top-coded at 10 based on the number of actual victimizations. If an incident is a series crime, but the number of occurrences is unknown, the series weight is set to 6. A description of the variables used to create indicators of series and the associated weights is included in Table \@ref(tab:cb-incident).
+Here, we adapt that code for R. Essentially, if a victimization is a series crime, its series weight is top-coded at 10 based on the number of actual victimizations, that is that even if the crime repeatedly occurred more than 10 times, it is counted as 10 times to reduce the influence of extreme outliers. If an incident is a series crime, but the number of occurrences is unknown, the series weight is set to 6. A description of the variables used to create indicators of series and the associated weights is included in Table \@ref(tab:cb-incident).
 
 Table: (\#tab:cb-incident) Codebook for incident variables - related to series weight
 
@@ -121,7 +121,7 @@ Table: (\#tab:cb-incident) Codebook for incident variables - related to series w
 |  |  | 8 | Residue (invalid data) |
 | WGTVICCY | Adjusted victimization weight |  | Numeric |
 
-We want to create four variables to indicate if an incident is a series crime.  First, we create a variable called series using `V4017`, `V4018`, and `V4019`.  Next, we top code the number of incidents (`V4016`).  Finally, we create the series weight using our new top-coded variable and the existing weight.
+We want to create four variables to indicate if an incident is a series crime.  First, we create a variable called series using `V4017`, `V4018`, and `V4019` where an incident is considered a series crime if there are 6 or more incidents (`V4107`), the incidents are similar in detail (`V4018`), or there is not enough detail to distinguish the incidents (`V4019`).  Next, we top-code the number of incidents (`V4016`) by creating a variable `n10v4016` which is set to 10 if `V4016 > 10`.  Finally, we create the series weight using our new top-coded variable and the existing weight.
 
 ```{r}
 #| label: ncvs-vign-incfile
@@ -132,7 +132,7 @@ inc_series <- ncvs_2021_incident %>%
     series = case_when(V4017 %in% c(1, 8) ~ 1,
                        V4018 %in% c(2, 8) ~ 1,
                        V4019 %in% c(1, 8) ~ 1,
-                       TRUE ~ 2 # series
+                       TRUE ~ 2
     ),
     n10v4016 = case_when(V4016 %in% c(997, 998) ~ NA_real_,
                          V4016 > 10 ~ 10,
@@ -635,19 +635,19 @@ vt2a
 vt2b
 ```
 
-The number of victimizations estimated using the incident file is equivalent to the person and household file method.  There are `r vt1$Property_Vzn` property incidents and `r vt1$Violent_Vzn` violent incidents in a six-month period.
+The number of victimizations estimated using the incident file is equivalent to the person and household file method.  There are `r prettyNum(vt1$Property_Vzn, big.mark=",")` property incidents and `r prettyNum(vt1$Violent_Vzn, big.mark=",")` violent incidents in a six-month period.
 
 ### Estimation 2: Victimization Proportions {#vic-prop}
 
 Victimization proportions are proportions describing features of a victimization. The key here is that these are questions among victimizations, not among the population. These types of estimates can only be calculated using the incident design object (`inc_des`). 
 
-For example, we could be interested in the percentage of property victimizations reported to the police:
+For example, we could be interested in the percentage of property victimizations reported to the police as shown in the following code with an estimate, the standard error, and 95% confidence interval:
 
 ```{r}
 #| label: ncvs-vign-vic-prop-police
 prop1 <- inc_des %>%
   filter(Property) %>%
-  summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE) * 100)
+  summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE, proportion=TRUE, vartype=c("se", "ci")) * 100)
 
 prop1
 ```
@@ -706,7 +706,7 @@ pers_des %>%
   ))
 ```
 
-A common desire is to calculate victimization rates by several characteristics. For example, we may want to calculate the violent victimization rate and aggravated assault rate by sex, race/Hispanic origin, age group, marital status, and household income. This requires a `group_by()` statement for each categorization separately. Thus, we make a function to do this and then use `map_df()` from the {purrr} package (part of the tidyverse) to loop through the variables. Finally, the {gt} package is used to make a publishable table shown in Table \@ref(tab:ncvs-vign-rates-demo-tab).
+A common desire is to calculate victimization rates by several characteristics. For example, we may want to calculate the violent victimization rate and aggravated assault rate by sex, race/Hispanic origin, age group, marital status, and household income. This requires a `group_by()` statement for each categorization separately. Thus, we make a function to do this and then use `map_df()` from the {purrr} package (part of the tidyverse) to loop through the variables. This function takes a demographic variable as its input (`byarvar`) and calculates the violent and aggravated assault vicitimization rate for each level. It then creates some columns with the variable, the level of each variable, and a numeric version of the variable (`LevelNum`) for sorting later. The function is run across multiple variables using `map()` and then stacks the results into a single output using `bind_rows()`.
 
 ```{r}
 #| label: ncvs-vign-rates-demo
@@ -729,7 +729,14 @@ pers_est_by <- function(byvar) {
 
 pers_est_df <-
   c("Sex", "RaceHispOrigin", "AgeGroup", "MaritalStatus", "Income") %>%
-  map_df(pers_est_by)
+  map(pers_est_by) %>%
+  bind_rows()
+```
+
+The output from all the estimates is cleanded to create better labels such as going from "RaceHispOrigin" to "Race/Hispanic Origin". Finally, the {gt} package is used to make a publishable table (Table \@ref(tab:ncvs-vign-rates-demo-tab)). Using the functions from the {gt} package, column labels and footnotes are added and estimates are presented to the first decimal place.
+
+```{r}
+#| label: ncvs-vgn-rates-demo-gt-create
 
 vr_gt<-pers_est_df %>%
   mutate(
@@ -789,6 +796,8 @@ vr_gt<-pers_est_df %>%
              by type of crime and demographic characteristics, 2021")
 ```
 
+
+
 ```{r}
 #| label: ncvs-vign-rates-demo-noeval
 #| eval: false
@@ -835,6 +844,40 @@ pers_prev_ests
 
 In the example above, the indicator is multiplied by 100 to return a percentage rather than a proportion. In 2021, we estimate that `r formatC(pers_prev_ests$Violent_Prev, digits=2, format="f")`% of people aged 12 and older were a victim of violent crime in the United States, and `r formatC(pers_prev_ests$AAST_Prev, digits=2, format="f")`% were victims of aggravated assault.
 
+## Statistical testing
+
+For any of the types of estimates discussed, we can also perform statistical testing. For example, we could test whether property victimization rates are different between properties that are owned versus rented. First, we calculate the point estimates.
+
+```{r}
+prop_tenure <- hh_des %>%
+  group_by(Tenure) %>%
+  summarize(
+    Property_Rate = survey_mean(Property * ADJINC_WT * 1000,
+                                na.rm = TRUE, vartype="ci"),
+  )
+
+prop_tenure  
+```
+
+The property victimization rate for rented households is `r prop_tenure %>% filter(Tenure=="Rented") %>% pull(Property_Rate) %>% round(1)` per 1,000 households while the property victimization rate for owned households is `r prop_tenure %>% filter(Tenure=="Owned") %>% pull(Property_Rate) %>% round(1)`, which seem very different especially given the non-overlapping confidence intervals. However, survey data is inheriently non-independent so statistical testing cannot be done by comparing confidence intervals. To conduct the statistical test, we first need to create a variable that we will compare which incorporates the adjusted incident weight (`ADJINC_WT`) and then the test can be conducted as discussed in Chapter \@ref(c06-statistical-testing).
+
+```{r}
+prop_tenure_test <- hh_des %>%
+  mutate(
+    Prop_Adj=Property * ADJINC_WT * 1000
+  ) %>%
+  svyttest(
+    formula = Prop_Adj ~ Tenure,
+    design = .,
+    na.rm = TRUE
+  ) %>%
+  broom::tidy()
+
+prop_tenure_test
+```
+
+The output of the statistical test shows the same difference of `r prop_tenure_test$estimate %>% round(1)` between the property victimization rates of renters and owners and the test is highly significant with the p-value of `r prettyunits::pretty_p_value(prop_tenure_test$p.value)`.
+
 ## Exercises
 
 1. What proportion of completed motor vehicle thefts are not reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529).
@@ -865,3 +908,25 @@ hh_des %>%
   summarize(Property_Rate = survey_mean(Property * ADJINC_WT * 1000, 
                                         na.rm = TRUE))
 ```
+
+4. What is the difference between the violent victimization rate between males and females? Is it statistically different?
+
+```{r}
+pers_des %>%
+  group_by(Sex) %>%
+  summarize(
+    Violent_rate=survey_mean(Violent * ADJINC_WT * 1000, na.rm=TRUE)
+  )
+
+pers_des %>%
+  mutate(
+    Violent_Adj=Violent * ADJINC_WT * 1000
+  ) %>%
+  svyttest(
+    formula = Violent_Adj ~ Sex,
+    design = .,
+    na.rm = TRUE
+  ) %>%
+  broom::tidy()
+```
+