diff --git a/07-modeling.Rmd b/07-modeling.Rmd index a3a9c4a..8e972af 100644 --- a/07-modeling.Rmd +++ b/07-modeling.Rmd @@ -26,7 +26,7 @@ library(gt) library(prettyunits) ``` -We are using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information.) +We are using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information). ```{r} #| label: model-anes-des @@ -63,7 +63,7 @@ recs_des <- recs_2020 %>% ## Introduction {#model-intro} -Modeling data is a way for researchers to investigate the relationship between a single dependent variable and one or more independent variables. This builds upon the analyses conducted in Chapter \@ref(c06-statistical-testing), which looked at the relationships between just two variables. For example, in Example 3 in Section \@ref(stattest-ttest-examples), we investigated if there is a relationship between the electrical bill cost and whether or not the household used air conditioning (A/C). However, there are potentially other elements that could go into what the cost of electrical bills are in a household (e.g., outside temperature, desired internal temperature, types and number of appliances, etc.) +Modeling data is a way for researchers to investigate the relationship between a single dependent variable and one or more independent variables. This builds upon the analyses conducted in Chapter \@ref(c06-statistical-testing), which looked at the relationships between just two variables. For example, in Example 3 in Section \@ref(stattest-ttest-examples), we investigated if there is a relationship between the electrical bill cost and whether or not the household used air conditioning (A/C). However, there are potentially other elements that could go into what the cost of electrical bills are in a household (e.g., outside temperature, desired internal temperature, types and number of appliances, etc.). \index{Analysis of variance (ANOVA)|(} T-tests only allow us to investigate the relationship of one independent variable at a time, but using models, we can look into multiple variables and even explore interactions between these variables. There are several types of models, but in this chapter, we cover Analysis of Variance (ANOVA) and linear regression models following common normal (Gaussian) and logit models. Jonas Kristoffer Lindeløv has an interesting [discussion](https://lindeloev.github.io/tests-as-linear/) of many statistical tests and models being equivalent to a linear model. For example, a one-way ANOVA is a linear model with one categorical independent variable, and a two-sample t-test is an ANOVA where the independent variable has exactly two levels. @@ -106,7 +106,7 @@ This could be specified in R code as any of the following: - `mpg ~ cyl + disp + hp + cyl:disp + cyl:hp + disp:hp` - `mpg ~ cyl*disp + cyl*hp + disp*hp` -In the above options, the way the `:` and `*` notation are implemented are different. Using `:` only includes the interactions and not the main effects, while using `*` includes the main effects and all possible interactions. Table \@ref(tab:notation-diffs) provides an overview of the syntax and differences between the two notations. +In the above options, the ways the `:` and `*` notations are implemented are different. Using `:` only includes the interactions and not the main effects, while using `*` includes the main effects and all possible interactions. Table \@ref(tab:notation-diffs) provides an overview of the syntax and differences between the two notations. Table: (\#tab:notation-diffs) Differences in formulas for `:` and `*` code syntax @@ -117,9 +117,9 @@ Table: (\#tab:notation-diffs) Differences in formulas for `:` and `*` code synta \index{Formula notation|)} -When using non-survey data, such as experimental or observational data, researchers use the `glm()` function for linear models. With survey data, however, \index{Functions in survey!svyglm|(}we use `svyglm()` from the {survey} package to ensure that we account for the survey design and weights in modeling^[There is some debate about whether weights should be used in regression [@bollen2016weightsreg; @gelman2007weights]. However, for the purposes of providing complete information on how to analyze complex survey data, this chapter includes weights.]. This allows us to generalize a model to the population of interest and accounts for the fact that the observations in the survey data may not be independent. As discussed in Chapter \@ref(c06-statistical-testing), modeling survey data cannot be directly done in {srvyr}, but can be done in the {survey} package [@lumley2010complex]. In this chapter, we provide syntax and examples for linear models, including ANOVA, normal linear regression, and logistic regression. For details on other types of regression, including ordinal regression, log-linear models, and survival analysis, refer to @lumley2010complex. @lumley2010complex also discusses custom models such as a negative binomial or Poisson model in Appendix E of his book.\index{svyglm|see {Functions in survey}} +When using non-survey data, such as experimental or observational data, researchers use the `glm()` function for linear models. With survey data, however, \index{Functions in survey!svyglm|(}we use `svyglm()` from the {survey} package to ensure that we account for the survey design and weights in modeling^[There is some debate about whether weights should be used in regression [@bollen2016weightsreg; @gelman2007weights]. However, for the purposes of providing complete information on how to analyze complex survey data, this chapter includes weights.]. This allows us to generalize a model to the population of interest and accounts for the fact that the observations in the survey data may not be independent. As discussed in Chapter \@ref(c06-statistical-testing), modeling survey data cannot be directly done in {srvyr}, but can be done in the {survey} package [@lumley2010complex]. In this chapter, we provide syntax and examples for linear models, including ANOVA, normal linear regression, and logistic regression. For details on other types of regression, including ordinal regression, log-linear models, and survival analysis, refer to @lumley2010complex. @lumley2010complex also discusses custom models such as a negative binomial or Poisson model in appendix E of his book.\index{svyglm|see {Functions in survey}} -## Analysis of variance (ANOVA) +## Analysis of variance \index{Analysis of variance (ANOVA)|(} In ANOVA, we are testing whether the mean of an outcome is the same across two or more groups. Statistically, we set up this as follows: @@ -135,9 +135,9 @@ where $x_i$ is a group indicator for groups $1, \cdots, k$. Some assumptions when using ANOVA on survey data include: - - The outcome variable is normally distributed within each group - - The variances of the outcome variable between each group are approximately equal - - We do NOT assume independence between the groups as with ANOVA on non-survey data. The covariance is accounted for in the survey design + - The outcome variable is normally distributed within each group. + - The variances of the outcome variable between each group are approximately equal. + - We do NOT assume independence between the groups as with ANOVA on non-survey data. The covariance is accounted for in the survey design. ### Syntax @@ -158,14 +158,14 @@ The arguments are: * `formula`: formula in the form of `outcome~group`. The group variable must be a factor or character. * `design`: a `tbl_svy` object created by `as_survey` * `na.action`: handling of missing data -* \index{Degrees of freedom|(}`df.resid`: degrees of freedom for Wald tests (optional) - defaults to using `degf(design)-(g-1)` where $g$ is the number of groups\index{Degrees of freedom|)} +* \index{Degrees of freedom|(}`df.resid`: degrees of freedom for Wald tests (optional); defaults to using `degf(design)-(g-1)` where $g$ is the number of groups\index{Degrees of freedom|)} -\index{Dot notation|(}The function `svyglm()` does not have the design as the first argument so the dot (`.`) notation is used to pass it with a pipe (see Chapter \@ref(c06-statistical-testing) for more details.)\index{Dot notation|)} The default for missing data is `na.omit`. This means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data, and we recommend looking at the help documentation for `na.omit` (run `help(na.omit)` or `?na.omit`) for more information on options to use for `na.action`. For a discussion on how to handle missing data, see Chapter \@ref(c11-missing-data). +\index{Dot notation|(}The function `svyglm()` does not have the design as the first argument so the dot (`.`) notation is used to pass it with a pipe (see Chapter \@ref(c06-statistical-testing) for more details).\index{Dot notation|)} The default for missing data is `na.omit`. This means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data, and we recommend looking at the help documentation for `na.omit` (run `help(na.omit)` or `?na.omit`) for more information on options to use for `na.action`. For a discussion on how to handle missing data, see Chapter \@ref(c11-missing-data). ### Example \index{Residential Energy Consumption Survey (RECS)|(} -Looking at an example helps us discuss the output and how to interpret the results. In RECS, respondents are asked what temperature they set their thermostat to during the day and evening when using A/C during the summer. To analyze these data, we filter the respondents to only those using A/C (`ACUsed`.) Then, if we want to see if there are regional differences, we can use `group_by()`. A descriptive analysis of the temperature at night (`SummerTempNight`) set by region and the sample sizes is displayed below. \index{Functions in srvyr!survey\_mean|(} \index{Functions in srvyr!unweighted|(} \index{Functions in srvyr!filter|(} \index{Functions in srvyr!summarize|(} +Looking at an example helps us discuss the output and how to interpret the results. In RECS, respondents are asked what temperature they set their thermostat to during the evening when using A/C during the summer^[Question text: "During the summer, what is your home’s typical indoor temperature inside your home at night?” [@recs-svy].]. To analyze these data, we filter the respondents to only those using A/C (`ACUsed`)^[Question text: "Is any air conditioning equipment used in your home?" [@recs-svy].]. Then, if we want to see if there are regional differences, we can use `group_by()`. A descriptive analysis of the temperature at night (`SummerTempNight`) set by region and the sample sizes is displayed below. \index{Functions in srvyr!survey\_mean|(} \index{Functions in srvyr!unweighted|(} \index{Functions in srvyr!filter|(} \index{Functions in srvyr!summarize|(} ```{r} #| label: model-anova-prep @@ -191,10 +191,10 @@ anova_out <- recs_des %>% tidy(anova_out) ``` -In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. In this output, the intercept represents the reference value of the Northeast region. The other coefficients indicate the difference in temperature relative to the Northeast region. For example, in the Midwest, temperatures are set, on average, `r tidy(anova_out) %>% filter(term=="RegionMidwest") %>% pull(estimate) %>% signif(3)` (p-value`r tidy(anova_out) %>% filter(term=="RegionMidwest") %>% pull(p.value) %>% pretty_p_value()`) degrees higher than in the Northeast during summer nights, and each region sets their thermostats at significantly higher temperatures than the Northeast. +In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. In this output, the intercept represents the reference value of the Northeast region. The other coefficients indicate the difference in temperature relative to the Northeast region. For example, in the Midwest, temperatures are set, on average, `r tidy(anova_out) %>% filter(term=="RegionMidwest") %>% pull(estimate) %>% signif(3)` (p-value is `r tidy(anova_out) %>% filter(term=="RegionMidwest") %>% pull(p.value) %>% pretty_p_value()`) degrees higher than in the Northeast during summer nights, and each region sets its thermostats at significantly higher temperatures than the Northeast. \index{Factor|(} -If we wanted to change the reference value, we would reorder the factor before modeling using the `relevel()` function from {stats} or using one of many factor ordering functions in {forcats} such as `fct_relevel()` or `fct_infreq()`. For example, if we wanted the reference level to be the Midwest region, we could use the following code. \index{gt package|(}Note the usage of the `gt()` function on top of `tidy()` to print a nice-looking output table [@R-gt; @R-broom] (see Chapter \@ref(c08-communicating-results) for more information on the {gt} package.) +If we wanted to change the reference value, we would reorder the factor before modeling using the `relevel()` function from {stats} or using one of many factor ordering functions in {forcats} such as `fct_relevel()` or `fct_infreq()`. For example, if we wanted the reference level to be the Midwest region, we could use the following code with the results in Table \@ref(tab:model-anova-ex-tab). \index{gt package|(}Note the usage of the `gt()` function on top of `tidy()` to print a nice-looking output table [@R-gt; @R-broom] (see Chapter \@ref(c08-communicating-results) for more information on the {gt} package). ```{r} #| label: model-anova-ex-relevel @@ -228,7 +228,7 @@ tidy(anova_out_relevel) %>% ``` \index{p-value|(} -This output now has the coefficients indicating the difference in temperature relative to the Midwest region. For example, in the Northeast, temperatures are set, on average, `r tidy(anova_out_relevel) %>% filter(term=="RegionNortheast") %>% pull(estimate) %>% signif(3)` (p-value`r tidy(anova_out_relevel) %>% filter(term=="RegionNortheast") %>% pull(p.value) %>% pretty_p_value()`) degrees lower than in the Midwest during summer nights, and each region sets their thermostats at significantly lower temperatures than the Midwest. This is the reverse of what we saw in the prior model, as we are still comparing the same two regions, just from different reference points. \index{Analysis of variance (ANOVA)|)} \index{Factor|)} \index{gt package|)} +This output now has the coefficients indicating the difference in temperature relative to the Midwest region. For example, in the Northeast, temperatures are set, on average, `r tidy(anova_out_relevel) %>% filter(term=="RegionNortheast") %>% pull(estimate) %>% abs() %>% signif(3)` (p-value is `r tidy(anova_out_relevel) %>% filter(term=="RegionNortheast") %>% pull(p.value) %>% pretty_p_value()`) degrees lower than in the Midwest during summer nights, and each region sets its thermostats at significantly lower temperatures than the Midwest. This is the reverse of what we saw in the prior model, as we are still comparing the same two regions, just from different reference points. \index{Analysis of variance (ANOVA)|)} \index{Factor|)} \index{gt package|)} \index{p-value|)} ## Normal linear regression @@ -273,7 +273,7 @@ The arguments are: * `formula`: formula in the form of `y~x` * `design`: a `tbl_svy` object created by `as_survey` * `na.action`: handling of missing data -* \index{Degrees of freedom|(}`df.resid`: degrees of freedom for Wald tests (optional) - defaults to using `degf(design)-p` where $p$ is the rank of the design matrix\index{Degrees of freedom|)} +* \index{Degrees of freedom|(}`df.resid`: degrees of freedom for Wald tests (optional); defaults to using `degf(design)-p` where $p$ is the rank of the design matrix\index{Degrees of freedom|)} \index{Formula notation|(} As discussed in Section \@ref(model-intro), the formula on the right-hand side can be specified in many ways, for example, denoting whether or not interactions are desired. @@ -283,7 +283,7 @@ As discussed in Section \@ref(model-intro), the formula on the right-hand side c #### Example 1: Linear regression with a single variable {.unnumbered} -On RECS, we can obtain information on the square footage of homes and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure \@ref(fig:model-plot-sf-elbill), each hexagon represents the weighted count of households in the bin, and we can see a general positive linear trend (as the square footage increases, so does the amount of money spent on electricity.) +On RECS, we can obtain information on the square footage of homes^[Question text: "What is the square footage of your home?" [@recs-svy].] and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure \@ref(fig:model-plot-sf-elbill), each hexagon represents the weighted count of households in the bin, and we can see a general positive linear trend (as the square footage increases, so does the amount of money spent on electricity). ```{r} #| label: model-plot-sf-elbill @@ -345,14 +345,14 @@ tidy(m_electric_sqft) %>% -In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. In these results, we can say that, on average, for every additional square foot of house size, the electricity bill increases by `r (tidy(m_electric_sqft) %>% filter(term=="TOTSQFT_EN") %>% pull(estimate) %>% signif(2))*100` cents, and that square footage is significantly associated with electricity expenditure (p-value`r tidy(m_electric_sqft) %>% filter(term=="TOTSQFT_EN") %>% pull(p.value) %>% pretty_p_value()`.) +In Table \@ref(tab:model-slr-examp-tab), we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. In these results, we can say that, on average, for every additional square foot of house size, the electricity bill increases by `r (tidy(m_electric_sqft) %>% filter(term=="TOTSQFT_EN") %>% pull(estimate) %>% signif(2))*100` cents, and that square footage is significantly associated with electricity expenditure (p-value is `r tidy(m_electric_sqft) %>% filter(term=="TOTSQFT_EN") %>% pull(p.value) %>% pretty_p_value()`). This is a straightforward model, and there are likely many more factors related to electricity expenditure, including the type of cooling, number of appliances, location, and more. However, starting with one-variable models can help analysts understand what potential relationships there are between variables before fitting more complex models. Often, we start with known relationships before building models to determine what impact additional variables have on the model. #### Example 2: Linear regression with multiple variables and interactions {.unnumbered} \index{Formula notation|(} -In the following example, a model is fit to predict electricity expenditure, including Census region (factor/categorical), urbanicity (factor/categorical), square footage (double/numeric), and whether A/C is used (logical/categorical) with all two-way interactions also included. In this example, we are choosing to fit this model without an intercept (using `-1` in the formula.) This results in an intercept estimate for each region instead of a single intercept for all data. +In the following example, a model is fit to predict electricity expenditure, including census region (factor/categorical), urbanicity (factor/categorical), square footage (double/numeric), and whether A/C is used (logical/categorical) with all two-way interactions also included. In this example, we are choosing to fit this model without an intercept (using `-1` in the formula). This results in an intercept estimate for each region instead of a single intercept for all data. ```{r} #| label: model-lmr-examp @@ -391,7 +391,7 @@ tidy(m_electric_multi) %>% ``` -As shown above, there are many terms in this model. To test whether coefficients for a term are different from zero, the `regTermTest()` function can be used. For example, in the above regression, we can test whether the interaction of region and urbanicity is significant as follows: +As shown in Table \@ref(tab:model-lmr-examp-tab), there are many terms in this model. To test whether coefficients for a term are different from zero, the `regTermTest()` function can be used. For example, in the above regression, we can test whether the interaction of region and urbanicity is significant as follows: ```{r} #| label: model-lmr-test-term @@ -401,7 +401,7 @@ urb_reg_test ``` \index{p-value|(} -This output indicates there is a significant interaction between urbanicity and region (p-value=`r pretty_p_value(urb_reg_test[["p"]])`.) +This output indicates there is a significant interaction between urbanicity and region (p-value is `r pretty_p_value(urb_reg_test[["p"]])`.) \index{p-value|)} To examine the predictions, residuals, and more from the model, the `augment()` function from {broom} can be used. The `augment()` function returns a tibble with the independent and dependent variables and other fit statistics. The `augment()` function has not been specifically written for objects of class `svyglm`, and as such, a warning is displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using `augment()`. To obtain the standard error of the predicted values (`.se.fit`), we need to use the `attr()` function on the predicted values (`.fitted`) created by `augment()`. Additionally, the predicted values created are outputted with a type of `svrep`. If we want to plot the predicted values, we need to use `as.numeric()` to get the predicted values into a numeric format to work with. However, it is important to note that this adjustment must be completed **after** the standard error adjustment. @@ -421,7 +421,7 @@ These results can then be used in a variety of ways, including examining residua ```{r} #| label: model-aug-examp-plot -#| fig.cap: Residual plot of electric cost model with covariates Region, Urbanicity, TOTSQFT\_EN, and ACUsed +#| fig.cap: 'Residual plot of electric cost model with the following covariates: Region, Urbanicity, TOTSQFT\_EN, and ACUsed' #| fig.alt: Residual scatter plot with a x-axis of 'Fitted value of electricity cost' ranging between approximately $0 and $4,000 and a y-axis with the 'Residual of model' ranging from approximately -$3,000 to $5,000. The points create a slight megaphone shape with largest residuals in the middle of the x-range. A red line is drawn horizontally at y=0. fitstats %>% ggplot(aes(x = .fitted, .resid)) + @@ -468,9 +468,9 @@ In the above example, it is predicted that the energy expenditure would be \$`r ## Logistic regression \index{Categorical data|(} \index{Logistic regression|(} \index{logit models|see {Logistic regression}} -Logistic regression is used to model binary outcomes, such as whether or not someone voted. There are several instances where an outcome may not be originally binary but is collapsed into being binary. For example, given that gender is often asked in surveys with multiple response options and not a binary scale, many researchers now code gender in logistic modeling as cis-male compared to not cis-male. We could also convert a 4-point Likert scale that has levels of "Strongly Agree", "Agree", "Disagree", and "Strongly Disagree" to group the agreement levels into one group and disagreement levels into a second group. +Logistic regression is used to model binary outcomes, such as whether or not someone voted. There are several instances where an outcome may not be originally binary but is collapsed into being binary. For example, given that gender is often asked in surveys with multiple response options and not a binary scale, many researchers now code gender in logistic modeling as "cis-male" compared to not "cis-male." We could also convert a 4-point Likert scale that has levels of "Strongly Agree," "Agree," "Disagree," and "Strongly Disagree" to group the agreement levels into one group and disagreement levels into a second group. -Logistic regression is a specific case of the generalized linear model (GLM.) A GLM uses a link function to link the response variable to the linear model. If we tried to use a normal linear regression with a binary outcome, many assumptions would not hold, namely, the response would not be continuous. Logistic regression allows us to link a linear model between the covariates and the propensity of an outcome. In logistic regression, the link model is the logit function. Specifically, the model is specified as follows: +Logistic regression is a specific case of the generalized linear model (GLM). A GLM uses a link function to link the response variable to the linear model. If we tried to use a normal linear regression with a binary outcome, many assumptions would not hold, namely, the response would not be continuous. Logistic regression allows us to link a linear model between the covariates and the propensity of an outcome. In logistic regression, the link model is the logit function. Specifically, the model is specified as follows: $$ y_i \sim \text{Bernoulli}(\pi_i)$$ @@ -481,9 +481,9 @@ $$ y_i \sim \text{Bernoulli}(\pi_i)$$ ``` which can be re-expressed as -$$ \pi_i=\frac{\exp \left(\beta_0 +\sum_{i=1}^n \beta_i x_i \right)}{1+\exp \left(\beta_0 +\sum_{i=1}^n \beta_i x_i \right)}.$$ where $y_i$ is the outcome, $\beta_0$ is an intercept, and $x_1, \cdots, x_n$ are the predictors with $\beta_1, \cdots, \beta_n$ as the associated coefficients. +$$ \pi_i=\frac{\exp \left(\beta_0 +\sum_{i=1}^n \beta_i x_i \right)}{1+\exp \left(\beta_0 +\sum_{i=1}^n \beta_i x_i \right)}$$ where $y_i$ is the outcome, $\beta_0$ is an intercept, and $x_1, \cdots, x_n$ are the predictors with $\beta_1, \cdots, \beta_n$ as the associated coefficients. -The Bernoulli distribution is a distribution which has an outcome of 0 or 1 given some probability ($\pi_i$) in this case and we model $\pi_i$ as a function of the covariates $x_i$ using this logit link. +The Bernoulli distribution is a distribution which has an outcome of 0 or 1 given some probability ($\pi_i$) in this case, and we model $\pi_i$ as a function of the covariates $x_i$ using this logit link. \index{Categorical data|)} Assumptions in logistic regression using survey data include: @@ -514,12 +514,12 @@ The arguments are: * `formula`: Formula in the form of `y~x` * \index{Degrees of freedom|(}`design`: a `tbl_svy` object created by `as_survey`\index{Degrees of freedom|)} * `na.action`: handling of missing data -* `df.resid`: degrees of freedom for Wald tests (optional) - defaults to using `degf(design)-p` where $p$ is the rank of the design matrix +* `df.resid`: degrees of freedom for Wald tests (optional); defaults to using `degf(design)-p` where $p$ is the rank of the design matrix * `family`: the error distribution/link function to be used in the model Note `svyglm()` is the same function used in both ANOVA and normal linear regression. However, we've added the link function quasibinomial. While we can use the binomial link function, it is recommended to use the quasibinomial as our weights may not be integers, and the quasibinomial also allows for overdispersion [@lumley2010complex; @mccullagh1989binary; @R-base]. The quasibinomial family has a default logit link, which is specified in the equations above. When specifying the outcome variable, it is likely specified in one of three ways with survey data: - - \index{Factor|(}A two-level factor variable where the first level of the factor indicates a "failure" and the second level indicates a "success"\index{Factor|)} + - \index{Factor|(}A two-level factor variable where the first level of the factor indicates a "failure," and the second level indicates a "success"\index{Factor|)} - A numeric variable which is 1 or 0 where 1 indicates a success - A logical variable where TRUE indicates a success @@ -528,7 +528,7 @@ Note `svyglm()` is the same function used in both ANOVA and normal linear regres #### Example 1: Logistic regression with single variable {.unnumbered} \index{American National Election Studies (ANES)|(} -In the following example, we use the ANES data to model whether someone usually has trust in the government^[Question: How often can you trust the federal government in Washington to do what is right?] by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump, though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an "Other" category. \index{Factor|(}We first create a binary outcome for trusting in the government by collapsing "Always" and "Most of the time" into a single-factor level, and the other response options ("About half the time", "Some of the time", and "Never") into a second factor level. Next, a scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data. \index{Functions in srvyr!survey\_mean|(} \index{Functions in srvyr!summarize|(} \index{Factor|)} +In the following example, we use the ANES data to model whether someone usually has trust in the government^[Question text: "How often can you trust the federal government in Washington to do what is right?" [@anes-svy].] by whom someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump, though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an "Other" category. \index{Factor|(}We first create a binary outcome for trusting in the government by collapsing "Always" and "Most of the time" into a single-factor level, and the other response options ("About half the time," "Some of the time," and "Never") into a second factor level. Next, a scatter plot of the raw data is not useful, as it is all 0 and 1 outcomes; so instead, we plot a summary of the data. \index{Functions in srvyr!survey\_mean|(} \index{Functions in srvyr!summarize|(} \index{Factor|)} ```{r} #| label: model-logisticexamp-plot @@ -563,7 +563,7 @@ anes_des_der %>% ``` \index{Functions in srvyr!summarize|)} \index{Functions in srvyr!survey\_mean|)} -Looking at Figure \@ref(fig:model-logisticexamp-plot), it appears that people who voted for Trump are more likely to say that they usually have trust in the government compared to those who voted for Biden and Other candidates. To determine if this insight is accurate, we next fit the model. +Looking at Figure \@ref(fig:model-logisticexamp-plot), it appears that people who voted for Trump are more likely to say that they usually have trust in the government compared to those who voted for Biden and other candidates. To determine if this insight is accurate, we next fit the model. ```{r} #| label: model-logisticexamp-model @@ -596,7 +596,7 @@ tidy(logistic_trust_vote) %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. This output indicates that respondents who voted for Trump are more likely to usually have trust in the government compared to those who voted for Biden (the reference level.) The coefficient of `r signif(logistic_trust_vote$coefficients[2],3)` represents the increase in the log odds of usually trusting the government. +In Table \@ref(tab:model-logisticexamp-tab), we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. This output indicates that respondents who voted for Trump are more likely to usually have trust in the government compared to those who voted for Biden (the reference level). The coefficient of `r signif(logistic_trust_vote$coefficients[2],3)` represents the increase in the log odds of usually trusting the government. In most cases, it is easier to talk about the odds instead of the log odds. To do this, we need to exponentiate the coefficients. We can use the same `tidy()` function but include the argument `exponentiate = TRUE` to see the odds. @@ -636,7 +636,7 @@ or_other <- pull(estimate) ``` -We can interpret this as saying that the odds of usually trusting the government for someone who voted for Trump is `r signif(or_trump*100, 3)`% as likely to trust the government compared to a person who voted for Biden (the reference level.) In comparison, a person who voted for neither Biden nor Trump is `r signif(or_other*100, 3)`% as likely to trust the government as someone who voted for Biden. +Using the output in Table \@ref(tab:model-logisticexamp-model-odds-tab), we can interpret this as saying that the odds of usually trusting the government for someone who voted for Trump is `r signif(or_trump*100, 3)`% as likely to trust the government compared to a person who voted for Biden (the reference level). In comparison, a person who voted for neither Biden nor Trump is `r signif(or_other*100, 3)`% as likely to trust the government as someone who voted for Biden. As with linear regression, the `augment()` can be used to predict values. By default, the prediction is the link function, not the probability model. To predict the probability, add an argument of `type.predict="response"` as demonstrated below: @@ -654,13 +654,13 @@ logistic_trust_vote %>% ``` -#### Example 2: Interaction Effects {.unnumbered} +#### Example 2: Interaction effects {.unnumbered} \index{Interaction effects|(} -Let's look at another example with interaction effects. If we're interested in understanding the demographics of people who voted for Biden among all voters in 2020, we could include the indicator of if respondents voted early (`EarlyVote2020`) and their income group (`Income7`) in our model. +Let's look at another example with interaction effects. If we're interested in understanding the demographics of people who voted for Biden among all voters in 2020, we could include the indicator of whether respondents voted early (`EarlyVote2020`) and their income group (`Income7`) in our model. -First, we need to subset the data to 2020 voters and then create an indicator for voted for Biden. \index{Functions in srvyr!filter|(} +First, we need to subset the data to 2020 voters and then create an indicator for who voted for Biden. \index{Functions in srvyr!filter|(} ```{r} #| label: model-logisticexamp-biden-ind @@ -693,7 +693,7 @@ tidy(log_biden_main) %>% fmt_number() ``` -(ref:model-logisticexamp-biden-main-tab) Logistic regression output for predicting voting for Biden given early voting behavior and income - main effects only, ANES 2020 +(ref:model-logisticexamp-biden-main-tab) Logistic regression output for predicting voting for Biden given early voting behavior and income; main effects only, ANES 2020 ```{r} #| label: model-logisticexamp-biden-main-tab @@ -708,7 +708,7 @@ tidy(log_biden_main) %>% ``` \index{p-value|(} -This main effect model (see Table \@ref(tab:model-logisticexamp-biden-main-tab)) indicates that people with incomes of \$125,000 or more have a significant negative coefficient `r signif(log_biden_main$coefficients[8],3)` (p-value=`r tidy(log_biden_main) %>% slice(8) %>% pull(p.value) %>% pretty_p_value()`). This indicates that people with incomes of \$125,000 or more were less likely to vote for Biden in the 2020 election compared to people with incomes of \$20,000 or less (reference level). +This main effect model (see Table \@ref(tab:model-logisticexamp-biden-main-tab)) indicates that people with incomes of \$125,000 or more have a significant negative coefficient -`r signif(log_biden_main$coefficients[8],3)` (p-value is `r tidy(log_biden_main) %>% slice(8) %>% pull(p.value) %>% pretty_p_value()`). This indicates that people with incomes of \$125,000 or more were less likely to vote for Biden in the 2020 election compared to people with incomes of \$20,000 or less (reference level). \index{p-value|)} Although early voting behavior was not significant, there may be an interaction between income and early voting behavior. To determine this, we can create a model that includes the interaction effects: @@ -731,7 +731,7 @@ tidy(log_biden_int) %>% fmt_number() ``` -(ref:model-logisticexamp-biden-int-tab) Logistic regression output for predicting voting for Biden given early voting behavior and income - with interaction, ANES 2020 +(ref:model-logisticexamp-biden-int-tab) Logistic regression output for predicting voting for Biden given early voting behavior and income; with interaction, ANES 2020 ```{r} #| label: model-logisticexamp-biden-int-tab @@ -761,7 +761,7 @@ The y-axis is the predicted probabilities, one of our x-variables is on the x-ax ```{r} #| label: model-logisticexamp-biden-plot -#| fig.cap: Interaction Plot of Early Voting and Income Predicting the Probability of Voting for Biden +#| fig.cap: Interaction plot of early voting and income predicting the probability of voting for Biden #| fig.alt: "Line plot with x-axis as indicator for voted early, with did not early vote on the left and did early vote on the right, and y-axis as 'Predicted Probability of Voting for Biden'. There are seven lines for income groups with lines being from top to bottom: Under $20k, $80k to less than $100k, $40k to less than $60k, $100k to less than $125k, $20k to less than 40k, $125k or more, and $60k to less than $80k. The lines for $40k to less than $60k, $60k to less than $80k, and $125k or more are all relatively flat with the probabilities for did not early vote and did early vote being equivalent. The lines for $20k to less than $40k and $100k to less than $125k have a slight positive slope. The line for less than $20k has a slight negative slope and has overall the highest probability for both levels of early voting. The line for $80k to less than $100k has a large positive slope. This line shows the lowest probability for those who did not early vote, and the second highest probability for those who did early vote." log_biden_pred %>% @@ -786,7 +786,7 @@ log_biden_pred %>% theme_minimal() ``` -From Figure \@ref(fig:model-logisticexamp-biden-plot), we can see that people who have incomes in most groups (e.g., \$40k to <60k) have roughly the same probability of voting for Biden regardless of whether they voted early or not. However, those with income in the \$100k to < 125k group were more likely to vote for Biden if they voted early than if they did not vote early. +From Figure \@ref(fig:model-logisticexamp-biden-plot), we can see that people who have incomes in most groups (e.g., \$40,000 to less than \$60,000) have roughly the same probability of voting for Biden regardless of whether they voted early or not. However, those with income in the \$100,000 to less than \$125,000 group were more likely to vote for Biden if they voted early than if they did not vote early. Interactions in models can be difficult to understand from the coefficients alone. Using these interaction plots can help others understand the nuances of the results.\index{Functions in survey!svyglm|)} \index{American National Election Studies (ANES)|)} \index{Logistic regression|)} \index{Interaction effects|)} @@ -796,8 +796,8 @@ Interactions in models can be difficult to understand from the coefficients alon 2. Does temperature play a role in electricity expenditure? Cooling degree days are a measure of how hot a place is. CDD65 for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0, while a day that averages 85°F (29.4°C) would have CDD65=20 because it is 20 degrees Fahrenheit warmer [@eia-cdd]. Each day in the year is summed up to indicate how hot the place is throughout the year. Similarly, HDD65 indicates the days colder than 65°F. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions. -3. Continuing with our results from question 2, create a plot between the actual and predicted expenditures and a residual plot for the predicted expenditures. +3. Continuing with our results from Exercise 2, create a plot between the actual and predicted expenditures and a residual plot for the predicted expenditures. 4. Early voting expanded in 2020 [@npr-voting-trend]. Build a logistic model predicting early voting in 2020 (`EarlyVote2020`) using age (`Age`), education (`Education`), and party identification (`PartyID`). Include two-way interactions. -5. Continuing from Exercise 4, predict the probability of early voting for two people. Both are 28 years old and have a graduate degree, but one person is a strong Democrat, and the other is a strong Republican. \ No newline at end of file +5. Continuing from Exercise 4, predict the probability of early voting for two people. Both are 28 years old and have a graduate degree; however, one person is a strong Democrat, and the other is a strong Republican. diff --git a/93-AppendixD.Rmd b/93-AppendixD.Rmd index fa81b78..3581026 100644 --- a/93-AppendixD.Rmd +++ b/93-AppendixD.Rmd @@ -561,7 +561,7 @@ tidy(temps_sqft_exp) %>% Answer: There is a significant interaction between square footage and cooling degree days in the model and the square footage is a significant predictor of eletricity expenditure. -3. Continuing with our results from question 2, create a plot between the actual and predicted expenditures and a residual plot for the predicted expenditures. +3. Continuing with our results from Exercise 2, create a plot between the actual and predicted expenditures and a residual plot for the predicted expenditures. Answer: ```{r} @@ -615,7 +615,7 @@ earlyvote_mod <- anes_des %>% tidy(earlyvote_mod) %>% print(n=50) ``` -5. Continuing from Exercise 4, predict the probability of early voting for two people. Both are 28 years old and have a graduate degree, but one person is a strong Democrat, and the other is a strong Republican. +5. Continuing from Exercise 4, predict the probability of early voting for two people. Both are 28 years old and have a graduate degree; however, one person is a strong Democrat, and the other is a strong Republican. ```{r} #| label: model-ex-solution7