Note: This is not an
r
package but just a set ofr
functions aiming at one purpose.
R
codes to select the best models fitting three-way tables.
dplyr
broom
plyr
packages would be used. Install these before use this set.
- define
find_xname()
: used infitted_val.R
,goodness_fit.R
- modify
broom::tidy.anova()
function
function model_loglin()
- fit
glm
for given data - for every pair of independence model
function good_loglin()
- for fitted
model_loglin
, - compute goodness-of-fit
- implement with
purrr::map()
anddplyr::bind_rows()
function fit_loglin()
- for fitted
model_loglin
, - compute fitted values for every pair of independence model
- implement with
purrr::map()
andplyr::join_all()
Frow now on, this document will show how these functions can be applied.
- call
tidyverse
andbroom
libraries, and the other functions withr/_common.R
- fit with
model_loglin()
- goodness-of-fit with
good_loglin()
- choosing the best model based on the goodness-of-fit statistic
- compute fitted values with
fitted_val
# install.packages("httr")
library(httr)
git_api <- GET("https://api.github.com/repos/ygeunkim/loglinear3/git/trees/master?recursive=1")
stop_for_status(git_api)
repo_list <- unlist(lapply(content(git_api)$tree, "[", "path"), use.names = FALSE)
# R files in r folder -----------------------------
repo_list <- stringr::str_subset(repo_list, pattern = "^r/")
repo_list <- stringr::str_c("https://raw.githubusercontent.com/ygeunkim/loglinear3/master/", repo_list)
# source every file
sapply(repo_list, source)
#> https://raw.githubusercontent.com/ygeunkim/loglinear3/master/r/_common.R
#> value ?
#> visible FALSE
#> https://raw.githubusercontent.com/ygeunkim/loglinear3/master/r/fitted_val.R
#> value ?
#> visible FALSE
#> https://raw.githubusercontent.com/ygeunkim/loglinear3/master/r/goodness_fit.R
#> value ?
#> visible FALSE
#> https://raw.githubusercontent.com/ygeunkim/loglinear3/master/r/model_threeway.R
#> value ?
#> visible FALSE
tidyverse
and broom
are loaded. Here we should modify tidy
function of tidymodels (2018). tidy.anova()
is not defined for
anova.glm
, so we should add more elements in renamers
in the
function. Using the original tidy
gives warning message
.
renamers <- c(
...
"Ref.df" = "ref.df",
"Deviance" = "deviance", # glm object
"Resid. Df" = "resid.df", # glm object
"Resid. Dev" = "resid.dev" # glm object
)
In addition, find_xname()
function would be used later to construct
the various independence models.
Data from Agresti (2012) would be used.
refers to a survey by the Wright State University School of Medicine and the United Health Services in Dayton, Ohio. The survey asked 2276 students in their final year of high school in a nonurban area near Dayton, Ohio, whether they had ever used alcohol, cigarettes, or maijuana.
(substance <-
read_delim("data/Substance_use.dat", delim = " ") %>%
mutate(alcohol = str_trim(alcohol)) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.factor, fct_rev))
#> # A tibble: 8 x 4
#> alcohol cigarettes marijuana count
#> <fct> <fct> <fct> <dbl>
#> 1 yes yes yes 911
#> 2 yes yes no 538
#> 3 yes no yes 44
#> 4 yes no no 456
#> 5 no yes yes 3
#> 6 no yes no 43
#> 7 no no yes 2
#> 8 no no no 279
Long data format as above is easy to fit loglinear model. Against wide
format, i.e. contingency table, try tidyr::gather()
.
For example, look at the below table.
(aids <- read_table("data/AIDS.dat"))
#> # A tibble: 4 x 4
#> race azt yes no
#> <chr> <chr> <dbl> <dbl>
#> 1 white yes 14 93
#> 2 white no 32 81
#> 3 black yes 11 52
#> 4 black no 12 43
aids %>%
gather(yes, no, key = symptom, value = count)
#> # A tibble: 8 x 4
#> race azt symptom count
#> <chr> <chr> <chr> <dbl>
#> 1 white yes yes 14
#> 2 white no yes 32
#> 3 black yes yes 11
#> 4 black no yes 12
#> 5 white yes no 93
#> 6 white no no 81
#> 7 black yes no 52
#> 8 black no no 43
Finally, we get the long data.
We can write every formula in hand. However, it is annoying.
(subs_hierarchy <-
substance %>%
do(
indep = glm(count ~ alcohol + cigarettes + marijuana, data = ., family = poisson()),
ac_m = glm(count ~ alcohol + cigarettes + marijuana + alcohol:cigarettes,
data = ., family = poisson()),
amcm = glm(count ~ alcohol + cigarettes + marijuana + alcohol:marijuana + cigarettes:marijuana,
data = ., family = poisson()),
acamcm = glm(count ~ alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana,
data = ., family = poisson()),
acm = glm(count ~ alcohol * cigarettes * marijuana,
data = ., family = poisson())
))
#> # A tibble: 1 x 5
#> indep ac_m amcm acamcm acm
#> <list> <list> <list> <list> <list>
#> 1 <glm> <glm> <glm> <glm> <glm>
Function can be defined for more general usage. See r/model_threeway.R
model_loglin()
fits loglinear model for every pair of independence
model.
.data
: datayname
: character, response variableglm_fam
: random component of glm, option forfamily
ofglm()
. Since we are fitting loglinear model,poisson()
is set to be default.
It returns tibble
with list element.
(subs_hierarchy <- model_loglin(substance, yname = "count", glm_fam = poisson()))
#> # A tibble: 1 x 9
#> indep joint1 joint2 joint3 conditional1 conditional2 conditional3 homogen
#> <lis> <list> <list> <list> <list> <list> <list> <list>
#> 1 <glm> <glm> <glm> <glm> <glm> <glm> <glm> <glm>
#> # … with 1 more variable: threefac <list>
Each list has glm
object with [[1]]
.
subs_hierarchy$indep[[1]]
#>
#> Call: glm(formula = mutual, family = glm_fam, data = .)
#>
#> Coefficients:
#> (Intercept) alcoholno cigarettesno marijuanano
#> 6.292 -1.785 -0.649 0.315
#>
#> Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
#> Null Deviance: 2850
#> Residual Deviance: 1290 AIC: 1340
with residual df = the number of cell count - the number of non-redundant parameters
good_loglin()
calculates the above statistic for given option test
.
test = "LRT"
option of anova.glm()
produces
. test = Chisq
gives . As
noticed, this function is designed to be applied with purrr::map()
and dplyr::bind_rows()
.
(subs_good <-
subs_hierarchy %>%
map(good_loglin, test = "LRT") %>%
bind_rows()) %>%
pander::pander()
model | df | resid.df | resid.dev |
---|---|---|---|
alcohol + cigarettes + marijuana | 1 | 4 | 1286 |
alcohol + cigarettes + marijuana + cigarettes:marijuana | 1 | 3 | 534.2 |
alcohol + cigarettes + marijuana + alcohol:marijuana | 1 | 3 | 939.6 |
alcohol + cigarettes + marijuana + alcohol:cigarettes | 1 | 3 | 843.8 |
alcohol + cigarettes + marijuana + alcohol:marijuana + cigarettes:marijuana | 1 | 2 | 187.8 |
alcohol + cigarettes + marijuana + alcohol:cigarettes + cigarettes:marijuana | 1 | 2 | 92.02 |
alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana | 1 | 2 | 497.4 |
alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana | 1 | 1 | 0.374 |
alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana + alcohol:cigarettes:marijuana | 1 | 0 | 0 |
From , we compare reduced model to complex model
subs_good %>%
rename(alternative = model) %>%
group_by(resid.df) %>%
slice(which.min(resid.dev)) %>%
arrange(resid.df) %>%
mutate(
goodness = c(resid.dev[1], -diff(resid.dev)),
df_good = c(resid.df[1], -diff(resid.df))
) %>%
mutate(p_value = pchisq(goodness, df = df_good, lower.tail = FALSE)) %>%
pander::pander()
alternative | df | resid.df | resid.dev | goodness |
---|---|---|---|---|
alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana + alcohol:cigarettes:marijuana | 1 | 0 | 0 | 0 |
alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana | 1 | 1 | 0.374 | 0.374 |
alcohol + cigarettes + marijuana + alcohol:cigarettes + cigarettes:marijuana | 1 | 2 | 92.02 | 92.02 |
alcohol + cigarettes + marijuana + cigarettes:marijuana | 1 | 3 | 534.2 | 534.2 |
alcohol + cigarettes + marijuana | 1 | 4 | 1286 | 1286 |
Table continues below
df_good | p_value |
---|---|
0 | 1 |
1 | 0.541 |
2 | 0 |
3 | 0 |
4 | 0 |
(AC, AM, CM)
vs saturated model(ACM)
: cannot reject with p-value0.5408
, so we choose(AC, AM, CM)
(A, CM)
vs(AC, AM, CM)
: reject
Thus, we use the model (AC, AM, CM)
fit_loglin()
computes fitted values for each independent model. Use
this function with purrr::map()
and plyr::join_all()
. In
plyr::join_all()
, every variable name including response is written.
subs_hierarchy %>%
map(fit_loglin) %>%
plyr::join_all(by = c("count", "alcohol", "cigarettes", "marijuana")) %>%
pander::pander()
count | alcohol | cigarettes | marijuana | alcohol + cigarettes + marijuana |
---|---|---|---|---|
911 | yes | yes | yes | 540 |
538 | yes | yes | no | 740.2 |
44 | yes | no | yes | 282.1 |
456 | yes | no | no | 386.7 |
3 | no | yes | yes | 90.6 |
43 | no | yes | no | 124.2 |
2 | no | no | yes | 47.33 |
279 | no | no | no | 64.88 |
Table continues below
alcohol + cigarettes + marijuana + cigarettes:marijuana | alcohol + cigarettes + marijuana + alcohol:marijuana |
---|---|
782.7 | 627.3 |
497.5 | 652.9 |
39.39 | 327.7 |
629.4 | 341.1 |
131.3 | 3.284 |
83.47 | 211.5 |
6.609 | 1.716 |
105.6 | 110.5 |
Table continues below
alcohol + cigarettes + marijuana + alcohol:cigarettes | alcohol + cigarettes + marijuana + alcohol:marijuana + cigarettes:marijuana |
---|---|
611.2 | 909.2 |
837.8 | 438.8 |
210.9 | 45.76 |
289.1 | 555.2 |
19.4 | 4.76 |
26.6 | 142.2 |
118.5 | 0.24 |
162.5 | 179.8 |
Table continues below
alcohol + cigarettes + marijuana + alcohol:cigarettes + cigarettes:marijuana | alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana |
---|---|
885.9 | 710 |
563.1 | 739 |
29.45 | 245 |
470.6 | 255 |
28.12 | 0.703 |
17.88 | 45.3 |
16.55 | 4.297 |
264.4 | 276.7 |
Table continues below
alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana | alcohol + cigarettes + marijuana + alcohol:cigarettes + alcohol:marijuana + cigarettes:marijuana + alcohol:cigarettes:marijuana |
---|---|
910.4 | 911 |
538.6 | 538 |
44.62 | 44 |
455.4 | 456 |
3.617 | 3 |
42.38 | 43 |
1.383 | 2 |
279.6 | 279 |
Agresti, Alan. 2012. Categorical Data Analysis. 3rd ed. Wiley.
tidymodels. 2018. “Stats-Anova-tidiers.R.” https://github.com/tidymodels/broom/blob/master/R/stats-anova-tidiers.R.