forked from SurgicalInformatics/healthyr_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08_working_categorical.Rmd
599 lines (478 loc) · 23.5 KB
/
08_working_categorical.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
# Working with categorical outcome variables {#chap08-h1}
\index{categorical data@\textbf{categorical data}}
> Suddenly Christopher Robin began to tell Pooh about some of the things: People called Kings and Queens and something called Factors ... and Pooh said "Oh!" and thought how wonderful it would be to have a Real Brain which could tell you things.
> A.A. Milne, *The House at Pooh Corner* (1928)
## Factors
\index{factors}
```{r echo=FALSE, message=FALSE}
library(knitr)
library(kableExtra)
mykable = function(x, caption = "CAPTION", ...){
kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"),
booktabs = TRUE, caption = caption,
linesep = c("", "", "\\addlinespace"), ...) %>%
kable_styling(latex_options = c("scale_down", "hold_position"))
}
```
We said earlier that continuous data can be measured and categorical data can be counted, which is useful to remember.
Categorical data can be a:
* Factor
+ a fixed set of names/strings or numbers
+ these may have an inherent order (1st, 2nd 3rd) - ordinal factor
+ or may not (female, male)
* Character
+ sequences of letters, numbers, or symbols
* Logical
+ containing only TRUE or FALSE
Health data is awash with factors.
Whether it is outcomes like death, recurrence, or readmission.
Or predictors like cancer stage, deprivation quintile, or smoking status.
It is essential therefore to be comfortable manipulating factors and dealing with outcomes which are categorical.
## The Question
We will use the classic “Survival from Malignant Melanoma” dataset which is included in the **boot** package.
The data consist of measurements made on patients with malignant melanoma, a type of skin cancer.
Each patient had their tumour removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark, between 1962 and 1977.
For the purposes of this discussion, we are interested in the association between tumour ulceration and death from melanoma.
## Get the data
The Help page (F1 on `boot::melanoma`) gives us its data dictionary including the definition of each variable and the coding used.
```{r, message = F}
meldata <- boot::melanoma
```
## Check the data
As always, check any new dataset carefully before you start analysis.
```{r message=FALSE}
library(tidyverse)
library(finalfit)
theme_set(theme_bw())
meldata %>% glimpse()
meldata %>% ff_glimpse()
```
As can be seen, all of the variables are currently coded as continuous/numeric.
The `<dbl>` stands for 'double', meaning numeric which comes from 'double-precision floating point', an awkward computer science term.
## Recode the data {#chap08-recode}
It is really important that variables are correctly coded for all plotting and analysis functions.
Using the data dictionary, we will convert the categorical variables to factors.
In the section below, we convert the continuous variables to `factors` (e.g., `sex %>% factor() %>% `), then use the **forcats** package to recode the factor levels.
Modern databases (such as REDCap) can give you an R script to recode your specific dataset.
This means you don't always have to recode your factors from numbers to names manually.
But you will always be recoding variables during the exploration and analysis stages too, so it is important to follow what is happening here.
```{r}
meldata <- meldata %>%
mutate(sex.factor = # Make new variable
sex %>% # from existing variable
factor() %>% # convert to factor
fct_recode( # forcats function
"Female" = "0", # new on left, old on right
"Male" = "1") %>%
ff_label("Sex"), # Optional label for finalfit
# same thing but more condensed code:
ulcer.factor = factor(ulcer) %>%
fct_recode("Present" = "1",
"Absent" = "0") %>%
ff_label("Ulcerated tumour"),
status.factor = factor(status) %>%
fct_recode("Died melanoma" = "1",
"Alive" = "2",
"Died - other causes" = "3") %>%
ff_label("Status"))
```
We have formatted the recode of the `sex` variables to be on multiple lines - to make it easier for you to see the exact steps included.
We have condensed for the other recodes (e.g., `ulcer.factor = factor(ulcer) %>%`), but it does the exact same thing as the first one.
\index{functions@\textbf{functions}!factor}
\index{functions@\textbf{functions}!fct\_recode}
\index{functions@\textbf{functions}!ff\_label}
## Should I convert a continuous variable to a categorical variable?
\index{categorical data@\textbf{categorical data}!convert from continuous}
This is a common question and something which is frequently done.
Take for instance the variable age.
Is it better to leave it as a continuous variable, or to chop it into categories, e.g., 30 to 39 etc.?
The clear disadvantage in doing this is that information is being thrown away.
Which feels like a bad thing to be doing.
This is particularly important if the categories being created are large.
For instance, if age was dichotomised to "young" and "old" at say 42 years (the current median age in Europe), then it is likely that relevant information to a number of analyses has been discarded.
Secondly, it is unforgivable practice to repeatedly try different cuts of a continuous variable to obtain a statistically significant result.
This is most commonly done in tests of diagnostic accuracy, where a threshold for considering a continuous test result positive is chosen *post hoc* to maximise sensitivity/specificity, but not then validated in an independent cohort.
But there are also advantages to converting a continuous variable to categorical.
Say the relationship between age and an outcome is not linear, but rather u-shaped, then fitting a regression line is more difficult.
If age is cut into 10-year bands and entered into a regression as a factor, then this non-linearity is already accounted for.
Secondly, when communicating the results of an analysis to a lay audience, it may be easier to use a categorical representation.
For instance, an odds of death 1.8 times greater in 70-year-olds compared with 40-year-olds may be easier to grasp than a 1.02 times increase per year.
So what is the answer?
Do not do it unless you have to.
Plot and understand the continuous variable first.
If you do it, try not to throw away too much information.
Repeat your analyses both with the continuous data and categorical data to ensure there is no difference in the conclusion (often called a sensitivity analysis).
```{r fig.height=3, fig.width=4, warning=FALSE}
# Summary of age
meldata$age %>%
summary()
meldata %>%
ggplot(aes(x = age)) +
geom_histogram()
```
There are different ways in which a continuous variable can be converted to a factor.
You may wish to create a number of intervals of equal length.
The `cut()` function can be used for this.
Figure \@ref(fig:chap08-fig-cut) illustrates different options for this.
We suggest not using the `label` option of the `cut()` function to avoid errors, should the underlying data change or when the code is copied and reused.
A better practice is to recode the levels using `fct_recode` as above.
The intervals in the output are standard mathematical notation.
A square bracket indicates the value is included in the interval and a round bracket that the value is excluded.
Note the requirement for `include.lowest = TRUE` when you specify breaks yourself and the lowest cut-point is also the lowest data value.
This should be clear in Figure \@ref(fig:chap08-fig-cut).
```{r chap08-fig-cut, echo = FALSE, fig.cap="`Cut` a continuous variable into a categorical variable."}
knitr::include_graphics("images/chapter08/1_cut.png", auto_pdf = TRUE)
```
### Equal intervals vs quantiles
\index{categorical data@\textbf{categorical data}!quantiles}
Be clear in your head whether you wish to cut the data so the intervals are of equal length.
Or whether you wish to cut the data so there are equal proportions of cases (patients) in each level.
<!-- Option for below -->
<!-- # meldata$age.factor %>% -->
<!-- # fct_count() -->
Equal intervals:
```{r}
meldata <- meldata %>%
mutate(
age.factor =
age %>%
cut(4)
)
meldata$age.factor %>%
summary()
```
\index{functions@\textbf{functions}!cut}
Quantiles:
```{r}
meldata <- meldata %>%
mutate(
age.factor =
age %>%
Hmisc::cut2(g=4) # Note, cut2 comes from the Hmisc package
)
meldata$age.factor %>%
summary()
```
\index{functions@\textbf{functions}!quantile}
Using the cut function, a continuous variable can be converted to a categorical one:
```{r}
meldata <- meldata %>%
mutate(
age.factor =
age %>%
cut(breaks = c(4,20,40,60,95), include.lowest = TRUE) %>%
fct_recode(
"≤20" = "[4,20]",
"21 to 40" = "(20,40]",
"41 to 60" = "(40,60]",
">60" = "(60,95]"
) %>%
ff_label("Age (years)")
)
head(meldata$age.factor)
```
## Plot the data
We are interested in the association between tumour ulceration and death from melanoma.
To start then, we simply count the number of patients with ulcerated tumours who died.
It is useful to plot this as counts but also as proportions.
It is proportions you are comparing, but you really want to know the absolute numbers as well.
```{r, fig.width=7, fig.height=3, fig.cap="Bar chart: Outcome after surgery for patients with ulcerated melanoma."}
p1 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill = status.factor)) +
geom_bar() +
theme(legend.position = "none")
p2 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill = status.factor)) +
geom_bar(position = "fill") +
ylab("proportion")
library(patchwork)
p1 + p2
```
It should be obvious that more died from melanoma in the ulcerated tumour group compared with the non-ulcerated tumour group.
The stacking is orders from top to bottom by default.
This can be easily adjusted by changing the order of the levels within the factor (see re-levelling below).
This default order works well for binary variables - the "yes" or "1" is lowest and can be easily compared.
This ordering of this particular variable is unusual - it would be more common to have for instance `alive = 0`, `died = 1`.
One quick option is to just reverse the order of the levels in the plot.
```{r fig.width=7, fig.height=3, fig.cap="Bar chart: Outcome after surgery for patients with ulcerated melanoma, reversed levels."}
p1 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill = status.factor)) +
geom_bar(position = position_stack(reverse = TRUE)) +
theme(legend.position = "none")
p2 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill = status.factor)) +
geom_bar(position = position_fill(reverse = TRUE)) +
ylab("proportion")
library(patchwork)
p1 + p2
```
Just from the plot then, death from melanoma in the ulcerated tumour group is around 40% and in the non-ulcerated group around 13%.
The number of patients included in the study is not huge, however, this still looks like a real difference given its effect size.
We may also be interested in exploring potential effect modification, interactions and confounders.
Again, we urge you to first visualise these, rather than going straight to a model.
```{r, fig.width=6, fig.height=6, fig.cap="Facetted bar plot: Outcome after surgery for patients with ulcerated melanoma aggregated by sex and age."}
p1 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill=status.factor)) +
geom_bar(position = position_stack(reverse = TRUE)) +
facet_grid(sex.factor ~ age.factor) +
theme(legend.position = "none")
p2 <- meldata %>%
ggplot(aes(x = ulcer.factor, fill=status.factor)) +
geom_bar(position = position_fill(reverse = TRUE)) +
facet_grid(sex.factor ~ age.factor)+
theme(legend.position = "bottom")
p1 / p2
```
## Group factor levels together - `fct_collapse()`
\index{functions@\textbf{functions}!fct\_collapse}
Our question relates to the association between tumour ulceration and death from melanoma.
The outcome measure has three levels as can be seen.
For our purposes here, we will generate a disease-specific mortality variable (`status_dss`), by combining "Died - other causes" and "Alive".
```{r}
meldata <- meldata %>%
mutate(
status_dss = fct_collapse(
status.factor,
"Alive" = c("Alive", "Died - other causes"))
)
```
## Change the order of values within a factor - `fct_relevel()` {#chap08-h2-fct-relevel}
\index{functions@\textbf{functions}!fct\_relevel}
The default order for levels with `factor()` is alphabetical.
We often want to reorder the levels in a factor when plotting, or when performing a regression analysis and we want to specify the reference level.
The order can be checked using `levels()`.
```{r}
# dss - disease specific survival
meldata$status_dss %>% levels()
```
The reason "Alive" is second, rather than alphabetical, is it was recoded from "2" and that order was retained.
If, however, we want to make comparisons relative to "Alive", we need to move it to the front by using `fct_relevel()`.
```{r}
meldata <- meldata %>%
mutate(status_dss = status_dss %>%
fct_relevel("Alive")
)
```
Any number of factor levels can be specified in `fct_relevel()`.
## Summarising factors with `finalfit`
\index{categorical data@\textbf{categorical data}!finalfit}
\index{categorical data@\textbf{categorical data}!summarising}
Our own **finalfit** package provides convenient functions to summarise and compare factors, producing final tables for publication.
```{r eval=FALSE}
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory = "ulcer.factor")
```
```{r chap08-tab-2x2, echo=FALSE}
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory = "ulcer.factor") %>%
mykable(caption = "Two-by-two table with finalfit: Died with melanoma by tumour ulceration status.")
```
`finalfit` is useful for summarising multiple variables.
We often want to summarise more than one factor or continuous variable against our `dependent` variable of interest.
Think of Table 1 in a journal article.
Any number of continuous or categorical explanatory variables can be added.
```{r eval=FALSE}
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness")
)
```
```{r echo=FALSE}
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness")
) %>%
mykable(caption = "Multiple variables by outcome: Outcome after surgery for melanoma by patient and disease factors.")
```
## Pearson's chi-squared and Fisher's exact tests
\index{categorical data@\textbf{categorical data}!chi-squared test}
\index{categorical data@\textbf{categorical data}!Fisher's exact test}
\index{chi-squared test}
\index{Fisher's exact test}
Pearson's chi-squared ($\chi^2$) test of independence is used to determine whether two categorical variables are independent in a given population.
Independence here means that the relative frequencies of one variable are the same over all levels of another variable.
A common setting for this is the classic 2x2 table.
This refers to two categorical variables with exactly two levels each, such as is show in Table \@ref(tab:chap08-tab-2x2) above.
The null hypothesis of independence for this particular question is no difference in the proportion of patients with ulcerated tumours who die (45.6%) compared with non-ulcerated tumours (13.9%).
From the raw frequencies, there seems to be a large difference, as we noted in the plot we made above.
### Base R
Base R has reliable functions for all common statistical tests, but they are sometimes a little inconvenient to extract results from.
A table of counts can be constructed, either using the `$` to identify columns, or using the `with()` function.
```{r eval=FALSE}
table(meldata$ulcer.factor, meldata$status_dss) # both give same result
with(meldata, table(ulcer.factor, status_dss))
```
```{r echo=FALSE}
table(meldata$ulcer.factor, meldata$status_dss)
```
When working with older R functions, a useful shortcut is the `exposition pipe-operator` (`%$%`) from the **magrittr** package, home of the standard forward pipe-operator (`%>%`).
The exposition pipe-operator exposes data frame/tibble columns on the left to the function which follows on the right.
It's easier to see in action by making a table of counts.
```{r message=FALSE}
library(magrittr)
meldata %$% # note $ sign here
table(ulcer.factor, status_dss)
```
The counts table can be passed to `prop.table()` for proportions.
```{r}
meldata %$%
table(ulcer.factor, status_dss) %>%
prop.table(margin = 1) # 1: row, 2: column etc.
```
Similarly, the counts table can be passed to `chisq.test()` to perform the chi-squared test.
```{r}
meldata %$% # note $ sign here
table(ulcer.factor, status_dss) %>%
chisq.test()
```
\index{functions@\textbf{functions}!chisq.test}
The result can be extracted into a tibble using the `tidy()` function from the **broom** package.
```{r}
library(broom)
meldata %$% # note $ sign here
table(ulcer.factor, status_dss) %>%
chisq.test() %>%
tidy()
```
The `chisq.test()` function applies the Yates' continuity correction by default.
The standard interpretation assumes that the discrete probability of observed counts in the table can be approximated by the continuous chi-squared distribution.
This introduces some error.
The correction involves subtracting 0.5 from the absolute difference between each observed and expected value.
This is particularly helpful when counts are low, but can be removed if desired by `chisq.test(..., correct = FALSE)`.
## Fisher's exact test
A commonly stated assumption of the chi-squared test is the requirement to have an expected count of at least 5 in each cell of the 2x2 table.
For larger tables, all expected counts should be $>1$ and no more than 20% of all cells should have expected counts $<5$.
If this assumption is not fulfilled, an alternative test is Fisher's exact test.
For instance, if we are testing across a 2x4 table created from our `age.factor` variable and `status_dss`, then we receive a warning.
```{r}
meldata %$% # note $ sign here
table(age.factor, status_dss) %>%
chisq.test()
```
Switch to Fisher's exact test
```{r}
meldata %$% # note $ sign here
table(age.factor, status_dss) %>%
fisher.test()
```
\index{functions@\textbf{functions}!fisher.test}
## Chi-squared / Fisher's exact test using finalfit
It is easier using the `summary_factorlist()` function from the **finalfit** package.
Including `p = TRUE` in `summary_factorlist()` adds a hypothesis test to each included comparison.
This defaults to chi-squared tests with a continuity correction for categorical variables.
```{r eval=FALSE}
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory = "ulcer.factor",
p = TRUE)
```
\index{functions@\textbf{functions}!summary\_factorlist}
```{r echo=FALSE}
library(finalfit)
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory = "ulcer.factor",
p = TRUE) %>%
mykable(caption = "Two-by-two table with chi-squared test using final fit: Outcome after surgery for melanoma by tumour ulceration status.")
```
Adding further variables:
```{r eval=FALSE}
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE)
```
```{r echo=FALSE}
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE) %>%
mykable(caption = "Multiple variables by outcome with hypothesis tests: Outcome after surgery for melanoma by patient and disease factors (chi-squared test).")
```
Note that for continuous expanatory variables, an F-test (ANOVA) is performed by default.
If variables are considered non-parametric (`cont = "mean"`), then a Kruskal-Wallis test is used.
Switch to Fisher's exact test:
```{r eval=FALSE}
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE,
p_cat = "fisher")
```
```{r echo=FALSE}
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE,
p_cat = "fisher") %>%
mykable(caption = "Multiple variables by outcome with hypothesis tests: Outcome after surgery for melanoma by patient and disease factors (Fisher's exact test).")
```
Further options can be included:
```{r eval=FALSE}
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE,
p_cat = "fisher",
digits =
c(1, 1, 4, 2), #1: mean/median, 2: SD/IQR
# 3: p-value, 4: count percentage
na_include = TRUE, # include missing in results/test
add_dependent_label = TRUE
)
```
```{r echo=FALSE}
meldata %>%
summary_factorlist(dependent = "status_dss",
explanatory =
c("ulcer.factor", "age.factor",
"sex.factor", "thickness"),
p = TRUE,
p_cat = "fisher",
digits =
c(1, 1, 4, 2), #1: mean/median, 2: SD/IQR
#3: p-value, 4: count percentage
na_include = TRUE, # include missing in table and tests
add_dependent_label = TRUE
) %>%
mykable(caption = "Multiple variables by outcome with hypothesis tests: Options including missing data, rounding, and labels.")
```
## Exercises
### Exercise {#chap08-ex1}
Using `finalfit`, create a summary table with "status.factor" as the dependent variable and the following as explanatory variables: `sex.factor, ulcer.factor, age.factor, thickness`.
Change the continuous variable summary statistic to `median` and `interquartile range` instead of `mean` and `sd`.
### Exercise {#chap08-ex2}
By changing one and only one line in the following block create firstly a new table showing the breakdown of `status.factor` by age and secondly the breakdown of `status.factor` by sex:
```{r, eval=FALSE}
meldata %>%
count(ulcer.factor, status.factor) %>%
group_by(status.factor) %>%
mutate(total = sum(n)) %>%
mutate(percentage = round(100*n/total, 1)) %>%
mutate(count_perc = paste0(n, " (", percentage, ")")) %>%
select(-total, -n, -percentage) %>%
spread(status.factor, count_perc)
```
### Exercise {#chap08-ex3}
Now produce these tables using the `summary_factorlist()` function from the **finalfit** package.