08_working_categorical.Rmd

# Working with categorical outcome variables {#chap08-h1}
\index{categorical data@\textbf{categorical data}}

> Suddenly Christopher Robin began to tell Pooh about some of the things: People called Kings and Queens and something called Factors ... and Pooh said "Oh!" and thought how wonderful it would be to have a Real Brain which could tell you things.  
> A.A. Milne, *The House at Pooh Corner* (1928)

## Factors
\index{factors}

```{r echo=FALSE, message=FALSE}
library(knitr)
library(kableExtra)
mykable = function(x, caption = "CAPTION", ...){
  kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"), 
        booktabs = TRUE, caption = caption, 
        linesep = c("", "", "\\addlinespace"), ...) %>%
    kable_styling(latex_options = c("scale_down", "hold_position"))
}
```

We said earlier that continuous data can be measured and categorical data can be counted, which is useful to remember. 
Categorical data can be a:

* Factor
  + a fixed set of names/strings or numbers
  + these may have an inherent order (1st, 2nd 3rd) - ordinal factor
  + or may not (female, male) 
* Character
  + sequences of letters, numbers, or symbols
* Logical
  + containing only TRUE or FALSE

Health data is awash with factors. 
Whether it is outcomes like death, recurrence, or readmission. 
Or predictors like cancer stage, deprivation quintile, or smoking status.
It is essential therefore to be comfortable manipulating factors and dealing with outcomes which are categorical.

## The Question

We will use the classic “Survival from Malignant Melanoma” dataset which is included in the **boot** package. 
The data consist of measurements made on patients with malignant melanoma, a type of skin cancer. 
Each patient had their tumour removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark, between 1962 and 1977.

For the purposes of this discussion, we are interested in the association between tumour ulceration and death from melanoma. 

## Get the data

The Help page (F1 on `boot::melanoma`) gives us its data dictionary including the definition of each variable and the coding used.

```{r, message = F}
meldata <- boot::melanoma
```

## Check the data
As always, check any new dataset carefully before you start analysis.  

```{r message=FALSE}
library(tidyverse)
library(finalfit)
theme_set(theme_bw())
meldata %>% glimpse()
meldata %>% ff_glimpse()
```

As can be seen, all of the variables are currently coded as continuous/numeric.
The `<dbl>` stands for 'double', meaning numeric which comes from 'double-precision floating point', an awkward computer science term. 

## Recode the data {#chap08-recode}
It is really important that variables are correctly coded for all plotting and analysis functions. 
Using the data dictionary, we will convert the categorical variables to factors. 

In the section below, we convert the continuous variables to `factors` (e.g., `sex %>% factor() %>% `), then use the **forcats** package to recode the factor levels. 
Modern databases (such as REDCap) can give you an R script to recode your specific dataset.
This means you don't always have to recode your factors from numbers to names manually.
But you will always be recoding variables during the exploration and analysis stages too, so it is important to follow what is happening here.

```{r}
meldata <- meldata %>% 
  mutate(sex.factor =             # Make new variable  
           sex %>%                # from existing variable
           factor() %>%           # convert to factor
           fct_recode(            # forcats function
             "Female" = "0",      # new on left, old on right
             "Male"   = "1") %>% 
           ff_label("Sex"),       # Optional label for finalfit
         
         # same thing but more condensed code:
         ulcer.factor = factor(ulcer) %>% 
           fct_recode("Present" = "1",
                      "Absent"  = "0") %>% 
           ff_label("Ulcerated tumour"),
         
         status.factor = factor(status) %>% 
           fct_recode("Died melanoma"       = "1",
                      "Alive"               = "2",
                      "Died - other causes" = "3") %>% 
           ff_label("Status"))
```

We have formatted the recode of the `sex` variables to be on multiple lines - to make it easier for you to see the exact steps included.
We have condensed for the other recodes (e.g., `ulcer.factor = factor(ulcer) %>%`), but it does the exact same thing as the first one.

\index{functions@\textbf{functions}!factor}
\index{functions@\textbf{functions}!fct\_recode}
\index{functions@\textbf{functions}!ff\_label}

## Should I convert a continuous variable to a categorical variable?
\index{categorical data@\textbf{categorical data}!convert from continuous}

This is a common question and something which is frequently done. 
Take for instance the variable age. 
Is it better to leave it as a continuous variable, or to chop it into categories, e.g., 30 to 39 etc.?

The clear disadvantage in doing this is that information is being thrown away. 
Which feels like a bad thing to be doing. 
This is particularly important if the categories being created are large. 

For instance, if age was dichotomised to "young" and "old" at say 42 years (the current median age in Europe), then it is likely that relevant information to a number of analyses has been discarded. 

Secondly, it is unforgivable practice to repeatedly try different cuts of a continuous variable to obtain a statistically significant result. 
This is most commonly done in tests of diagnostic accuracy, where a threshold for considering a continuous test result positive is chosen *post hoc* to maximise sensitivity/specificity, but not then validated in an independent cohort.

But there are also advantages to converting a continuous variable to categorical. 
Say the relationship between age and an outcome is not linear, but rather u-shaped, then fitting a regression line is more difficult.
If age is cut into 10-year bands and entered into a regression as a factor, then this non-linearity is already accounted for. 

Secondly, when communicating the results of an analysis to a lay audience, it may be easier to use a categorical representation. 
For instance, an odds of death 1.8 times greater in 70-year-olds compared with 40-year-olds may be easier to grasp than a 1.02 times increase per year. 

So what is the answer? 
Do not do it unless you have to. 
Plot and understand the continuous variable first. 
If you do it, try not to throw away too much information. 
Repeat your analyses both with the continuous data and categorical data to ensure there is no difference in the conclusion (often called a sensitivity analysis). 

```{r fig.height=3, fig.width=4, warning=FALSE}
# Summary of age
meldata$age %>% 
  summary()

meldata %>% 
  ggplot(aes(x = age)) + 
  geom_histogram()
```

There are different ways in which a continuous variable can be converted to a factor. 
You may wish to create a number of intervals of equal length. 
The `cut()` function can be used for this. 

Figure \@ref(fig:chap08-fig-cut) illustrates different options for this. 
We suggest not using the `label` option of the `cut()` function to avoid errors, should the underlying data change or when the code is copied and reused. 
A better practice is to recode the levels using `fct_recode` as above. 

The intervals in the output are standard mathematical notation. 
A square bracket indicates the value is included in the interval and a round bracket that the value is excluded.

Note the requirement for `include.lowest = TRUE` when you specify breaks yourself and the lowest cut-point is also the lowest data value. 
This should be clear in Figure \@ref(fig:chap08-fig-cut).

```{r chap08-fig-cut, echo = FALSE, fig.cap="`Cut` a continuous variable into a categorical variable."}
knitr::include_graphics("images/chapter08/1_cut.png", auto_pdf = TRUE)
```

### Equal intervals vs quantiles
\index{categorical data@\textbf{categorical data}!quantiles}

Be clear in your head whether you wish to cut the data so the intervals are of equal length. 
Or whether you wish to cut the data so there are equal proportions of cases (patients) in each level. 

<!-- Option for below -->
<!-- # meldata$age.factor %>%  -->
<!-- #  fct_count() -->

Equal intervals:
```{r}
meldata <- meldata %>% 
  mutate(
    age.factor = 
      age %>%
      cut(4)
  )
meldata$age.factor %>%
  summary()
```

\index{functions@\textbf{functions}!cut}

Quantiles:
```{r}
meldata <- meldata %>% 
  mutate(
    age.factor = 
      age %>%
      Hmisc::cut2(g=4) # Note, cut2 comes from the Hmisc package
  )
meldata$age.factor %>% 
  summary()
```

\index{functions@\textbf{functions}!quantile}

Using the cut function, a continuous variable can be converted to a categorical one:

```{r}
meldata <- meldata %>% 
  mutate(
    age.factor = 
      age %>%
      cut(breaks = c(4,20,40,60,95), include.lowest = TRUE) %>% 
      fct_recode(
        "≤20"      =  "[4,20]",
        "21 to 40" = "(20,40]",
        "41 to 60" = "(40,60]",
        ">60"      = "(60,95]"
      ) %>% 
      ff_label("Age (years)")
  )
head(meldata$age.factor)
```

## Plot the data

We are interested in the association between tumour ulceration and death from melanoma. 
To start then, we simply count the number of patients with ulcerated tumours who died. 
It is useful to plot this as counts but also as proportions. 
It is proportions you are comparing, but you really want to know the absolute numbers as well. 

```{r, fig.width=7, fig.height=3, fig.cap="Bar chart: Outcome after surgery for patients with ulcerated melanoma."}
p1 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill = status.factor)) + 
  geom_bar() + 
  theme(legend.position = "none")

p2 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill = status.factor)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")

library(patchwork)
p1 + p2
```

It should be obvious that more died from melanoma in the ulcerated tumour group compared with the non-ulcerated tumour group.
The stacking is orders from top to bottom by default. 
This can be easily adjusted by changing the order of the levels within the factor (see re-levelling below).
This default order works well for binary variables - the "yes" or "1" is lowest and can be easily compared. 
This ordering of this particular variable is unusual - it would be more common to have for instance `alive = 0`, `died = 1`. 
One quick option is to just reverse the order of the levels in the plot. 

```{r fig.width=7, fig.height=3, fig.cap="Bar chart: Outcome after surgery for patients with ulcerated melanoma, reversed levels."}
p1 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill = status.factor)) + 
  geom_bar(position = position_stack(reverse = TRUE)) + 
  theme(legend.position = "none")

p2 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill = status.factor)) + 
  geom_bar(position = position_fill(reverse = TRUE)) + 
  ylab("proportion")

library(patchwork)
p1 + p2
```

Just from the plot then, death from melanoma in the ulcerated tumour group is around 40% and in the non-ulcerated group around 13%. 
The number of patients included in the study is not huge, however, this still looks like a real difference given its effect size.

We may also be interested in exploring potential effect modification, interactions and confounders.
Again, we urge you to first visualise these, rather than going straight to a model. 

```{r, fig.width=6, fig.height=6, fig.cap="Facetted bar plot: Outcome after surgery for patients with ulcerated melanoma aggregated by sex and age."}
p1 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill=status.factor)) + 
  geom_bar(position = position_stack(reverse = TRUE)) +
  facet_grid(sex.factor ~ age.factor) + 
  theme(legend.position = "none")

p2 <- meldata %>% 
  ggplot(aes(x = ulcer.factor, fill=status.factor)) + 
  geom_bar(position = position_fill(reverse = TRUE)) +
  facet_grid(sex.factor ~ age.factor)+ 
  theme(legend.position = "bottom")

p1 / p2
```

## Group factor levels together - `fct_collapse()`
\index{functions@\textbf{functions}!fct\_collapse}

Our question relates to the association between tumour ulceration and death from melanoma. 
The outcome measure has three levels as can be seen. 
For our purposes here, we will generate a disease-specific mortality variable (`status_dss`), by combining "Died - other causes" and "Alive". 

```{r}
meldata <- meldata %>%
  mutate(
    status_dss = fct_collapse(
      status.factor,
      "Alive" = c("Alive", "Died - other causes"))
  )
```

## Change the order of values within a factor - `fct_relevel()` {#chap08-h2-fct-relevel}
\index{functions@\textbf{functions}!fct\_relevel}

The default order for levels with `factor()` is alphabetical. 
We often want to reorder the levels in a factor when plotting, or when performing a regression analysis and we want to specify the reference level.  

The order can be checked using `levels()`.

```{r}
# dss - disease specific survival
meldata$status_dss %>% levels()
```

The reason "Alive" is second, rather than alphabetical, is it was recoded from "2" and that order was retained.
If, however, we want to make comparisons relative to "Alive", we need to move it to the front by using `fct_relevel()`.

```{r}
meldata <- meldata %>% 
  mutate(status_dss = status_dss %>%
           fct_relevel("Alive")
         )
```

Any number of factor levels can be specified in `fct_relevel()`. 

## Summarising factors with `finalfit`
\index{categorical data@\textbf{categorical data}!finalfit}
\index{categorical data@\textbf{categorical data}!summarising}

Our own **finalfit** package provides convenient functions to summarise and compare factors, producing final tables for publication. 

```{r eval=FALSE}
library(finalfit)
meldata %>% 
  summary_factorlist(dependent   = "status_dss", 
                     explanatory = "ulcer.factor")
```

```{r chap08-tab-2x2, echo=FALSE}
library(finalfit)
meldata %>% 
  summary_factorlist(dependent   = "status_dss", 
                     explanatory = "ulcer.factor") %>% 
  mykable(caption = "Two-by-two table with finalfit: Died with melanoma by tumour ulceration status.")
```

`finalfit` is useful for summarising multiple variables. 
We often want to summarise more than one factor or continuous variable against our `dependent` variable of interest. 
Think of Table 1 in a journal article. 

Any number of continuous or categorical explanatory variables can be added. 

```{r eval=FALSE}
library(finalfit)
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness")
  )
```

```{r echo=FALSE}
library(finalfit)
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness")
  ) %>% 
  mykable(caption = "Multiple variables by outcome: Outcome after surgery for melanoma by patient and disease factors.")
```

## Pearson's chi-squared and Fisher's exact tests
\index{categorical data@\textbf{categorical data}!chi-squared test}
\index{categorical data@\textbf{categorical data}!Fisher's exact test}
\index{chi-squared test}
\index{Fisher's exact test}

Pearson's chi-squared ($\chi^2$) test of independence is used to determine whether two categorical variables are independent in a given population.
Independence here means that the relative frequencies of one variable are the same over all levels of another variable.

A common setting for this is the classic 2x2 table. 
This refers to two categorical variables with exactly two levels each, such as is show in Table \@ref(tab:chap08-tab-2x2) above. 
The null hypothesis of independence for this particular question is no difference in the proportion of patients with ulcerated tumours who die (45.6%) compared with non-ulcerated tumours (13.9%).
From the raw frequencies, there seems to be a large difference, as we noted in the plot we made above. 

### Base R

Base R has reliable functions for all common statistical tests, but they are sometimes a little inconvenient to extract results from.

A table of counts can be constructed, either using the `$` to identify columns, or using the `with()` function. 
```{r eval=FALSE}
table(meldata$ulcer.factor, meldata$status_dss) # both give same result
with(meldata, table(ulcer.factor, status_dss))
```

```{r echo=FALSE}
table(meldata$ulcer.factor, meldata$status_dss)
```

When working with older R functions, a useful shortcut is the `exposition pipe-operator` (`%$%`) from the **magrittr** package, home of the standard forward pipe-operator (`%>%`). 
The exposition pipe-operator exposes data frame/tibble columns on the left to the function which follows on the right.
It's easier to see in action by making a table of counts. 

```{r message=FALSE}
library(magrittr)
meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss)
```

The counts table can be passed to `prop.table()` for proportions. 

```{r}
meldata %$%
  table(ulcer.factor, status_dss) %>% 
  prop.table(margin = 1)     # 1: row, 2: column etc.
```

Similarly, the counts table can be passed to `chisq.test()` to perform the chi-squared test. 
```{r}
meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss) %>% 
  chisq.test()
```

\index{functions@\textbf{functions}!chisq.test}

The result can be extracted into a tibble using the `tidy()` function from the **broom** package. 
```{r}
library(broom)
meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss) %>% 
  chisq.test() %>% 
  tidy()
```

The `chisq.test()` function applies the Yates' continuity correction by default. 
The standard interpretation assumes that the discrete probability of observed counts in the table can be approximated by the continuous chi-squared distribution. 
This introduces some error. 
The correction involves subtracting 0.5 from the absolute difference between each observed and expected value. 
This is particularly helpful when counts are low, but can be removed if desired by `chisq.test(..., correct = FALSE)`.

## Fisher's exact test

A commonly stated assumption of the chi-squared test is the requirement to have an expected count of at least 5 in each cell of the 2x2 table. 
For larger tables, all expected counts should be $>1$ and no more than 20% of all cells should have expected counts $<5$.
If this assumption is not fulfilled, an alternative test is Fisher's exact test. 
For instance, if we are testing across a 2x4 table created from our `age.factor` variable and `status_dss`, then we receive a warning.  

```{r}
meldata %$%        # note $ sign here
  table(age.factor, status_dss) %>% 
  chisq.test()
```

Switch to Fisher's exact test
```{r}
meldata %$%        # note $ sign here
  table(age.factor, status_dss) %>% 
  fisher.test()
```

\index{functions@\textbf{functions}!fisher.test}

## Chi-squared / Fisher's exact test using finalfit

It is easier using the `summary_factorlist()` function from the **finalfit** package. 
Including `p = TRUE` in `summary_factorlist()` adds a hypothesis test to each included comparison. 
This defaults to chi-squared tests with a continuity correction for categorical variables. 

```{r eval=FALSE}
library(finalfit)
meldata %>% 
  summary_factorlist(dependent   = "status_dss", 
                     explanatory = "ulcer.factor",
                     p = TRUE)
```

\index{functions@\textbf{functions}!summary\_factorlist}

```{r echo=FALSE}
library(finalfit)
meldata %>% 
  summary_factorlist(dependent   = "status_dss", 
                     explanatory = "ulcer.factor",
                     p = TRUE) %>% 
  mykable(caption = "Two-by-two table with chi-squared test using final fit: Outcome after surgery for melanoma by tumour ulceration status.")
```

Adding further variables:

```{r eval=FALSE}
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness"),
                     p = TRUE)
```

```{r echo=FALSE}
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness"),
                     p = TRUE) %>% 
  mykable(caption = "Multiple variables by outcome with hypothesis tests: Outcome after surgery for melanoma by patient and disease factors (chi-squared test).")
```

Note that for continuous expanatory variables, an F-test (ANOVA) is performed by default. 
If variables are considered non-parametric (`cont = "mean"`), then a Kruskal-Wallis test is used. 

Switch to Fisher's exact test:

```{r eval=FALSE}
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness"),
                     p = TRUE,
                     p_cat = "fisher")
```

```{r echo=FALSE}
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness"),
                     p = TRUE,
                     p_cat = "fisher") %>% 
  mykable(caption = "Multiple variables by outcome with hypothesis tests: Outcome after surgery for melanoma by patient and disease factors (Fisher's exact test).")
```

Further options can be included:

```{r eval=FALSE}
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness"),
                     p = TRUE,
                     p_cat = "fisher",
                     digits = 
                       c(1, 1, 4, 2), #1: mean/median, 2: SD/IQR 
                                      # 3: p-value, 4: count percentage
                     na_include = TRUE, # include missing in results/test
                     add_dependent_label = TRUE
  )
```

```{r echo=FALSE}
meldata %>% 
  summary_factorlist(dependent = "status_dss", 
                     explanatory = 
                       c("ulcer.factor", "age.factor", 
                         "sex.factor", "thickness"),
                     p = TRUE,
                     p_cat = "fisher",
                     digits = 
                       c(1, 1, 4, 2), #1: mean/median, 2: SD/IQR 
                                      #3: p-value, 4: count percentage
                     na_include = TRUE, # include missing in table and tests
                     add_dependent_label = TRUE
  ) %>% 
  mykable(caption = "Multiple variables by outcome with hypothesis tests: Options including missing data, rounding, and labels.")
```

## Exercises

### Exercise {#chap08-ex1}

Using `finalfit`, create a summary table with "status.factor" as the dependent variable and the following as explanatory variables: `sex.factor, ulcer.factor, age.factor, thickness`.

Change the continuous variable summary statistic to `median` and `interquartile range` instead of `mean` and `sd`.  

### Exercise {#chap08-ex2}

By changing one and only one line in the following block create firstly a new table showing the breakdown of `status.factor` by age and secondly the breakdown of `status.factor` by sex:

```{r, eval=FALSE}
meldata %>%
  count(ulcer.factor, status.factor) %>%
  group_by(status.factor) %>%
  mutate(total = sum(n)) %>%
  mutate(percentage = round(100*n/total, 1)) %>% 
  mutate(count_perc = paste0(n, " (", percentage, ")")) %>% 
  select(-total, -n, -percentage) %>% 
  spread(status.factor, count_perc)
```

### Exercise {#chap08-ex3}
Now produce these tables using the `summary_factorlist()` function from the **finalfit** package.