aw_MTDT.Rmd

---
title: "Mutil-tiered decision tree (MTDT)"
author: "Andy Wang"
date: "`r Sys.Date()`"
output:
  html_document:
    number_sections: yes
    theme: default
    toc: yes
    toc_depth: 4
    toc_float: yes
    code_folding: hide
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.path='./Figs_MTDT/')

seed = 123
set.seed(seed)

# Load libraries
library(MultiAssayExperiment)
library(tidyverse)
library(magrittr)
library(rms)
library(Hmisc)
library(skimr)
library(knitr)
library(caret)
library(grid)
library(gridExtra)
library(tictoc)

tictoc::tic()

# Load data
# setwd("/Users/andywang/Dropbox (Sydney Uni)/aw2021PhD/melanoma/")
# load("./melanomaAssays.RData")
load("./melanomaAssaysNorm.RData")
source("./MTDT-0-fxn-SSER.R")
source("./MTDT-0-fxn-MTblock.R")
source("./MTDT-0-fxn-others.R")
source("./MTDT-0-fxn-MTDT.R")
source("./MTDT-1-datachecking.R")

```


# Data

These were characteristics of patients with cutaneous melanoma.  

Data consisted of 3 different types  
(1) Clinical  
(2) Histopathology  
(3) Nanostring  

The outcome variable was 1 for "good" prognosis and 0 otherwise.  

***
<br>

## Clinical data

There were 59 patients. The variables included  
- Age in years  
- Female gender 1 or 0  
- Primary site "extremity" or "axis"  
- Primary sun exposure site either "nil or intermittent" or "Chronic"  

The clinical prediction model was  
$$
\begin{align*}
  y_i & \sim Binomial(n, p_i) \\
  logit(p_i) & = \beta_0 + \beta_1Age_i+\beta_2Female_i+
                 \beta_3PrimarySite_i+\beta_4PrimarySunExpSite_i \\
\end{align*}
$$

```{r tierClin}

# setting data
dat = dat %>% 
  dplyr::filter(!is.na(class2vs2years)) %>% 
  dplyr::mutate(
    y.good = ifelse(class2vs2years=="Good", 1, 0),
    Female = ifelse(Person_Sex=="Female", 1, 0))

# Clinical
tierClin.d = dat %>% 
  dplyr::select(Person_ID, y.good, Age, Female, 
                Prim_SiteBinary, Prim_SunExpSite) %>% 
  as_tibble()
tierClin.formula = (y.good ~ Age+Female+Prim_SiteBinary+Prim_SunExpSite)


```

## Histopathology data

Similarly, there were 59 patients. The variables included  
- Stage I, II or III/IV  
- Nodular 1/0  
- BRAFmut Yes/No  
- NRASmut Yes/No  
- LargeCellSize Yes/No  
- Pigmented 1/0  

The histopathology prediction model was  

$$
\begin{align*}
  y_i & \sim Binomial(n, p_i) \\
  logit(p_i) & = \beta_0 + \beta_1Stage_i + \beta_2Nodularity_i + 
                 \beta_3BRAFmut_i + \beta_4NRASmut_i + 
                 \beta_5LargeCellSize_i + \beta_6Pigment_i  \\
\end{align*}
$$


```{r tierHisto}

# Histo
tierHisto.d = dat %>% 
  dplyr::select(Person_ID, y.good, 
                Prim_Stage, Prim_Nodular, 
                Tum_BRAFmut, Tum_NRASmut, 
                Tum_LargeCellSize, Tum_Pigment) %>% 
  as_tibble()
tierHisto.formula = (y.good ~ 
                    Prim_Stage + Prim_Nodular + Tum_BRAFmut + Tum_NRASmut +
                    Tum_LargeCellSize + Tum_Pigment)

```


## Nanostring data

There are 105 patients and 192 genes.  
Note, of the 105 patients, 64 didn't have a corresponding Person_ID hence there was no corresponding  outcome label, i.e. actual effective sample size was 41 rather than 105.  

The expression density for each patient is plotted below.  


```{r tierNano}

# Nano
tierNano.d <- t(melanomaAssaysNorm@ExperimentList@listData$NanoString) %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(var="Tumour_ID") %>% 
  dplyr::left_join(dat %>% dplyr::select(Person_ID, y.good, Tumour_ID),
                   by = "Tumour_ID") %>% 
  dplyr::select(Person_ID, y.good, everything(.), -Tumour_ID)

tierNano.d %>%
  tidyr::pivot_longer(c(-Person_ID, -y.good), names_to = "gene", values_to = "expr") %>%
  dplyr::mutate(y.good=factor(y.good, labels=c("No", "Yes"))) %>%
  dplyr::filter(!is.na(y.good)) %>%
  ggplot(aes(x = expr,
             group = Person_ID %>% as.factor,
             colour = y.good)) +
  stat_density(position = "identity", geom = "line") +
  labs(title = "Expression density plot overlaid by samples") +
  facet_grid(y.good~.) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))


```

<br>

We will use diagonal linear discriminant analysis to predict outcome class with nanostring data.  


***
<br>

# Multi-tier decision tree

Here we have 3 input models. Each model comprised of different sources of predictor variables. There were differences in terms of how these predictors variables could be obtained, how invasive they could be acquired, and how much they cost. The goal here is to find an optimal testing sequence with the different input models that we have, balancing it with the tier model error rates (TSER), the difficulty and differences in their unit cost.  

The idea here is that we would predict patients' outcome with the 3 input models in sequence/tiers. Once a prediction has been made for a particular patient/sample with an input model, we would determine if the error rate for that patient using that particular model is satisfactory by setting a pre-determined error rate cutoff. If the patient/sample specific error rate (SSER) is too high, this particular patient/sample would progress to the next tier model, until either the SSER is satisfactory, or we determined the patient/sample could not be satisfactorily classified with the input models we have.  


<br>

***

SSER = sample specific error rate
unit cost = cost per sample in performing this model

Tier clin = Clinical model prediction, sser cutoff = 0.2, unit cost = $100  
Tier histo = Histopathology model prediction, sser cutoff = 0.2, unit cost = $500    
Tier nano = Nano-string dlda model prediction, sser cutoff = 0.2, unit cost = $1000  

***

## Summary of results

```{r MTDTsummary}

A.d = tierClin.d %>% 
  dplyr::mutate(Prim_SiteBinary=factor(Prim_SiteBinary))
B.d = tierHisto.d
C.d = tierNano.d

A.f = tierClin.formula
# A.f = y.good ~ 0 + Age + Female + Prim_SiteBinary + Prim_SunExpSite
B.f = tierHisto.formula

tierA = "Clin"
tierB = "Histo"
tierC = "Nano"

A.m = rms::lrm(A.f, data=A.d, x=T, y=T, se.fit=T)
# A.m = glm(A.f, data=A.d, family=binomial(logit))
B.m = rms::lrm(B.f, data=B.d, x=T, y=T, se.fit=T)
# B.m = glm(B.f, data=B.d, family=binomial(link="logit"))
y=C.d[,2] %>% as.factor; x=C.d[, -c(1,2)]
# C.m = sparsediscrim::dlda(y=y, x=x)
C.m = sparsediscrim::lda_diag(y=y, x=x)
rm(list=c("x", "y"))


dataList = list(A.d, B.d, C.d)
modelList = list(A.m, B.m, C.m)
rsmpList = list("rcv", "rcv", "boot")
tierList = list(tierA, tierB, tierC)
# ssercutoffList = c(0.2, 0.2, 0.2)
# tierUnitCosts = c(0, 500, 1000)
ssercutoffList = c(0.2, 0.2, 0.2)
tierUnitCosts = c(100, 500, 1000)


MTDTresults = MTDT.algm(dataList,
                        modelList,
                        rsmpList,
                        tierList,
                        ssercutoffList = ssercutoffList,
                        k=5, times=100, rsmp=100,
                        seed=1, verbose=F)

MTDTsummary = MTDT.cost.summary(MTDTresults, tierUnitCosts)

MTDTsummary$TSER.summary %>% 
  DT::datatable(option=list(
    columnDefs=list(list(className="dt-center", targets=c(1, 8))),
    pageLength=20)) %>% 
  DT::formatRound(columns=c(2, 3, 7, 9), digits=2)


```


## Visual summary

```{r MTDTplots, include=T, fig.width=16, fig.height=8}

nperms = dim(MTDTresults$myperms)[1]
  
for (nperm in 1:nperms){
  p1 = MTDTsummary$Plots.strat.prop[[nperm]] +
    theme(plot.title=element_text(face="bold"))
  p2 = MTDTsummary$Plots.TSERcutoff[[nperm]][[1]]
  p3 = MTDTsummary$Plots.TSERcutoff[[nperm]][[2]] + ylab("")
  p4 = MTDTsummary$Plots.TSERcutoff[[nperm]][[3]] + ylab("")
  p5 = MTDTsummary$Plots.TSER[[nperm]] + 
    ggtitle("Overall") + ylab("") +
    theme(plot.title=element_text(hjust=0.5, face="bold"))
  
  gridExtra::grid.arrange(
    grobs = list(p1, p2, p3, p4, p5),
    layout_matrix = rbind(c(1,1,1,1),
                          c(1,1,1,1),
                          c(2,3,4,5),
                          c(2,3,4,5),
                          c(2,3,4,5))
  )
}
# for (nperm in 1:nperms){
#   p1 = MTDTsummary$Plots.stratification[[nperm]] +
#     theme(plot.title=element_text(face="bold"))
#   p2 = MTDTsummary$Plots.TSERcutoff[[nperm]][[1]]
#   p3 = MTDTsummary$Plots.TSERcutoff[[nperm]][[2]] + ylab("")
#   p4 = MTDTsummary$Plots.TSERcutoff[[nperm]][[3]] + ylab("")
#   p5 = MTDTsummary$Plots.TSER[[nperm]] + 
#     ggtitle("Overall") + ylab("") +
#     theme(plot.title=element_text(hjust=0.5, face="bold"))
#   
#   gridExtra::grid.arrange(
#     grobs = list(p1, p2, p3, p4, p5),
#     layout_matrix = rbind(c(1,1,1,1),
#                           c(2,3,4,5),
#                           c(2,3,4,5),
#                           c(2,3,4,5))
#     )
# }
  

```


***

<br><br>


# Session Info

```{r sessionInf, include=T}

date()
sessionInfo()
tictoc::toc()

```