-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathaw_MTDT.Rmd
329 lines (238 loc) · 8.63 KB
/
aw_MTDT.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: "Mutil-tiered decision tree (MTDT)"
author: "Andy Wang"
date: "`r Sys.Date()`"
output:
html_document:
number_sections: yes
theme: default
toc: yes
toc_depth: 4
toc_float: yes
code_folding: hide
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.path='./Figs_MTDT/')
seed = 123
set.seed(seed)
# Load libraries
library(MultiAssayExperiment)
library(tidyverse)
library(magrittr)
library(rms)
library(Hmisc)
library(skimr)
library(knitr)
library(caret)
library(grid)
library(gridExtra)
library(tictoc)
tictoc::tic()
# Load data
# setwd("/Users/andywang/Dropbox (Sydney Uni)/aw2021PhD/melanoma/")
# load("./melanomaAssays.RData")
load("./melanomaAssaysNorm.RData")
source("./MTDT-0-fxn-SSER.R")
source("./MTDT-0-fxn-MTblock.R")
source("./MTDT-0-fxn-others.R")
source("./MTDT-0-fxn-MTDT.R")
source("./MTDT-1-datachecking.R")
```
# Data
These were characteristics of patients with cutaneous melanoma.
Data consisted of 3 different types
(1) Clinical
(2) Histopathology
(3) Nanostring
The outcome variable was 1 for "good" prognosis and 0 otherwise.
***
<br>
## Clinical data
There were 59 patients. The variables included
- Age in years
- Female gender 1 or 0
- Primary site "extremity" or "axis"
- Primary sun exposure site either "nil or intermittent" or "Chronic"
The clinical prediction model was
$$
\begin{align*}
y_i & \sim Binomial(n, p_i) \\
logit(p_i) & = \beta_0 + \beta_1Age_i+\beta_2Female_i+
\beta_3PrimarySite_i+\beta_4PrimarySunExpSite_i \\
\end{align*}
$$
```{r tierClin}
# setting data
dat = dat %>%
dplyr::filter(!is.na(class2vs2years)) %>%
dplyr::mutate(
y.good = ifelse(class2vs2years=="Good", 1, 0),
Female = ifelse(Person_Sex=="Female", 1, 0))
# Clinical
tierClin.d = dat %>%
dplyr::select(Person_ID, y.good, Age, Female,
Prim_SiteBinary, Prim_SunExpSite) %>%
as_tibble()
tierClin.formula = (y.good ~ Age+Female+Prim_SiteBinary+Prim_SunExpSite)
```
## Histopathology data
Similarly, there were 59 patients. The variables included
- Stage I, II or III/IV
- Nodular 1/0
- BRAFmut Yes/No
- NRASmut Yes/No
- LargeCellSize Yes/No
- Pigmented 1/0
The histopathology prediction model was
$$
\begin{align*}
y_i & \sim Binomial(n, p_i) \\
logit(p_i) & = \beta_0 + \beta_1Stage_i + \beta_2Nodularity_i +
\beta_3BRAFmut_i + \beta_4NRASmut_i +
\beta_5LargeCellSize_i + \beta_6Pigment_i \\
\end{align*}
$$
```{r tierHisto}
# Histo
tierHisto.d = dat %>%
dplyr::select(Person_ID, y.good,
Prim_Stage, Prim_Nodular,
Tum_BRAFmut, Tum_NRASmut,
Tum_LargeCellSize, Tum_Pigment) %>%
as_tibble()
tierHisto.formula = (y.good ~
Prim_Stage + Prim_Nodular + Tum_BRAFmut + Tum_NRASmut +
Tum_LargeCellSize + Tum_Pigment)
```
## Nanostring data
There are 105 patients and 192 genes.
Note, of the 105 patients, 64 didn't have a corresponding Person_ID hence there was no corresponding outcome label, i.e. actual effective sample size was 41 rather than 105.
The expression density for each patient is plotted below.
```{r tierNano}
# Nano
tierNano.d <- t(melanomaAssaysNorm@ExperimentList@listData$NanoString) %>%
as.data.frame() %>%
tibble::rownames_to_column(var="Tumour_ID") %>%
dplyr::left_join(dat %>% dplyr::select(Person_ID, y.good, Tumour_ID),
by = "Tumour_ID") %>%
dplyr::select(Person_ID, y.good, everything(.), -Tumour_ID)
tierNano.d %>%
tidyr::pivot_longer(c(-Person_ID, -y.good), names_to = "gene", values_to = "expr") %>%
dplyr::mutate(y.good=factor(y.good, labels=c("No", "Yes"))) %>%
dplyr::filter(!is.na(y.good)) %>%
ggplot(aes(x = expr,
group = Person_ID %>% as.factor,
colour = y.good)) +
stat_density(position = "identity", geom = "line") +
labs(title = "Expression density plot overlaid by samples") +
facet_grid(y.good~.) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
```
<br>
We will use diagonal linear discriminant analysis to predict outcome class with nanostring data.
***
<br>
# Multi-tier decision tree
Here we have 3 input models. Each model comprised of different sources of predictor variables. There were differences in terms of how these predictors variables could be obtained, how invasive they could be acquired, and how much they cost. The goal here is to find an optimal testing sequence with the different input models that we have, balancing it with the tier model error rates (TSER), the difficulty and differences in their unit cost.
The idea here is that we would predict patients' outcome with the 3 input models in sequence/tiers. Once a prediction has been made for a particular patient/sample with an input model, we would determine if the error rate for that patient using that particular model is satisfactory by setting a pre-determined error rate cutoff. If the patient/sample specific error rate (SSER) is too high, this particular patient/sample would progress to the next tier model, until either the SSER is satisfactory, or we determined the patient/sample could not be satisfactorily classified with the input models we have.
<br>
***
SSER = sample specific error rate
unit cost = cost per sample in performing this model
Tier clin = Clinical model prediction, sser cutoff = 0.2, unit cost = $100
Tier histo = Histopathology model prediction, sser cutoff = 0.2, unit cost = $500
Tier nano = Nano-string dlda model prediction, sser cutoff = 0.2, unit cost = $1000
***
## Summary of results
```{r MTDTsummary}
A.d = tierClin.d %>%
dplyr::mutate(Prim_SiteBinary=factor(Prim_SiteBinary))
B.d = tierHisto.d
C.d = tierNano.d
A.f = tierClin.formula
# A.f = y.good ~ 0 + Age + Female + Prim_SiteBinary + Prim_SunExpSite
B.f = tierHisto.formula
tierA = "Clin"
tierB = "Histo"
tierC = "Nano"
A.m = rms::lrm(A.f, data=A.d, x=T, y=T, se.fit=T)
# A.m = glm(A.f, data=A.d, family=binomial(logit))
B.m = rms::lrm(B.f, data=B.d, x=T, y=T, se.fit=T)
# B.m = glm(B.f, data=B.d, family=binomial(link="logit"))
y=C.d[,2] %>% as.factor; x=C.d[, -c(1,2)]
# C.m = sparsediscrim::dlda(y=y, x=x)
C.m = sparsediscrim::lda_diag(y=y, x=x)
rm(list=c("x", "y"))
dataList = list(A.d, B.d, C.d)
modelList = list(A.m, B.m, C.m)
rsmpList = list("rcv", "rcv", "boot")
tierList = list(tierA, tierB, tierC)
# ssercutoffList = c(0.2, 0.2, 0.2)
# tierUnitCosts = c(0, 500, 1000)
ssercutoffList = c(0.2, 0.2, 0.2)
tierUnitCosts = c(100, 500, 1000)
MTDTresults = MTDT.algm(dataList,
modelList,
rsmpList,
tierList,
ssercutoffList = ssercutoffList,
k=5, times=100, rsmp=100,
seed=1, verbose=F)
MTDTsummary = MTDT.cost.summary(MTDTresults, tierUnitCosts)
MTDTsummary$TSER.summary %>%
DT::datatable(option=list(
columnDefs=list(list(className="dt-center", targets=c(1, 8))),
pageLength=20)) %>%
DT::formatRound(columns=c(2, 3, 7, 9), digits=2)
```
## Visual summary
```{r MTDTplots, include=T, fig.width=16, fig.height=8}
nperms = dim(MTDTresults$myperms)[1]
for (nperm in 1:nperms){
p1 = MTDTsummary$Plots.strat.prop[[nperm]] +
theme(plot.title=element_text(face="bold"))
p2 = MTDTsummary$Plots.TSERcutoff[[nperm]][[1]]
p3 = MTDTsummary$Plots.TSERcutoff[[nperm]][[2]] + ylab("")
p4 = MTDTsummary$Plots.TSERcutoff[[nperm]][[3]] + ylab("")
p5 = MTDTsummary$Plots.TSER[[nperm]] +
ggtitle("Overall") + ylab("") +
theme(plot.title=element_text(hjust=0.5, face="bold"))
gridExtra::grid.arrange(
grobs = list(p1, p2, p3, p4, p5),
layout_matrix = rbind(c(1,1,1,1),
c(1,1,1,1),
c(2,3,4,5),
c(2,3,4,5),
c(2,3,4,5))
)
}
# for (nperm in 1:nperms){
# p1 = MTDTsummary$Plots.stratification[[nperm]] +
# theme(plot.title=element_text(face="bold"))
# p2 = MTDTsummary$Plots.TSERcutoff[[nperm]][[1]]
# p3 = MTDTsummary$Plots.TSERcutoff[[nperm]][[2]] + ylab("")
# p4 = MTDTsummary$Plots.TSERcutoff[[nperm]][[3]] + ylab("")
# p5 = MTDTsummary$Plots.TSER[[nperm]] +
# ggtitle("Overall") + ylab("") +
# theme(plot.title=element_text(hjust=0.5, face="bold"))
#
# gridExtra::grid.arrange(
# grobs = list(p1, p2, p3, p4, p5),
# layout_matrix = rbind(c(1,1,1,1),
# c(2,3,4,5),
# c(2,3,4,5),
# c(2,3,4,5))
# )
# }
```
***
<br><br>
# Session Info
```{r sessionInf, include=T}
date()
sessionInfo()
tictoc::toc()
```