-
Notifications
You must be signed in to change notification settings - Fork 0
/
pt3_decision_trees.Rmd
339 lines (280 loc) · 15.6 KB
/
pt3_decision_trees.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: "Decision Trees"
author: "Vincent La"
date: "May 30, 2017"
output: pdf_document
bibliography: bibliography.bibtex
---
```{r include=FALSE, eval=FALSE}
# Only run this chunk if knitting this document by itself
source('functions.R')
source('data.R')
source('roc_plot.R')
```
```{r "Logistic Regression Model"}
worthless_multi.glm <- dep_glm(
worthless_once ~ age + black + asian + hispanic + other_race
+ fam_income35_50k + fam_income50_75k + fam_income75_100k + fam_income101k
+ looking_for_work + married + male
+ english + english*factor(srvy_yr)
+ neck_pain + back_pain + back_pain*male
+ alc12mwk2 + I(alc12mwk2^2)
+ family_size
+ factor(srvy_yr))
```
# Decision Trees and Random Forests
Alongside logistic regression, I also decided to fit decision trees to my data. While the logic of decision trees is different from that of logistic regression, they have nice properties which make them an attractive model for detecting symptoms of depression. Like logistic regression, they are readily interpretable--a useful quality especially when it comes to explaining illnesses. Furthermore, the branching structure of decision trees at first glance appears to be a decent approximation to how mental illness operates. Each node gives new probabilities of having a symptom of mental distress as a function of that node and its parent node.
Random forests are an extension of decision trees. Because decision trees can vary randomly in what nodes appear, random forests attempt to reduce this variance by training a large number of random forests and then choosing an answer based on a majority vote.
## Research Questions
In particular, I had some questions about how decisions would fit the data compared to logistic regression.
1. What is the effect of class imbanace? How can it be handled?
1. How sensitive are decision trees to sampling method?
## Class Imbalance and Downsampling
For decision trees, I focused my research mainly on feelings of worthlessness--which only 10.5\% of the sample reported having at any time. As a result, both decision trees and random forests simply classified every individual as not having any symptoms of depression. This was unsuprisingly as a "model" which blindly classifies every individual as being perfectly fine would have a 89.5\% accuracy rate. Likewise, the observation that random forests were no different than individual decision trees is unsurprising. If a single decision tree is highly likely to classify every observation as "not depressed," then so would a multitude of them.
To deal with this issue, I performed downsampling on my original training set. Specifically, I created a new training set composed with all elements of the minority class (all individuals who reported feeling "worthless") and randomly sampled an equally numerous set of observations from the majority class. As a result, in the new training set, there were equal numbers of both classes.
```{r}
nrow.minority <- nrow(dep %>% filter(worthless_once == 1 & test == 0))
nrow.majority <- nrow(dep %>% filter(worthless_once == 0 & test == 0))
# ==== Training Set 1 ====
set.seed(420)
# Ids of majority class to be used
sample.majority <- sample(x=1:nrow.majority, size=nrow.minority)
# ==== Training Set 2 ====
set.seed(1)
# Ids of majority class to be used
sample.majority2 <- sample(x=1:nrow.majority, size=nrow.minority)
# Only training set
dep_tree <- rbind(
dep %>% filter(worthless_once == 1 & test == 0),
(dep %>% filter(worthless_once == 0 & test == 0))[sample.majority, ]
)
dep_tree2 <- rbind(
dep %>% filter(worthless_once == 1 & test == 0),
(dep %>% filter(worthless_once == 0 & test == 0))[sample.majority2, ]
)
```
## Decision Tree Training
To answer my research questions, I trained four decision trees. Two pairs of these decision trees were trained with the same minimum deviance parameter: either 0.001 or 0.0005. With each pair, I trained either tree with one of two training sets. Both training sets were created using the same downsampling procedure, but with a differnet random seed. This was done to test the trees' sensitivity to different training data.
As we can see, there is some deviation between the pairs of decision trees created by the different training sets. Lastly, I also trained a random forest with 100 trees using the same predictors as for the decision trees above. The results of the classification metrics comparing decision trees, random forests, and logistic regression can be found below.
### Decision Tree 1: Training Set 1 -- Mindev 0.001
```{r fig.height=9}
worthless.tree <- tree(factor(ifelse(worthless_once == 1, yes="Yes", no="No")) ~
age + looking_for_work + race + fam_income + married + male + neck_pain + back_pain
+ alc12mwk2 + factor(srvy_yr) + family_size,
control=tree.control(nobs=nrow(dep_tree), mincut=1, minsize=2, mindev=0.001),
data=dep_tree)
worthless.tree2 <- tree(factor(ifelse(worthless_once == 1, yes="Yes", no="No")) ~
age + looking_for_work + race + fam_income + married + male + neck_pain + back_pain
+ alc12mwk2 + factor(srvy_yr) + family_size,
control=tree.control(nobs=nrow(dep_tree), mincut=1, minsize=2, mindev=0.0005),
data=dep_tree)
plot(worthless.tree)
text(worthless.tree)
```
### Decision Tree 2: Training Set 1 -- Mindev 0.0005
```{r fig.height=9}
plot(worthless.tree2)
text(worthless.tree2)
```
### Decision Tree 3: Training Set 2 -- Mindev 0.001
Although this tree was grown with the same parameters as Decision Tree 1, and its training set was sampled using the same sampling method, the seed used was different. The different configuration of the tree shows that decision trees are sensitive to sampling method.
```{r "Sensitive To Sampling Method Test", fig.height=9}
worthless.tree_sample2 <- tree(factor(ifelse(worthless_once == 1, yes="Yes", no="No")) ~
age + looking_for_work + race + fam_income + married + male + neck_pain + back_pain
+ alc12mwk2 + factor(srvy_yr) + family_size,
control=tree.control(nobs=nrow(dep_tree), mincut=1, minsize=2, mindev=0.001),
data=dep_tree2)
plot(worthless.tree_sample2)
text(worthless.tree_sample2)
```
### Decision Tree 4: Training Set 2 -- Mindev 0.0005
Likewise, this decision tree also differs from Decision Tree 2--which was trained with the same minimum deviance parameter but a different training set.
```{r fig.height=9}
worthless.tree2_sample2 <- tree(factor(ifelse(worthless_once == 1, yes="Yes", no="No")) ~
age + looking_for_work + race + fam_income + married + male + neck_pain + back_pain
+ alc12mwk2 + factor(srvy_yr) + family_size,
control=tree.control(nobs=nrow(dep_tree), mincut=1, minsize=2, mindev=0.0005),
data=dep_tree2)
plot(worthless.tree2_sample2)
text(worthless.tree2_sample2)
```
```{r}
worthless.forest = randomForest(
factor(ifelse(worthless_once == 1, yes="Yes", no="No")) ~
looking_for_work + race + fam_income + married + male + neck_pain + back_pain
+ alc12mwk2,
data=dep_tree, do.trace=10, ntree=100)
```
# Logistic Regression vs. Trees and Forests: Measures of Accuracy
Finally, after training all of my models I used various classification metrics to compare them.
```{r "Training Set"}
roc.tree <- ROCCurve(actual=dep_tree$worthless_once,
predicted=as.data.frame(predict(worthless.tree))$Yes)
roc.tree_sample2 <- ROCCurve(actual=dep_tree$worthless_once,
predicted=as.data.frame(predict(worthless.tree_sample2))$Yes)
roc.tree2 <- ROCCurve(dep_tree$worthless_once,
as.data.frame(predict(worthless.tree2))$Yes)
roc.tree2_sample2 <- ROCCurve(dep_tree$worthless_once,
as.data.frame(predict(worthless.tree2_sample2))$Yes)
roc.forest <- ROCCurve(dep_tree$worthless_once,
as.data.frame(predict(worthless.forest))$Yes)
roc.glm <- ROCCurve(unlist(dep %>%
filter(test == 0) %>%
select(worthless_once)),
worthless_multi.glm$fitted.values)
```
```{r "Testing Set"}
roc.tree.test_data <- as.data.frame(predict(worthless.tree,
newdata=as.data.frame(dep
%>% filter(test == 1))))$Yes
roc.tree_sample2.test_data <- as.data.frame(predict(worthless.tree_sample2,
newdata=as.data.frame(dep
%>% filter(test == 1))))$Yes
roc.tree2.test_data <- as.data.frame(predict(worthless.tree2,
newdata=as.data.frame(dep
%>% filter(test == 1))))$Yes
roc.tree2_sample2.test_data <- as.data.frame(predict(worthless.tree2_sample2,
newdata=as.data.frame(dep
%>% filter(test == 1))))$Yes
roc.forest.test_data <- as.data.frame(
ifelse(
predict(worthless.forest,
newdata=as.data.frame(dep %>% filter(test == 1))) == "Yes",
yes=1, no=0
))
roc.glm.test_data <- as.data.frame(predict(worthless_multi.glm,
type="response",
newdata=as.data.frame(dep %>% filter(test == 1)))
)$response
roc.tree_test <- ROCCurve(actual=dep.test_labels.worthless,
predicted=roc.tree.test_data)
roc.tree_sample2_test <- ROCCurve(actual=dep.test_labels.worthless,
predicted=roc.tree_sample2.test_data)
roc.tree2_test <- ROCCurve(actual=dep.test_labels.worthless,
predicted=roc.tree2.test_data)
roc.tree2_sample2_test <- ROCCurve(actual=dep.test_labels.worthless,
predicted=roc.tree2_sample2.test_data)
roc.forest_test <- ROCCurve(actual=dep.test_labels.worthless,
predicted=roc.forest.test_data)
roc.glm_test <- ROCCurve(actual=dep.test_labels.worthless,
predicted=roc.glm.test_data)
```
## ROC Curves
While the logistic regression was the best fitting curve in both the training and testing sets, it is hard to tell unless you look closely. The most noticeable feature of the ROC curve graphs is the drop-off of performance of the random forest in the validation set. It seems that a multitude of trees does worse than individual decision trees.
```{r "Curve ggplots"}
# ROC Curve
roc_plot(roc.tree, roc.tree_sample2, roc.tree2, roc.tree2_sample2, roc.forest, roc.glm,
title="ROC Curve (Training Set)",
names=c("Decision Tree 1", "Decision Tree 1 (Sample 2)",
"Decision Tree 2", "Decision Tree 2 (Sample 2)",
"Random Forest", "Logistic Regression"))
roc_plot(roc.tree_test, roc.tree_sample2_test, roc.tree2_test, roc.tree2_sample2_test, roc.forest_test, roc.glm_test,
title="ROC (Testing Set)",
names=c("Decision Tree 1", "Decision Tree 1 (Sample 2)",
"Decision Tree 2", "Decision Tree 2 (Sample 2)",
"Random Forest", "Logistic Regression"))
```
## Precision-Recall Curve
The precision-recall curve is another classification metric, which measures the trade-off between how accurate a model is on the data it retrieves (precision), and how many positive cases out of the data it retrieves overall (recall). Unlike the ROC graphs above, these pair of graphs show a stunning difference. While decision trees fit very well to the training data, their performance in the testing set is no better than the logistic regression. On the other hand, the logistic regression is the only model which has similar performance in both sets--suggesting that its coefficient estimates are very stable while decision trees overfit the data.
```{r}
pr_plot(roc.tree, roc.tree_sample2, roc.tree2, roc.tree2_sample2, roc.forest, roc.glm,
title="Precision-Recall Curve (Training Set)",
names=c("Decision Tree 1", "Decision Tree 1 (Sample 2)",
"Decision Tree 2", "Decision Tree 2 (Sample 2)",
"Random Forest", "Logistic Regression"))
pr_plot(roc.tree_test, roc.tree_sample2_test, roc.tree2_test, roc.tree2_sample2_test, roc.forest_test, roc.glm_test,
title="Precision-Recall Curve (Testing Set)",
names=c("Decision Tree 1", "Decision Tree 1 (Sample 2)",
"Decision Tree 2", "Decision Tree 2 (Sample 2)",
"Random Forest", "Logistic Regression"))
```
## F1 Score
Lastly, I also considered the F1 score since it is a useful alternative to the ROC curve in the case of severe class imbalance. These graphs tell a similar story to the precision-recall curves. While the logistic regression initially seems worse when looking at the training set, it is the only model which has consistent performance in both the training and testing set.
```{r}
f1_plot(roc.tree, roc.tree_sample2, roc.tree2, roc.tree2_sample2, roc.forest, roc.glm,
title="F1 Score (Training Set)",
names=c("Decision Tree 1", "Decision Tree 1 (Sample 2)",
"Decision Tree 2", "Decision Tree 2 (Sample 2)",
"Random Forest", "Logistic Regression"))
f1_plot(roc.tree_test, roc.tree_sample2_test, roc.tree2_test, roc.tree2_sample2_test, roc.forest_test, roc.glm_test,
title="F1 Score (Testing Set)",
names=c("Decision Tree 1", "Decision Tree 1 (Sample 2)",
"Decision Tree 2", "Decision Tree 2 (Sample 2)",
"Random Forest", "Logistic Regression"))
```
```{r}
# # Accuracy
# roc.tree$data_frame <- roc.tree$data_frame %>%
# mutate(acc = (tp + tn)/(tp + tn + fn + fp))
#
# roc.tree2$data_frame <- roc.tree2$data_frame %>%
# mutate(acc = (tp + tn)/(tp + tn + fn + 10*fp))
#
# roc.glm$data_frame <- roc.glm$data_frame %>%
# mutate(acc = (tp + tn)/(tp + tn + fn + fp))
#
# ggplot(
# rbind(roc.tree$data_frame,
# roc.tree2$data_frame,
# roc.glm$data_frame),
# aes(p_thresh, acc, color=model)) +
# scale_color_brewer(type="div", palette="Spectral") +
# geom_line(size=1.5, linetype="dashed") +
# labs(title="Accuracy (Testing Set)",
# x="Probability Threshold",
# y="Accuracy")
#
# # Weighted Accuracy
# roc.tree$data_frame <- roc.tree$data_frame %>%
# mutate(wacc = (tp + tn)/(tp + tn + 10*fn + 10*fp))
#
# roc.tree2$data_frame <- roc.tree2$data_frame %>%
# mutate(wacc = (tp + tn)/(tp + tn + 10*fn + 10*fp))
#
# roc.glm$data_frame <- roc.glm$data_frame %>%
# mutate(wacc = (tp + tn)/(tp + tn + 10*fn + 10*fp))
#
# ggplot(
# rbind(roc.tree$data_frame,
# roc.tree2$data_frame,
# roc.glm$data_frame),
# aes(p_thresh, wacc, color=model)) +
# scale_color_brewer(type="div", palette="Spectral") +
# geom_line(size=1.5, linetype="dashed") +
# labs(title="Weighted Accuracy (Testing Set)",
# x="Probability Threshold",
# y="Weighted Accuracy")
```
```{r "Calculate Lift"}
# # Lift
# roc.tree.lift <- prediction(roc.tree.test,
# as.data.frame(dep %>% filter(test == 1))$worthless_once)
#
# roc.tree2.lift <- prediction(roc.tree2.test,
# as.data.frame(dep %>% filter(test == 1))$worthless_once)
#
# roc.glm.lift <- prediction(roc.glm.test,
# as.data.frame(dep %>% filter(test == 1))$worthless_once)
#
#
# roc.tree.lift <- performance(roc.tree.lift, x.measure="rpp", measure="lift")
# roc.tree2.lift <- performance(roc.tree2.lift, x.measure="rpp", measure="lift")
# roc.glm.lift <- performance(roc.glm.lift, x.measure="rpp", measure="lift")
#
# ggplot(
# rbind(data.frame(
# x=unlist([email protected]), y=unlist([email protected])) %>%
# mutate(model="Tree"),
# data.frame(
# x=unlist([email protected]), y=unlist([email protected])) %>%
# mutate(model="Tree 2"),
# data.frame(
# x=unlist([email protected]), y=unlist([email protected])) %>%
# mutate(model="Logistic Regression")),
# aes(x, y, color=model)) +
# scale_color_brewer(type="div", palette="Spectral") +
# geom_line(size=1.5, linetype="dashed") +
# labs(title="Lift",
# x="RPP",
# y="Lift")
```