forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter3.rmd
475 lines (324 loc) · 18.6 KB
/
chapter3.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
---
title: "Chapter 3: Logistic regression"
date: "*2020-11-12*"
output:
html_document:
toc: true
toc_float:
smooth_scroll: true
collapsed: false
toc_depth: 2
number_sections: true
---
```{r, echo=FALSE, results=FALSE}
#Let's make our reports reproducible
set.seed(134856791)
```
***
# Description of data
The data set used this exercise has been previously parsed from [this](https://archive.ics.uci.edu/ml/datasets/Student+Performance) file using an R script available [here](https://github.com/MB-Finski/IODS-project/data).
The original purpose of the study was to find factors associated with the students' school performance. The parsed data consists of students in two Portugese schools. For the purposes of the study the background and alcohol consumption of the students was recorded through questionnaires. The students' performance in mathematics and Portugese language was collected from school reports. The scores for either subject were combined in the above data handling script. A more exhaustive metadata is available [here](https://archive.ics.uci.edu/ml/datasets/Student+Performance).
***
## Variable decoding
For an explanation of the individual variables, please, refer to the following:
***
Variable name | Explanation
--- | -------------
*school* | Student's school (either GP or MS)
*sex* | Gender
*age* | Age of the student
*address* | Student's residence as binary of urban (U) versus rural (R)
*famsize* | Binary of family size: less than three LE3 versus greater than three (GT3)
*Pstatus* | Parent's cohabitation status: together (T) versus apart (A)
*Medu* | Mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
*Fedu* | Father's education (numeric: categories identical to above)
*Mjob* | Mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
*Fjob* | Father's job (nominal: see categories above)
*reason* | Reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
*guardian* | Student's guardian (nominal: 'mother', 'father' or 'other')
*traveltime* | Home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
*studytime* | Weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
*failures* | Number of past class failures (numeric: n if 1<=n<3, else 4)
*schoolsup* | Extra educational support (binary: yes or no)
*famsup* | Family educational support (binary: yes or no)
*paid* | Extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
*activities* | Extra-curricular activities (binary: yes or no)
*nursery* | Attended nursery school (binary: yes or no)
*higher* | Wants to take higher education (binary: yes or no)
*internet* | Internet access at home (binary: yes or no)
*romantic* | With a romantic relationship (binary: yes or no)
*famrel* | Quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
*freetime* | Free time after school (numeric: from 1 - very low to 5 - very high)
*goout* | Going out with friends (numeric: from 1 - very low to 5 - very high)
*Dalc* | Workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
*Walc* | Weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
*health* | Current health status (numeric: from 1 - very bad to 5 - very good)
*absences* | Number of school absences (numeric: from 0 to 93)
*G1* | First period combined grade (numeric: from 0 to 20)
*G2* | Second period grade (numeric: from 0 to 20)
*G3* | Final grade (numeric: from 0 to 20, output target)
*alc_use* | Average of Dalc and Walc
*high_use* | Binary: TRUE if alc_use > 2, FALSE otherwise
***
```{r out.width="100%"}
#Read the alc_data -table
alc_data <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", stringsAsFactors = TRUE, sep = ",",header = TRUE)
#... and print out the summray of variable names as requested, although
#the text output is quite ugly:
str(alc_data)
```
***
## Summary table and data exploration
Let's scope out the variables with "high_use" as contrast to identify variables that
may have an association with alcohol use.
Please note, I have elected to treat the Likert scale variables
as scale variables here so I won't factorize them.
````{r out.width="100%"}
library(gtsummary)
continuous_vars = list(c(age,Medu,Fedu,failures,famrel, freetime, goout,Dalc,Walc,health,absences, G1, G2, G3, alc_use) ~ "continuous")
#Print a summary table of the data with high_use as the contrast. Also add an "overall" column
my_contrast_table <- tbl_summary(alc_data,
by = high_use,
digits = all_continuous() ~ 2,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"),
type = continuous_vars)
#Add some missing elements to the table and print it out
my_contrast_table %>% add_overall() %>% bold_labels() %>% add_p()
````
***
### Interpretation
***
First, it should be ephasized that while I did print out the p-values for the explorative statistical analyses,
the p-values should not be mistaken for effect sizes! I.e. direct direct comparisons of the variables performance as a regressor for "high_use"
should not be made based on the p-values alone.
However, odds ratios are reversible for outcome and potential exposure (i.e. the OR for an outcome given an
exposure is equal to the OR of an exposure given an outcome). This means that the strategy of finding a "reverse association" between our response and potential regressors
is useful for identifying good regressors.
As a side note, beyond the scope of this chapter, we may be missing out on potential interactions in the data
with this type of modeling strategy. For example, the data is from two different schools which would suggest that we'd be better off using a hierachical
modeling strategy for the data. Yet, as mentioned, this is beyond the scope of the assignment.
Finally, the type of *a priori* exploration we used for identifying the potential regressors increases the risk of finding spurious associations in our
data. Therefore, we should clearly state that our findings of any subsequent analyses are explorative in nature.
***
# Choosing the potential regressors
***
Based on explorative analyses above, I chose the following variables as regressors:
1. sex
2. absences
3. goout
4. studytime
***
A viable hypothesis for how each of these variables might be associated with alcohol use can be formed:
1. Males have traditionally been viewed as more prone to drinking alcohol.
2. Absences may be the result of or the enabler of excessive drinking.
3. Going out with friends and socializing may be linked with increased alcohol use.
4. Students who study hard don't have time to drink.
***
## (Graphical) Inspection of regressors
***
Let's print a more compact table compared to above with _cross tabulations_ of the categorical regressors:
***
````{r out.width="100%"}
#Create a subset table of the variables of interest
alc_subset <- alc_data %>% select(sex, high_use, absences, goout, studytime)
continuous_vars = list(c(absences) ~ "continuous")
#Print a summary table of the data with high_use as the contrast. Also add an "overall" column
my_contrast_table <- tbl_summary(alc_subset,
by = high_use,
digits = all_continuous() ~ 2,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"),
type = continuous_vars)
#Add some missing elements to the table and print it out
my_contrast_table %>% add_overall() %>% bold_labels()
````
***
### Boxplot of absences
***
Print boxplot of absences:
```{r out.width="100%", warning=FALSE, message=FALSE}
library(ggplot2)
#Create the boxplot
my_boxplot <- ggplot(alc_data, aes(x = high_use, y = absences))+theme(legend.position="none")
my_boxplot <- my_boxplot + geom_boxplot(outlier.colour="black", outlier.shape=16, outlier.size=2, notch=FALSE)
#Adjust outline color
my_boxplot <- my_boxplot + scale_color_manual(values=c("black","black"))
#Adjust fill color
my_boxplot <- my_boxplot + scale_fill_manual(values=c("#b4b4b4","#4e4e4e"))
#Set proper lables for both axes
my_boxplot <- my_boxplot + labs(x= "High alcohol use", y = "Absences",fill = "")
#Set the graph title
my_boxplot <- my_boxplot + ggtitle("Absences VS High alcohol use")
#Print the plot
my_boxplot
```
***
### Interpretation of cross tabulations and boxplot
***
Our hypothesis forming was exploratory (see above) so we have by and large already covered this topic.
The crosstabulations and the boxplot support the presented hypotheses for all our variables: there's a clear association between high_use and all our chosen regressors.
The distribution of the absences is understandably quite skewed, reminiscent of a chi-square distribution. This is usually not an issue with parametric statistical methods. There are, also, some potential outliers in both groups for absences.
***
# Logistic regression
***
Create the logistic regression model:
```{r out.width="100%", warning=FALSE, message=FALSE}
my_reg_model <- glm(high_use ~ absences + sex + goout + studytime, data = alc_data, family = "binomial")
tbl_regression(my_reg_model, exponentiate = TRUE) %>% bold_labels()
summary(my_reg_model)
```
***
## Interpretation of the model
***
In this case the intercept of the model is of no significant interest to us. All regressors have a statistically significant association with the response.
As an example of interpreting the odds ratios; each increase of one in absences was associated with an odds ratio (OR) of 1.06 (95% CI 1.032 -- 1.10) for high alchohol use.
For the catecorigal variables we interpret the results as follows: male students had 2.25 (95% CI: 1.33 -- 3.84) times greater odds
at having high alcohol use than female students.
It is crucial to note that this **does not** mean that a student with 1 absence was 1.06 times more likely to
have high alcohol use compared to a student who had none! Odds ratios (usually) cannot be used to make inferences of relative probabilities. Only in some edge cases you can equate RRs and ORs (like under the rare disease assumption).
We can simply state that the *odds* of having high alcohol use was 1.06 times grater for someone with 1 absence compared to a student with no absences.
The numerical value of OR also does not equate to effect size. For example, the absolute odds ratio is very small for absences, but the effect size in this case
is also related to the distribution of absences in the data: a student with 10 absences has an odds ratio of 1.06^10 = 1.79 for high alcohol use.
While the exact odds ratios and even risk ratios can be calculated in our data sample, there remains uncertainty when making interferences from our data
to more general populations. This uncertainty is described using the confidence intervals (CI) for our OR estimates. Various factors like regressor variance and sample size
affect the confidence intervals.
In conclusion, the data mostly confirms our hypotheses: absences, male gender, and going out were associated (OR > 1) with high alcohol consumption.
Increased studytime was associated with lesser odds (OR < 1) at high alcohol consumption.
***
# Model predictions and performance
***
Print a cross tabulation of outcome vs model prediction with marginal row and column:
```{r out.width="100%", warning=FALSE, message=FALSE}
library("dplyr")
#Add a column model_probability for the outcome:
alc_data <- mutate(alc_data,model_probability = predict(my_reg_model, type = "response"))
#Add a column with binary model_prediction:
alc_data <- mutate(alc_data,model_prediction = model_probability > 0.5)
#Print a cross tabulation
tbl_cross(alc_data,row = high_use, col = model_prediction, percent ="cell")
```
***
Print a scatter plot visualizing predictions vs outcome:
```{r out.width="100%", warning=FALSE, message=FALSE}
my_scatter_plot <- ggplot(alc_data, aes(x = model_probability, y = high_use, color = (model_prediction==high_use)))
my_scatter_plot <- my_scatter_plot+geom_point() + labs(x = "Model probability", y = "High alcohol use",color= "Model prediction correct")
my_scatter_plot
```
***
Calculate training error and compare the method to a guessing strategy:
***
```{r out.width="100%", warning=FALSE, message=FALSE}
#Calculate the precentage of wrong predictions:
mean((alc_data$high_use != alc_data$model_prediction))
#Just for "shits and giggles" run a rudimentary Monte Carlo simulation
#with 1000000 runs to calculate the precentage of wrong predictions made
#by quessing if a student has heavy alc use based on the knowlede of
#high_alc use prevalence in our data
#I.e. we calculate a "guess_vector" with the same proportion of true/false
#guesses as the prevalence of high_alc in our data.
#Place the correct vectors into separate variables to streamline everything
#for the Monte Carlo loop:
high_use <- alc_data$high_use
propability_of_heavy_alc <- mean(high_use)
n_of_students <- length(high_use)
i = 0
mc_runs = 1000000
mc_mean_pred_error = 0
while(i < mc_runs) {
i <- i + 1
#Create a vector with random variables from 0 to 1
#Essentially this is a guess on the high_use status of each student
random_vector <- runif(n_of_students, 0, 1)
#Convert the random variable to the corresponding random binary
guess_vector <- (random_vector <= propability_of_heavy_alc)
#Claculate the proportion of wrong guesses
mc_mean_pred_error <- mc_mean_pred_error + mean((high_use != guess_vector))
}
#Calculate the mean
mc_mean_pred_error <- mc_mean_pred_error / mc_runs
#Expected result derived arithmetically
expected_result <- 2 * (propability_of_heavy_alc * (1 - propability_of_heavy_alc))
expected_result
#Print the result of the Monte Carlo run:
mc_mean_pred_error
1- mean(alc_data$high_use)
```
***
## Interpretation of model performance
***
The training error of our model was about 21%. Our guessing strategy was intentionally non-optimal and resulted in the right guess about 59% of the time.
The optimal guessing strategy would be to just guess that no student has high alcohol use which would result in a right guess about 71% of the time (or a
wrong guess 29% of the time). So, our model had about 28% less wrong predictions than the "optimal" guessing strategy. However, this is only valid for the data our
model was trained with and using fresh data would likely result in inferior performance.
***
# BONUS 1: Ten-fold cross validation
***
```{r out.width="100%", warning=FALSE, message=FALSE}
cost_function <- function(response, probability) {
mean(response != (probability > 0.5))
}
library(boot)
cross_validation <- cv.glm(data = alc_data, cost = cost_function, glmfit = my_reg_model, K = 10)
# average number of wrong predictions in the cross validation
cross_validation$delta[1]
```
***
Our logistic regression model had smaller testing error (about 21%) compared to the
Datacamp example (about 26%) using 10-fold cross-validation.
There are various methods for selecting best subsets of variables for GLM and it is likely that an even better model can be found.
However, (especially step-wise) selection of regression models from large datasets is extremely prone to bias!
***
# BONUS 2: Training vs testing error
***
```{r}
library("tibble")
#Strip the alc_data table from all but the predictor variables and the response variable
bare_dataset = select(alc_data, -c(Dalc,Walc,alc_use,model_probability,model_prediction))
#Create a regression model with all possible predictors
my_huge_reg_model <- glm(high_use ~ ., data = bare_dataset, family = "binomial")
#Prepare a table to hold our pairs of training and testing errors
comparison_table <- data.frame(matrix(ncol=3,nrow=0, dimnames=list(NULL, c("error", "error_type", "n_regressors"))))
comparison_table$error_type <- as.factor(comparison_table$error_type)
#A loop for backward stepping our regression model
i <- 1
while(i < ncol(bare_dataset)) {
i <- i + 1
#Calculate traning error for current regression model
probability <- predict(my_huge_reg_model, type = "response")
training_error <- cost_function(bare_dataset$high_use, probability)
#Calculate testing error for current regression model
#using 10-fold cross validation
testing_error <- cv.glm(data = bare_dataset,
cost = cost_function,
glmfit = my_huge_reg_model,
K = 10)$delta[1]
num_of_regressors <- length(coef(my_huge_reg_model)) - 1
#Add the relevant data to our table
comparison_table <- add_row(comparison_table, error = training_error,
error_type = "Training error",
n_regressors = num_of_regressors)
comparison_table <- add_row(comparison_table, error = testing_error,
error_type = "Testing error",
n_regressors = num_of_regressors)
#Drop one variable out from our regression equation with backward stepping
#based on the Akaike Information Criteria (AIC).
#Set k=100 so we continue stepping even if we drop statistically
#significant regressors
my_huge_reg_model <- step(my_huge_reg_model, direction = "backward",steps = 1, k = 100, trace = 0)
}
#Print the requested plot comparing training and testing errors
comparison_plot <- ggplot(data = comparison_table, aes(x = n_regressors, y = error, group = error_type, color = error_type))
comparison_plot <- comparison_plot + geom_line() + geom_point()
comparison_plot <- comparison_plot + ggtitle("Training and testing errors vs number of regressors")
comparison_plot <- comparison_plot + labs(x = "Number of predictors", y = "Prediction error", color = "")
comparison_plot
```
***
As is evident from the comparison graph, the increasing number of regressors leads to overfitting. I.e.
the models become too complex and begin to describe spurious rather than true associations in the data. As a result, the training error grows ever so slightly smaller with
the increasing number of regressors (this would be true even with random variables as predictors).
Yet, the testing error starts to grow with larger numbers of
regressors after reaching a minimum at around 5 regressors. This is a result of the decreasing external validity of the overfitted models -> i.e. the predictions
of these models are not generalizable despite the very low training errors.
The "sweet spot" for the number of regressors in our data seems to be around 5 or 6 predictors/regressors.