-
Notifications
You must be signed in to change notification settings - Fork 4
/
A2_P2_LangASD_predictions_instructions.Rmd
253 lines (172 loc) · 11.2 KB
/
A2_P2_LangASD_predictions_instructions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "Assignment 2 - Language Development in ASD - Making predictions"
author: "Riccardo Fusaroli"
date: "August 9, 2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Welcome to the second exciting part of the Language Development in ASD exercise
In this exercise we will delve more in depth with different practices of model comparison and model selection, by first evaluating your models from last time against some new data. Does the model generalize well?
Then we will learn to do better by cross-validating models and systematically compare them.
The questions to be answered (in a separate document) are:
1- Discuss the differences in performance of your model in training and testing data
2- Which individual differences should be included in a model that maximizes your ability to explain/predict new data?
3- Predict a new kid's performance (Bernie) and discuss it against expected performance of the two groups
## Learning objectives
- Critically appraise the predictive framework (contrasted to the explanatory framework)
- Learn the basics of machine learning workflows: training/testing, cross-validation, feature selections
## Let's go
N.B. There are several datasets for this exercise, so pay attention to which one you are using!
1. The (training) dataset from last time (the awesome one you produced :-) ).
2. The (test) datasets on which you can test the models from last time:
* Demographic and clinical data: https://www.dropbox.com/s/ra99bdvm6fzay3g/demo_test.csv?dl=1
* Utterance Length data: https://www.dropbox.com/s/uxtqqzl18nwxowq/LU_test.csv?dl=1
* Word data: https://www.dropbox.com/s/1ces4hv8kh0stov/token_test.csv?dl=1
### Exercise 1) Testing model performance
How did your models from last time perform? In this exercise you have to compare the results on the training data () and on the test data. Report both of them. Compare them. Discuss why they are different.
- recreate the models you chose last time (just write the model code again and apply it to your training data (from the first assignment))
- calculate performance of the model on the training data: root mean square error is a good measure. (Tip: google the function rmse())
- create the test dataset (apply the code from assignment 1 to clean up the 3 test datasets)
- test the performance of the models on the test data (Tips: google the functions "predict()")
- optional: predictions are never certain, can you identify the uncertainty of the predictions? (e.g. google predictinterval())
```{r, include = FALSE}
pacman::p_load(readr,dplyr,stringr,lmerTest,Metrics,caret, tidyverse, groupdata2)
## Clean up function, included to inspire you
CleanUpData <- function(Demo,LU,Word){
Speech <- merge(LU, Word) %>%
dplyr::rename(
Child.ID = SUBJ,
Visit=VISIT) %>%
mutate(
Visit = as.numeric(str_extract(Visit, "\\d")),
Child.ID = gsub("\\.","", Child.ID)
) %>%
dplyr::select(
Child.ID, Visit, MOT_MLU, CHI_MLU, types_MOT, types_CHI, tokens_MOT, tokens_CHI
)
Demo <- Demo %>%
dplyr::select(
Child.ID, Visit, Ethnicity, Diagnosis, Gender, Age, ADOS, MullenRaw, ExpressiveLangRaw, Socialization
) %>%
mutate(
Child.ID = gsub("\\.","", Child.ID)
)
Data=merge(Demo,Speech,all=F) #Originally from RF, all = T but we changed it to all = F to omit NA's
Data1= Data %>%
subset(Visit=="1") %>%
dplyr::select(Child.ID, ADOS, ExpressiveLangRaw, MullenRaw, Socialization) %>%
dplyr::rename(Ados1 = ADOS,
verbalIQ1 = ExpressiveLangRaw,
nonVerbalIQ1 = MullenRaw,
Socialization1 = Socialization)
Data=merge(Data, Data1, all=F) %>% #Changing the model to all = F to remove NA's
mutate(
Child.ID = as.numeric(as.factor(as.character(Child.ID))),
Visit = as.numeric(as.character(Visit)),
Gender = recode(Gender,
"1" = "M",
"2" = "F"),
Diagnosis = recode(Diagnosis,
"A" = "TD",
"B" = "ASD")
)
return(Data)
}
# Load training Data
training_data <- CleanUpData(Demo = read.csv("demo_train.csv"), LU = read.csv("LU_train.csv"), Word = read.csv("token_train.csv"))
#- create the test dataset (apply the code from assignment 1 or my function to clean up the 3 test datasets)
# Test data
test_data <- CleanUpData(Demo = read.csv("demo_test.csv"), LU = read.csv("LU_test.csv"), Word = read.csv("token_test.csv"))
#- recreate the models you chose last time (just write the code again and apply it to Train Data)
model1 <- lmer(CHI_MLU ~ Visit*Diagnosis*verbalIQ1+MOT_MLU+verbalIQ1:MOT_MLU + (1+Visit|Child.ID), data = training_data, REML=F) # Not using this one, we chose to use the non-dredged one
model2 <- lmer(CHI_MLU ~ Visit*Diagnosis + verbalIQ1 + MOT_MLU + (1+Visit|Child.ID), data = training_data, REML = F)
summary(model2)
MuMIn::r.squaredGLMM(model2)
#- calculate performance of the model on the training data: root mean square error is a good measure. (Tip: google the function rmse())
rmse(training_data$CHI_MLU, predict(model1)) #0.340
rmse(training_data$CHI_MLU, predict(model2)) #0.342
#How rmse() works: take the root mean squared error(actual values, predict using(this model, on this data))
#- test the performance of the models on the test data (Tips: google the functions "predict()")
rmse(test_data$CHI_MLU, predict(model1,test_data)) #0.487 - dredged model
rmse(test_data$CHI_MLU, predict(model2,test_data)) #0.489 - own model
#- optional: predictions are never certain, can you identify the uncertainty of the predictions? (e.g. google predictinterval())
```
[HERE GOES YOUR ANSWER]
### Exercise 2) Model Selection via Cross-validation (N.B: ChildMLU!)
One way to reduce bad surprises when testing a model on new data is to train the model via cross-validation.
In this exercise you have to use cross-validation to calculate the predictive error of your models and use this predictive error to select the best possible model.
- Use cross-validation to compare your model from last week with the basic model (Child MLU as a function of Time and Diagnosis, and don't forget the random effects!)
- (Tips): google the function "createFolds"; loop through each fold, train both models on the other folds and test them on the fold)
- Now try to find the best possible predictive model of ChildMLU, that is, the one that produces the best cross-validated results.
- Bonus Question 1: What is the effect of changing the number of folds? Can you plot RMSE as a function of number of folds?
- Bonus Question 2: compare the cross-validated predictive error against the actual predictive error on the test data
```{r}
#- Create the basic model of ChildMLU as a function of Time and Diagnosis (don't forget the random effects!).
basic <- lmer(CHI_MLU ~ Visit*Diagnosis + (1+Visit|Diagnosis), training_data, REML =F)
#- Make a cross-validated version of the model. (Tips: google the function "createFolds"; loop through each fold, train a model on the other folds and test it on the fold)
set.seed(12)
#Using the function fold from the package groupdata2 - making sure that the variable diagnosis is represented in all folds and that all data from one child is in the same fold.
folds3 <- fold(training_data, k = 5, cat_col = "Diagnosis", id_col = "Child.ID")
#loop for basic model
result <- c() #Making empty list in order for the loop to put something in it
loop <- {for (i in unique(folds3[[".folds"]])){
test <- subset(folds3, .folds == i)
train <- subset(folds3, .folds != i)
model <- lmer(CHI_MLU ~ Visit*Diagnosis + (1+Visit|Diagnosis), train, REML = F)
result[i] <- rmse(test$CHI_MLU, predict(model,test))
print(result[i])}
print(c("result" = mean(result)))
}
#Loop for own model
result <- c()
loop <- {for (i in unique(folds3[[".folds"]])){
test <- subset(folds3, .folds == i)
train <- subset(folds3, .folds != i)
model <- lmer(CHI_MLU ~ Visit*Diagnosis + verbalIQ1 + MOT_MLU + (1+Visit|Diagnosis), train, REML = F)
result[i] <- rmse(test$CHI_MLU, predict(model,test))
print(result[i])}
print(c("result" = mean(result)))
}
#- Report the results and comment on them.
#- Now try to find the best possible predictive model of ChildMLU, that is, the one that produces the best cross-validated results.
# Bonus Question 1: What is the effect of changing the number of folds? Can you plot RMSE as a function of number of folds?
# Bonus Question 2: compare the cross-validated predictive error against the actual predictive error on the test data
```
[HERE GOES YOUR ANSWER]
### Exercise 3) Assessing the single child
Let's get to business. This new kiddo - Bernie - has entered your clinic. This child has to be assessed according to his group's average and his expected development.
Bernie is one of the six kids in the test dataset, so make sure to extract that child alone for the following analysis.
You want to evaluate:
- how does the child fare in ChildMLU compared to the average TD child at each visit? Define the distance in terms of absolute difference between this Child and the average TD.
- how does the child fare compared to the model predictions at Visit 6? Is the child below or above expectations? (tip: use the predict() function on Bernie's data only and compare the prediction with the actual performance of the child)
```{r}
#Filtering Bernie from test data
bernie<-filter(test_data,Child.ID=="2")
#Measuring mean MLU from TD kids in train data
tds<-filter(test_data,Diagnosis=="TD")
tds
#Means of average TD kid pr visit
means<-aggregate(CHI_MLU~Visit,data=tds,mean)
print(means)
absolute<-(means$CHI_MLU)-(bernie$CHI_MLU)
absolute
absolute1<-bernie$CHI_MLU-means$CHI_MLU
print(absolute1)
#The average child increases its MLU (much) slower than Bernie.
#How does the child fare compared to the model predictions at Visit 6? Is the child below or above expectations? (tip: use the predict() function on Bernie's data only and compare the prediction with the actual performance of the child)
#0.50 - average of squared residuals for each data point
#Creating df with Bernies data only with visit 6
berniev6<-filter(bernie,Visit=="6")
predict(model1, bernie)
#3.347
```
[HERE GOES YOUR ANSWER]
### OPTIONAL: Exercise 4) Model Selection via Information Criteria
Another way to reduce the bad surprises when testing a model on new data is to pay close attention to the relative information criteria between the models you are comparing. Let's learn how to do that!
Re-create a selection of possible models explaining ChildMLU (the ones you tested for exercise 2, but now trained on the full dataset and not cross-validated).
Then try to find the best possible predictive model of ChildMLU, that is, the one that produces the lowest information criterion.
- Bonus question for the optional exercise: are information criteria correlated with cross-validated RMSE? That is, if you take AIC for Model 1, Model 2 and Model 3, do they co-vary with their cross-validated RMSE?
### OPTIONAL: Exercise 5): Using Lasso for model selection
Welcome to the last secret exercise. If you have already solved the previous exercises, and still there's not enough for you, you can expand your expertise by learning about penalizations. Check out this tutorial: http://machinelearningmastery.com/penalized-regression-in-r/ and make sure to google what penalization is, with a focus on L1 and L2-norms. Then try them on your data!