-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfitbit.Rmd
194 lines (120 loc) · 4.94 KB
/
fitbit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
title: 'Coursera Assignment: JHU Machine Learning in R [fitbit data]'
author: "Jonathan Zwart [jtlz2]"
output:
html_document: default
github_document: default
pdf_document: default
url: http://github.com/jtlz2/fitbit
---
# Background
This is my report for the final, peer-reviewed assignment in the JHU Machine Learning in R Coursera course.
The data, from Fitbit, are the Weight
Lifting Exercise Dataset (see http://groupware.les.inf.puc-rio.br/har), available from:
Training Set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Test Set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The aim is to predict the "classe" variable in the training set, a five-level (A, B, C, D, E) indicator of how an individual carried out an exercise.
The prediction is for 20 different test cases (see below).
# Nomenclature
I will divide the data as follows:
1. Training set: 70 per cent of the "Training Set" data, used for training only
2. Test set: 30 per cent of the "Training Set" data, used for error rate estimation
3. Holdout set: the "Test Set" data, used for verification of the analysis only
# Preliminaries
1. Set the working directory and load libraries
2. Register 4 cores to attempt acceleration of the training
3. Set the random seed for training
```{r}
setwd("/Users/jtlz2/coursera/jhu/ml/project/fitbit")
library(caret)
library(pROC)
library(doMC)
doMC::registerDoMC(cores=4)
library(randomForest)
set.seed(1234)
```
# Data selection and Preprocessing
1. Load the training data
2. Convert the "classe" variable to a factor
3. Divide the training data into training (70 per cent) and test (30 per cent) sets
```{r}
trainFrame<-read.csv("data/pml-training.csv")
trainFrame$classe<-factor(trainFrame$classe)
inTrain<-createDataPartition(trainFrame$classe,p=0.7,list=FALSE)
trainFr<-trainFrame[inTrain,]
testFr<-trainFrame[-inTrain,]
```
4. Remove any near-zero variables from the training and test sets (based on the training set only)
```{r}
nzv<-nearZeroVar(trainFr,saveMetrics=TRUE)
trainFr<-trainFr[!nzv$nzv]
testFr<-testFr[!nzv$nzv]
```
5. The first 5 columns contain no information useful for the training, so remove them
```{r}
excludeCols=1:5
trainFr<-trainFr[,-excludeCols]
testFr<-testFr[,-excludeCols]
```
6. We finally have to load the holdout set, but only because we need to find columns that are common between the training and holdout sets (otherwise the holdout prediction will not work).
7. At the same time, we strip out the first 5 columns (see above), and remove any near-zero variables (in either set) from the holdout set.
```{r}
holdoutFr<-read.csv("data/pml-testing.csv")
holdoutFr<-holdoutFr[,-excludeCols]
nzv2<-nearZeroVar(holdoutFr,saveMetrics=TRUE)
holdoutFr<-holdoutFr[!nzv2$nzv]
classe=trainFr$classe
classe2=testFr$classe
common_cols <- intersect(colnames(trainFr), colnames(holdoutFr))
trainFr<-trainFr[,common_cols]
testFr<-testFr[,common_cols]
holdoutFr<-holdoutFr[,common_cols]
trainFr$classe=classe
testFr$classe=classe2
```
8. Now remove incomplete cases from the training, test and holdout sets
```{r}
trainFr<-trainFr[complete.cases(trainFr),]
testFr<-testFr[complete.cases(testFr),]
holdoutFr<-holdoutFr[complete.cases(holdoutFr),]
```
9. Preprocess the training and test sets with KNN imputation for missing values. This step is probably not necessary since we have already removed incomplete cases.
```{r}
preObj<-preProcess(trainFr[,-ncol(trainFr)],method="knnImpute")
trainFr<-predict(preObj,trainFr)
preObj3<-preProcess(testFr[,-ncol(testFr)],method="knnImpute")
testFr<-predict(preObj3,testFr)
```
# Training
1. Now train using a random forest (which has performed well throughout my experience of the course), omitting any NA fields (also probably now unnecessary)
```{r}
RF<-randomForest(classe ~.,data=trainFr,na.action=na.omit)
RF
```
2. We can see that the OOB error rate is < 0.3 per cent
3. Plot receiver operating curves (ROCs) for each class. These turn out to be not-very-useful because the accuracy is so high.
```{r}
predTest<-as.numeric(predict(RF, testFr, type = 'response'))
conf<-confusionMatrix(testFr$classe,predict(RF,testFr))
predTest<-as.numeric(predict(RF, testFr, type = 'response'))
roc.multi<-multiclass.roc(testFr$classe, predTest)
rs <- roc.multi[['rocs']]
plot.roc(rs[[1]])
sapply(2:length(rs),function(i) lines.roc(rs[[i]],col=i))
```
# Prediction
1. Now the training is done, KNN impute the holdout set and predict using the earlier RF model
```{r}
preObj2<-preProcess(holdoutFr,method="knnImpute")
hF<-predict(preObj,holdoutFr)
p<-predict(RF,hF)
p
```
2. Having earlier scored 20/20 in the test, I can assert the answers in order to unit-test my code after any refactoring. If the final statement evaluates to TRUE, the code is working correctly.
```{r}
answers<-c("B", "A", "B", "A", "A", "E", "D", "B", "A", "A", "B", "C", "B", "A", "E", "E", "A", "B", "B", "B")
all(p==answers)
```
# End of report
Jonathan Zwart
16 March 2017