forked from dlab-berkeley/Machine-Learning-in-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-preprocessing.Rmd
231 lines (153 loc) · 9.35 KB
/
02-preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# Preprocessing
## Load packages
Explicitly load the packages that we need for this analysis.
```{r packages}
library(rio)
library(ck37r)
library(caret)
```
## Load the data
Load the heart disease dataset.
```{r load_data}
# Load the heart disease dataset using import() from the rio package.
data_original = import("data-raw/heart.csv")
# Preserve the original copy
data = data_original
str(data)
```
## Read background information and variable descriptions
https://archive.ics.uci.edu/ml/datasets/heart+Disease
## Data preprocessing
Data peprocessing is an integral first step in machine learning workflows. Because different algorithms sometimes require the moving parts to be coded in slightly different ways, always make sure you research the algorithm you want to implement so that you properly setup your $y$ and $x$ variables and split your data appropriately.
> NOTE: also, use the `save` function to save your variables of interest. In the remaining walkthroughs, we will use the `load` function to load the relevant variables.
### What is one-hot encoding?
One additional preprocessing aspect to consider: datasets that contain factor (categorical) features should typically be expanded out into numeric indicators (this is often referred to as [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f). You can do this manually with the `model.matrix` R function. This makes it easier to code a variety of algorithms to a dataset as many algorithms handle factors poorly (decision trees being the main exception). Doing this manually is always good practice. In general however, functions like `lm` will do this for you automatically.
## Handling missing data
Missing values need to be handled somehow. Listwise deletion (deleting any row with at least one missing value) is common but this method throws out a lot of useful information. Many advocate for mean imputation, but arithmetic means are sensitive to outliers. Still, others advocate for Chained Equation/Bayesian/Expectation Maximization imputation (e.g., the [mice](https://www.jstatsoft.org/article/view/v045i03/v45i03.pdf) and [Amelia II](https://gking.harvard.edu/amelia) R packages). K-nearest neighbor imputation can also be useful but median imputation is demonstrated below.
However, you will want to learn about [Generalized Low Rank Models](https://stanford.edu/~boyd/papers/pdf/glrm.pdf) for missing data imputation in your research. See the `impute_missing_values` function from the ck37r package to learn more - you might need to install an h2o dependency.
First, count the number of missing values across variables in our dataset.
```{r review_missingness}
colSums(is.na(data))
```
We have no missing values, so let's introduce a few to the "oldpeak" feature for this example to see how it works:
```{r}
# Add five missing values added to oldpeak in row numbers 50, 100, 150, 200, 250
data$oldpeak[c(50, 100, 150, 200, 250)] = NA
colSums(is.na(data))
colMeans(is.na(data))
```
There are now 5 missing values in the "oldpeak" feature. Now, median impute the missing values! We also want to create missingness indicators to inform us about the location of missing data. These are additional columns we will add to our data frame that represent the _locations_ within each feature that have missing values - 0 means data are present, 1 means there was a missing (and subsequently imputed) value.
```{r impute_missing_values}
result = ck37r::impute_missing_values(data, verbose = TRUE, type = "standard")
names(result)
# Use the imputed dataframe.
data = result$data
# View new columns. Note that the indicator feature "miss_oldpeak" has been added as the last column of our data frame.
str(data)
# No more missing data!
colSums(is.na(data))
```
Since the "ca", "cp", "slope", and "thal" features are currently integer type, convert them to factors. The other relevant variables are either continuous or are already indicators (just 1's and 0's).
```{r}
data = ck37r::categoricals_to_factors(data,
categoricals = c("sex", "ca", "cp", "slope", "thal"),
verbose = TRUE)
# Inspect the updated data frame
str(data)
```
## Defining *y* outcome vectors and *x* feature dataframes
### Convert factors to indicators
Now expand "sex", "ca", "cp", "slope", and "thal" features out into indicators.
```{r factors_to_indicators}
result = ck37r::factors_to_indicators(data, verbose = TRUE)
data = result$data
str(data)
dim(data)
```
What happened?
## Regression setup
Now that the data have been imputed and properly converted, we can assign the regression outcome variable (`age`) to its own vector for the lasso **REGRESSION task**. Remember that lasso can also perform classification as well.
### Set seed for reproducibility
Take the simple approach to data splitting and divide our data into training and test sets; 70% of the data will be assigned to the training set and the remaining 30% will be assigned to the holdout, or test, set.
### Random versus stratified random split
Splitting data into training and test subsets is a fundamental step in machine learning. Usually, the marjority portion of the original dataset is partitioned to the training set, where the algorithms learn the relationships between the $x$ feature predictors and the $y$ outcome variable. Then, these models are given new data (the test set) to see how well they perform on data they have not yet seen.
Since age is a continuous variable and will be the outcome for the OLS and lasso regressions, we will not perform a stratified random split like we will for the classification tasks (see below). Instead, [let's randomly assign](https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function) 70% of the `age` values to the training set and the remaining 30% to the test set.
```{r}
# Create a list to organize our regression task
task_reg = list(
data = data,
outcome = "age"
)
# All variables can be used as covariates except the outcome ("age")
(task_reg$covariates = setdiff(names(data), task_reg$outcome))
names(task_reg)
# Define the sizes of training (70%) and test (30%) sets.
(training_size = floor(0.70 * nrow(task_reg$data)))
# Set seed for reproducibility.
set.seed(1)
# Partition the rows to be included in the training set.
training_rows = sample(nrow(task_reg$data), size = training_size)
task_reg$train_rows = training_rows
head(task_reg$train_rows)
# View our regresion task list
names(task_reg)
```
## Classification setup
Assign the outcome variable to its own vector for the decision tree, random forest, gradient boosted tree, and SuperLearner ensemble **CLASSIFICATION tasks**. However, keep in mind that these algorithms can also perform regression!
This time however, "target" will by our y outcome variable (1 = person has heart disease, 0 = person does not have heart disease) - the others will be our x features.
```{r}
# Store everything in a list like we did for the regression task
task_class = list(
data = data,
outcome = "target"
)
(task_class$covariates = setdiff(names(task_class$data), task_class$outcome))
# See the names of the list elements
names(task_class)
```
Our factors have still been converted to indicators from the regression setup! :)
### Stratified random split
For classification, we then use [stratified random sampling](https://stats.stackexchange.com/questions/250273/benefits-of-stratified-vs-random-sampling-for-generating-training-data-in-classi) to divide our data into training and test sets; 70% of the data will be assigned to the training set and the remaining 30% will be assigned to the holdout, or test, set.
```{r}
# Set seed for reproducibility.
set.seed(2)
# Create a stratified random split.
training_rows =
caret::createDataPartition(task_class$data[[task_class$outcome]],
p = 0.70, list = FALSE)
# Partition training dataset
task_class$train_rows = training_rows
mean(task_class$data[training_rows, "target"])
table(task_class$data[training_rows, "target"])
mean(task_class$data[-training_rows, "target"])
table(task_class$data[-training_rows, "target"])
```
## Define training and test x and y variables for regression and classification
### Regression variables
```{r}
# Pull out regression preprocessed data for easier analysis
train_x_reg = task_reg$data[task_reg$train_rows, task_reg$covariates]
train_y_reg = task_reg$data[task_reg$train_rows, task_reg$outcome]
test_x_reg = task_reg$data[-task_reg$train_rows, task_reg$covariates]
test_y_reg = task_reg$data[-task_reg$train_rows, task_reg$outcome]
# Look at ages of first 20 individuals
head(train_y_reg, n = 20)
# Look at features for the corresponding first 6 individuals
head(train_x_reg)
```
### Classification variables
```{r}
train_x_class = task_class$data[task_class$train_rows, task_class$covariates]
train_y_class = task_class$data[task_class$train_rows, task_class$outcome]
test_x_class = task_class$data[-task_class$train_rows, task_class$covariates]
test_y_class = task_class$data[-task_class$train_rows, task_class$outcome]
```
### Save our preprocessed data
We save our preprocessed data into an RData file so that we can easily load it the later files.
```{r save_data}
save(data, data_original,
task_reg, task_class,
train_x_reg, train_y_reg, test_x_reg, test_y_reg,
train_x_class, train_y_class, test_x_class, test_y_class,
file = "data/preprocessed.RData")
```