forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter4.rmd
417 lines (273 loc) · 13.5 KB
/
chapter4.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
---
title: "Chapter 4: Clustering and classification"
date: "*2020-11-19*"
output:
html_document:
toc: true
toc_float:
smooth_scroll: true
collapsed: false
toc_depth: 2
number_sections: true
---
```{r, echo=FALSE, results=FALSE}
#Let's make our reports reproducible
set.seed(469771822)
```
***
# Description of data
The data used in this excercise contains information collected by the U.S. Census Service concerning housing in the area of Boston Mass. It has been previous released to the public domain
and has been used extensively to benchmark differen algorithms.
The data first appeared in: Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
The data was originally used to relate housing price with various predictors.
***
## Variable decoding
For an explanation of the individual variables, please, refer to the following:
***
Variable name | Explanation
--- | -------------
*crim* | per capita crime rate by town/suburb.
*zn* | proportion of residential land zoned for lots over 25,000 sq.ft.
*indus* | proportion of non-retail business acres per town.
*chas* | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
*nox* | nitrogen oxides concentration (parts per 10 million).
*rm* | average number of rooms per dwelling.
*age* | proportion of owner-occupied units built prior to 1940.
*dis* | weighted mean of distances to five Boston employment centres.
*rad* | index of accessibility to radial highways.
*tax* | full-value property-tax rate per \$10,000.
*ptratio* | pupil-teacher ratio by town.
*black* | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
*lstat* | lower status of the population (percent).
*medv* | median value of owner-occupied homes in \$1000s.
***
## Interactive table
```{r}
#Read the Boston data
library(MASS)
data("Boston")
#Draw an interactive table that can be used to view the data and its dimensions
library(DT)
datatable(Boston)
```
***
# Graphical overview and summary
***
## Summary table of variables
```{r}
library(gtsummary)
#First print out a summary table of all variables
my_table <- tbl_summary(Boston,
digits = all_continuous() ~ 2,
type = all_continuous() ~ "continuous2",
statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}"),
all_categorical() ~ "{n} / {N} ({p}%)"))
#Add some missing elements to the table and print it out
my_table %>% bold_labels()
```
***
### Comments on the data tables
***
From the table we can see that some of the variables (like crime, zn, and tax) have a quite skewed distributions. By exploring the interactive
data table presented previously, we can see that tax has multiple discrete levels and some of these levels have disproportionately large numbers of
observations (like 711). zn has observations of 0 in most cases (median = 0) although the range is 0 to 100.
***
## A scatter plot matrix
***
```{r fig.width=8, fig.height=8,warning=FALSE, message=FALSE}
library(GGally)
#Print a scatter plot matrix
ggpairs( data = Boston,
upper = list(continuous = "points", combo = "facethist", discrete = "facetbar", na = "na"))
```
***
### Interpretation of the scatter plot matrix
We can see various types of correlation in the data. There are examples of linear correlation for some pairs of variables (e.g. rm vs medv)
and also non-linear correlations (e.g. nox vs dis).
As a continuous variable crime seems to have a chi-squared type of distribution with a long tail towards higher values.
Crime seems to have some potential interesting correlations: for example, somewhat curiously higher age seems to be associated with increasing crime levels.
Other variable with a potential positive correlation with crime are nox and lstat
dis and medv may have a negative correlation with crime.
There are also various variables with "conditional" correlations with crime like zn, indus, chas, rad, tax, and ptratio.
***
# Standardize and scale the data
***
```{r}
#Scale and standardize the data as instructed
boston_scaled <- scale(Boston) %>% as.data.frame
#Yet another way to summarize variables:
library(psych)
describe(boston_scaled)
```
***
### How did the variables change
***
As can be seen from the summary of boston_scaled, all variable means were "moved" to 0 and the standard deviations
were scaled to equal 1. In more general statistical terms the data was *standardized* with mean of 0 and sd of 1. It should be noted that standardization
of data does not mean that data's type of distribution were altered, e.g. if the distribution of the variable was skewed before it'll still be skewed after standardization.
***
## Create categorized crime variable
***
```{r}
#Include a convenience library
library(dvmisc)
#Create a discrete variable "crime" from quartiles of "crim"
boston_scaled$crime <- quant_groups(x = boston_scaled$crim,
groups = 4,
cut.list=list(labels = c("low","med_low","med_high","high")))
#Drop the crim variable
boston_scaled <- boston_scaled[,!(names(boston_scaled) %in% "crim")]
#Demonstrate the outcome
str(boston_scaled)
```
***
This code is more or less self explanatory (and a bit shorter than the datacamp example).
***
## Divide into "train" and "test" subsets
***
```{r}
#Generate the sample indices
#Not that you must use floor() to get the division exactly right
#if n_rows * 0.8 is not an even number
n_rows <- nrow(boston_scaled)
my_training_sample <- sample(n_rows, size = floor(n_rows * 0.8))
#Training data set
boston_train <- boston_scaled[my_training_sample, ]
#Testing data set
boston_test <- boston_scaled[-my_training_sample, ]
```
***
This code chunk should be self explanatory
***
# Linear discriminant analysis
```{r out.width="100%"}
#Generate the lda fit for leves of crime with all other variables
#as discriminators
my_lda <- lda(crime ~ ., data = boston_train)
#Print the lda fit
my_lda
```
***
## Interpretation of the LDA fit
***
The prior probabilities of groups are simply the proportion of observations in each group in the training data.
The reason these prior probabilities are printed is that they're necessary in calculating the so-called
posterior probability which is used for calssification of the observations.
In the group means table, we can see the means of all independent variables for each quartile of crime.
One way to understand the coefficients of linear discriminants is that they are a coefficients of a linear function for calculating a "score value" (y) from all independent variables in the model.
The coefficients for this function are calculated such that the group means of y are as far away from the mean of y for the whole data while also
minimizing the variance of y for each individual group. Thus, the values of y will represent the best one dimensional projection of all
the independent variables for telling the groups apart from each other.
There are multiple sets of linear discriminant coefficients in the output due to the fact that there are
multiple levels of crime. In fact, if there are k groups then the total number of linear discriminant functions will be k-1.
This is due to the fact that k groups can have k-1 combinations of coefficients for the independent variables that optimally tell apart the groups.
The proportion of trace for each linear discriminant function is the amount of (multivariate) between-group variance that the dircriminant function is able to
explain. For example, if all the group means are on a single multidimensional line for all independent variables, the first discriminant function will be able to explain
100% of the between-group variance. On the other hand, if all scaled group means are equidistant from the mean of the whole data, while being equally distant from all other groups, all discriminant
functions will be able to explain an equal proportion of the between-group variance. In such a case, all discriminant functions (or dimensions) are equally important for predicting the group.
***
In our case the first linear discriminant function was able to explain 95.8% of the between-group variance in our data. The two remaining discriminant functions were of very little added value for telling
the groups apart.
Since our variables were previously scaled, directly comparing the scalar values of the linear discriminant function coefficients is possible. In the case of LD1, for example,
we clearly see that rad was the variable that was the most significant discriminant for LD1.
***
```{r}
#We'll use another 3rd party library for generating the biplot
#devtools::install_github("fawda123/ggord")
library(ggord)
#Print the biplot as requested
ggord(my_lda,boston_train$crime)
```
***
## Interpretation of the biplot
***
Only the two most significant linear discriminant functions (LD1 and LD2) are used for this plot. LD1 and LD2 are calculated for each observation and
the observations are plotted as a scatter plot of these two values.
The arrows in the plot represent the linear discriminant coefficients for both of these functions (LD1/LD2). The longer the arrow, the greater the coefficient. I.e. we can clearly see that
for the combined effect of LD1 and LD2 rad is clearly the most significant discriminant. Again, it should be noted that this comparison makes sense only when the independent variables are scaled.
***
# Validating the LDA fit
***
```{r}
library(dplyr)
#First save the correct crime classes in the test data separately, as instructed
#(although this step is completely unnecessary)
correct_crime_classes <- boston_test$crime
#Drop the correct crime class from the test data
boston_test <- boston_test[, !(names(boston_test) %in% "crime")]
#Predict crime class based on the test data and our lda fit
crime_predictions <- predict(my_lda, newdata = boston_test)
#Cross tabulate the results
results_table <- as.data.frame(correct_crime_classes)
results_table$predicted_class <- crime_predictions$class
str(results_table)
tbl_cross( results_table,
row = correct_crime_classes,
col = predicted_class,
label = list(correct_crime_classes ~ "Correct class", predicted_class ~ "Predicted class"),
percent ="cell")
#Calculate the proportion of correct classifications
mean(correct_crime_classes == crime_predictions$class)
```
***
### Interpretation
***
The LDA model was able to predict the correct crime class in 73.5% of cases in our test data. Compared to the pre-test propability of roughly 25% (in the whole data), this performance
is very good.
***
# K-means clustering
***
## The distances of observations
```{r}
#Reload the boston dataset and scale it
boston_scaled <- as.data.frame(scale(Boston))
#Calculate the matrix for euclidean distances for all observations
euclidean_distances <- dist(boston_scaled)
#Print a summary
summary(euclidean_distances)
#Print info on the matrix
str(euclidean_distances)
```
***
The values in this matrix are the multivariate euclidean (all variables in boston_scaled) distances between all observations (rows) in the data table.
The size of such matrix is 506^2. However, only (506^2 - 506) / 2 = 127765 unique values exist in the matrix as can be seen from the code output.
***
## The optimal number of clusters
```{r}
#Please note that I have previously set the "global" seed for this r-project
#Visualize the behaviour of TWCSS vs the number of clusters
qplot( x = 1:10,
y = sapply(1:10, function(k){kmeans(boston_scaled, k)$tot.withinss}),
geom = 'line',
ylab = "TWCSS")
```
***
As can be seen from the graph, theres a steep drop for total within cluster sum of squares at k = 2 but not thereafter.
Therefore, the data, based on the kmeans clustering has two distinct multivariate clusters. I.e. there are no discernible sub-clusters within
the two clusters (in which case increasing the k would reduce the TWCSS dramatically).
***
## Visualize the clusters
***
```{r fig.width=8, fig.height=8,warning=FALSE, message=FALSE}
#Perform k-means clustering with the optimal number of clusters
#Save the assigned cluster into the table
boston_scaled$cluster <- as.factor(kmeans(boston_scaled, 2)$cluster)
#Print a scatter plot matrix
ggpairs( data = boston_scaled,
upper = list(continuous = "points", combo = "facethist", discrete = "facetbar", na = "na"),
mapping = aes(col = cluster,alpha = 0.3))
```
***
### Interpretation
***
It is important to recognize that k-means clustering is an unsupervized method for classification of data. In other words,
we have provided no limits for the classes/clusters for the algorithm. Indeed, it's possible that the clusters assigned by
the algorithm are not necessarily meaningful. They simply represent multivariate "concentrations" of observations.
From the scatter matrix we can see that, interestingly, higher vs lower levels of crime seem to present as two discernible multivariate
clusters in our data. Since the increased crime rates have been assigned to the red cluster almost exclusively, we can also view the rest of the scatter matrix
as contrasting what variables are associated with higher levels of crime (red).
From the scatter matrix, we can also visualize various interactions in the data that separate the clusters like lstat vs medv; i.e.
a combination of high lstat and low medv have been assigned to the red cluster.
***