This repository has been archived by the owner on May 6, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 72
/
Copy path09-hclust.Rmd
206 lines (157 loc) · 7.04 KB
/
09-hclust.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
# Hierarchical Agglomerative Clustering
# Load data
Review the numeric heart dataset from 08-PCA.Rmd
```{r}
load("data/unsupervised.RData")
str(ml_num)
```
# Overview
Hierarchical agglomerative clustering is a "bottom-up" method of clustering. Each observation begins as its own cluster and forms clusters with like items as it moves up the hierarchy. That is, all leaves are their own clusters to begin with and form clusters as grouping moves up the trunk and various branches are formed.
Distance and cluster method information are usually displayed at the bottom of the graph, while the vertical axis displays the height, which refers to the distance between two clusters. We can also "cut" the dendrogram to specify a number of clusters, which is similar to defining _k_ in k-means clustering (which can also be problematic).
```{r}
library(ape)
library(pvclust)
library(mclust)
```
Start by using the `hclust` built-in function, which prefers a distance matrix via the `dist` function. This plots rows as opposed to columns like the methods further below.
```{r}
?hclust
# Create distance matrix
heart_dist = dist(ml_num[, -6], method = "euclidean")
# Fit hclust_model
system.time({
hclust_model = hclust(heart_dist, method = "complete")
})
# Plot hclust_model dendrogram
plot(hclust_model, hang = -1)
```
Data are visualized in dendrograms, or branching tree-like structures similar to decision trees, albeit with less information displayed at each node. The most similar items are found lower in the dendrogram and fuse into $n-1$ clusters as we move up the tree; the next two items to fuse into a cluster produce $n-2$ clusters and so on as we move up the tree until there is just one overarching cluster. Thus, clusters become more inclusive as we move up the hierarchy.
Dissimilarity is applied not just to single observations, but to groups as well (linkage).
You can also cut the tree to see how the tree varies:
```{r}
# If we want only 5 clusters, for example (must be a number between 1-303), since ml_num has 303 observations:
cutree(hclust_model, 5)
```
# The `ape` package
The [`ape` package](https://cran.r-project.org/web/packages/ape/index.html) provides some great functionality for constructing and plotting clusters:
```{r}
# Various plots
plot(as.phylo(hclust_model))
plot(as.phylo(hclust_model), type = "cladogram")
plot(as.phylo(hclust_model), type = "unrooted")
# Radial plot
colors = c("red", "orange", "blue", "green", "purple")
clus5 = cutree(hclust_model, 5)
plot(as.phylo(hclust_model), type = "fan", tip.color = colors[clus5], lwd = 2, cex = 1)
# These color settings apply to the other ape plots as well
```
# The `pvclust` package
The [pvclust](http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/) package offers a straightfoward way to perform hierarchical agglomerative clustering of columns with two types of p-values at each split: approximately unbiased **(AU)** and bootstrap probability **(BP)**.
## Compare different dissimilarity measures
### Ward's method: minimum variance between clusters
```{r}
system.time({
pvclust_model_ward = pvclust(ml_num[,-6],
method.hclust = "ward.D",
method.dist = "euclidean",
nboot = 1000, parallel = T)
})
plot(pvclust_model_ward)
# pvrect will draw rectangles around clusters with high or low p-values
pvrect(pvclust_model_ward, alpha = 0.95)
```
### Complete linkage: largest intercluster difference
```{r}
pvclust_model_complete = pvclust(ml_num[,-6],
method.hclust = "complete",
method.dist = "euclidean",
nboot = 1000, parallel = T)
plot(pvclust_model_complete)
pvrect(pvclust_model_complete, alpha = 0.95)
```
### Single linkage: smallest intercluster difference
```{r}
pvclust_model_single = pvclust(ml_num[,-6],
method.hclust = "single",
method.dist = "euclidean",
nboot = 1000, parallel = T)
plot(pvclust_model_single)
pvrect(pvclust_model_single, alpha = 0.95)
```
### Average linkage: mean intercluster difference
```{r}
pvclust_model_average = pvclust(ml_num[,-6],
method.hclust = "average",
method.dist = "euclidean",
nboot = 1000, parallel = T)
plot(pvclust_model_complete)
pvrect(pvclust_model_complete, alpha = 0.95)
```
### View summaries
```{r}
(clust_sum = list("Ward" = pvclust_model_ward$edges,
"Complete" = pvclust_model_complete$edges,
"Single" = pvclust_model_single$edges,
"Average" = pvclust_model_average$edges))
```
### Plot Euclidean distance linkages
```{r}
par(mfrow = c(2,2))
plot(pvclust_model_ward, main = "Ward", xlab = "", sub = "")
pvrect(pvclust_model_ward, alpha = 0.95)
plot(pvclust_model_complete, main = "Complete", xlab = "", sub = "")
pvrect(pvclust_model_complete, alpha = 0.95)
plot(pvclust_model_single, main = "Single", xlab = "", sub = "")
pvrect(pvclust_model_single, alpha = 0.95)
plot(pvclust_model_average, main = "Average", xlab = "", sub = "")
pvrect(pvclust_model_average, alpha = 0.95)
par(mfrow = c(1,1))
```
### View standard error plots:
```{r}
par(mfrow=c(2,2))
seplot(pvclust_model_ward, main = "Ward")
seplot(pvclust_model_complete, main = "Complete")
seplot(pvclust_model_single, main = "Single")
seplot(pvclust_model_average, main = "Average")
par(mfrow=c(1,1))
```
# Going further - the `mclust` package
The [`mclust`](https://cran.r-project.org/web/packages/mclust/index.html) package provides "Gaussian finite mixture models fitted via EM algorithm for model-based clustering, classification, and density estimation, including Bayesian regularization, dimension reduction for visualisation, and resampling-based inference."
```{r}
# Fit model
mclust_model = Mclust(ml_num)
# View various plots
plot(mclust_model, what = "BIC")
plot(mclust_model, what = "classification")
plot(mclust_model, what = "uncertainty")
plot(mclust_model, what = "density")
```
### Return best performing model
```{r}
summary(mclust_model)
```
### Cross-validated mclust
```{r}
# sort age in decreasing order
ml_num = ml_num[order(-ml_num$age),]
head(ml_num)
# create a binary factor variable from age: "less than 0" and "greater than/equal to 0"
ml_num$class = cut(ml_num$age,
breaks = c(min(ml_num$age),
0,
max(ml_num$age)),
levels = c(1, 2),
labels = c("less than 0", "greater than/equal to 0"))
ml_num
# Define our predictors (X) and class labels (class)
X = subset(ml_num, select = -c(h.target, class))
class = ml_num$h.target
# Fit the model (EEE covariance structure, basically the same as linear discriminant analysis)
mclust_model2 = MclustDA(X, class = class, modelType = "EDDA", modelNames = "EEE")
# Cross-validate!
set.seed(1)
cv_mclust = cvMclustDA(mclust_model2, nfold = 20)
# View cross-validation error and standard error of the cv error
cv_mclust[c("error", "se")]
```