-
Notifications
You must be signed in to change notification settings - Fork 0
/
Clustering.Rmd
158 lines (123 loc) · 5.59 KB
/
Clustering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "ClusteringLesson"
author: "Tara"
date: "2/4/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#Introduction
Today we will learn two techniques that we can use to explore our data and find patterns in our dataset. Clustering analysis is a type of unsupervised machine learning that we use to find patterns or groupings in the data.
The two methods we are learning are hierarchical agglomerative clustering and k-means clustering. The main difference between these two methods is that with hierarchical clustering, we choose the number of clusters we want a posteriori (or after we conduct the analysis), for k-means clustering, we determine the number of clusters we want a priori (prior to the anlalysis). In addition to the choice of the two clustering methods, there are a few methodological decisions we can make that will affect our results, we will dicuss some of these today.
##Setting up the data and preprocessing
```{r}
#Load in a dataset that is already in R
data("mtcars")
#Use the help function to learn about the dataset
?mtcars
#What types of variables do you see? Which ones do you find interesting? we can conduct clustering analysis on a subset of the variables or on the entire dataset because we have numeric (including two binary) variables
```
```{r}
#preprocessing the data
#1 check for missing values
summary(mtcars)
#this data set doesn't have any missing values so we can proceed with our analysis
#Normalization and Scaling, chose either:
#Normalize function from library(BBmisc)
library("BBmisc")
dfNorm<- normalize(mtcars)
#scale the data using scale()
dfScaled<- scale(mtcars)
```
##Hierarchical Agglomerative Clustering
For hierarchical agglomerative clustering begins with each observation in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. This approach can be visualized as a dendroagram, a tree-like diagram, showing how pairs of observations are fused together.
There are three different method choices that will affect the way our tree looks:
(1) distance metric. this is the calculation that determines how similar the pairs are to eachother.
(2) different algorithms for how to etermine how close pairs of observations are to eachother. This will give us different shapes of dendrogrmas.
```{r}
#Step 1. Create the distance matrix
#using Euclidean distance, this is most common for data that has continuous numerical values
dist_Euc <- dist(dfScaled, method = "euclidean")
#what other methods are there?
?dist
#we can also look at the maximum distance, manhattan distance, canberra, binari, and minkowski. Let's also try the maximum distsance, create another distance matrix adn change the method agreemeent to "maximum"
dist_Max <- dist(dfScaled, method = "maximum")
```
```{r}
#Step 2. Plot the clustering algorithms
#Run the code for how many clusters does it look like there are in each dendrogram?
#complete method
plot(hclust(dist_Euc, method = "complete"))
#average method
plot(hclust(dist_Euc, method = "average"))
#single method
plot(hclust(dist_Euc, method = "single"))
#ward.D2
plot(hclust(dist_Euc, method = "ward.D2"))
#which method gives the most balanced?
#try the exercise with the dist_Max matrix
```
```{r}
#Step 3. Choosing the number of clusters Storing the Cluster ID for each observation
#using the call bellow, we are taking our clustering and "cutting" the tree into four clusters. The call will return a vector where each observation is assigned to a cluster.
cutree(hclust(dist_Euc, method = "complete"), k=4)
#we can append this vector to a copy of the original dataframe to interpret the clusters
df_hclust<- mtcars
df_hclust$cluster<-cutree(hclust(dist_Euc, method = "complete"), k=4)
```
```{r}
#Step. 4 Interpretation
#how many cars are in each cluster?
summary(as.factor(df_hclust$cluster))
#let's summarize by the groups of clusters
#need dplyr package
library(dplyr)
#compare the mean mpg between clusters
df_hclust_summary <- df_hclust %>%
group_by(cluster) %>%
summarize(mpg_mean = mean(mpg))
#on your own compare another variable mean between clusters
#visualize the clusters
library(ggplot2)
df_hclust %>%
ggplot(aes(x =hp, y=mpg, col = as.factor(cluster)))+ geom_point()
```
##Kmeans Clustering
1. Scaling/Normalization before kmeans
- figure out the optimal k-value
```{r}
##Step 2: Figure out the optimal K-value.
#Different methodological approach: trial and error, choose a priori based on theoretical knowledge, choose based on the hierarchical agglomerative clustering result, or use an numerical approach: elbow plot, gap statistic, silohuette.
#we can use these packages to do all three of these numerical approaches
library(factoextra)
#install.packages("NbClust")
library(NbClust)
#elbow method
fviz_nbclust(dfScaled, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
#silohuette
fviz_nbclust(dfScaled, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
#gap statistic,
set.seed(123)
fviz_nbclust(dfScaled, kmeans, nstart = 10, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
```
##Run KMeans algorithm
```{r}
#set the seed for the analysis, for reproducibility
set.seed(123)
#run the algorithm to start the clusters
kclust<- kmeans(dfScaled, nstart=1, centers=2, iter.max = 10)
#try to look at what the results are in the kclust object
kclust
```
##Examining the content of kmeans clusters
```{r}
```
```{r}
#another way to visualize clusters, helpful package
fviz_cluster(kmeans(dfScaled, centers = 2), geom = "point", data = dfScaled)
```
##Heat maps if we have time