hclust.Rmd

---
title: "Introduction to Clustering"
author: "Lieven Clement"
output:
  bookdown::html_document2:
    df_print: paged
  bookdown::pdf_document2:
    toc: true
    number_sections: true
    latex_engine: xelatex
---

```{r, child="_setup.Rmd"}
```

# Introduction
## Objective

Objective: grouping of observations into **clusters**, so that

- similar observations appear in the same cluster
- dissimilar observations appear in distinct clusters

$\longrightarrow$ need for a measure for **similarity** and **dissimilarity**?


## Example 1

Single cell transcriptomics:  $n \times p$ Matrix for which

  - every column contains the expression levels of one of $p$ genes for $n$ cells

  - every row contains the expression levels of $p$ genes for one cell (**sample**)

  - Research question: look for groups of cells that have similar gene expression patterns

 - Or, look for groups of genes that have similar expression levels across the different cells. This can
 help us in understanding the regulation and functionality of the genes.

 $\longrightarrow$ both **observations** (rows) and **variables** (columns) can be clustered


## Example 2

Abundance studies: the abundances of $n$ plant species are counted on $p$ plots (habitats)

  - look for groups that contain species that live in the same habitats, or, look for groups of
 habitats that have similar species communities

$\longrightarrow$ both **observations** (rows) and **variables** (columns) can be clustered

# Partition Based Cluster Analysis

- Partition based cluster methods require the number of clusters (k) to be specified prior to the start of the algorithm.

## K-means Methods

- To use the k-means clustering algorithm we have to pre-define $k$, the number of clusters we want to define.
- The k-means algorithm is iterative.
- The algorithm starts by defining k cluster centers (centroids).
- Then the algorithm proceeds as follows

    1. First each observation is assigned to the cluster with the closest center to that observation.
    2. Then the k centers are redefined using the observations in each cluster, i.e. the multivariate means (column means) of all observations in a cluster are used to define each new cluster center.
    3. We repeat these two steps until the centers converge.

## Example

```{r}
library(tidyverse)
data(diabetes,package = "mclust")
class <- diabetes$class
table(class)
head(diabetes)
```
```{r}
mclust::clPairs(diabetes[,-1], diabetes$class)
```

```{r}
diabetesKmeans <- kmeans(diabetes[,-1], centers = 3)
diabetesKmeans
```

```{r out.width="49%"}
mclust::clPairs(diabetes[,-1], diabetes$class, main = "Real"); mclust::clPairs(diabetes[,-1], diabetesKmeans$cluster,colors = c(2,3,4),symbols = c(0,17,16), main="K-means")
```

# Hierarchical Cluster Analysis

- Distinction between agglomerative and divisive methods
- Agglomerative start from the situation where each individual observations forms its own cluster (so it starts with n clusters). In the next steps clusters are sequentially merged, until finally there is only one cluster with n observations.
- Divisive methods work just the other way around.
- The solution of an hierarchical clustering is thus a sequence of n nested cluster solutions.

## General Algorithm of Agglomerative Hierarchical Clustering

- In step 0 each observations is considered as a cluster (i.e. $n$ clusters).

- Every next step consists of:

   1. merge the two clusters with the smallest intercluster dissimilarity
   2. recalculate the intercluster dissimilarities

In step 0 the intercluster dissimilarity coincides with the dissimilarity between the corresponding observations

$\rightarrow$ intercluster dissimilarity?

## Intercluster Dissimilarities

- Represent clusters (e.g. $C_1$ and $C_2$)
   as sets of points $\mathbf{x}_i$ which belong to that cluster

- $d(C_1,C_2)$: intercluster dissimilarity between

We consider three intercluster dissimilarities.

### Single Linkage = Nearest Neighbour

\[
  d(C_1,C_2) = \min_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
  d(\mathbf{x}_1,\mathbf{x}_2) ,
\]

i.e. the dissimilarity between $C_1$ and $C_2$ is determined by the smallest dissimilarity between a point of $C_1$ and a point of $C_2$.

```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("./figures/hclustNearest.png")
```

### Complete Linkage = Furthest Neighbour
   \[
    d(C_1,C_2) = \max_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
    d(\mathbf{x}_1,\mathbf{x}_2) ,
   \]
   i.e. the dissimilarity between $C_1$ and $C_2$ is determined by the largest dissimilarity between a point of $C_1$ and a
   point of $C_2$.


```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("./figures/hclustFurthest.png")
```

### Average Linkage = Group Average

   \[
    d(C_1,C_2) = \frac{1}{\vert C_1 \vert \vert C_2 \vert}
    \sum_{\mathbf{x}_1 \in C_1; \mathbf{x}_2 \in C_2}
    d(\mathbf{x}_1,\mathbf{x}_2) ,
   \]
   i.e. the dissimilarity between $C_1$ and $C_2$ is determined by the average dissimilarity between all points of $C_1$ and all
   points of $C_2$.

```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("./figures/hclustAverage.png")
```


## Cluster Tree

Hierarchical nature of the algorithm:

- Nested sequence of clusters $\longrightarrow$ visualisation via a tree


- Height of branches indicate the intercluster dissimilarity at which clusters are merged.

- Can used as instrument for deciding the number of clusters in the data


# Toy example


```{r echo = FALSE}
toy <- data.frame(
  X1 = c(1.50,
         2.00,
         2.50,
         2.00,
         2.25),
  X2 = c(2.40,
         2.50,
         2.25,
         3.00,
         3.20),
  label = 1:5
  )

knitr::kable(toy)
```

```{r}
toy %>%
  ggplot(aes(X1, X2, label = label)) +
  geom_point() +
  geom_text(nudge_x = .05)

toy[,1:2] %>% dist
```

## Single linkage

```{r}
toyDist <- toy[,1:2] %>% dist
toySingle <- hclust(toyDist, method = "single")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toySingle, main = "Single")
toyDist
```

## Complete linkage

```{r}
toyComplete <- hclust(toyDist, method = "complete")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toyComplete,  main = "Complete")
toyDist
```

## Average linkage

```{r}
toyAvg <- hclust(toyDist, method = "average")
par(mfrow=c(1,2),pty="s")
plot(X2 ~ X1, toy, xlim = c(1.25,2.75),ylim = c(2,3.5))
text(toy$X1*1.05,toy$X2,label=toy$label)
plot(toyAvg, main = "Average")
toyDist
```

## Example

```{r}
diabetesDist <- dist(diabetes[,-1])
diabetesSingle <- hclust(diabetesDist, method = "single")
plot(diabetesSingle, labels = as.double(diabetes$class), main="single")
diabetesComplete <- hclust(diabetesDist, method = "complete")
plot(diabetesComplete, labels = as.double(diabetes$class), main="complete")
diabetesAverage <- hclust(diabetesDist, method = "average")
plot(diabetesAverage, labels = as.double(diabetes$class), main = "average",cex=0.5)
```

# Model-based clustering

- [Paper: Fraley and Raftery (1998). How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis. The Computer Journal, (41)8:578-588.](https://sites.stat.washington.edu/people/raftery/Research/PDF/fraley1998.pdf)

- [EM algorithm](./em.html) [[PDF](./em.pdf)]

- Example: see tutorial session

```{r, child="_session-info.Rmd"}
```