-
Notifications
You must be signed in to change notification settings - Fork 4
/
05-kmeans.html
209 lines (199 loc) · 14.1 KB
/
05-kmeans.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Advanced NumPy</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Advanced NumPy</h1></a>
<h2 class="subtitle">Case study: K-means</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning objectives</h2>
</div>
<div class="panel-body">
<p>After the lesson the learner should:</p>
<ul>
<li>Be able to combine axis-based reductions, broadcasting and indexing to implement a simple clustering algorithm.</li>
<li>Understand what are the advantages of vectorisation and when to use or not use it.</li>
</ul>
</div>
</section>
<p>K-means is a simple algorithm to cluster data – that is to identify groups of similar objects based only on their properties. The algorithm is best-illustrated by the following graph.</p>
<div class="figure">
<img src="fig/kmeans/kmeans_illustration.png" />
</div>
<h3 id="loading-data">Loading data</h3>
<p>We first need to load sample. If you haven’t done some before you can download it from <a href="data/kmeans_data.csv">here</a>.</p>
<pre><code>>>> data = np.loadtxt('kmeans_data.csv')
>>> data.shape
(30, 2)</code></pre>
<p>To visualise the data we can use the <code>scatter</code> function from matplotlib package:</p>
<pre><code>>>> import matplotlib.pyplot as plt
>>> plt.scatter(data[:, 0], data[:, 1], s=40)
>>> plt.show()</code></pre>
<div class="figure">
<img src="fig/kmeans/generating_data.png" alt="Sample data with 3 clusters" />
<p class="caption">Sample data with 3 clusters</p>
</div>
<p>Since, we are going to build up the example gradually. Let us put the commands in a script:</p>
<pre><code>import numpy as np
import matplotlib.pyplot as plt
# load data
data = np.loadtxt('kmeans_data.csv')
# plot
plt.scatter(data[:, 0], data[:, 1], s=40)</code></pre>
<h3 id="initialisation">Initialisation</h3>
<p>In the first step of the algorithm we need to initialise the centers of the clusters. We will initialise them randomly but consistently with the mean and standard deviation of the data:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">K <span class="op">=</span> <span class="dv">3</span>
centroids <span class="op">=</span> np.random.randn(K, <span class="dv">2</span>)</code></pre></div>
<p>To center the cluster centroids on the data it’s better to normalise to the mean and standard deviation of the data:</p>
<pre><code>centroids = centroids * np.std(data, 0)
centroids = centroids + np.mean(data, 0)</code></pre>
<p>Let’s now plot the data and the random cluster centers on the same figure:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.scatter(data[:, <span class="dv">0</span>], data[:, <span class="dv">1</span>], s<span class="op">=</span><span class="dv">40</span>)
plt.scatter(centroids[:, <span class="dv">0</span>], centroids[:, <span class="dv">1</span>], c<span class="op">=</span>np.arange(<span class="dv">3</span>), s<span class="op">=</span><span class="dv">100</span>)</code></pre></div>
<div class="figure">
<img src="fig/kmeans/initialisation.png" alt="Randomly initalised cluster centers (color big dots)" />
<p class="caption">Randomly initalised cluster centers (color big dots)</p>
</div>
<p>Now you can copy-and-paste these lines into the script. You may find the <code>%history</code> command of ipython console useful.</p>
<h3 id="assignment">Assignment</h3>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="find-closest-centers"><span class="glyphicon glyphicon-pencil"></span>Find closest centers</h2>
</div>
<div class="panel-body">
<p>Calculate the Euclidean distance between all data points to each of the center and then find the index of the closest center.</p>
</div>
</section>
<p>We now need to assign each point to the closest cluster center. First, we will calculate the Euclidean distance of each point to each of the centers. For this we can use the <strong>broadcasting</strong>:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">deltas <span class="op">=</span> data[:, np.newaxis, :] <span class="op">-</span> centroids
distances <span class="op">=</span> np.sqrt(np.<span class="bu">sum</span>((deltas) <span class="op">**</span> <span class="dv">2</span>, <span class="dv">2</span>))</code></pre></div>
<p>For each data point we find the center with minimum distance. We can use the <code>argmin</code> method with the <strong>axis argument</strong>:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">closest <span class="op">=</span> distances.argmin(<span class="dv">1</span>)</code></pre></div>
<p>Now we plot the centroids and data points with the color-code reflecting cluster membership:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.scatter(data[:, <span class="dv">0</span>], data[:, <span class="dv">1</span>], s<span class="op">=</span><span class="dv">40</span>, c<span class="op">=</span>closest)
plt.scatter(centroids[:, <span class="dv">0</span>], centroids[:, <span class="dv">1</span>], c<span class="op">=</span>np.arange(<span class="dv">3</span>), s<span class="op">=</span><span class="dv">100</span>)</code></pre></div>
<div class="figure">
<img src="fig/kmeans/assignment.png" alt="Data points assigned to closest cluster center" />
<p class="caption">Data points assigned to closest cluster center</p>
</div>
<h3 id="calculate-new-cluster-centers">Calculate new cluster centers</h3>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="cluster-center"><span class="glyphicon glyphicon-pencil"></span>Cluster center</h2>
</div>
<div class="panel-body">
<p>Given the array of cluster assignments <code>closest</code> calculate the center coordinates of the first cluster cluster (index 0).</p>
</div>
</section>
<p>To calculate new centers of the clusters, we average all points belonging to that cluster. We can use a <strong>boolean mask</strong>. For example, to calculate the center of a cluster 0 we will use the following instruction:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data[closest<span class="op">==</span><span class="dv">0</span>, :].mean(<span class="dv">0</span>)</code></pre></div>
<pre><code>array([ 2.90695091, 2.52099101])</code></pre>
<p>To repeat it for all clusters we can use a for loop or list comprehension. Since the number of clusters is usually much smaller than the number of data points, this for loop won’t affect the performance of our algorithm:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">centroids <span class="op">=</span> np.array([data[closest <span class="op">==</span> i, :].mean(<span class="dv">0</span>) <span class="cf">for</span> i <span class="op">in</span> <span class="bu">range</span>(<span class="dv">3</span>)])</code></pre></div>
<p>Lets check the positions of new centers and assignment of points to clusters.</p>
<div class="figure">
<img src="fig/kmeans/update_centers.png" alt="Re-calculated cluster centers" />
<p class="caption">Re-calculated cluster centers</p>
</div>
<h3 id="iterations">Iterations</h3>
<p>Now we can repeat the steps of assigning point to clusters and updating the cluster centers iteratively and watch the progress of the algorithm:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> iteration <span class="op">in</span> <span class="bu">range</span>(<span class="dv">5</span>):
<span class="co"># assign points to clusters</span>
deltas <span class="op">=</span> data[:, np.newaxis, :] <span class="op">-</span> centroids
distances <span class="op">=</span> np.sqrt(np.<span class="bu">sum</span>((deltas) <span class="op">**</span> <span class="dv">2</span>, <span class="dv">2</span>))
closest <span class="op">=</span> distances.argmin(<span class="dv">1</span>)
<span class="co"># calculate new centroids</span>
centroids <span class="op">=</span> np.array([data[closest <span class="op">==</span> i, :].mean(<span class="dv">0</span>) <span class="cf">for</span> i <span class="op">in</span> <span class="bu">range</span>(<span class="dv">3</span>)])</code></pre></div>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="stopping-criterion"><span class="glyphicon glyphicon-pencil"></span>Stopping criterion</h2>
</div>
<div class="panel-body">
<p>After each iteration test whether any point changes their cluster membership. Stop the algorithm if convergence was reached i.e. clusters do not change after the re-assignment step.</p>
</div>
</section>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="single-cluster"><span class="glyphicon glyphicon-pushpin"></span>Single cluster?</h2>
</div>
<div class="panel-body">
<p>Note that sometimes the algorithm can produce degenerate results – all of the points will be assigned to a single cluster (or final number of clusters will be less than K). This is one of drawbacks of K-means with random initialisations. A possible solution is to repeat the algorithm with other initialisations and find the best cluster assignment, but better solutions exist.</p>
</div>
</aside>
<h3 id="putting-it-all-together">Putting it all together</h3>
<p>Our final script will look as the following:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt
data <span class="op">=</span> np.loadtxt(<span class="st">'kmeans_data.csv'</span>)
<span class="co"># randomly initalize the centroids</span>
K <span class="op">=</span> <span class="dv">3</span>
centroids <span class="op">=</span> np.random.randn(K, <span class="dv">2</span>)
centroids <span class="op">=</span> centroids <span class="op">*</span> np.std(data, <span class="dv">0</span>)
centroids <span class="op">=</span> centroids <span class="op">+</span> np.mean(data, <span class="dv">0</span>)
<span class="cf">for</span> iteration <span class="op">in</span> <span class="bu">range</span>(<span class="dv">5</span>):
<span class="co"># assign points to clusters</span>
deltas <span class="op">=</span> data[:, np.newaxis, :] <span class="op">-</span> centroids
distances <span class="op">=</span> np.sqrt(np.<span class="bu">sum</span>((deltas) <span class="op">**</span> <span class="dv">2</span>, <span class="dv">2</span>))
closest <span class="op">=</span> distances.argmin(<span class="dv">1</span>)
<span class="co"># calculate new centroids</span>
centroids <span class="op">=</span> np.array([data[closest <span class="op">==</span> i, :].mean(<span class="dv">0</span>) <span class="cf">for</span> i <span class="op">in</span> <span class="bu">range</span>(<span class="dv">3</span>)])
<span class="co"># plot </span>
plt.scatter(data[:, <span class="dv">0</span>], data[:, <span class="dv">1</span>], s<span class="op">=</span><span class="dv">40</span>, c<span class="op">=</span>closest)
plt.scatter(centroids[:, <span class="dv">0</span>], centroids[:, <span class="dv">1</span>], c<span class="op">=</span>np.arange(<span class="dv">3</span>), s<span class="op">=</span><span class="dv">100</span>)
plt.show()</code></pre></div>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="choice-of-k"><span class="glyphicon glyphicon-pencil"></span>Choice of K</h2>
</div>
<div class="panel-body">
<p>Check whether the algorithm works for any K. Try using K > 3. What happens then?</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="memory-or-speed"><span class="glyphicon glyphicon-pencil"></span>Memory or speed</h2>
</div>
<div class="panel-body">
<p>Replace the assignment and calculation of new clusters with a for loop. Which implementation would be preferable for small (few observations and dimensions) and which for large datasets (large number of observations and dimensions).</p>
</div>
</section>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/paris-swc/advanced-numpy-lesson">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
<script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
</body>
</html>