05-kmeans.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="generator" content="pandoc">
    <title>Software Carpentry: Advanced NumPy</title>
    <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
    <link rel="stylesheet" type="text/css" href="css/swc.css" />
    <link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
    <meta charset="UTF-8" />
    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body class="lesson">
    <div class="container card">
      <div class="banner">
        <a href="http://software-carpentry.org" title="Software Carpentry">
          <img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
        </a>
      </div>
      <article>
      <div class="row">
        <div class="col-md-10 col-md-offset-1">
                    <a href="index.html"><h1 class="title">Advanced NumPy</h1></a>
          <h2 class="subtitle">Case study: K-means</h2>
          <section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning objectives</h2>
</div>
<div class="panel-body">
<p>After the lesson the learner should:</p>
<ul>
<li>Be able to combine axis-based reductions, broadcasting and indexing to implement a simple clustering algorithm.</li>
<li>Understand what are the advantages of vectorisation and when to use or not use it.</li>
</ul>
</div>
</section>
<p>K-means is a simple algorithm to cluster data – that is to identify groups of similar objects based only on their properties. The algorithm is best-illustrated by the following graph.</p>
<div class="figure">
<img src="fig/kmeans/kmeans_illustration.png" />

</div>
<h3 id="loading-data">Loading data</h3>
<p>We first need to load sample. If you haven’t done some before you can download it from <a href="data/kmeans_data.csv">here</a>.</p>
<pre><code>&gt;&gt;&gt; data = np.loadtxt(&#39;kmeans_data.csv&#39;)
&gt;&gt;&gt; data.shape
(30, 2)</code></pre>
<p>To visualise the data we can use the <code>scatter</code> function from matplotlib package:</p>
<pre><code>&gt;&gt;&gt; import matplotlib.pyplot as plt
&gt;&gt;&gt; plt.scatter(data[:, 0], data[:, 1], s=40)
&gt;&gt;&gt; plt.show()</code></pre>
<div class="figure">
<img src="fig/kmeans/generating_data.png" alt="Sample data with 3 clusters" />
<p class="caption">Sample data with 3 clusters</p>
</div>
<p>Since, we are going to build up the example gradually. Let us put the commands in a script:</p>
<pre><code>import numpy as np
import matplotlib.pyplot as plt

# load data
data = np.loadtxt(&#39;kmeans_data.csv&#39;)

# plot
plt.scatter(data[:, 0], data[:, 1], s=40)</code></pre>
<h3 id="initialisation">Initialisation</h3>
<p>In the first step of the algorithm we need to initialise the centers of the clusters. We will initialise them randomly but consistently with the mean and standard deviation of the data:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">K <span class="op">=</span> <span class="dv">3</span>
centroids <span class="op">=</span> np.random.randn(K, <span class="dv">2</span>)</code></pre></div>
<p>To center the cluster centroids on the data it’s better to normalise to the mean and standard deviation of the data:</p>
<pre><code>centroids = centroids * np.std(data, 0)
centroids = centroids + np.mean(data, 0)</code></pre>
<p>Let’s now plot the data and the random cluster centers on the same figure:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.scatter(data[:, <span class="dv">0</span>], data[:, <span class="dv">1</span>], s<span class="op">=</span><span class="dv">40</span>)
plt.scatter(centroids[:, <span class="dv">0</span>], centroids[:, <span class="dv">1</span>], c<span class="op">=</span>np.arange(<span class="dv">3</span>), s<span class="op">=</span><span class="dv">100</span>)</code></pre></div>
<div class="figure">
<img src="fig/kmeans/initialisation.png" alt="Randomly initalised cluster centers (color big dots)" />
<p class="caption">Randomly initalised cluster centers (color big dots)</p>
</div>
<p>Now you can copy-and-paste these lines into the script. You may find the <code>%history</code> command of ipython console useful.</p>
<h3 id="assignment">Assignment</h3>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="find-closest-centers"><span class="glyphicon glyphicon-pencil"></span>Find closest centers</h2>
</div>
<div class="panel-body">
<p>Calculate the Euclidean distance between all data points to each of the center and then find the index of the closest center.</p>
</div>
</section>
<p>We now need to assign each point to the closest cluster center. First, we will calculate the Euclidean distance of each point to each of the centers. For this we can use the <strong>broadcasting</strong>:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">deltas <span class="op">=</span> data[:, np.newaxis, :] <span class="op">-</span> centroids
distances <span class="op">=</span> np.sqrt(np.<span class="bu">sum</span>((deltas) <span class="op">**</span> <span class="dv">2</span>, <span class="dv">2</span>))</code></pre></div>
<p>For each data point we find the center with minimum distance. We can use the <code>argmin</code> method with the <strong>axis argument</strong>:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">closest <span class="op">=</span> distances.argmin(<span class="dv">1</span>)</code></pre></div>
<p>Now we plot the centroids and data points with the color-code reflecting cluster membership:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.scatter(data[:, <span class="dv">0</span>], data[:, <span class="dv">1</span>], s<span class="op">=</span><span class="dv">40</span>, c<span class="op">=</span>closest)
plt.scatter(centroids[:, <span class="dv">0</span>], centroids[:, <span class="dv">1</span>], c<span class="op">=</span>np.arange(<span class="dv">3</span>), s<span class="op">=</span><span class="dv">100</span>)</code></pre></div>
<div class="figure">
<img src="fig/kmeans/assignment.png" alt="Data points assigned to closest cluster center" />
<p class="caption">Data points assigned to closest cluster center</p>
</div>
<h3 id="calculate-new-cluster-centers">Calculate new cluster centers</h3>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="cluster-center"><span class="glyphicon glyphicon-pencil"></span>Cluster center</h2>
</div>
<div class="panel-body">
<p>Given the array of cluster assignments <code>closest</code> calculate the center coordinates of the first cluster cluster (index 0).</p>
</div>
</section>
<p>To calculate new centers of the clusters, we average all points belonging to that cluster. We can use a <strong>boolean mask</strong>. For example, to calculate the center of a cluster 0 we will use the following instruction:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data[closest<span class="op">==</span><span class="dv">0</span>, :].mean(<span class="dv">0</span>)</code></pre></div>
<pre><code>array([ 2.90695091,  2.52099101])</code></pre>
<p>To repeat it for all clusters we can use a for loop or list comprehension. Since the number of clusters is usually much smaller than the number of data points, this for loop won’t affect the performance of our algorithm:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">centroids <span class="op">=</span> np.array([data[closest <span class="op">==</span> i, :].mean(<span class="dv">0</span>) <span class="cf">for</span> i <span class="op">in</span> <span class="bu">range</span>(<span class="dv">3</span>)])</code></pre></div>
<p>Lets check the positions of new centers and assignment of points to clusters.</p>
<div class="figure">
<img src="fig/kmeans/update_centers.png" alt="Re-calculated cluster centers" />
<p class="caption">Re-calculated cluster centers</p>
</div>
<h3 id="iterations">Iterations</h3>
<p>Now we can repeat the steps of assigning point to clusters and updating the cluster centers iteratively and watch the progress of the algorithm:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> iteration <span class="op">in</span> <span class="bu">range</span>(<span class="dv">5</span>):
   <span class="co"># assign points to clusters</span>
   deltas <span class="op">=</span> data[:, np.newaxis, :] <span class="op">-</span> centroids
   distances <span class="op">=</span> np.sqrt(np.<span class="bu">sum</span>((deltas) <span class="op">**</span> <span class="dv">2</span>, <span class="dv">2</span>))
   closest <span class="op">=</span> distances.argmin(<span class="dv">1</span>)

   <span class="co"># calculate new centroids</span>
   centroids <span class="op">=</span> np.array([data[closest <span class="op">==</span> i, :].mean(<span class="dv">0</span>) <span class="cf">for</span> i <span class="op">in</span> <span class="bu">range</span>(<span class="dv">3</span>)])</code></pre></div>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="stopping-criterion"><span class="glyphicon glyphicon-pencil"></span>Stopping criterion</h2>
</div>
<div class="panel-body">
<p>After each iteration test whether any point changes their cluster membership. Stop the algorithm if convergence was reached i.e. clusters do not change after the re-assignment step.</p>
</div>
</section>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="single-cluster"><span class="glyphicon glyphicon-pushpin"></span>Single cluster?</h2>
</div>
<div class="panel-body">
<p>Note that sometimes the algorithm can produce degenerate results – all of the points will be assigned to a single cluster (or final number of clusters will be less than K). This is one of drawbacks of K-means with random initialisations. A possible solution is to repeat the algorithm with other initialisations and find the best cluster assignment, but better solutions exist.</p>
</div>
</aside>
<h3 id="putting-it-all-together">Putting it all together</h3>
<p>Our final script will look as the following:</p>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np
<span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt

data <span class="op">=</span> np.loadtxt(<span class="st">&#39;kmeans_data.csv&#39;</span>)

<span class="co"># randomly initalize the centroids</span>
K <span class="op">=</span> <span class="dv">3</span>
centroids <span class="op">=</span> np.random.randn(K, <span class="dv">2</span>)
centroids <span class="op">=</span> centroids <span class="op">*</span> np.std(data, <span class="dv">0</span>)
centroids <span class="op">=</span> centroids <span class="op">+</span> np.mean(data, <span class="dv">0</span>)

<span class="cf">for</span> iteration <span class="op">in</span> <span class="bu">range</span>(<span class="dv">5</span>):
   <span class="co"># assign points to clusters</span>
   deltas <span class="op">=</span> data[:, np.newaxis, :] <span class="op">-</span> centroids
   distances <span class="op">=</span> np.sqrt(np.<span class="bu">sum</span>((deltas) <span class="op">**</span> <span class="dv">2</span>, <span class="dv">2</span>))
   closest <span class="op">=</span> distances.argmin(<span class="dv">1</span>)

   <span class="co"># calculate new centroids</span>
   centroids <span class="op">=</span> np.array([data[closest <span class="op">==</span> i, :].mean(<span class="dv">0</span>) <span class="cf">for</span> i <span class="op">in</span> <span class="bu">range</span>(<span class="dv">3</span>)])

<span class="co"># plot </span>
plt.scatter(data[:, <span class="dv">0</span>], data[:, <span class="dv">1</span>], s<span class="op">=</span><span class="dv">40</span>, c<span class="op">=</span>closest)
plt.scatter(centroids[:, <span class="dv">0</span>], centroids[:, <span class="dv">1</span>], c<span class="op">=</span>np.arange(<span class="dv">3</span>), s<span class="op">=</span><span class="dv">100</span>)

plt.show()</code></pre></div>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="choice-of-k"><span class="glyphicon glyphicon-pencil"></span>Choice of K</h2>
</div>
<div class="panel-body">
<p>Check whether the algorithm works for any K. Try using K &gt; 3. What happens then?</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="memory-or-speed"><span class="glyphicon glyphicon-pencil"></span>Memory or speed</h2>
</div>
<div class="panel-body">
<p>Replace the assignment and calculation of new clusters with a for loop. Which implementation would be preferable for small (few observations and dimensions) and which for large datasets (large number of observations and dimensions).</p>
</div>
</section>
        </div>
      </div>
      </article>
      <div class="footer">
        <a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
        <a class="label swc-blue-bg" href="https://github.com/paris-swc/advanced-numpy-lesson">Source</a>
        <a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
        <a class="label swc-blue-bg" href="LICENSE.html">License</a>
      </div>
    </div>
    <!-- Javascript placed at the end of the document so the pages load faster -->
    <script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
    <script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
    <script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
  </body>
</html>