API/ENH: rm main, add time distance scaling

Add tools to read image file's timestamp: Read EXIF tags if possible. Add new dependency piexif. Add `fmt` kwd to tests.ImagedirCtx__init__() -> test png and jpg images. Read EXIF timestamps. Add module exceptions.py, io.py, examples/inria_holiday.sh. Replace most of main's logic (check for existing db files) by io.get_image_data(). Move other IO-related things to io.py. Use timestamps to add time-distance scaling.
elcorto · Jun 3, 2019 · f9ad719 · f9ad719
1 parent adc81d3
commit f9ad719
Show file tree

Hide file tree

Showing 13 changed files with 416 additions and 327 deletions.
diff --git a/README.rst b/README.rst
@@ -2,83 +2,23 @@ About
 =====
 
 Package for clustering images by content. We use a pre-trained deep
-convolutional neural network to calculate image fingerprints, which are then
-used to cluster similar images.
+convolutional neural network to calculate image fingerprints which represent
+content. Those are used to cluster similar images. In addition to pure
+image content, it is possible to mix in timestamp information which improves
+clustering for temporally uncorrelated images.
 
 Usage
 =====
 
-The package is designed as a library. Here is what you can do:
-
-.. code:: python
-
-    from imagecluster import calc as ic
-    from imagecluster import postproc as pp
-
-    # Create image database in memory. This helps to feed images to the NN model
-    # quickly.
-    ias = ic.image_arrays('pics/', size=(224,224))
-
-    # Create Keras NN model.
-    model = ic.get_model()
-
-    # Feed images through the model and extract fingerprints (feature vectors).
-    fps = ic.fingerprints(ias, model)
-
-    # Optionally run a PCA on the fingerprints to compress the dimensions. Use a
-    # cumulative explained variance ratio of 0.95.
-    fps = ic.pca(fps, n_components=0.95)
-
-    # Run clustering on the fingerprints.  Select clusters with similarity index
-    # sim=0.5
-    clusters = ic.cluster(fps, sim=0.5)
-
-    # Create dirs with links to images. Dirs represent the clusters the images
-    # belong to.
-    pp.make_links(clusters, 'pics/imagecluster/clusters')
-
-    # Plot images arranged in clusters.
-    pp.visualize(clusters, ias)
-
-See also ``imagecluster.main.main()``. It does the same as the code above, but
-also saves/loads the image database and the fingerprints to/from disk, such
-that you can re-run the clustering and post-processing again without
-re-calculating fingerprints.
-
-Example session:
-
-.. code:: python
-
-    >>> from imagecluster import main
-    >>> main.main('pics/', sim=0.5, vis=True)
-    no fingerprints database pics/imagecluster/fingerprints.pk found
-    create image array database pics/imagecluster/images.pk
-    pics/140301.jpg
-    pics/140601.jpg
-    pics/140101.jpg
-    pics/140400.jpg
-    pics/140801.jpg
-    [...]
-    running all images through NN model ...
-    pics/140301.jpg
-    pics/140503.jpg
-    pics/140601.jpg
-    pics/140901.jpg
-    pics/140101.jpg
-    [...]
-    clustering ...
-    #images : #clusters
-    2 : 7
-    3 : 1
-    #images in clusters total:  17
-    cluster dir: pics/imagecluster/clusters
-
-If you run this again on the same directory, only the clustering (which is very
-fast) and the post-processing (links, visualization) will be repeated.
+The package is designed as a library. See ``examples/example_api.py``.
 
-For this example, we use a very small subset of the `Holiday image dataset
-<holiday_>`_ (25 images (all named 140*.jpg) of 1491 total images in the
-dataset).
+.. Here is what you can do:
+
+.. .. code:: python
+.. example_api.py
+
+The bottleneck is ``~imagecluster.calc.fingerprints``, all other
+operations have negligible relative cost.
 
 Have a look at the clusters (as dirs with symlinks to the relevant files):
 
@@ -119,7 +59,16 @@ at the clusters:
 
 .. image:: doc/clusters.png
 
-Here is the result of using a larger subset of 292 images from the same dataset.
+For this example, we use a very small subset of the `Holiday image dataset
+<holiday_>`_ (25 images (all named 140*.jpg) of 1491 total images in the
+dataset). See ``examples/inria_holiday.sh`` for how to select such a subset:
+
+.. code:: sh
+
+    $ /path/to/imagecluster/examples/inria_holiday.sh jpg/140*
+
+Here is the result of using a larger subset of 292 images from the same dataset
+(``/inria_holiday.sh jpg/14*``):
 
 .. image:: doc/clusters_many.png
 
@@ -136,8 +85,6 @@ can be grouped together depending on their similarity (y-axis).
 
 .. image:: doc/dendrogram.png
 
-
-
 One can now cut through the dendrogram tree at a certain height (``sim``
 parameter 0...1, y-axis) to create clusters of images with that level of
 similarity. ``sim=0`` is the root of the dendrogram (top in the plot) where
@@ -164,6 +111,28 @@ use (`thanks for the hint! <alexcnwy_>`_) the activations of the second to last
 fully connected layer ('fc2', 4096 nodes) as image fingerprints (numpy 1d array
 of shape ``(4096,)``) by default.
 
+Content and time distance
+-------------------------
+
+Image fingerprints represent content. Clustering based on content ignores time
+correlations. Say we have two images of some object that look similar and will
+thus be put into the same cluster. However, they might be in fact pictures of
+different objects, taken at different times -- which is our original holiday
+image use case (e.g. two images of a church from different cities, taken on
+separate trips). In this case, we want the images to end up in different
+clusters. We have a feature to mix content distance (``d_c`` and time distance
+``d_t``) such that
+
+::
+
+    d = (1 - alpha) * d_c * ahpha * d_t
+
+One can thus do pure content-based clustering (``alpha=0``) or pure time-based
+(``alpha=1``). The effect of the mixing is that fingerprint points representing
+content get pushed further apart when the corresponding images' time distance
+is large. That way, we achieve a transparent addition of time information w/o
+changing the clustering method.
+
 
 Quality of clustering & parameters to tune
 ------------------------------------------
@@ -211,13 +180,7 @@ Install
 
     $ pip3 install -e .
 
-or if you have the ``requirements.txt`` already installed (e.g. by your system's
-package manager)
-
-.. code:: sh
-
-    $ pip3 install -e . --no-deps
-
+See also samplepkg_.
 
 Contributions
 =============
@@ -246,3 +209,4 @@ Related projects
 .. _curse: https://en.wikipedia.org/wiki/Curse_of_dimensionality
 .. _gh_beleidy: https://github.com/beleidy/unsupervised-image-clustering
 .. _commit_pfx: https://github.com/elcorto/libstuff/blob/master/commit_prefixes
+.. _samplepkg: https://github.com/elcorto/samplepkg
diff --git a/examples/example_api.py b/examples/example_api.py
@@ -1,27 +1,42 @@
+#!/usr/bin/python3
+
 from imagecluster import calc as ic
+from imagecluster import io as icio
 from imagecluster import postproc as pp
 
-# Create image database in memory. This helps to feed images to the NN model
-# quickly.
-ias = ic.image_arrays('pics/', size=(224,224))
-
-# Create Keras NN model.
-model = ic.get_model()
-
-# Feed images through the model and extract fingerprints (feature vectors).
-fps = ic.fingerprints(ias, model)
+# # Create image database in memory. This helps to feed images to the NN model
+# # quickly.
+# image_arrays = icio.read_image_arrays('pics/', size=(224,224))
+#
+# # Create Keras NN model.
+# model = ic.get_model()
+#
+# # Feed images through the model and extract fingerprints (feature vectors).
+# fingerprints = ic.fingerprints(image_arrays, model)
+#
+# # Optionally run a PCA on the fingerprints to compress the dimensions. Use a
+# # cumulative explained variance ratio of 0.95.
+# fingerprints = ic.pca(fingerprints, n_components=0.95)
+#
+# # Read image timestamps. Need that to calculate the time distance, can be used
+# # in clustering.
+# timestamps = icio.read_timestamps('pics/')
 
-# Optionally run a PCA on the fingerprints to compress the dimensions. Use a
-# cumulative explained variance ratio of 0.95.
-fps = ic.pca(fps, n_components=0.95)
+# XXX where on disk? add to README
+# Convenience function to perform the steps above. Check for existing
+# `image_arrays` and `fingerprints` database files on disk and load them if
+# present. Running this again only loads data from disk, which is fast.
+image_arrays,fingerprints,timestamps = icio.get_image_data(
+    'pics/',
+    pca_kwds=dict(n_components=0.95))
 
-# Run clustering on the fingerprints.  Select clusters with similarity index
-# sim=0.5
-clusters = ic.cluster(fps, sim=0.5)
+# Run clustering on the fingerprints. Select clusters with similarity index
+# sim=0.5. Mix 80% content distance with 20% timestamp distance (alpha=0.2).
+clusters = ic.cluster(fingerprints, sim=0.5, timestamps=timestamps, alpha=0.2)
 
 # Create dirs with links to images. Dirs represent the clusters the images
 # belong to.
 pp.make_links(clusters, 'pics/imagecluster/clusters')
 
 # Plot images arranged in clusters.
-pp.visualize(clusters, ias)
+pp.visualize(clusters, image_arrays)
diff --git a/examples/example_main.py b/examples/example_main.py
diff --git a/examples/inria_holiday.sh b/examples/inria_holiday.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+
+# select 25 images
+#   ./this.sh jpg/100*
+#
+# select 274 images
+#   ./this.sh jpg/10*
+
+if ! [ -d jpg ]; then
+    for name in jpg1 jpg2; do
+        wget ftp://ftp.inrialpes.fr/pub/lear/douze/data/${name}.tar.gz
+        tar -xzf ${name}.tar.gz
+    done
+fi
+
+mkdir -p pics
+rm -rf pics/*
+for x in $@; do
+    f=$(echo "$x" | sed -re 's|jpg/||')
+    ln -s $(readlink -f jpg/$f) pics/$f
+done
+
+echo "#images: $(ls pics | wc -l)"
diff --git a/examples/plot_dendrogram.py b/examples/plot_dendrogram.py
@@ -1,19 +1,20 @@
+#!/usr/bin/python3
+
 from matplotlib import pyplot as plt
 import numpy as np
 from scipy.cluster.hierarchy import dendrogram
 
 from imagecluster import calc as ic
+from imagecluster import io as icio
 
-ias = ic.image_arrays('pics/', size=(224,224))
+image_arrays = icio.read_image_arrays('pics/', size=(224,224))
 model = ic.get_model()
-fps = ic.fingerprints(ias, model)
-clusters,extra = ic.cluster(fps, sim=0.5, extra_out=True)
+fingerprints = ic.fingerprints(image_arrays, model)
+clusters,extra = ic.cluster(fingerprints, sim=0.5, extra_out=True)
 
 # linkage matrix Z
-Z = extra['Z']
-
 fig,ax = plt.subplots()
-dendrogram(Z, ax=ax)
+dendrogram(extra['Z'], ax=ax)
 
 # Adjust yaxis labels (values from Z[:,2]) to our definition of the `sim`
 # parameter.