DOC: add sphinx docs, update examples

elcorto · Jun 16, 2019 · 303e100 · 303e100
1 parent f9ad719
commit 303e100
Show file tree

Hide file tree

Showing 27 changed files with 787 additions and 294 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,8 @@
 *.pyc
 *.egg-info
 *.swp
-*.bak
+*.bak*
 build/
+doc/build
+doc/source/generated
+doc/sphinx-autodoc
diff --git a/README.rst b/README.rst
@@ -1,212 +1,7 @@
-About
-=====
+``imagecluster`` is a package for clustering images by content. We use a
+pre-trained deep convolutional neural network to calculate image fingerprints
+which represent content. Those are used to cluster similar images. In addition
+to pure image content, it is possible to mix in timestamp information which
+improves clustering for temporally uncorrelated images.
 
-Package for clustering images by content. We use a pre-trained deep
-convolutional neural network to calculate image fingerprints which represent
-content. Those are used to cluster similar images. In addition to pure
-image content, it is possible to mix in timestamp information which improves
-clustering for temporally uncorrelated images.
-
-Usage
-=====
-
-The package is designed as a library. See ``examples/example_api.py``.
-
-.. Here is what you can do:
-
-.. .. code:: python
-.. example_api.py
-
-The bottleneck is ``~imagecluster.calc.fingerprints``, all other
-operations have negligible relative cost.
-
-Have a look at the clusters (as dirs with symlinks to the relevant files):
-
-.. code:: sh
-
-    $ tree pics/imagecluster/clusters/
-    pics/imagecluster/clusters/
-    ├── cluster_with_2
-    │   ├── cluster_0
-    │   │   ├── 140100.jpg -> /path/to/pics/140100.jpg
-    │   │   └── 140101.jpg -> /path/to/pics/140101.jpg
-    │   ├── cluster_1
-    │   │   ├── 140600.jpg -> /path/to/pics/140600.jpg
-    │   │   └── 140601.jpg -> /path/to/pics/140601.jpg
-    │   ├── cluster_2
-    │   │   ├── 140400.jpg -> /path/to/pics/140400.jpg
-    │   │   └── 140401.jpg -> /path/to/pics/140401.jpg
-    │   ├── cluster_3
-    │   │   ├── 140501.jpg -> /path/to/pics/140501.jpg
-    │   │   └── 140502.jpg -> /path/to/pics/140502.jpg
-    │   ├── cluster_4
-    │   │   ├── 140000.jpg -> /path/to/pics/140000.jpg
-    │   │   └── 140001.jpg -> /path/to/pics/140001.jpg
-    │   ├── cluster_5
-    │   │   ├── 140300.jpg -> /path/to/pics/140300.jpg
-    │   │   └── 140301.jpg -> /path/to/pics/140301.jpg
-    │   └── cluster_6
-    │       ├── 140200.jpg -> /path/to/pics/140200.jpg
-    │       └── 140201.jpg -> /path/to/pics/140201.jpg
-    └── cluster_with_3
-        └── cluster_0
-            ├── 140801.jpg -> /path/to/pics/140801.jpg
-            ├── 140802.jpg -> /path/to/pics/140802.jpg
-            └── 140803.jpg -> /path/to/pics/140803.jpg
-
-So there are some clusters with 2 images each, and one with 3 images. Lets look
-at the clusters:
-
-.. image:: doc/clusters.png
-
-For this example, we use a very small subset of the `Holiday image dataset
-<holiday_>`_ (25 images (all named 140*.jpg) of 1491 total images in the
-dataset). See ``examples/inria_holiday.sh`` for how to select such a subset:
-
-.. code:: sh
-
-    $ /path/to/imagecluster/examples/inria_holiday.sh jpg/140*
-
-Here is the result of using a larger subset of 292 images from the same dataset
-(``/inria_holiday.sh jpg/14*``):
-
-.. image:: doc/clusters_many.png
-
-Methods
-=======
-
-Clustering and similarity index
--------------------------------
-
-We use `hierarchical clustering <hc_>`_ (``calc.cluster()``), which compares
-the image fingerprints (4096-dim vectors) using a distance metric and produces
-a `dendrogram <dendro_>`_ as an intermediate result. This shows how the images
-can be grouped together depending on their similarity (y-axis).
-
-.. image:: doc/dendrogram.png
-
-One can now cut through the dendrogram tree at a certain height (``sim``
-parameter 0...1, y-axis) to create clusters of images with that level of
-similarity. ``sim=0`` is the root of the dendrogram (top in the plot) where
-there is only one node (= all images in one cluster). ``sim=1`` is equal to the
-end of the dendrogram tree (bottom in the plot), where each image is its own
-cluster. By varying the index between 0 and 1, we thus increase the number of
-clusters from 1 to the number of images. However, note that we only report
-clusters with at least 2 images, such that ``sim=1`` will in fact produce no
-results at all (unless there are completely identical images).
-
-Image fingerprints
-------------------
-
-The task of the fingerprints (feature vectors) is to represent an image's
-content (mountains, car, kitchen, person, ...). Deep convolutional neural
-networks trained on many different images have developed an internal
-representation of objects in higher layers, which we use for that purpose.
-
-To this end, we use a pre-trained NN (VGG16_ as implemented by Keras_). The
-weights will be downloaded *once* by Keras automatically upon first import and
-placed into ``~/.keras/models/``. The network was trained on ImageNet_ and is
-able to categorize images into 1000 classes (the last layer has 1000 nodes). We
-use (`thanks for the hint! <alexcnwy_>`_) the activations of the second to last
-fully connected layer ('fc2', 4096 nodes) as image fingerprints (numpy 1d array
-of shape ``(4096,)``) by default.
-
-Content and time distance
--------------------------
-
-Image fingerprints represent content. Clustering based on content ignores time
-correlations. Say we have two images of some object that look similar and will
-thus be put into the same cluster. However, they might be in fact pictures of
-different objects, taken at different times -- which is our original holiday
-image use case (e.g. two images of a church from different cities, taken on
-separate trips). In this case, we want the images to end up in different
-clusters. We have a feature to mix content distance (``d_c`` and time distance
-``d_t``) such that
-
-::
-
-    d = (1 - alpha) * d_c * ahpha * d_t
-
-One can thus do pure content-based clustering (``alpha=0``) or pure time-based
-(``alpha=1``). The effect of the mixing is that fingerprint points representing
-content get pushed further apart when the corresponding images' time distance
-is large. That way, we achieve a transparent addition of time information w/o
-changing the clustering method.
-
-
-Quality of clustering & parameters to tune
-------------------------------------------
-
-You may have noticed that in the example above, only 17 out of 25 images are
-put into clusters. The others are not assigned to any cluster. Technically they
-are in clusters of size 1, which we don't report by default (unless you use
-``calc.cluster(..., min_csize=1)``). One can now start to lower ``sim`` to
-find a good balance of clustering accuracy and the tolerable amount of
-dissimilarity among images within a cluster.
-
-Also, the parameters of the clustering method itself are worth tuning. ATM, we
-expose only some in ``calc.cluster()``. We tested several distance metrics and
-linkage methods, but this could nevertheless use a more elaborate evaluation.
-See ``calc.cluster()`` for "method", "metric" and "criterion" and the scipy
-functions called. If you do this and find settings which perform much better --
-PRs welcome!
-
-Additionally, some other implementations do not use any of the inner fully
-connected layers as features, but instead the output of the last pooling
-layer (layer 'flatten' in Keras' VGG16). We tested that briefly (see
-``get_model(... layer='fc2')`` or ``main(..., layer='fc2')`` and found our
-default 'fc2' to perform well enough. 'fc1' performs almost the same, while
-'flatten' seems to do worse. But again, a quantitative analysis is in order.
-
-PCA: Because of the `Curse of dimensionality <curse_>`_, it may be helpful to
-perform a PCA on the fingerprints before clustering to reduce the feature
-vector dimensions to, say, a few 100, thus making the distance metrics used in
-clustering more effective. However, our tests so far show no substantial change
-in clustering results, in accordance to what `others have found
-<gh_beleidy_>`_. See ``examples/example_api.py`` and ``calc.pca()``.
-
-
-Tests
-=====
-
-See ``imagecluster/tests/``. Use a test runner such as ``nosetests`` or
-``pytest``.
-
-
-Install
-=======
-
-.. code:: sh
-
-    $ pip3 install -e .
-
-See also samplepkg_.
-
-Contributions
-=============
-
-Contributions are welcome. To streamline the git log, consider using one of
-the prefixes mentioned `here <commit_pfx_>`_ in your commit message.
-
-
-Related projects
-================
-
-* https://artsexperiments.withgoogle.com/tsnemap/
-* https://github.com/YaleDHLab/pix-plot
-* https://github.com/beleidy/unsupervised-image-clustering
-* https://github.com/zegami/image-similarity-clustering
-* https://github.com/sujitpal/holiday-similarity
-
-
-.. _VGG16: https://arxiv.org/abs/1409.1556
-.. _Keras: https://keras.io
-.. _ImageNet: http://www.image-net.org/
-.. _alexcnwy: https://github.com/alexcnwy
-.. _hc: https://en.wikipedia.org/wiki/Hierarchical_clustering
-.. _dendro: https://en.wikipedia.org/wiki/Dendrogram
-.. _holiday: http://lear.inrialpes.fr/~jegou/data.php
-.. _curse: https://en.wikipedia.org/wiki/Curse_of_dimensionality
-.. _gh_beleidy: https://github.com/beleidy/unsupervised-image-clustering
-.. _commit_pfx: https://github.com/elcorto/libstuff/blob/master/commit_prefixes
-.. _samplepkg: https://github.com/elcorto/samplepkg
+Check the `documentation <https://elcorto.github.io/imagecluster>`_.
diff --git a/TODO b/TODO
@@ -6,3 +6,7 @@
 * Add option to main() for passing a global fingerprints.pk file, which gets
   updated. Remove hard-coded location
   /path/to/pics/imagecluster/fingerprints.pk
+
+* We need an option to pass a list of file names to io.get_image_data(),
+  examples/inria_holiday.sh (old select.sh) is just stooopid. In case we pass
+  in file names, put db files in ./imagecluster/ by default. The end.
diff --git a/doc/Makefile b/doc/Makefile
@@ -0,0 +1,19 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/doc/clusters.png b/doc/clusters.png
diff --git a/doc/generate-apidoc.sh b/doc/generate-apidoc.sh
@@ -0,0 +1,37 @@
+#!/bin/sh
+
+err(){
+    echo "error: $@"
+    exit 1
+}
+
+if [ $# -eq 1 ]; then
+    autodoc=$(readlink -f $1)
+    [ -e $autodoc ] || err "not found: $autodoc"
+else
+    local_loc=sphinx-autodoc/sphinx-autodoc.py
+    std_loc=$HOME/soft/git/sphinx-autodoc/sphinx-autodoc.py
+    if which sphinx-autodoc.py; then
+        autodoc=sphinx-autodoc.py
+    elif [ -f $std_loc ]; then
+        autodoc=$std_loc
+    elif [ -f $local_loc ]; then
+        autodoc=$local_loc
+    else
+        git clone https://github.com/elcorto/sphinx-autodoc
+        autodoc=$local_loc
+    fi
+fi
+
+# ensure a clean generated tree
+rm -v $(find ../ -name "*.pyc" -o -name "__pycache__")
+make clean
+rm -rfv build/ source/generated/
+
+# generate API doc rst files
+echo "using: $autodoc"
+$autodoc -s source -a generated/api \
+         -X 'test\.(test_|check_dep.*|utils|testenv)' imagecluster
+
+### make heading the same level as in source/written/index.rst
+##sed -i -re '/^API.*/,/[-]+/ s/-/=/g' source/generated/api/index.rst
diff --git a/doc/source/_static/clusters.png b/doc/source/_static/clusters.png
diff --git a/doc/clusters_many.png → doc/source/_static/clusters_many.png b/doc/clusters_many.png → doc/source/_static/clusters_many.png
diff --git a/doc/dendrogram.png → doc/source/_static/dendrogram.png b/doc/dendrogram.png → doc/source/_static/dendrogram.png