diff --git a/README.rst b/README.rst index ee8c9f2..c29b01c 100644 --- a/README.rst +++ b/README.rst @@ -2,83 +2,23 @@ About ===== Package for clustering images by content. We use a pre-trained deep -convolutional neural network to calculate image fingerprints, which are then -used to cluster similar images. +convolutional neural network to calculate image fingerprints which represent +content. Those are used to cluster similar images. In addition to pure +image content, it is possible to mix in timestamp information which improves +clustering for temporally uncorrelated images. Usage ===== -The package is designed as a library. Here is what you can do: - -.. code:: python - - from imagecluster import calc as ic - from imagecluster import postproc as pp - - # Create image database in memory. This helps to feed images to the NN model - # quickly. - ias = ic.image_arrays('pics/', size=(224,224)) - - # Create Keras NN model. - model = ic.get_model() - - # Feed images through the model and extract fingerprints (feature vectors). - fps = ic.fingerprints(ias, model) - - # Optionally run a PCA on the fingerprints to compress the dimensions. Use a - # cumulative explained variance ratio of 0.95. - fps = ic.pca(fps, n_components=0.95) - - # Run clustering on the fingerprints. Select clusters with similarity index - # sim=0.5 - clusters = ic.cluster(fps, sim=0.5) - - # Create dirs with links to images. Dirs represent the clusters the images - # belong to. - pp.make_links(clusters, 'pics/imagecluster/clusters') - - # Plot images arranged in clusters. - pp.visualize(clusters, ias) - -See also ``imagecluster.main.main()``. It does the same as the code above, but -also saves/loads the image database and the fingerprints to/from disk, such -that you can re-run the clustering and post-processing again without -re-calculating fingerprints. - -Example session: - -.. code:: python - - >>> from imagecluster import main - >>> main.main('pics/', sim=0.5, vis=True) - no fingerprints database pics/imagecluster/fingerprints.pk found - create image array database pics/imagecluster/images.pk - pics/140301.jpg - pics/140601.jpg - pics/140101.jpg - pics/140400.jpg - pics/140801.jpg - [...] - running all images through NN model ... - pics/140301.jpg - pics/140503.jpg - pics/140601.jpg - pics/140901.jpg - pics/140101.jpg - [...] - clustering ... - #images : #clusters - 2 : 7 - 3 : 1 - #images in clusters total: 17 - cluster dir: pics/imagecluster/clusters - -If you run this again on the same directory, only the clustering (which is very -fast) and the post-processing (links, visualization) will be repeated. +The package is designed as a library. See ``examples/example_api.py``. -For this example, we use a very small subset of the `Holiday image dataset -`_ (25 images (all named 140*.jpg) of 1491 total images in the -dataset). +.. Here is what you can do: + +.. .. code:: python +.. example_api.py + +The bottleneck is ``~imagecluster.calc.fingerprints``, all other +operations have negligible relative cost. Have a look at the clusters (as dirs with symlinks to the relevant files): @@ -119,7 +59,16 @@ at the clusters: .. image:: doc/clusters.png -Here is the result of using a larger subset of 292 images from the same dataset. +For this example, we use a very small subset of the `Holiday image dataset +`_ (25 images (all named 140*.jpg) of 1491 total images in the +dataset). See ``examples/inria_holiday.sh`` for how to select such a subset: + +.. code:: sh + + $ /path/to/imagecluster/examples/inria_holiday.sh jpg/140* + +Here is the result of using a larger subset of 292 images from the same dataset +(``/inria_holiday.sh jpg/14*``): .. image:: doc/clusters_many.png @@ -136,8 +85,6 @@ can be grouped together depending on their similarity (y-axis). .. image:: doc/dendrogram.png - - One can now cut through the dendrogram tree at a certain height (``sim`` parameter 0...1, y-axis) to create clusters of images with that level of similarity. ``sim=0`` is the root of the dendrogram (top in the plot) where @@ -164,6 +111,28 @@ use (`thanks for the hint! `_) the activations of the second to last fully connected layer ('fc2', 4096 nodes) as image fingerprints (numpy 1d array of shape ``(4096,)``) by default. +Content and time distance +------------------------- + +Image fingerprints represent content. Clustering based on content ignores time +correlations. Say we have two images of some object that look similar and will +thus be put into the same cluster. However, they might be in fact pictures of +different objects, taken at different times -- which is our original holiday +image use case (e.g. two images of a church from different cities, taken on +separate trips). In this case, we want the images to end up in different +clusters. We have a feature to mix content distance (``d_c`` and time distance +``d_t``) such that + +:: + + d = (1 - alpha) * d_c * ahpha * d_t + +One can thus do pure content-based clustering (``alpha=0``) or pure time-based +(``alpha=1``). The effect of the mixing is that fingerprint points representing +content get pushed further apart when the corresponding images' time distance +is large. That way, we achieve a transparent addition of time information w/o +changing the clustering method. + Quality of clustering & parameters to tune ------------------------------------------ @@ -211,13 +180,7 @@ Install $ pip3 install -e . -or if you have the ``requirements.txt`` already installed (e.g. by your system's -package manager) - -.. code:: sh - - $ pip3 install -e . --no-deps - +See also samplepkg_. Contributions ============= @@ -246,3 +209,4 @@ Related projects .. _curse: https://en.wikipedia.org/wiki/Curse_of_dimensionality .. _gh_beleidy: https://github.com/beleidy/unsupervised-image-clustering .. _commit_pfx: https://github.com/elcorto/libstuff/blob/master/commit_prefixes +.. _samplepkg: https://github.com/elcorto/samplepkg diff --git a/examples/example_api.py b/examples/example_api.py old mode 100644 new mode 100755 index babd9da..41a8792 --- a/examples/example_api.py +++ b/examples/example_api.py @@ -1,27 +1,42 @@ +#!/usr/bin/python3 + from imagecluster import calc as ic +from imagecluster import io as icio from imagecluster import postproc as pp -# Create image database in memory. This helps to feed images to the NN model -# quickly. -ias = ic.image_arrays('pics/', size=(224,224)) - -# Create Keras NN model. -model = ic.get_model() - -# Feed images through the model and extract fingerprints (feature vectors). -fps = ic.fingerprints(ias, model) +# # Create image database in memory. This helps to feed images to the NN model +# # quickly. +# image_arrays = icio.read_image_arrays('pics/', size=(224,224)) +# +# # Create Keras NN model. +# model = ic.get_model() +# +# # Feed images through the model and extract fingerprints (feature vectors). +# fingerprints = ic.fingerprints(image_arrays, model) +# +# # Optionally run a PCA on the fingerprints to compress the dimensions. Use a +# # cumulative explained variance ratio of 0.95. +# fingerprints = ic.pca(fingerprints, n_components=0.95) +# +# # Read image timestamps. Need that to calculate the time distance, can be used +# # in clustering. +# timestamps = icio.read_timestamps('pics/') -# Optionally run a PCA on the fingerprints to compress the dimensions. Use a -# cumulative explained variance ratio of 0.95. -fps = ic.pca(fps, n_components=0.95) +# XXX where on disk? add to README +# Convenience function to perform the steps above. Check for existing +# `image_arrays` and `fingerprints` database files on disk and load them if +# present. Running this again only loads data from disk, which is fast. +image_arrays,fingerprints,timestamps = icio.get_image_data( + 'pics/', + pca_kwds=dict(n_components=0.95)) -# Run clustering on the fingerprints. Select clusters with similarity index -# sim=0.5 -clusters = ic.cluster(fps, sim=0.5) +# Run clustering on the fingerprints. Select clusters with similarity index +# sim=0.5. Mix 80% content distance with 20% timestamp distance (alpha=0.2). +clusters = ic.cluster(fingerprints, sim=0.5, timestamps=timestamps, alpha=0.2) # Create dirs with links to images. Dirs represent the clusters the images # belong to. pp.make_links(clusters, 'pics/imagecluster/clusters') # Plot images arranged in clusters. -pp.visualize(clusters, ias) +pp.visualize(clusters, image_arrays) diff --git a/examples/example_main.py b/examples/example_main.py deleted file mode 100644 index 0f1ec6d..0000000 --- a/examples/example_main.py +++ /dev/null @@ -1,3 +0,0 @@ -from imagecluster import main - -main.main('pics/', sim=0.65, vis=True, max_csize=10, pca=True) diff --git a/examples/inria_holiday.sh b/examples/inria_holiday.sh new file mode 100755 index 0000000..e369d00 --- /dev/null +++ b/examples/inria_holiday.sh @@ -0,0 +1,23 @@ +#!/bin/sh + +# select 25 images +# ./this.sh jpg/100* +# +# select 274 images +# ./this.sh jpg/10* + +if ! [ -d jpg ]; then + for name in jpg1 jpg2; do + wget ftp://ftp.inrialpes.fr/pub/lear/douze/data/${name}.tar.gz + tar -xzf ${name}.tar.gz + done +fi + +mkdir -p pics +rm -rf pics/* +for x in $@; do + f=$(echo "$x" | sed -re 's|jpg/||') + ln -s $(readlink -f jpg/$f) pics/$f +done + +echo "#images: $(ls pics | wc -l)" diff --git a/examples/plot_dendrogram.py b/examples/plot_dendrogram.py old mode 100644 new mode 100755 index 220367f..256d224 --- a/examples/plot_dendrogram.py +++ b/examples/plot_dendrogram.py @@ -1,19 +1,20 @@ +#!/usr/bin/python3 + from matplotlib import pyplot as plt import numpy as np from scipy.cluster.hierarchy import dendrogram from imagecluster import calc as ic +from imagecluster import io as icio -ias = ic.image_arrays('pics/', size=(224,224)) +image_arrays = icio.read_image_arrays('pics/', size=(224,224)) model = ic.get_model() -fps = ic.fingerprints(ias, model) -clusters,extra = ic.cluster(fps, sim=0.5, extra_out=True) +fingerprints = ic.fingerprints(image_arrays, model) +clusters,extra = ic.cluster(fingerprints, sim=0.5, extra_out=True) # linkage matrix Z -Z = extra['Z'] - fig,ax = plt.subplots() -dendrogram(Z, ax=ax) +dendrogram(extra['Z'], ax=ax) # Adjust yaxis labels (values from Z[:,2]) to our definition of the `sim` # parameter. diff --git a/imagecluster/calc.py b/imagecluster/calc.py index c7c90a6..dabd9f2 100644 --- a/imagecluster/calc.py +++ b/imagecluster/calc.py @@ -1,26 +1,18 @@ import os - -import multiprocessing as mp -import functools from collections import OrderedDict -from PIL import Image +import numpy as np from scipy.spatial import distance from scipy.cluster import hierarchy -import numpy as np from sklearn.decomposition import PCA from keras.applications.vgg16 import VGG16, preprocess_input -from keras.preprocessing import image from keras.models import Model -from . import common - pj = os.path.join - def get_model(layer='fc2'): """Keras Model of the VGG16 network, with the output layer set to `layer`. @@ -53,64 +45,6 @@ def get_model(layer='fc2'): return model -def load_img_rgb(fn): - return Image.open(fn).convert('RGB') - - -# keras.preprocessing.image.load_img() uses img.rezize(shape) with the default -# interpolation of Image.resize() which is pretty bad (see -# imagecluster/play/pil_resample_methods.py). Given that we are restricted to -# small inputs of 224x224 by the VGG network, we should do our best to keep as -# much information from the original image as possible. This is a gut feeling, -# untested. But given that model.predict() is 10x slower than PIL image loading -# and resizing .. who cares. -# -# (224, 224, 3) -##img = image.load_img(fn, target_size=size) -##... = image.img_to_array(img) -def _img_worker(fn, size): - # Handle PIL error "OSError: broken data stream when reading image file". - # See https://github.com/python-pillow/Pillow/issues/1510 . We have this - # issue with smartphone panorama JPG files. But instead of bluntly setting - # ImageFile.LOAD_TRUNCATED_IMAGES = True and hoping for the best (is the - # image read, and till the end?), we catch the OSError thrown by PIL and - # ignore the file completely. This is better than reading potentially - # undefined data and process it. A more specialized exception from PILs - # side would be good, but let's hope that an OSError doesn't cover too much - # ground when reading data from disk :-) - try: - print(fn) - return fn, image.img_to_array(load_img_rgb(fn).resize(size, 3), - dtype=int) - except OSError as ex: - print(f"skipping {fn}: {ex}") - return fn, None - - -def image_arrays(imagedir, size, ncores=mp.cpu_count()): - """Load images from `imagedir` and resize to `size`. - - Parameters - ---------- - imagedir : str - size : sequence length 2 - (width, height), used in ``Image.open(filename).resize(size)`` - ncores : int - run that many parallel processes - - Returns - ------- - dict - {filename: 3d array (height, width, 3), - ... - } - """ - _f = functools.partial(_img_worker, size=size) - with mp.Pool(ncores) as pool: - ret = pool.map(_f, common.get_files(imagedir)) - return {k: v for k,v in ret if v is not None} - - def fingerprint(img_arr, model): """Run image array (3d array) run through `model` (keras.models.Model). @@ -165,18 +99,18 @@ def fingerprint(img_arr, model): ## return fn, fingerprint(img_arr, model) ## ## -##def fingerprints(ias, model): +##def fingerprints(image_arrays, model): ## _f = functools.partial(_worker, model=model) ## with mp.Pool(int(mp.cpu_count()/2)) as pool: -## ret = pool.map(_f, ias.items()) +## ret = pool.map(_f, image_arrays.items()) ## return dict(ret) -def fingerprints(ias, model): - """Calculate fingerprints for all image arrays in `ias`. +def fingerprints(image_arrays, model): + """Calculate fingerprints for all image arrays in `image_arrays`. Parameters ---------- - ias : see :func:`image_arrays` + image_arrays : see :func:`~io.image_arrays` model : see :func:`fingerprint` Returns @@ -187,33 +121,38 @@ def fingerprints(ias, model): ... } """ - fps = {} - for fn,img_arr in ias.items(): + fingerprints = {} + for fn,img_arr in image_arrays.items(): print(fn) - fps[fn] = fingerprint(img_arr, model) - return fps + fingerprints[fn] = fingerprint(img_arr, model) + return fingerprints -def pca(fps, n_components=0.9, **kwds): +def pca(fingerprints, n_components=0.9, **kwds): if 'n_components' not in kwds.keys(): kwds['n_components'] = n_components # Yes in recent Pythons, dicts are ordered in CPython, but still. - _fps = OrderedDict(fps) - X = np.array(list(_fps.values())) + _fingerprints = OrderedDict(fingerprints) + X = np.array(list(_fingerprints.values())) Xp = PCA(**kwds).fit(X).transform(X) - return {k:v for k,v in zip(_fps.keys(), Xp)} + return {k:v for k,v in zip(_fingerprints.keys(), Xp)} -def cluster(fps, sim=0.5, method='average', metric='euclidean', - extra_out=False, print_stats=True, min_csize=2): +def cluster(fingerprints, sim=0.5, timestamps=None, alpha=0.3, method='average', + metric='euclidean', extra_out=False, print_stats=True, min_csize=2): """Hierarchical clustering of images based on image fingerprints. Parameters ---------- - fps: dict + fingerprints: dict output of :func:`fingerprints` sim : float 0..1 similarity index + timestamps: dict + output of :func:`~imagecluster.io.load_timestamps` + alpha : float + mixing parameter of image content distance and time distance, ignored + if timestamps is None method : see scipy.hierarchy.linkage(), all except 'centroid' produce pretty much the same result metric : see scipy.hierarchy.linkage(), make sure to use 'euclidean' in @@ -240,14 +179,30 @@ def cluster(fps, sim=0.5, method='average', metric='euclidean', if `extra_out` is True """ assert 0 <= sim <= 1, "sim not 0..1" + assert 0 <= alpha <= 1, "alpha not 0..1" assert min_csize >= 1, "min_csize must be >= 1" - files = list(fps.keys()) + files = list(fingerprints.keys()) # array(list(...)): 2d array # [[... fingerprint of image1 (4096,) ...], # [... fingerprint of image2 (4096,) ...], # ... # ] - dfps = distance.pdist(np.array(list(fps.values())), metric) + dfps = distance.pdist(np.array(list(fingerprints.values())), metric) + if timestamps is not None: + # Sanity error check as long as we don't have a single data struct to + # keep fingerprints and timestamps, as well as image data. This is not + # pretty, but at least a safety hook. + set_files = set(files) + set_tsfiles = set(timestamps.keys()) + set_diff = set_files.symmetric_difference(set_tsfiles) + assert len(set_diff) == 0, (f"files in fingerprints and timestamps do " + f"not match: diff={set_diff}") + # use 'files' to make sure we have the same order as in 'fingerprints' + tsarr = np.array([timestamps[k] for k in files])[:,None] + dts = distance.pdist(tsarr, metric) + dts = dts / dts.max() + dfps = dfps / dfps.max() + dfps = dfps * (1 - alpha) + dts * alpha # hierarchical/agglomerative clustering (Z = linkage matrix, construct # dendrogram), plot: scipy.cluster.hierarchy.dendrogram(Z) Z = hierarchy.linkage(dfps, method=method, metric=metric) diff --git a/imagecluster/common.py b/imagecluster/common.py deleted file mode 100644 index ea9585c..0000000 --- a/imagecluster/common.py +++ /dev/null @@ -1,19 +0,0 @@ -import re -import pickle -import os - - -def read_pk(fn): - with open(fn, 'rb') as fd: - ret = pickle.load(fd) - return ret - - -def write_pk(obj, fn): - with open(fn, 'wb') as fd: - pickle.dump(obj, fd) - - -def get_files(dr, ext='jpg|jpeg|bmp|png'): - rex = re.compile(r'^.*\.({})$'.format(ext), re.I) - return [os.path.join(dr,base) for base in os.listdir(dr) if rex.match(base)] diff --git a/imagecluster/exceptions.py b/imagecluster/exceptions.py new file mode 100644 index 0000000..a484ddf --- /dev/null +++ b/imagecluster/exceptions.py @@ -0,0 +1,6 @@ +class ICError(Exception): + pass + + +class ICExifReadError(ICError): + pass diff --git a/imagecluster/io.py b/imagecluster/io.py new file mode 100644 index 0000000..7e754d5 --- /dev/null +++ b/imagecluster/io.py @@ -0,0 +1,178 @@ +import datetime +import functools +import multiprocessing as mp +import os +import pickle +import re + +from keras.preprocessing import image +import PIL.Image + +from . import exceptions +from . import calc as ic + +pj = os.path.join + +ic_base_dir = 'imagecluster' + + +def read_pk(filename): + with open(filename, 'rb') as fd: + ret = pickle.load(fd) + return ret + + +def write_pk(obj, filename): + os.makedirs(os.path.dirname(filename), exist_ok=True) + with open(filename, 'wb') as fd: + pickle.dump(obj, fd) + + +def get_files(dr, ext='jpg|jpeg|bmp|png'): + rex = re.compile(r'^.*\.({})$'.format(ext), re.I) + return [os.path.join(dr,base) for base in os.listdir(dr) if rex.match(base)] + + +def exif_timestamp(filename): + # PIL lazy-loads the image data, so this open and _getexif() is fast. + img = PIL.Image.open(filename) + if ('exif' not in img.info.keys()) or (not hasattr(img, '_getexif')): + raise exceptions.ICExifReadError(f"no EXIF data found in {filename}") + # Avoid constucting the whole EXIF dict just to extract the DateTime field. + # DateTime -> key 306 is in the EXIF standard, so let's use that directly. + ## date_time = {TAGS[k] : v for k,v in exif.items()}['DateTime'] + exif = img._getexif() + key = 306 + if key not in exif.keys(): + raise exceptions.ICExifReadError(f"key 306 (DateTime) not found in " + f"EXIF data of file {filename}") + # '2019:03:10 22:42:42' + date_time = exif[key] + if date_time.count(':') != 4: + msg = f"unsupported EXIF DateTime format in '{date_time}' of {filename}" + raise exceptions.ICExifReadError(msg) + # '2019:03:10 22:42:42' -> ['2019', '03', '10', '22', '42', '42'] + date_time_str = date_time.replace(':', ' ').split() + names = ('year', 'month', 'day', 'hour', 'minute', 'second') + stamp = datetime.datetime(**{nn:int(vv) for nn,vv in zip(names,date_time_str)}, + tzinfo=datetime.timezone.utc).timestamp() + return stamp + + +def stat_timestamp(filename): + return os.stat(filename).st_mtime + + +def timestamp(filename, source='auto'): + if source == 'auto': + try: + return exif_timestamp(filename) + except exceptions.ICExifReadError: + return stat_timestamp(filename) + elif source == 'stat': + return stat_timestamp(filename) + elif source == 'exif': + return exif_timestamp(filename) + else: + raise ValueError("source not in ['stat', 'exif', 'auto']") + + +# TODO some code dups below, fix later by fancy factory functions + +# keras.preprocessing.image.load_img() uses img.rezize(shape) with the default +# interpolation of Image.resize() which is pretty bad (see +# imagecluster/play/pil_resample_methods.py). Given that we are restricted to +# small inputs of 224x224 by the VGG network, we should do our best to keep as +# much information from the original image as possible. This is a gut feeling, +# untested. But given that model.predict() is 10x slower than PIL image loading +# and resizing .. who cares. +# +# (224, 224, 3) +##img = image.load_img(filename, target_size=size) +##... = image.img_to_array(img) +def _img_arr_worker(filename, size): + # Handle PIL error "OSError: broken data stream when reading image file". + # See https://github.com/python-pillow/Pillow/issues/1510 . We have this + # issue with smartphone panorama JPG files. But instead of bluntly setting + # ImageFile.LOAD_TRUNCATED_IMAGES = True and hoping for the best (is the + # image read, and till the end?), we catch the OSError thrown by PIL and + # ignore the file completely. This is better than reading potentially + # undefined data and process it. A more specialized exception from PILs + # side would be good, but let's hope that an OSError doesn't cover too much + # ground when reading data from disk :-) + try: + print(filename) + img = PIL.Image.open(filename).convert('RGB').resize(size, resample=3) + arr = image.img_to_array(img, dtype=int) + return filename, arr + except OSError as ex: + print(f"skipping {filename}: {ex}") + return filename, None + + +def _timestamp_worker(filename, source): + try: + return filename, timestamp(filename, source) + except OSError as ex: + print(f"skipping {filename}: {ex}") + return filename, None + + +def read_image_arrays(imagedir, size, ncores=mp.cpu_count()): + """Load images from `imagedir` and resize to `size`. + + Parameters + ---------- + imagedir : str + size : sequence length 2 + (width, height), used in ``Image.open(filename).resize(size)`` + ncores : int + run that many parallel processes + + Returns + ------- + dict + {filename: 3d array (height, width, 3), ...} + """ + _f = functools.partial(_img_arr_worker, size=size) + with mp.Pool(ncores) as pool: + ret = pool.map(_f, get_files(imagedir)) + return {k: v for k,v in ret if v is not None} + + +def read_timestamps(imagedir, source='auto', ncores=mp.cpu_count()): + _f = functools.partial(_timestamp_worker, source=source) + with mp.Pool(ncores) as pool: + ret = pool.map(_f, get_files(imagedir)) + return {k: v for k,v in ret if v is not None} + + +# TODO fingerprints and timestamps may have different images which have been +# skipped -> we need a data struct to hold all image data and mask out the +# skipped ones. For now we have a check in calc.cluster() +def get_image_data(imagedir, model_kwds=dict(layer='fc2'), + img_kwds=dict(size=(224,224)), timestamps_kwds=dict(source='auto'), + pca_kwds=None): + """Return all image data needed for clustering.""" + fingerprints_fn = pj(imagedir, ic_base_dir, 'fingerprints.pk') + image_arrays_fn = pj(imagedir, ic_base_dir, 'images.pk') + if os.path.exists(image_arrays_fn): + print(f"reading image arrays {image_arrays_fn} ...") + image_arrays = read_pk(image_arrays_fn) + else: + print(f"create image arrays {image_arrays_fn}") + image_arrays = read_image_arrays(imagedir, **img_kwds) + write_pk(image_arrays, image_arrays_fn) + if os.path.exists(fingerprints_fn): + print(f"reading fingerprints {fingerprints_fn} ...") + fingerprints = read_pk(fingerprints_fn) + else: + print(f"create fingerprints {fingerprints_fn}") + fingerprints = ic.fingerprints(image_arrays, ic.get_model(**model_kwds)) + if pca_kwds is not None: + fingerprints = ic.pca(fingerprints, **pca_kwds) + write_pk(fingerprints, fingerprints_fn) + print(f"reading timestamps ...") + if timestamps_kwds is not None: + timestamps = read_timestamps(imagedir, **timestamps_kwds) + return image_arrays, fingerprints, timestamps diff --git a/imagecluster/main.py b/imagecluster/main.py deleted file mode 100644 index 85c0818..0000000 --- a/imagecluster/main.py +++ /dev/null @@ -1,84 +0,0 @@ -import os - -from imagecluster import calc as ic -from imagecluster import common as co -from imagecluster import postproc as pp - -pj = os.path.join - - -ic_base_dir = 'imagecluster' - - -def main(imagedir, sim=0.5, layer='fc2', size=(224,224), links=True, vis=False, - max_csize=None, pca=False, pca_params=dict(n_components=0.9)): - """Example main app using this library. - - Upon first invocation, the image and fingerprint databases are built and - written to disk. Each new invocation loads those and only repeats - * clustering - * creation of links to files in clusters - * visualization (if `vis=True`) - - This is good for playing around with the `sim` parameter, for - instance, which only influences clustering. - - Parameters - ---------- - imagedir : str - path to directory with images - sim : float (0..1) - similarity index (see :func:`calc.cluster`) - layer : str - which layer to use as feature vector (see - :func:`calc.get_model`) - size : tuple - input image size (width, height), must match `model`, e.g. (224,224) - links : bool - create dirs with links - vis : bool - plot images in clusters - max_csize : max number of images per cluster for visualization (see - :mod:`~postproc`) - pca : bool - Perform PCA on fingerprints before clustering, using `pca_params`. - pca_params : dict - kwargs to sklearn's PCA - - Notes - ----- - imagedir : To select only a subset of the images, create an `imagedir` and - symlink your selected images there. In the future, we may add support - for passing a list of files, should the need arise. But then again, - this function is only an example front-end. - """ - fps_fn = pj(imagedir, ic_base_dir, 'fingerprints.pk') - ias_fn = pj(imagedir, ic_base_dir, 'images.pk') - ias = None - if not os.path.exists(fps_fn): - print(f"no fingerprints database {fps_fn} found") - os.makedirs(os.path.dirname(fps_fn), exist_ok=True) - model = ic.get_model(layer=layer) - if not os.path.exists(ias_fn): - print(f"create image array database {ias_fn}") - ias = ic.image_arrays(imagedir, size=size) - co.write_pk(ias, ias_fn) - else: - ias = co.read_pk(ias_fn) - print("running all images through NN model ...") - fps = ic.fingerprints(ias, model) - co.write_pk(fps, fps_fn) - else: - print(f"loading fingerprints database {fps_fn} ...") - fps = co.read_pk(fps_fn) - if pca: - fps = ic.pca(fps, **pca_params) - print("pca dims:", list(fps.values())[0].shape[0]) - print("clustering ...") - clusters = ic.cluster(fps, sim) - if links: - pp.make_links(clusters, pj(imagedir, ic_base_dir, 'clusters')) - if vis: - if ias is None: - ias = co.read_pk(ias_fn) - pp.visualize(clusters, ias, max_csize=max_csize) diff --git a/imagecluster/postproc.py b/imagecluster/postproc.py index d3a0ce8..f54e754 100644 --- a/imagecluster/postproc.py +++ b/imagecluster/postproc.py @@ -1,5 +1,6 @@ import os import shutil +import functools from matplotlib import pyplot as plt import numpy as np @@ -9,15 +10,15 @@ pj = os.path.join -def plot_clusters(clusters, ias, max_csize=None, mem_limit=1024**3): - """Plot `clusters` of images in `ias`. +def plot_clusters(clusters, image_arrays, max_csize=None, mem_limit=1024**3): + """Plot `clusters` of images in `image_arrays`. For interactive work, use :func:`visualize` instead. Parameters ---------- clusters : see :func:`calc.cluster` - ias : see :func:`calc.image_arrays` + image_arrays : see :func:`calc.image_arrays` max_csize : int plot clusters with at most this many images mem_limit : float or int, bytes @@ -32,7 +33,7 @@ def plot_clusters(clusters, ias, max_csize=None, mem_limit=1024**3): ncols = stats[:,1].sum() # csize (number of images per cluster) nrows = stats[:,0].max() - shape = ias[list(ias.keys())[0]].shape[:2] + shape = image_arrays[list(image_arrays.keys())[0]].shape[:2] mem = nrows * shape[0] * ncols * shape[1] * 3 if mem > mem_limit: raise Exception(f"size of plot array ({mem/1024**2} MiB) > mem_limit " @@ -45,7 +46,7 @@ def plot_clusters(clusters, ias, max_csize=None, mem_limit=1024**3): for cluster in clusters[csize]: icol += 1 for irow, filename in enumerate(cluster): - img_arr = ias[filename] + img_arr = image_arrays[filename] arr[irow*shape[0]:(irow+1)*shape[0], icol*shape[1]:(icol+1)*shape[1], :] = img_arr print(f"plot array ({arr.dtype}) size: {arr.nbytes/1024**2} MiB") @@ -56,6 +57,7 @@ def plot_clusters(clusters, ias, max_csize=None, mem_limit=1024**3): return fig,ax +@functools.wraps(plot_clusters) def visualize(*args, **kwds): plot_clusters(*args, **kwds) plt.show() diff --git a/imagecluster/tests/tests.py b/imagecluster/tests/tests.py index 5df42e0..709dff2 100644 --- a/imagecluster/tests/tests.py +++ b/imagecluster/tests/tests.py @@ -1,15 +1,17 @@ import logging import os -import pickle import shutil import tempfile +import copy +import datetime import numpy as np from matplotlib.pyplot import imsave import PIL.Image +import piexif -from imagecluster import main from imagecluster import calc as ic +from imagecluster import io as icio # https://stackoverflow.com/a/39708493 @@ -17,11 +19,19 @@ pj = os.path.join +# TODO re-use ImagedirCtx where possible, we write files in each context, +# re-use ctxs which don't alter the files class ImagedirCtx: - def __init__(self): + def __init__(self, fmt='png'): + assert fmt in ['jpg', 'png'] + date_time_base_dct = dict(year=2019, + month=12, + day=31, + hour=23, + minute=42) imagedir = tempfile.mkdtemp(prefix='imagecluster_') - dbfn = pj(imagedir, main.ic_base_dir, 'fingerprints.pk') + dbfn = pj(imagedir, icio.ic_base_dir, 'fingerprints.pk') arr = np.ones((500,600,3), dtype=np.uint8) white = np.ones_like(arr) * 255 black = np.zeros_like(arr) @@ -32,19 +42,33 @@ def __init__(self): black=[black]*4) image_fns = [] clusters = {} + second = 0 for color, arrs in images.items(): nimg = len(arrs) clus = clusters.get(nimg, []) for idx, arr in enumerate(arrs): - fn = pj(imagedir, f'image_{color}_{idx}.png') - imsave(fn, arr) + if fmt == 'png': + fn = pj(imagedir, f'image_{color}_{idx}.png') + imsave(fn, arr) + elif fmt == 'jpg': + fn = pj(imagedir, f'image_{color}_{idx}.jpg') + img = PIL.Image.fromarray(arr, mode='RGB') + # just the DateTime field + date_time_dct = copy.deepcopy(date_time_base_dct) + date_time_dct.update(second=second) + exif_date_time_fmt = '{year}:{month}:{day} {hour}:{minute}:{second}' + exif_date_time_str = exif_date_time_fmt.format(**date_time_dct) + piexif_exif_dct = {'0th': {306: exif_date_time_str}} + img.save(fn, exif=piexif.dump(piexif_exif_dct)) image_fns.append(fn) clus.append(fn) + second += 1 clusters[nimg] = [clus] self.imagedir = imagedir self.dbfn = dbfn self.image_fns = image_fns self.clusters = clusters + self.date_time_base_dct = date_time_base_dct print(clusters) def __enter__(self): @@ -54,31 +78,37 @@ def __exit__(self, *args): shutil.rmtree(self.imagedir) -def test_main_basic(): +def test_api_get_image_data(): with ImagedirCtx() as ctx: # run 1: create fingerprints database, run clustering - main.main(ctx.imagedir) - # run 2: only run clustering, should be much faster, this time also use PCA - main.main(ctx.imagedir, pca=True) - with open(ctx.dbfn, 'rb') as fd: - fps = pickle.load(fd) - assert len(fps.keys()) == len(ctx.image_fns) - assert set(fps.keys()) == set(ctx.image_fns) - for kk,vv in fps.items(): - assert isinstance(vv, np.ndarray) - assert len(vv) == 4096 + image_arrays,fingerprints,timestamps = icio.get_image_data(ctx.imagedir) + # run 2: only run clustering, should be much faster, this time use all + # kwds (test API) + image_arrays,fingerprints,timestamps = icio.get_image_data( + ctx.imagedir, + pca_kwds=dict(n_components=0.95), + model_kwds=dict(layer='fc2'), + img_kwds=dict(size=(224,224)), + timestamps_kwds=dict(source='auto')) + assert len(fingerprints.keys()) == len(ctx.image_fns) + assert set(fingerprints.keys()) == set(ctx.image_fns) -def test_cluster(): - # use API +def test_low_level_api_and_clustering(): + # use low level API (same as get_image_data) but call all funcs # test clustering with ImagedirCtx() as ctx: - ias = ic.image_arrays(ctx.imagedir, size=(224,224)) + image_arrays = icio.read_image_arrays(ctx.imagedir, size=(224,224)) model = ic.get_model() - fps = ic.fingerprints(ias, model) - fps = ic.pca(fps, n_components=0.95) - clusters = ic.cluster(fps, sim=0.5) + fingerprints = ic.fingerprints(image_arrays, model) + for kk,vv in fingerprints.items(): + assert isinstance(vv, np.ndarray) + assert len(vv) == 4096, len(vv) + fingerprints = ic.pca(fingerprints, n_components=0.95) + clusters = ic.cluster(fingerprints, sim=0.5) assert set(clusters.keys()) == set(ctx.clusters.keys()) + assert len(fingerprints.keys()) == len(ctx.image_fns) + assert set(fingerprints.keys()) == set(ctx.image_fns) for nimg in ctx.clusters.keys(): for val_clus, ref_clus in zip(clusters[nimg], ctx.clusters[nimg]): msg = f"ref_clus: {ref_clus}, val_clus: {val_clus}" @@ -91,9 +121,9 @@ def test_png_rgba_io(): shape2d = (123,456) shape = shape2d + (3,) rgb = (np.random.rand(*shape) * 255).astype(np.uint8) - alpha1 = np.ones(shape2d, dtype=np.uint8) * 255 # white - alpha2 = np.zeros(shape2d, dtype=np.uint8) # black - alpha3 = (np.random.rand(*shape2d) * 255).astype(np.uint8) # noise + alpha1 = np.ones(shape2d, dtype=np.uint8) * 255 # white + alpha2 = np.zeros(shape2d, dtype=np.uint8) # black + alpha3 = (np.random.rand(*shape2d) * 255).astype(np.uint8) # noise for alpha in [alpha1, alpha2, alpha3]: rgba = np.empty(shape2d + (4,), dtype=np.uint8) rgba[..., :3] = rgb @@ -103,9 +133,29 @@ def test_png_rgba_io(): img = PIL.Image.open(fn) assert img.mode == 'RGBA', img.mode assert img.format == 'PNG', img.format - rgb2 = np.array(ic.load_img_rgb(fn)) + rgb2 = np.array(PIL.Image.open(fn).convert('RGB')) assert (rgb == rgb2).all() assert rgb.dtype == rgb2.dtype finally: if os.path.exists(fn): os.remove(fn) + + +def test_img_timestamp(): + with ImagedirCtx(fmt='jpg') as ctx: + for second, fn in enumerate(ctx.image_fns): + stamp = icio.exif_timestamp(fn) + dct = copy.deepcopy(ctx.date_time_base_dct) + dct.update(second=second) + ref = datetime.datetime(**dct, tzinfo=datetime.timezone.utc).timestamp() + assert stamp is not None + assert stamp == ref, f"stamp={stamp} ref={ref}" + # try EXIF first + assert stamp == icio.timestamp(fn, source='auto') + assert stamp == icio.timestamp(fn, source='exif') + + with ImagedirCtx(fmt='png') as ctx: + fn = ctx.image_fns[0] + assert icio.stat_timestamp(fn) is not None + assert icio.timestamp(fn, source='auto') is not None + assert icio.timestamp(fn, source='auto') == icio.stat_timestamp(fn) diff --git a/requirements.txt b/requirements.txt index 7f88ea7..6fc2c64 100644 --- a/requirements.txt +++ b/requirements.txt @@ -4,3 +4,4 @@ keras Pillow scikit-learn matplotlib +piexif