Clustering and Quantization
Using photographs as visual input
Significant colors in a photograph.
This started as a very simple exploration of the simplest clustering algorithm in use, but I can see that doing a more comprehensive coverage of algorithms may be very valuable. Kandinsky aims to cover:
I. Basic building blocks
-
Similarity/Distance Measures:
- Euclidean Distance (Cartesian)
- Manhattan Distance
- Cosine Distance
- Mahalanobis Distance
- Domain-specific Distances
-
Data Preprocessing:
- Feature Scaling and Normalization
- Dimensionality Reduction (e.g., PCA, t-SNE)
-
Cluster Evaluation:
- Internal Measures (Cohesion, Separation)
- Silhouette Coefficient
- Davies-Bouldin Index
- External Measures (vs. Ground Truth)
- Purity, Rand Index, Adjusted Rand Index
- Internal Measures (Cohesion, Separation)
II. Clustering Algorithms
-
Partitioning-Based
- K-Means (hard assignments)
- K-Medoids (more robust to outliers)
- Fuzzy C-Means (soft assignments)
-
Hierarchical
- Agglomerative (Bottom-up)
- Various linkage methods (single, complete, average)
- Divisive (Top-down)
- Agglomerative (Bottom-up)
-
Density-Based
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise, discovers clusters of varying shapes)
- OPTICS (Ordering Points To Identify the Clustering Structure, extension of DBSCAN, provides reachability plot)
- HDBSCAN (Improved density clustering, handles varying densities)
-
Distribution-Based
- Gaussian Mixture Models (GMM) (assumes data follows mixtures of Gaussian distributions)
-
Grid-Based
- STING (Statistical Information Grid-based Clustering)
- CLIQUE (Clustering In QUEst)
-
Neural Network-Based
- Autoencoders (Variational, Denoising, etc.)
- Learn latent representations for clustering
- Self-Organizing Maps (SOMs)
- Preserve neighborhood relationships in a grid-like space
- Deep Embedded Clustering (DEC)
- Autoencoders (Variational, Denoising, etc.)
III. Additional Stuff to tackle when I get time and braincycles to spare...
- Clustering High-Dimensional Data: Image data often results in high-dimensional feature vectors, so techniques for dimensionality reduction become crucial. It is easy to see that distances like Euclidean or Cartesian lose their meaning as we go into higher dimensional data. Also think about situations where one dimension may not advance as much as other - for e.g. considering age and salary, age may only go from 0 to 100, while salary may range from 0 to 1 million (hint: specifically for this example, prefer Manhattan distance over Cartesian).
- Clustering Large-Scale Data: When you have many images, scalable clustering algorithms (e.g., sampling or mini-batch variations of standard methods) are essential.
- Spectral Clustering (Flexible approach, particularly effective on non-convex cluster shapes)
- Graph-Based Clustering
- Hybrid Approaches (Combining traditional algorithms with neural networks)
- Clustering High-Dimensional Data
- Clustering Large-Scale Data (sampling, incremental approaches)
- Affinity Propagation (Finds clusters based on message-passing between data points)
...so yeah! there's a bunch of work needed!
- 00 Prep the Pictures
- 01 K-Means
- 015 Color Models
Kandinsky helped in the cinematography for our feature film Eight Down Toofaan Mail.
- Trailer for Eight Down Toofaan Mail on YouTube
- After a successful awards run and theatrical distribution in India, the film is now on YouTube (with English Subtitles)
- Opening day audience reactions YouTube Shorts
- Press release from Ministry of Information and Broadcasting, Govt. of India at IFFI 2021
- Full press conference at IFFI 2021 (YouTube)
- at PyCascades Seattle 2024: (uneditied live-stream) YouTube, schedule
- at Vidyalankar Institure of Technology LinkedIn Post (no video)
- Some presentations use The Inter typeface family