Topic detection aims to identify different topics discussed in a corpus of textual documents, describing the different topics through a set of keywords that help us to understand them. Many different approaches are known for this purpose: probabilistic generative models such as topic modeling, soft clustering techniques like Gaussian Mixture Models, hard clustering algorithms like the famous K-Means. This project implements text clustering using K-Means++ on a large dataset. The goal is to improve efficiency of the K-Means algorithm without loosing effectiveness.
The dataset is not available on this repository, due to its large dimension. It contains 314808 reviews, mostly in english. The cluster labels were available for the evaluation, that is performed using accuracy metric.
We compared the results obtained through different approaches:
- A standard K-Means on the original dataset, that took about 1 hour of computation
- Dimensionality reduction through Truncated SVD and the standard K-Means
- Mini-batch K-Means on the original dataset
- Random Sampling (10% of the original dataset) with Dimensionality Reduction and standard K-Means
Clusters were analyzed printing most relevant keywords contained in the centroids and through WordCloud visualization. All the techniques provided very similar results in terms of centroid information and accuracy, and proved to be much faster than the standard K-Means on the original dataset.
We obtained a cluster regarding reviews for some pet product, and a cluster regarding reviews for baby products.
More details are described in the report file.