TOPIC DETECTION FOR BIG DATA

Introduction

Topic detection aims to identify different topics discussed in a corpus of textual documents, describing the different topics through a set of keywords that help us to understand them. Many different approaches are known for this purpose: probabilistic generative models such as topic modeling, soft clustering techniques like Gaussian Mixture Models, hard clustering algorithms like the famous K-Means. This project implements text clustering using K-Means++ on a large dataset. The goal is to improve efficiency of the K-Means algorithm without loosing effectiveness.

Dataset

The dataset is not available on this repository, due to its large dimension. It contains 314808 reviews, mostly in english. The cluster labels were available for the evaluation, that is performed using accuracy metric.

Methodology

We compared the results obtained through different approaches:

A standard K-Means on the original dataset, that took about 1 hour of computation
Dimensionality reduction through Truncated SVD and the standard K-Means
Mini-batch K-Means on the original dataset
Random Sampling (10% of the original dataset) with Dimensionality Reduction and standard K-Means

Results

Clusters were analyzed printing most relevant keywords contained in the centroids and through WordCloud visualization. All the techniques provided very similar results in terms of centroid information and accuracy, and proved to be much faster than the standard K-Means on the original dataset.

We obtained a cluster regarding reviews for some pet product, and a cluster regarding reviews for baby products.

More details are described in the report file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
README.md		README.md
Report_.pdf		Report_.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOPIC DETECTION FOR BIG DATA

Introduction

Dataset

Methodology

Results

About

Releases

Packages

Languages

AlessandraMonaco/Topic-Detection-for-Big-Data

Folders and files

Latest commit

History

Repository files navigation

TOPIC DETECTION FOR BIG DATA

Introduction

Dataset

Methodology

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages