Name		Name	Last commit message	Last commit date
parent directory ..
notebooks		notebooks
report		report
DBSCAN-text-clustering.py		DBSCAN-text-clustering.py
README.md		README.md
format.dat		format.dat
requirements.txt		requirements.txt
train.dat		train.dat

README.md

Text Clustering with DBSCAN

For this project, I implemented the DBSCAN (Density-based spatial clustering of applications with noise) clustering algorithm from scratch to cluster text data (news records). DBSCAN is an unsupervised clustering algorithm that is density-based, meaning the algorithm will group together points that are closely packed together, and mark points in low-densities as outliers. The input data consists of 8,580 text records in document-term sparse matrix (CSR) format with no labels provided. For evaluation purposes (leaderboard ranking), the Normalized Mutual Information Score (NMI), an external index metric for evaluating clustering solutions, will be the metric used for scoring the clustering algorithm.

Rank & NMI

My current rank on CLP public leaderboard is 2nd with a NMI (normalized mutual information) score of 0.6238.

Update: My rank on the CLP final leaderboard is 3rd with a NMI score of 0.6249. This score is calculated on the whole test set.

Note: This assignment follows a similar format with kaggle competitions in terms of ranking & scoring.

Report

For more details of implementation (data preprocessing, dimensionality reduction, model implementation, etc.), see the detailed report for this project located here: report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBSCAN-text-clustering

DBSCAN-text-clustering

README.md

Text Clustering with DBSCAN

Rank & NMI

Report

Files

DBSCAN-text-clustering

Directory actions

More options

Directory actions

More options

Latest commit

History

DBSCAN-text-clustering

Folders and files

parent directory

README.md

Text Clustering with DBSCAN

Rank & NMI

Report