Text Document Classification & Clustering(code of KNN & clustring algorithm from scratch)

Note: before running the code, kindly check the path 'bbcsports' folder in the directory.

DataSet:

Total Documents: 737, Terms: 4613

The data contain BBC-Sports documents related to 5 different sports:

Athletics
Cricket
Football
Rugby
Tennis

Algorithms Implemented

Assignment Objective

This assignment focuses on the tasks of classification and clustering.

Classification

First task is related to document classification. We have discussed three methods for the task of text/document classification. These are Rocchio’s, Naïve Bayesian and KNN. In this assignment you need to work on KNN using VSM model to hold documents and Euclidian distance measure to estimate closeness of the instances in neighborhood. There is no training required for the KNN algorithm, all you need to divide your data into suitable splits of train and test. Using the training data, you need to check the labels of test instances using k=3. From the 5 classes in the dataset you need to create you train and test sets from 737 instances. Using your idea of a good split set you are free to create the train and test set. The evaluation of classification will be performed on Accuracy of classification for the test split. For a detail description of KNN read the chapter 14 of the textbook.

Clustering

For the task of clustering you need to implement a K-means clustering algorithm, assuming all documents are represented by VSM of suitable feature space. There are 4613 features in the dataset. The feature selection is what you need to perform as per your understanding. The evaluation of clustering will be performed by measure purity of the dataset. For a detail description of K-means read the chapter 16 of the textbook.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
bbcsport		bbcsport
CS317-IR-A3-Spring 2020.pdf		CS317-IR-A3-Spring 2020.pdf
Q1_Text classification_gui-Input.ipynb		Q1_Text classification_gui-Input.ipynb
Q2_Clustering.ipynb		Q2_Clustering.ipynb
Readme.md		Readme.md
TFIDFdict_allDocs.txt		TFIDFdict_allDocs.txt
TestDocs_allDocs.txt		TestDocs_allDocs.txt
TrainDocs_vector.txt		TrainDocs_vector.txt
clustering output.PNG		clustering output.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Document Classification & Clustering(code of KNN & clustring algorithm from scratch)

DataSet:

Total Documents: 737, Terms: 4613

Algorithms Implemented

Assignment Objective

Classification

Clustering

Output:

Clustering

About

Releases

Packages

Languages

EishaMazhar/Text-classification-with-Clustering-

Folders and files

Latest commit

History

Repository files navigation

Text Document Classification & Clustering(code of KNN & clustring algorithm from scratch)

DataSet:

Total Documents: 737, Terms: 4613

Algorithms Implemented

Assignment Objective

Classification

Clustering

Output:

Clustering

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages