In this brief project we're gonna explore a few NLP tools using a Sklearn dataset. The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.
In this project we'll only work trying to predict four categories of the Sklearn dataset:
- alt.atheism
- talk.religion.misc
- comp.graphics
- sci.space
Feel free to check the dataset documentation to know more about it.
- Introduction to the dataset and its exploration
- Bag of words model: what it is and application
- Exploring most common words in several ways
- Looking at the confusion matrix out of our model
- Using Hashing and TF-IDF: theoretical introduction and application
- A classifiers comparison