Introduction to Natural Language Processing

Introduction

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset. The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

In this project we'll only work trying to predict four categories of the Sklearn dataset:

alt.atheism
talk.religion.misc
comp.graphics
sci.space

Feel free to check the dataset documentation to know more about it.

What you'll find in this repository

Introduction to the dataset and its exploration
Bag of words model: what it is and application
Exploring most common words in several ways
Looking at the confusion matrix out of our model
Using Hashing and TF-IDF: theoretical introduction and application
A classifiers comparison

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
NLP_with_sklearn_dataset.ipynb		NLP_with_sklearn_dataset.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to Natural Language Processing

Introduction

What you'll find in this repository

About

Releases

Packages

Languages

abhijithrajan/NLP_with_20newsgroups

Folders and files

Latest commit

History

Repository files navigation

Introduction to Natural Language Processing

Introduction

What you'll find in this repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages