This is the code for classifying 20 news group dataset using support vector machine and natural language processing.
Basically we create a data clean function using nltk which removes non-alpha words (like abc1234) or characters, punctuatons and popular names(like John, James using nltk.corpus.names). And then each word is lemmatized ({close, closely, closed, closer} => close ) using WordNetLemmatizer.
- nltk
- sklearn
- numpy
Just run the given jupyter notebook in your browser.