Skip to content

Tweets Sentiment Classification Using PySpark's MlLib NaiveBayes.

License

Notifications You must be signed in to change notification settings

sohaibomr/tweet-sentiment-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

tweet-sentiment-pyspark

Tweets Sentiment Classification Using PySpark's NaiveBayes. Sentiment Analysis on tweets Dataset using NaiveBayes binary classification Model and Bag of words technique to make feature vectors to feed NaiveBayes.
tweets are classified as positive=1, negative=0.
Dataset contains contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment. Dataset can be downloaded from the this Tweets Dataset .
With this script i acheived upto 60% Accuracy on unlabeled test dataset.
I took about 20 min. max on my i5-2.3u, 4gb ram machine to train on 90% of the dataset and test on remaining 10%.


Dependencies
  • Apache Spark and pyspark
  • Pandas
  • Python 2.7
TODO
One can further improve accuracy by Lemmatisation of dataset and using word2vec technique. On which i am still working on. And you can also try different classification models like Random Forest, SVM or Even try Deep Learning, CNN, RNN.

Releases

No releases published

Packages

No packages published

Languages