Skip to content

A tag classification data science project using NLP and Stack Overflow posts

Notifications You must be signed in to change notification settings

QED0711/stack_overflow_nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stack Overflow Tag Predictor

Classifying Posts Using NLP

See here for the companion web app


Authors:

Quinn Dizon
Mindy Zhou


Summary

Using raw text data retrieved from Stack Overflow posts, we predict the main programming language tag for each post.

We begin by performing natural language processing (NLP) using the NLTK library to extract feature data from the raw posts. We then train and measure the accuracy of a number of different machine learning models.

Our top three models were logistic regression, multinomial NB, and random forest classifier. All produced accuracy scores around 80%. Using all the models together in majority vote, we were able to get about 83% accuracy.

As a secondary analysis, we attempted to perform topic clustering on the processed dataset. The results for this clustering analysis were inconclusive.

Conclusion

Our final conclusion is that, while we are able to get relatively good results in predicting language, topics within or among languages are numerous, share many common words, and are difficult to distinguish.

If you would like to see the final model (logistic regression, 81% accuracy) in action, see our companion web app for this project.

For a visual slide deck summary, see here


Dataset

All data was retrieved directly from Stack Overflow using Google BigQuery.

We limited our dataset to a little over 32 thousand unique posts with five of the most popular programming language categories:

Java | C# | Javascript | Python| C++


File Structure

Final Analysis:

Our final, high-level analysis can be found in:

/notebooks/Stack_Overflow_NLP_Summary_Notebook.ipynb


Cleaned Dataset:

The dataset we used in our final analysis can be found in:

/data/final/text_target.pkl


Primary Classes and Functions

We wrote custom classes and helper functions to handle text preprocessing/NLP and the formation and evaluation of our model pipelines. The code for those classes can be found in the respective folders listed below:

A notebook demonstrating the use of each class can be found in:

/notebooks/class_demonstration.ipynb


Final Report (PDF):

PDF version of final report can be found in:

/data/reports/Stack_Overflow_Tag_Predictor.pdf


Acknowledgements

In doing research for this project, we found the following articles very helpful:

Topic Modeling and Latent Dirichlet Allocation (LDA) in Python
A basic exploration and tutorial for LDA in python

Gensim Tutorial – A Complete Beginners Guide
A guide for text preprocessing/analysis using the Gensim Library

About

A tag classification data science project using NLP and Stack Overflow posts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published