See here for the companion web app
Using raw text data retrieved from Stack Overflow posts, we predict the main programming language tag for each post.
We begin by performing natural language processing (NLP) using the NLTK library to extract feature data from the raw posts. We then train and measure the accuracy of a number of different machine learning models.
Our top three models were logistic regression, multinomial NB, and random forest classifier. All produced accuracy scores around 80%. Using all the models together in majority vote, we were able to get about 83% accuracy.
As a secondary analysis, we attempted to perform topic clustering on the processed dataset. The results for this clustering analysis were inconclusive.
Our final conclusion is that, while we are able to get relatively good results in predicting language, topics within or among languages are numerous, share many common words, and are difficult to distinguish.
If you would like to see the final model (logistic regression, 81% accuracy) in action, see our companion web app for this project.
For a visual slide deck summary, see here
All data was retrieved directly from Stack Overflow using Google BigQuery.
We limited our dataset to a little over 32 thousand unique posts with five of the most popular programming language categories:
Java | C# | Javascript | Python| C++
Our final, high-level analysis can be found in:
The dataset we used in our final analysis can be found in:
We wrote custom classes and helper functions to handle text preprocessing/NLP and the formation and evaluation of our model pipelines. The code for those classes can be found in the respective folders listed below:
A notebook demonstrating the use of each class can be found in:
PDF version of final report can be found in:
In doing research for this project, we found the following articles very helpful:
Topic Modeling and Latent Dirichlet Allocation (LDA) in Python
A basic exploration and tutorial for LDA in python
Gensim Tutorial – A Complete Beginners Guide
A guide for text preprocessing/analysis using the Gensim Library