For this project, we scraped data from GitHub repository README files using GitHub's API.
language: Programming language used for repository project
category: The category within the energy sector
repo: The specific repo referenced with that observation
readme_contents: Description of each repository containing keywords used to make predictions
clean_tokes: README content normalized removing any uppercased characters, special characters, non-alpha characters, and alpha strings with 2 or less characters
clean_stemmed: README content reducing each word to its root stem and then removes any stopwords
clean_lemmatized: README content reducing each word to its root word and then removes any stopwords
word_count: The total word count for that observation
The primary goal for this project was to build an NLP model that can predict the primary language of REPO using the text in the README file.
As a secondary goal, we decided to pull repos from a specific sector, or topic, to see if an industry we were interested is utilizing a program language we're familiar with.
Due to the first and second goals, this repo can be used as a means to research an potenial industry you might be interested in entering, and knowing what programming language you'll need to have familiarity with.
A well-documented jupyter notebook that contains your analysis.
Three to four slide deck suitable for a general audience that summarize your findings - including well-labelled visualizations
Some of the questions answered in either the notebook or deck:
-
What are the most common words in READMEs?
-
What does the distribution of IDFs look like for the most common words?
-
Does the length of the README vary by programming language?
-
Do different programming languages use a different number of unique words?
How top performing model used the top 4 languages in the corpus, a TF-IDF Vectorizor, and a Random Forest Classifier. The results were 92.31% accuracy on the train dataset and 60.47% accuracy on the test dataset.
- Python
- Jupyter Notebook
- VS Code
- Various Data Science libraries (Pandas, Numpy, Matplotlib, Seaborn, Sklearn, etc.)
- BeautifulSoup
- nltk
- Stats (Hypothesis testing, correlation tests)
- Classification Models (Decision Tree, Random Forest, KNN)
Github READMEs can be good predictors of the programming languages of the repos. With NLP models language is important, so it isn't a surprise that the longer and more in-depth a README is, the better we can get at predicting the primary language.
Whenever possible, words that are unique to a particular programming language, including the name of that language, make for key identifiers of a language.
As with most data work, NLP modeling is no different, imbalances in your data affect everything, most importantly model performance. To improve our models, we'd want to ensure a better language distribution throughout the corpus, and gathering more observations generally speaking will also possitively impact model performace.
You can use our python files to recreate the work done in our notebook. You'll also need to get create an env.py file that includes your GitHub username and GitHub token.
Be sure to include a .gitignore file that includes that env.py