LanguageAnalysis

I wanted to build something in Scala, but didn't know what people typically used Scala for. GitHub tags repositories by the primary language, so I had the idea to try and cluster programming languages based on keywords in repositories' README's - in Scala. This project could be divided into two core components: extraction and analysis. It's a bit of a mess, because I originally was working in Scala and decided to switch to Python when I started working on the analysis component.

Extraction

I created a Scala wrapper for the GitHub API.
Using Slick, I wrote an incremental extraction system to fetch results from the API.

Analysis

I made a simple Tokenizer that used the Java Wiktionary Library to filter words by part of speech.
I implemented the TextRank algorithm to extract keywords. This algorithm proved to be much slower than tf-idf, so I ended up not using it.
At this point, I decided to switch to Python to take advantage of the libraries it already had. I used nltk for tokenizing, and scikit-learn to vectorize repositories with tf-idf and then cluster using k-means. I played around with the parameters to improve the results from clustering. Below is a sample, listing the top ten stems for each cluster along with the number of repositories for each language in each cluster. Some clusters were a lot more defined than others (e.g. the last cluster is clearly about machine learning).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LanguageAnalysis

Extraction

Analysis

Files

README.md

Latest commit

History

README.md

File metadata and controls

LanguageAnalysis

Extraction

Analysis