I wanted to build something in Scala, but didn't know what people typically used Scala for. GitHub tags repositories by the primary language, so I had the idea to try and cluster programming languages based on keywords in repositories' README's - in Scala. This project could be divided into two core components: extraction and analysis. It's a bit of a mess, because I originally was working in Scala and decided to switch to Python when I started working on the analysis component.
- I created a Scala wrapper for the GitHub API.
- Using Slick, I wrote an incremental extraction system to fetch results from the API.
- I made a simple Tokenizer that used the Java Wiktionary Library to filter words by part of speech.
- I implemented the TextRank algorithm to extract keywords. This algorithm proved to be much slower than tf-idf, so I ended up not using it.
- At this point, I decided to switch to Python to take advantage of the libraries it already had. I used nltk for tokenizing, and scikit-learn to vectorize repositories with tf-idf and then cluster using k-means. I played around with the parameters to improve the results from clustering. Below is a sample, listing the top ten stems for each cluster along with the number of repositories for each language in each cluster. Some clusters were a lot more defined than others (e.g. the last cluster is clearly about machine learning).