Skip to content

DSouzaM/LanguageAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LanguageAnalysis

I wanted to build something in Scala, but didn't know what people typically used Scala for. GitHub tags repositories by the primary language, so I had the idea to try and cluster programming languages based on keywords in repositories' README's - in Scala. This project could be divided into two core components: extraction and analysis. It's a bit of a mess, because I originally was working in Scala and decided to switch to Python when I started working on the analysis component.

Extraction

  1. I created a Scala wrapper for the GitHub API.
  2. Using Slick, I wrote an incremental extraction system to fetch results from the API.

Analysis

  1. I made a simple Tokenizer that used the Java Wiktionary Library to filter words by part of speech.
  2. I implemented the TextRank algorithm to extract keywords. This algorithm proved to be much slower than tf-idf, so I ended up not using it.
  3. At this point, I decided to switch to Python to take advantage of the libraries it already had. I used nltk for tokenizing, and scikit-learn to vectorize repositories with tf-idf and then cluster using k-means. I played around with the parameters to improve the results from clustering. Below is a sample, listing the top ten stems for each cluster along with the number of repositories for each language in each cluster. Some clusters were a lot more defined than others (e.g. the last cluster is clearly about machine learning). Clustering results on GitHub repository data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published