LanguageAnalysis

I wanted to build something in Scala, but didn't know what people typically used Scala for. GitHub tags repositories by the primary language, so I had the idea to try and cluster programming languages based on keywords in repositories' README's - in Scala. This project could be divided into two core components: extraction and analysis. It's a bit of a mess, because I originally was working in Scala and decided to switch to Python when I started working on the analysis component.

Extraction

I created a Scala wrapper for the GitHub API.
Using Slick, I wrote an incremental extraction system to fetch results from the API.

Analysis

I made a simple Tokenizer that used the Java Wiktionary Library to filter words by part of speech.
I implemented the TextRank algorithm to extract keywords. This algorithm proved to be much slower than tf-idf, so I ended up not using it.
At this point, I decided to switch to Python to take advantage of the libraries it already had. I used nltk for tokenizing, and scikit-learn to vectorize repositories with tf-idf and then cluster using k-means. I played around with the parameters to improve the results from clustering. Below is a sample, listing the top ten stems for each cluster along with the number of repositories for each language in each cluster. Some clusters were a lot more defined than others (e.g. the last cluster is clearly about machine learning).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
lib		lib
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
github_repos.py		github_repos.py
kmeans.py		kmeans.py
stack_overflow.py		stack_overflow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LanguageAnalysis

Extraction

Analysis

About

Releases

Packages

Languages

DSouzaM/LanguageAnalysis

Folders and files

Latest commit

History

Repository files navigation

LanguageAnalysis

Extraction

Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages