A frequently asked question on the Apache Spark user email list concerns where to find data sets for evaluating the code. Oddly enough, the collection of archived messages for this email list provides an excellent data set for evaluating Spark capabilities, e.g., machine learning, graph algorithms, text analytics, time-series analysis, etc.
Herein, an open source developer community considers itself algorithmically.
This project shows work-in-progress for how to surface data insights from
the developer email forums for an Apache open source project.
It leverages advanced technologies for natural language processing, machine
learning, graph algorithms, time series analysis, etc.
As an example, we use data from the <[email protected]>
email list archives to help understand
its community better.
See these talks about Exsto
:
- DataDayTexas 2015 session talk, [Microservices, Containers, and Machine Learning] (http://www.slideshare.net/pacoid/microservices-containers-and-machine-learning)
- Scala Days EU 2015 session, [GraphX: Graph analytics for insights about developer communities] (http://www.slideshare.net/pacoid/graphx-graph-analytics-for-insights-about-developer-communities)
In particular, we will shows production use of NLP tooling in Python, integrated with MLlib (machine learning) and GraphX (graph algorithms) in Apache Spark. Machine learning approaches used include: Word2Vec, TextRank, Connected Components, Streaming K-Means, etc.
Keep in mind that "One Size Fits All" is an anti-pattern, especially for Big Data tools. This project illustrates how to leverage microservices and containers to scale-out the code+data components that do not fit well in Spark, Hadoop, etc.
In addition to Spark, other technologies used include: Mesos, Docker, Anaconda, Flask, NLTK, TextBlob.
conda config --add channels https://conda.binstar.org/sloria
conda install textblob
python -m textblob.download_corpora
python -m nltk.downloader -d ~/nltk_data all
pip install -U textblob textblob-aptagger
pip install lxml
pip install python-dateutil
pip install Flask
NLTK and TextBlob require some data downloads which may also require updating the NLTK data path:
import nltk
nltk.data.path.append("~/nltk_data/")
To change the project configuration simply edit the defaults.cfg
file.
./scrape.py data/foo.json
./parse.py data/foo.json parsed/foo.json
Example data from the Apache Spark email list is available as JSON:
- original: https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/original/2015_01.json
- parsed: https://s3-us-west-1.amazonaws.com/paco.dbfs.public/exsto/parsed/2015_01.json
The word exsto is the Latin verb meaning "to stand out", in its present active form.