There are two components: ChallengeApp & TraversalApp.
To run the solution to the tree traversal problem, compile with sbt:
challenge.git$ sbt compile
Run the TraversalApp with sbt:
challenge.git$ sbt run
Multiple main classes detected, select one to run:
[1] ChallengeApp
[2] TraversalApp
Enter number: 2
[info] Running TraversalApp
To get started, download and install Scala 2.10, sbt, and Spark 1.6.1.
Download the prebuilt version 1.6.1 from here: Download Spark
Move it to the standard installation directory on your machine.
Set the $SPARK_HOME
environment variable to this directory.
To build from source, execute the package command from sbt:
challenge.git$ sbt package
Copy the training.tsv and test.tsv files into the resources
directory. The expected path is:
resources/training.tsv
resources/test.tsv
To generate output files, run the jar you just created in standalone mode. This will run locally on a single machine.
challenge.git$ $SPARK_HOME/spark-submit target/scala-2.10/interview-challenge_2.10-1.0.jar
The activity types most useful in predicting which user will convert in the future are:
EmailOpen
FormSubmit
EmailClickthrough
WebVisit
PageView
The counts for these activity types were used as features in the statistical model.
The relevant output file can be found here:
output/purchasers.txt
This file contains a list of the userIds most likely to convert, sorted from most likely to least likely.
While this sample code runs on a single node, the driver could easily be modified to operate on a full Spark cluster, whether standalone or Hadoop-based.
This represents an initial attempt to predict whether users will make a purchase based on their behavior. The following approaches are also worth considering:
- Incorporating sequence rather than simple counts of activity types.
- Incorporating days between events. This data set could be formed into a time-series structure, which would preserve the time dimension included in the logs, not just the sequence of events.
- Different types of classifiers produce different results. For this example, logistic regression and random forest classifiers were compared. Random forest provided slightly better accuracy on a random sampling of the training set, but performance of the two is very similar. It might be worthwhile to investigate the accuracies of additional types of binary classifiers. MLlib's pipeline structure makes this very easy.