Skip to content

texasmichelle/challenge-purchase-predict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Interview Challenge

There are two components: ChallengeApp & TraversalApp.

TraversalApp

To run the solution to the tree traversal problem, compile with sbt:

challenge.git$ sbt compile

Run the TraversalApp with sbt:

challenge.git$ sbt run  
Multiple main classes detected, select one to run:  
 [1] ChallengeApp  
 [2] TraversalApp  
Enter number: 2  
[info] Running TraversalApp  

ChallengeApp

To get started, download and install Scala 2.10, sbt, and Spark 1.6.1.

Download and install Spark

Download the prebuilt version 1.6.1 from here: Download Spark
Move it to the standard installation directory on your machine.
Set the $SPARK_HOME environment variable to this directory.

Compilation

To build from source, execute the package command from sbt:

challenge.git$ sbt package

Input files

Copy the training.tsv and test.tsv files into the resources directory. The expected path is:

resources/training.tsv  
resources/test.tsv

Execution

To generate output files, run the jar you just created in standalone mode. This will run locally on a single machine.

challenge.git$ $SPARK_HOME/spark-submit target/scala-2.10/interview-challenge_2.10-1.0.jar

Results

Activity Types

The activity types most useful in predicting which user will convert in the future are:

EmailOpen  
FormSubmit  
EmailClickthrough  
WebVisit  
PageView

The counts for these activity types were used as features in the statistical model.

Conversion

The relevant output file can be found here:

output/purchasers.txt  

This file contains a list of the userIds most likely to convert, sorted from most likely to least likely.

Scalability

While this sample code runs on a single node, the driver could easily be modified to operate on a full Spark cluster, whether standalone or Hadoop-based.

Future considerations

This represents an initial attempt to predict whether users will make a purchase based on their behavior. The following approaches are also worth considering:

  • Incorporating sequence rather than simple counts of activity types.
  • Incorporating days between events. This data set could be formed into a time-series structure, which would preserve the time dimension included in the logs, not just the sequence of events.
  • Different types of classifiers produce different results. For this example, logistic regression and random forest classifiers were compared. Random forest provided slightly better accuracy on a random sampling of the training set, but performance of the two is very similar. It might be worthwhile to investigate the accuracies of additional types of binary classifiers. MLlib's pipeline structure makes this very easy.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages