Project status: 1.0 release
This project collects data on Amazon reviewers in an attempt to determine which reviewers write consistently bad reviews. Using a Chrome extension a shopper can select some criteria with which to judge poor reviewers and hide any reviews by those reviewers. Current criteria are:
- Reviewer consistently gives low reviews.
- Reviewer consistently gives high reviews.
- Reviewer consistently writes extemely short reviews.
The project pipeline is:
- Data comes out of S3 in .gz files.
- It is processed through Spark:
- Imported to rdd;
- Stepped through to total total numbers of stars and words in reviews.
- Reduce by key to total these.
- Count by key.
- Join key counts to other counts.
- Goes into a Redis database as a key:val with key = user id, val is hash with fields # of reviews, # of stars, # of words
- The Chrome extension
- Sets a default state.
- Does a DOM pass and collects a list of all users on the page.
- Watches for updates to popup.
- On update
- Queries Apache server to get counts for each user_id;
- Does quick math to get average values of star ratings and reviews lengths;
- Updates DOM to hide users who match criteria set in popup.
- wsgi running on Apache
- Checks GET for bad values;
- Queries Redis;
- Returns JSON to Chrome extension.
There are three sets of source:
src/spark/spark_run.py
, which contains all the code for the Spark pipeline: pull from S3, process, push to Redis.src/wsgi script/wsgy.py
, which runs on a remote server and presents an API to the outside world. It takes in a GET request, queries the Redis server, and returns JSON.src/Review-hide_extension/
, which is a set of javascript, html and css files for the Chrome extension.
For each of these source files, please see the respective README
files in the src folders for pseudo code.
Directory | Description of Contents |
---|---|
src |
main code base |
src/Review-hide_extension |
client-side code for Chrome extension |
src/spark |
module that runs spark specifically |
src/wsgi script |
wsgi server-side code for Apache |
run.sh |
shell script to run source |
test |
profling and prototyping code |