Skip to content

Program that performs textual analysis of Reddit data (approx. 300 GB) preprocessed by another team member. Uses Hadoop's Mapreduce to classify comments as either positive or negative based on certain keywords, negation, etc.

Notifications You must be signed in to change notification settings

dboston1/Reddit-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit-Sentiment-Analysis

2-part Mapreduce Program that performs textual analysis of Reddit data (approx. 300 GB of JSON data) preprocessed by another team member. This program performs textual sentiment analysis on reddit comments determined by preprocessing to be discussing either Donald Trump, Hillary Clinton, or both, and summarizes the data.

The preprocessing is assumed to have already screened comments by date and topic (Trump and Clinton). Per the specifications of the project, we limited our scope to comments made between July 19th, 2016, through November 8th, 2016.

PART 1:

TO COMPILE PROGRAM:

$ mkdir build

$ $HADOOP_HOME/bin/hadoop com.sun.tools.javac.Main *.java -d build -Xlint

$ jar -cvf SentimentAnalysis.jar -C build/ .

$ rm -r build

TO RUN:

This assumes you have all text files (ExampleInput.txt, negate-words.txt, pos-words.txt, and neg-words.txt) in /sentimentAnalysis directory in hdfs. Modify the paths to reflect any differences.

$HADOOP_HOME/bin/hadoop jar SentimentAnalysis.jar org.SentimentAnalysis.Driver /sentimentAnalysis/ExampleInput.txt /sentimentAnalysis/out -negation /sentimentAnalysis/negate-words.txt -pos /sentimentAnalysis/pos-words.txt -neg /sentimentAnalysis/neg-words.txt

As-is, it will take /sentimentAnalysis/ExampleInput.txt, run the program, and store the results in /sentimentAnalysis/out. This can be modified to a directory of input files by replacing sentimentAnalysis/ExampleInput.txt with /your-HDFS-Directory/

Part 2:

Part 2 takes the output from part 1, and summarizes the data. It is hardcoded to utilize the partitions defined in details.md, but could be altered easily to read partition data from a file, etc.

TO COMPILE PROGRAM:

$ mkdir build

$ $HADOOP_HOME/bin/hadoop com.sun.tools.javac.Main *.java -d build -Xlint

$ jar -cvf SentimentAnalysis.jar -C build/ .

$ rm -r build

TO RUN:

$ $HADOOP_HOME/bin/hadoop jar Summary.jar org.Summary.Driver /SentimentAnalysis/out /SentimentAnalysis/summary

About

Program that performs textual analysis of Reddit data (approx. 300 GB) preprocessed by another team member. Uses Hadoop's Mapreduce to classify comments as either positive or negative based on certain keywords, negation, etc.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages