CS410 Classification Competition

Usage

First clone the source code to local, and use this command to install all the dependencies pip3 install -r requirements.txt, assuming Python3 is used.

Use command python3 prediction.py to generate the prediction results which is stored in answer.txt

If there is any errors with nltk package when running the code, please try to install suggested additional dependencies to solve the issue. We also welcome our reviewers to schedule a live demo.

Description of Algorithm

As we planned in the project proposal, the first method we tried is classifier that based on Naive Bayes. The equation to compute the probability is

We used Laplace smoothing when we calculate the probability of every single key in the "SARCASM" and "NOT_SARCASM" dictionary. The equation for calculating the probability is and . UNK stands for the words that we have not seen in the training data. D stands for the dictionary we used when we calculate the probability of its words. It can be either "SARCASM" dictionary or "NOT_SARCASM" dictionary. Alpha stnads for the laplace smoothing parameter we set before training, default to be 1.0. Count(W) stands for the number of times a specific word W appeared in the training data. V stands for the size of the corresponding dictionary.

Overview of Functions & Implementation Details

reader.py, provide helpers to load the datasets into proper data structures to be used by the algorithm

loadFile(name,stemming,sarcasm,training): The helper function to load training data and test data. The parameter "name" indicates the directory path to the file of data. The parameter "sarcasm", a boolean variable, indicates whether the training data is labelled as "SARCASM" or "NOT_SARCASM". The parameter "training", a boolean variable, indicates whether the input data file is training or test. "stemming" is provided as an optional parameter to enable stemming. It returns a list containing the tweets.

load_dataset(train_dir,dev_dir,stemming): It loads data and form structures that can be used by naive bayes algorithm using loadFile(). It returns lists indicating the label of each data entry.

prediction.py, the wrapper file that is called by the user to run the our Naive Bayes Classifier

main(args): It is a wrapper function that extracts data, run naive bayes algorithm and output prediction results using the functions provided by reader.py and naive_bayes.py

naive_bayes.py, implementation of our Naive Bayes Classifier

naiveBayes(train_set, train_labels, dev_set, smoothing_parameter): The wrapper function of our implemented Naive Bayes algorithm.

do_unigram(train_set, trai_labels, dev_set, smoothing_parameter): It applies unigram model and predict the labels of tweets.

get_probability(tweet,p_dict): It calculates the sum of the probability of a word in a dictionary.

get_probability_dict(some_dict,word_count,smoothing_parameter): It calculates the probability of every single key in the "SARCASM" and "NOT_SARCASM" dictionary, including the words that we have not seen in the training data. We used the equations mentioned in the above section to calculate the probability.

get_dicts(train_set, train_labels): It creates "SARCASM" dictionary and "NOT_SARCASM" dictionary respectively and store the count of the occurrences of all the "SARCARSM" words and all the "NOT_SARCASM" words

Based on the training data given, we first tried to predict the labels using the response tweets without context tweets as our dictionary. The accuracy was around 0.69. Then we included the context tweets into our dictionary. This time the accuracy beat the baseline. We tried to tune the laplace smoothing parameter and it turned out that the default one gave the highest.

We also tried with stemming using the PorterStemmer of nltk package but it does not improve the accuracy considerably.

Contribution of team members

We went through design of algorithms, coding, and documentation together.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
CS 410 Project Proposal.pdf		CS 410 Project Proposal.pdf
CS410 Project Progress Report.pdf		CS410 Project Progress Report.pdf
README.md		README.md
eq1.svg		eq1.svg
eq2.svg		eq2.svg
eq3.svg		eq3.svg
naive_bayes.py		naive_bayes.py
prediction.py		prediction.py
reader.py		reader.py
requirements.txt		requirements.txt
video_link		video_link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS410 Classification Competition

Usage

Description of Algorithm

Overview of Functions & Implementation Details

reader.py, provide helpers to load the datasets into proper data structures to be used by the algorithm

prediction.py, the wrapper file that is called by the user to run the our Naive Bayes Classifier

naive_bayes.py, implementation of our Naive Bayes Classifier

Contribution of team members

About

Releases

Packages

Languages

EsportsNoEyes/CourseProject

Folders and files

Latest commit

History

Repository files navigation

CS410 Classification Competition

Usage

Description of Algorithm

Overview of Functions & Implementation Details

reader.py, provide helpers to load the datasets into proper data structures to be used by the algorithm

prediction.py, the wrapper file that is called by the user to run the our Naive Bayes Classifier

naive_bayes.py, implementation of our Naive Bayes Classifier

Contribution of team members

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages