This is a repository for the replication project for the Winter 2020 Data Reproducibility course in the Master of Data Science program at University of Washington.
Lately, there has been a lot of effort and research on identifying content that is abusive or offensive on online and social media. Twitter recently published a relatively large and reliable dataset on ‘Hate and Abusive Speech on Twitter’. As Data Scientists, we understand the need to find the best methods and data for identifying such content and flagging it as inappropriate.
In this repository, our aim is to replicate some of the findings in a research paper that performs a comparative study and provides suggestions for using additional features and data for improving such classification of hate and abusive speech using Twitter data. Using the data and code provided by the authors, we aim to replicate the efficacy and accuracy of Logistic Regression model presented in this paper. The original paper had a comparative study of 5 different machine learning and deep learning algorithms. However, for our replication purpose we chose Logistic Regression model using word-level features as the authors have stated that this model outperformed all the machine learning techniques and had an F1-score which was equivalent to the best CNN model. For our project, we also had limited computational resources due to which execution of other machine learning and deep learning models was out of scope.
Citation: Lee, Y., Yoon, S., & Jung, K. (2018). Comparative studies of detecting abusive language on twitter. arXiv preprint arXiv:1808.10245.
URL: https://arxiv.org/abs/1808.10245
Git Repository: https://github.com/younggns/comparative-abusive-lang/blob/master/README.md
All data files required for our replication project can be found in the 'data' directory in this repository. URL: https://github.com/UW-MSDS-DATA-598-Reproducibility-WI20/goel-modi-moroney-ramprasad-replication-project/tree/master/Data
This directory contains all details about the original data which was used by the authors of the research as well as the data which was sampled and processed for this replication study. Please refer the README.md in the data directory for additional details.
The analysis directory contains the R Markdown report detailing the procedure and results of this replication study. This directory also contains the intermediate outputs, R scripts, data and images required to Knit the R Markdown report file successfully. For additional details, please refer the README.md in this directory. URL: https://github.com/UW-MSDS-DATA-598-Reproducibility-WI20/goel-modi-moroney-ramprasad-replication-project/tree/master/analysis
OS type and version: Windows 10 Pro, Version 1903, OS build 18362.535
System type: 64-bit OS, x64-based processor
R version: >=3.6.2
R packages and versions:
R Package | Version |
---|---|
CARET | 6.0-84 |
future | 1.16.0 |
tm | 0.7-7 |
quanteda | 1.5.2 |
Liblinear | 2.10-8 |
stringr | 1.4.0 |
here | 0.1 |
ggplot2 | 3.2.1 |
wordcloud | 2.6 |
bookdown | 0.17 |
dplyr | 0.8.3 |
knitr | 1.28 |
The project is licensed as MIT. Please read our license details.
Text and Figures : MIT + file LICENSE Code : MIT + file LICENSE Data : MIT + file LICENSE
We welcome contributions from everyone. If you would like to make a contribution, please read our contributor guidelines. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.