Popbots-mTurk-HITS

A code repository for the HIT code (i.e., HTML, CSS, JavaScript, jQuery) we use on mTurk to collect and QA stressful sentences. As well as the data pipeline for analyzing votes and outputing CSV files used for training and testing of predictive models used in the Popbots system.

The following is the typical order of the data curation, processing, and classification pipeline

1. mTurk HITS

Explanation of HITs: https://www.loom.com/share/226d565b3dc846fcbce164905991229b
Copy Collection and/or QA code into the mTurk interface and copy back/push any changes.

2. mTurk Data Processing (mturk-automation folder)

The CSV file returned from the Amazon Mechanical Turk QA is ran through a python script (mturk_revised_analysis.ipynb) that assigns the stressor labels to the sentences generated from the Mechanical Turkers. It creates a CSV('s) with analysis of the weights each stressor label had on each sentence as well as noting whether on not the sentence was a stressor and the stressor statistics (e.g. mean, standard deviation, etc of the severity of the stressor). It also creates a plot of the distribution of the amount of labels assigned to a sentence for each label. This plot can be useful for purposes of understanding the number of turker votes needed to create reliable data to use for training and testing of an algorithim.

3. Dataframe Filtering (mturk-automation folder)

This script is to be ran from the command line. The CSV file(s) that is returned from the script in the mTurk data processing step is used as input for the filtering script (filter_file.py). This script allows the user to change the size of the test set, the number of times the contents of the dataframe are shuffled, the sample sizes of the number of sentences within each label, whether to have the dataframe contain both/either stressor or non-stressor sentences as well as manipulate the confidence percentages for those votes, whether to have the dataframe contain both/either covid or non-covid related sentences as well as manipulate the confidence percentages for those votes, which labels to include, which sources to include, whether to keep sentences that were used as a seed sentence for inquire, and whether to distribute the test set through each of the sources rather than randomly over all sentences. It then returns the dataframe that was passed into it with the edits/filters you applied.

4. Scikit Classification (scikit-pipeline folder)

This script (scikit-script.py) is to be ran from the command line. The CSV file that is returned from the filtering script is to be used as input for the scikit classification script. This script is to be edited manually to change the algorithm settings for the classifier and then will run the classifier on the test and training set that is returned from the dataframe filtering script. It returns the classification model, the probabilities, and a text file of the classification report and confusion matrix.

Prerequisites

Requires Python3 and potentially non-standard libraries including pandas, numpy, statistics, and matplotlib, Ipywidgets, pyoperators, and tkintertable that can be installed using Pip or a similar installers. For running the notebook locally, install Jupyter Notebook: https://jupyter.org/install. Browse to the local directory on command line and launch Jupyter Notebook by typing "jupyter notebook"; this should automatically open a browser window/tab (or you can browse) to http://localhost:8888/tree. When opening the notebook, ensure Python3 Kernel is running. Access to Stanford Commuter Server or a server that has high computing power.

Inquire_Scraping (inquire-scraping folder)

The inquire scraper is to be run from the command line. You will need to have a stanford commuter server accouhnt and havce cisco AnyConnect Secure Mobility Client downloaded. Begin by VPN in through su-vpn.stanford.edu via cisco application. Then open up your terminal and ssh into {your sunet ID}@commuter.stanford.edu. Once connected to commuter change directories into commuter (e.g. cd /commuter). Activate the inquire_venv by running source inquire_venv/bin/activate. change directories into inquire backend and then type 'python run_server.py'. With that running in the background open another instance of your terminal and navigate locally to (/commuter/PopBots/NLP) ~/Popbots-mTurk-HITS/inquire-scrapping. Then run python3 scrapper.py. Before running scrapper.py you need to edit the script the 'SCRAPED_CATEGORIES' variable to reflect the labels you want to scrape more sentences for. Upon running this script, two folders will be created in the inquire-scrapping folder on your computer. One is 'inquired_scraped_data' and the other 'final_data'. All you care about is the final_data folder. In that folder and in the csv sub-folder you will have a csv for each label you scraped for that contains the sentence used to scrape the inquire sentence, the inquire sentence itself, and the cosine similarity of the two sentences. In the 'get_stressors' function you can change the model used for the 'params' variable. It should be set to 'bert' as the model to start, but you can change it to w2v as well.

BERT Pipeline (bert-pipeline folder)

The multilabel.py and single-label.py scripts take in the csv file that is returned from the dataframe filtering script (filter_file.py). The multilabel.py and singlelabel.py scripts return a classification report text file that consists of a list of lists and a text file containing evalutation information which should be a list of two dictionaries. The text files are to be used for further analysis of the performance of the classification of sentences using BERT. Singlelabel.py classifies the sentences by using the label that had the most votes from the turkers to train the algorithm while multilabel takes in the weights of the votes across all the different label choices that the turkers voted on per sentence to train the algorithm. To run the scripts you need to VPN through su-vpn.stanford.edu via cisco and then open a terminal and ssh into {sunet}@commuter.stanford.edu. Once in commuter cd into /commuter/PopBots/NLP/Popbots-mTurk-HITS/bert-pipeline, run source /commuter/thierrylincoln/Tf1.1_py36/bin/activate, and run export LD_LIBRARY_PATH='/usr/local/cuda-10.0/lib64'. After completing those commands, you can edit the EXPERIMENT_NAME to the name of the folder you want the files to go to, DATASET_NAME to the name of the csv file you named the dataframe that came out of the dataframe filtering script, TODAY_DATE to the date you are running the experiments, and boostrap_nb to 20. Changing these variables each time you run an experiment will ensure that your files are being saved and stored properly and not getting overwritten. The script has a 1.5 - 2 hour run time.

Additional Dataset Analysis (mturk-automation folder)

The pair_analysis.ipynb file reads in a dataframe that is returned from the mturk_revised_analysis script. The purpose of this script is to create vizualizations to better look at the distributions of the label assignments of sentences. It returns a picture of several graphs that are the counts of the main label category compared to the counts of the second most voted label for that sentence for all 9 label categories. The LDA_PerLabel_Analysis is a script that is dedicated to measuring the coherence and perplexity of the sentences assigned to a label. It reads in a dataframe that comes from the mturk_revised_analysis script. The initial motivation for knowing coherence and perplexity was to find some measure to determine how many more sentences we needed to improve classifier performance. This file prints out the coherence and perplexity scores of the sentences of each label in the jupyter notebook cell.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
bert-pipeline		bert-pipeline
inquire-scrapping		inquire-scrapping
mturk-automation		mturk-automation
mturk-hits		mturk-hits
scikit-pipeline		scikit-pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Popbots-mTurk-HITS

The following is the typical order of the data curation, processing, and classification pipeline

1. mTurk HITS

2. mTurk Data Processing (mturk-automation folder)

3. Dataframe Filtering (mturk-automation folder)

4. Scikit Classification (scikit-pipeline folder)

Prerequisites

Inquire_Scraping (inquire-scraping folder)

BERT Pipeline (bert-pipeline folder)

Additional Dataset Analysis (mturk-automation folder)

About

Releases

Packages

Contributors 2

Languages

License

PervasiveWellbeingTech/Popbots-mTurk-HITS

Folders and files

Latest commit

History

Repository files navigation

Popbots-mTurk-HITS

The following is the typical order of the data curation, processing, and classification pipeline

1. mTurk HITS

2. mTurk Data Processing (mturk-automation folder)

3. Dataframe Filtering (mturk-automation folder)

4. Scikit Classification (scikit-pipeline folder)

Prerequisites

Inquire_Scraping (inquire-scraping folder)

BERT Pipeline (bert-pipeline folder)

Additional Dataset Analysis (mturk-automation folder)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages