- classes_two.npy - contains two classes -- "toxic" , "non-toxic"
- classes.npy- contains three classes -- "hatespeech" , "normal", "offensive"
Code to load the encoder
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.classes_ = np.load(<name_of_the_file>,allow_pickle=True)
Sample Entry:
"24198545_gab": {
"post_id": "24198545_gab",
"annotators": [
"label": "hatespeech",
"annotator_id": 4,
"target": ["African"]
"label": "hatespeech",
"annotator_id": 3,
"target": ["African"]
"label": "offensive",
"annotator_id": 5,
"target": ["African"]
"post_tokens": ["and","this","is","why","i","end","up","with","nigger","trainee","doctors","who","can","not","speak","properly","lack","basic","knowledge","of","biology","it","truly","scary","if","the","public","only","knew"]
🔹post_id : Unique id for each post
🔹annotators : The list of annotations from each annotator
🔹annotators[label] : The label assigned by the annotator to this post. Possible values: [Hatespeech, Offensive, Normal]
🔹annotators[annotator_id] : The unique Id assigned to each annotator
🔹annotators[target] : A list of target community present in the post
🔹rationales : A list of rationales selected by annotators. Each rationales represents a list with values 0 or 1. A value of 1 means that the token is part of the rationale selected by the annotator. To get the particular token, we can use the same index position in "post_tokens"
🔹post_tokens : The list of tokens representing the post which was annotated
Post_id_divisions has a dictionary having train, valid and test post ids that are used to divide the dataset into train, val and test set in the ratio of 8:1:1.
We use Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download) pretrained word vectors from glove repo. This is required only when you plan to run the non-bert deep learning model (cnn-gru, birnn, birnn-scrat). One click download
- Extract the glove.840B.300d.txt in this folder (Data/)
- Run this python file to convert the glove model into gensim model.
🟢🟢 You are ready to roll!!! 🟢🟢
The data uses the label "homosexual" as defined at collection time; other sexual and gender orientation categories have been pruned from the data due to low incidence; the published version of the paper wrongly mentions the LGBTQ category.