Data and code for paper: Identifying Distributional Perspective Differences from Colingual Groups, to appear at SocialNLP@NAACL 2021
Raw data extracted from Wikipedia and negative samples created by flipping adjectives. In the format of:
ID text one-hot label
Here ID can be any number or string. For one-hot labels, 1 0 represents positive, while 0 1 represents negative
Note that these are raw data files. So the lines of each file does not match the number of training samples reported in the paper. You can choose to balance the data in whatever way you prefer.
** Sample data (useful for inspection)
- EN_philosophy.xlsx - original wiki data, data after negated using antonyms and after applying backtranslation
- Lang-pos_PageName.txt, Lang-pos_PageName.txt - paired positive and negative data for English, Chinese and Japanese
- pytorch_pretrained_bert
- csv, tqdm, numpy, pickle
- PyTorch 1.0
- matplotlib
Download the code and data
Put pre-trained BERT model in:
vocab_path: pybert/model/pretrain/uncased_L-12_H-768_A-12/vocab.txt
bert_config_file: pybert/model/pretrain/uncased_L-12_H-768_A-12/bert_config.json
pytorch_model_path: pybert/model/pretrain/pytorch_pretrain/pytorch_model.bin
bert_model_dir: pybert/model/pretrain/pytorch_pretrain
Modify data and model path in
, preprocess the data when necessary -
to fine tuning bert model. -
to predict new data.Credit: Bert classifier is borrowed and modified from lonePatient