Data and code for paper: Identifying Distributional Perspective Differences from Colingual Groups, to appear at SocialNLP@NAACL 2021
Raw data extracted from Wikipedia and negative samples created by flipping adjectives. In the format of:
ID text one-hot label
Here ID can be any number or string. For one-hot labels, 1 0 represents positive, while 0 1 represents negative
Note that these are raw data files. So the lines of each file does not match the number of training samples reported in the paper. You can choose to balance the data in whatever way you prefer.
** Sample data (useful for inspection)
- EN_philosophy.xlsx - original wiki data, data after negated using antonyms and after applying backtranslation
- Lang-pos_PageName.txt, Lang-pos_PageName.txt - paired positive and negative data for English, Chinese and Japanese
- pytorch_pretrained_bert
- csv, tqdm, numpy, pickle
- PyTorch 1.0
- matplotlib
-
Download the code and data
-
Put pre-trained BERT model in:
vocab_path: pybert/model/pretrain/uncased_L-12_H-768_A-12/vocab.txt
bert_config_file: pybert/model/pretrain/uncased_L-12_H-768_A-12/bert_config.json
pytorch_model_path: pybert/model/pretrain/pytorch_pretrain/pytorch_model.bin
bert_model_dir: pybert/model/pretrain/pytorch_pretrain
-
Modify data and model path in
pybert/config/basic_config.py
, preprocess the data when necessary -
Run
python train_bert_classify.py
to fine tuning bert model. -
Run
python inference.py
to predict new data.Credit: Bert classifier is borrowed and modified from lonePatient