Skip to content

Data and code for paper: Identifying Distributional Perspective Differences from Colingual Groups, SocialNLP@NAACL 2021

Notifications You must be signed in to change notification settings

NinaTian98369/CLUSTER

 
 

Repository files navigation

CLUSTER

Data and code for paper: Identifying Distributional Perspective Differences from Colingual Groups, to appear at SocialNLP@NAACL 2021

Multi-lingual Wikipedia Data

Raw data extracted from Wikipedia and negative samples created by flipping adjectives. In the format of:

ID	text	one-hot label

Here ID can be any number or string. For one-hot labels, 1 0 represents positive, while 0 1 represents negative

Note that these are raw data files. So the lines of each file does not match the number of training samples reported in the paper. You can choose to balance the data in whatever way you prefer.

** Sample data (useful for inspection)

- EN_philosophy.xlsx - original wiki data, data after negated using antonyms and after applying backtranslation
- Lang-pos_PageName.txt, Lang-pos_PageName.txt - paired positive and negative data for English, Chinese and Japanese

Requirements

  • pytorch_pretrained_bert
  • csv, tqdm, numpy, pickle
  • PyTorch 1.0
  • matplotlib

How to run the code:

  1. Download the code and data

  2. Put pre-trained BERT model in:

      vocab_path: pybert/model/pretrain/uncased_L-12_H-768_A-12/vocab.txt
      bert_config_file: pybert/model/pretrain/uncased_L-12_H-768_A-12/bert_config.json
      pytorch_model_path: pybert/model/pretrain/pytorch_pretrain/pytorch_model.bin
      bert_model_dir: pybert/model/pretrain/pytorch_pretrain
  1. Modify data and model path in pybert/config/basic_config.py, preprocess the data when necessary

  2. Run python train_bert_classify.py to fine tuning bert model.

  3. Run python inference.py to predict new data.

    Credit: Bert classifier is borrowed and modified from lonePatient

About

Data and code for paper: Identifying Distributional Perspective Differences from Colingual Groups, SocialNLP@NAACL 2021

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%