This is the code for NLPCC 2020 paper Label-Wised Document Pre-Training for Multi-Label Text Classification
- Ubuntu 16.04
- Python >= 3.6.0
- PyTorch >= 1.3.0
--data
and--outputs
We provide the proprecessed RMSC and AAPD datasets and pretrained checkpoints of LW-LSTM+PT+FT model and HLW-LSTM+PT+FT model to make sure reproducibility. Please download from the link and decompress to the root directory of this repository.
--data
|--aapd
|--label_test
|--label_train
...
|--rmsc
|--rmsc.data.test.json
|--rmsc.data.train.json
|--rmsc.data.valid.json
aapd_word2vec.model
aapd_word2vec.model.wv.vectors.npy
aapd.meta.json
aapd.pkl
rmsc_word2vec.model
rmsc_word2vec.model.wv.vectors.npy
rmsc.meta.json
rmsc.pkl
--outputs
|--aapd
|--rmsc
Note that the
data/aapd
anddata/rmsc
is the initial dataset. Here we provide a split of RMSC (i.e. RMSC-V2).
- Testing on AAPD
python classification.py -config=aapd.yaml -in=aapd -gpuid [GPU_ID] -test
- Testing on RMSC
python classification.py -config=rmsc.yaml -in=rmsc -gpuid [GPU_ID] -test
If you want to preprocess the dataset by yourself, just run the following command with name of dataset (e.g. RMSC or AAPD).
PYTHONHASHSEED=1 python preprocess.py -data=[RMSC/AAPD]
Note that
PYTHONHASHSEED
is used in word2vec.
Pre-train the LW-PT model.
python pretrain.py -config=[CONFIG_NAME] -out=[OUT_INFIX] -gpuid [GPU_ID] -train -test
CONFIG_NAME
:aapd.yaml
orrmsc.yaml
OUT_INFIX
: infix of outputs directory contains logs and checkpoints
Train the downstream model for MLTC task.
python classification.py -config=[CONFIG_NAME] -in=[IN_INFIX] -out=[OUT_INFIX] -gpuid [GPU_ID] -train -test
IN_INFIX
: infix of inputs directory contains pre-trained checkpoints
- build a static documents representation to facilitate downstream tasks
python build_doc_rep.py -config=[CONFIG_NAME] -in=[IN_INFIX] -gpuid [GPU_ID]
Not used unless necessary.
- make RMSC-V2 dataset:
tests/make_rmsc.py
- visual document embeddings:
tests/visual_emb.py
- visual labels F1 score:
tests/visual_label_f1.py
- case study:
tests/case_study.py
If you consider our work useful, please cite the paper:
@inproceedings{liu2020label,
title="Label-Wise Document Pre-Training for Multi-Label Text Classification",
author="Han Liu, Caixia Yuan and Xiaojie Wang",
booktitle="CCF International Conference on Natural Language Processing and Chinese Computing",
year="2020"
}