You may wnat to start ray service, e.g.,
ray start --head --port=6379
To reproduce the data processing:
Human Labeled Data:
We download the NER data from BLURB zip (BC5CDR-chem
, BC5CDR-disease
, NCBI-disease
in data_generation/data/
). I think this is exactly the same as this github repo.
Run sh ./data/download_labeled.sh
Weakly Supervised Data:
# Put `data/download_pubmed.sh` into the directories you want to save the data before running it. It takes a long time to run and require large disc space.
mkdir tasks/unlabeled
cp data/download_pubmed.sh tasks/unlabeled/
# Unlabeled in-domain data: Dump PubMed using `data/download_pubmed.sh`.
# You may want to change the name & number of files in the scripts due to change of annual baseline of pubmed (i.e., `GZFILE`, `XMLFILE` and `$(seq -f "%04g" 1 1015)`). See current year's data in https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
cd tasks/unlabeled/
sh download_pubmed.sh
cat *.txt > all_text
rm *.txt
mv all_text all_text.txt
# Put `data/Annotate.ipynb`, `data/chem_dict.txt`, `data/disease_dict.txt` into the directory.
cd ../..
cp data/Annotate.ipynb tasks/unlabeled/
cp data/chem_dict.txt tasks/unlabeled/
cp data/disease_dict.txt tasks/unlabeled/
Annotate data using dictionary Annotate.ipynb
. Change TGT_ENTITY_TYPE
for generating data for different tasks, see details in the notebook.
# create soft link
cd BC5CDR-chem
ln -s ../unlabeled/chem_weak.txt weak.txt
cd ../BC5CDR-disease
ln -s ../unlabeled/disease_weak.txt weak.txt
cd ../NCBI-disease
ln -s ../unlabeled/disease_weak.txt weak.txt
├── data # Scripts for pre-processing
│ ├── download_labeled.sh # Download PubMed Data
│ ├── download_pubmed.sh # Download PubMed Data
│ ├── Annotate.ipyb # Get weak annotation
│ ├── disease_dict.sh # Dictionary Files for Disease
│ └── chem_dict.txt # Dictionary Files for Chemical
├── weak_weighted_selftrain.sh # Script for weakly-supervised training (Stage II)
├── finetune.sh # Fine-tune a model with human labeled data
├── profile.sh # Create profile data
├── profile2refinedweakdata.sh # Turn profile data into refined weakly supervised data (weak label refinement)
└── supervised.sh # Fine-tune from a publicly avaliable pre-trained model (e.g., BioBERT)
Before using any script in this folder, you need to set TASK
from one of the following:
TASK="BC5CDR-chem"
TASK="BC5CDR-disease"
TASK="NCBI-disease"
The default max length is 256: MAX_LENGTH=256
Other Parameters see Hyperparameter Explaination in ../README.md
Method | BC5CDR-chem | BC5CDR-disease | NCBI-disease |
---|---|---|---|
Previous SOTA (F1-score) | |||
BERT | 89.99 | 79.92 | 85.87 |
bioBERT | 92.85 | 84.70 | 89.13 |
SciBERT | 92.51 | 84.70 | 88.25 |
ClinicalBERT | 90.80 | 83.04 | 86.32 |
BlueBERT | 91.19 | 83.69 | 88.04 |
PubMedBERT | 93.33 | 85.62 | 87.82 |
Reimp (P/R/F1) | |||
BioBERT | 92.64/93.28/92.96 | 83.73/86.80/85.23 | 87.18/91.35/89.22 |
Ours | 93.21/95.12/94.17 | 87.99/93.56/90.69 | 91.76/92.81/92.28 |
Since BioBERT is already an in-domain BERT, we do not do additional MLM pre-train.
An example script for MLM pre-training is in roberta_mlm_pretrain.sh
sh ./bio_script/supervised.sh
Stage II:
Initial NER model
./bio_script/supervised.sh 0,1
with
TASK="BC5CDR-chem"
NUM_EPOCHS=20
BATCH_SIZE=32
Profiling
./bio_script/profile.sh 0,1
with
TASK="BC5CDR-chem"
BERT_CKP=${TASK}/crf-dmis-lab-biobert-v1.1_EPOCH_20_BSZ_32
PROFILE_FILE=$DATA_DIR/dev.txt
./bio_script/profile.sh 0,1,2,3,4,5,6,7
with
TASK="BC5CDR-chem"
BERT_CKP=${TASK}/crf-dmis-lab-biobert-v1.1_EPOCH_20_BSZ_32
PROFILE_FILE=$DATA_DIR/weak.txt
Refine Weak Labels
./bio_script/profile2refinedweakdata.sh
with
TASK="BC5CDR-chem"
BERT_CKP=${TASK}/crf-dmis-lab-biobert-v1.1_EPOCH_20_BSZ_32
WEI_RULE=avgaccu_weak_non_O_promote
PRED_RULE=non_O_overwrite
NA-WSL
./bio_script/weak_weighted_selftrain.sh 0,1
with
TASK="BC5CDR-chem"
BERT_CKP=${TASK}/crf-dmis-lab-biobert-v1.1_EPOCH_20_BSZ_32
FBA_RULE=weak_non_O_overwrite-WEI_avgaccu_weak_non_O_promote
DISTRIBUTE_GPU=true
LOSSFUNC=corrected_nll
MAX_WEIGHT=0.95
USE_DA=false
Stage III: fine-tune
./bio_script/finetune.sh 0,1
with
TASK="BC5CDR-chem"
BERT_CKP=${TASK}/crf-dmis-lab-biobert-v1.1_EPOCH_20_BSZ_32/selftrain/weak_non_O_overwrite-WEI_avgaccu_weak_non_O_promote_EPOCH_1_MAXWEI_0.95_LOSS_corrected_nll_distributed
NUM_EPOCHS=15
SEED=10