Transfer Learning for Biomedical Relation Extraction Seminar. Applying BioBERT & SciBERT to Relation Extraction (protein-protein-interaction).
Clone the repository, create a python virtual environment and install the requirements.
Download the train and test data (AIMed, BioInfer) into TL_Bio_RE/data/raw
.
Use Korpusdaten-Bearbeiten.ipynb
to process corpora, do train-dev-test split and transform data according to the papers:
- Lee, Jinhyuk, et al. "BioBERT: pre-trained biomedical language representation model for biomedical text mining." arXiv preprint arXiv:1901.08746 (2019).
- Lin, Chen, et al. "A BERT-based universal model for both within-and cross-sentence clinical temporal relation extraction." Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019.
- Wu, Shanchan, and Yifan He. "Enriching pre-trained language model with entity information for relation classification." Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019.
(The text in the notebook is written in German, but the code is self-explanatory. Also keep in mind that we combined the corpora to create our train-dev-test split. We could have instead created a split for each corpora separately.)
Download BioBERT: https://github.com/naver/biobert-pretrained
We have chosen BioBERT v1.1 (+ PubMed 1M). The archive contains TensorFlow checkpoints, the BERT configuration and the vocabulary.
It is based on the original BERT implementation with TensorFlow from Google. Since we want to work with huggingface
, we converted the Tensorflow checkpoint to a PyTorch model.
Conversion based on: huggingface/transformers#457 (comment)
You need to adjust the path to the directory where you saved BioBERT.
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
transformers bert \
$BERT_BASE_DIR/model.ckpt-1000000 \
$BERT_BASE_DIR/bert_config.json \
$BERT_BASE_DIR/pytorch_model.bin
Our approach was to save pytorch_model.bin
, bert_config.json
and vocab.txt
to a new directory biobert_v1.1._pubmed_pytorch
. We had to rename bert_config.json
as config.json
to be able to load it with the transformers
package.
Download SciBERT: https://github.com/allenai/scibert
We have chosen SciBERT Scivocab Cased (PyTorch HuggingFace) because BioBERT is based on BERT cased.
No need to manually download the original BERT. The Hugging Face Transformers code will automatically handle that, when we load the model with BertModel.from_pretrained('bert-base-cased', config=bert_config)
.
We have implemented a simple BertForSequenceClassification model (based on the Hugging Face implementation) for the approaches described by Lee et al 2019 and Lin et al 2019.
- Lee et al 2019 anonymize the entities (with
@PROTEIN$
) - Lin et al 2019 add positional markers to the relevant entities (with
ps
andpe
)
Use the train_bert_simple.sh
script (while in the TL_BIO_RE directory) to train the BertForSequenceClassification model.
You need to change the absolute paths in the script.
In addition, we implemented RBERT (based on the following repository: https://github.com/monologg/R-BERT) described by Wu, S. and He, Y. 2019 and adapted the code to work with our corpora.
They add positional markers to the entities:
$
to mark the start and end of the first entity#
to mark the start and end of the second entity
In addition to the CLS token (pooled output for the whole sentence) they use the averaged representations for the entities for the classification.
Use the train_rbert.sh
script to train the RBERT model.
Our training parameters are as follows:
Parameter | Value |
---|---|
Batch size | 16 |
Max sequence length* | 286 |
Learning rate | 2e-5 |
Number of epochs | 5 |
Dropout rate | 0.1 |
*Note: Corpora analysis yielded a maximum sequence length of 281
after BERT-tokenization. We chose the value 286
to allow slack.
Evaluation is done at the end of the training automatically, because of the --do_eval
flag.
If you only want to do evaluation, you can remove the --do_train
flag or use the eval_rbert.sh
script. It simply loads a model and evaluates it on the test data.
The pred_rbert.sh
script loads a model and writes its predictions into .csv files (one for each corpora: AIMed, BioInfer) according to the specifications of the seminar:
- Pair id
- Label (True/False)
Our results on our the hold-out sets:
Model | ACC | P | R | F1 |
---|---|---|---|---|
Lee-Bert | 86.9 | 58.7 | 84.8 | 69.4 |
Lee-SciBert | 89.4 | 68.7 | 72.3 | 70.4 |
Lee-BioBert | 90.2 | 67.2 | 85.9 | 75.4 |
Lin-Bert | 89.6 | 67.4 | 78.0 | 72.3 |
Lin-SciBert | 87.4 | 63.2 | 66.5 | 64.8 |
Lin-BioBert | 87.8 | 61.8 | 78.0 | 69.0 |
WuHe-Bert | 86.8 | 60.3 | 72.3 | 65.7 |
WuHe-SciBert | 90.4 | 71.1 | 75.9 | 73.4 |
WuHe-BioBert | 89.7 | 69.9 | 71.7 | 70.8 |
Model | ACC | P | R | F1 |
---|---|---|---|---|
Lee-Bert | 85.0 | 74.1 | 65.9 | 69.7 |
Lee-SciBert | 87.4 | 83.1 | 64.9 | 72.9 |
Lee-BioBert | 86.8 | 81.3 | 64.5 | 71.9 |
Lin-Bert | 84.8 | 71.9 | 68.7 | 70.3 |
Lin-SciBert | 87.1 | 82.7 | 64.0 | 72.1 |
Lin-BioBert | 87.5 | 81.5 | 67.5 | 73.9 |
WuHe-Bert | 84.8 | 75.4 | 62.6 | 68.4 |
WuHe-SciBert | 87.8 | 83.1 | 67.1 | 74.2 |
WuHe-BioBert | 86.5 | 83.9 | 60.0 | 70.0 |