BERT + CRF for RuREBus
The main goal of this task is to train BERT-CRF model for solving Named Entity Recognition and Relation Extraction tasks on RuREBus dataset (Russian Relation Extraction for Business).
datasets
– implementations of torch datasets.models
– models implementation (bert+crf, classifier for RE task solving)re_utils
– various useful utilities (e.g. for working with files, ner data structure, for training models).resources
– materials for the design of the repository, it is also supposed to store data there for training and testing models.RuREBus
– repository with original task.scripts
– scripts for preparing data to training and evaluation.ner_experiments.ipynb
– training different models to solve NER task.re_experiments.ipynb
– training model to solve RE task.
Create virtual environment with venv
or conda
and install requirements:
pip install -r requirements.txt
Or build and run docker container:
./run_docker.sh
"The corpus contains regional reports and strategic plans. A part of the corpus is annotated with named entities (8 classes) and semantic relations (11 classes). In total there are approximately 300 annotated documents. The annotation schema and the guidelines for annotators can be found in here (in Russian)."
from RuREBus repo
There are a lot of Russian pretrained language models, the most popular one is sberbank/ruBERT-base. In order to get a higher quality when solving NER and RE on the RuREBus dataset, we've applied masked language modeling to sberbank/ruBERT-base model. The dataset for finetunning was chosen from the same domain: https://disk.yandex.ru/d/9uKbo3p0ghdNpQ
- Create masked dataset for BERT finetunning:
$ python scripts/mask_texts.py
- Running train script with created dataset:
$ python scripts/mlm.py
The training data is provided in the repository RuREBus. The data needs to be unpacked, this can be done using a script:
./unzip_data.sh
Next, the data must be processed by the script scripts/tokenize_texts.py:
python -m scripts.tokenize_texts
This script tokenizes the text using the tokenizer passed in the parameter --hf-tokenizer
(default: sberbank-ai/ruBert-base).
The script also breaks the texts into pieces, the size of which does not exceed --max-seq-len
(default: 512) tokens. In this case, the script is smart not to break a word consisting of several tokens in the middle, moreover, if a named entity consists of several words, then it will not be broken in the middle. The resulting pieces can be of different sizes, when processing this data by the dataloader, shorter sequences are padded.
This script creates 4 files in the same directory as the text and annotation data (--dir
parameter, default: resources/data/train):
-
labeled_texts.jsonl
– the file consists of lines, each line has the following form:{"input_ids": [113, 1947, 672, 73324, ..., 152, 64306], "text_labels": ["O", "O", "B-QUA", "I-QUA", ..., "O", "O"], "labels": [0, 0, 15, 10, ..., 0, 0], "id": 0}
id
– index of a piece of text in the datasetinput_ids
– token ids received by the tokenizertext_labels
– named entity labels for tokens. Labels are assigned to tokens according to the BIO system: in this case, if the token does not belong to the named entity, then it is marked with the "O" label; the label of the first named entity token is prefixed with "B-"; other named entity tokens begin with "I-". The corresponding prefix is followed by a tag denoting the class of the named entity.labels
–text_labels
converted to numbers
-
label2id.jsonl
– mapping from a label's text representation to a number, e.g:{"O": 0, "B-ECO": 1, "B-CMP": 2, "I-SOC": 3, "I-INST": 4, "B-INST": 5, ...}
-
relations.jsonl
,retag2id.json
– more details about these files are described in the Data subsection of the Relation Extraction section.
The outputs of the BERT model pretrained on the corpus of business texts are processed using Conditional Random Field.
The essence of CRF is to build a probabilistic model
The key idea of CRF is the definition of a feature vector
The function maps a pair of the input sequence and the label sequence to some feature vector in d-dimensional space.
The probabilistic model is built as follows:
The function
num type of labels
.
During training, negative log-likelihood is minimized:
The question is how to effectively calculate the sum over all possible sequences
Let
The calculation for the indices 1..m - 1 will be better understood from the figure:
$$\pi[i-1][j] = \log \sum\limits_{y' \in \mathcal{Y}^{i}, y'{-1} = \mathcal{Y}[j]} \exp(\sum\limits{k = 0}^{i - 1} \log \psi_{\texttt{EMIT}} (y'k \rightarrow x_k) + \log \psi{\texttt{TRANS}} (y'_{k - 1} \rightarrow y'_k))$$
$$\displaylines{\pi[i][j] = \log \sum\limits_{t = 0} ^ {|\mathcal{Y}| - 1} \exp (\log \sum\limits_{y' \in \mathcal{Y}^{i}, y'{-1} = \mathcal{Y}[t]} \exp(\sum\limits{k = 0}^{i - 1} \log \psi_{\texttt{EMIT}} (y'k \rightarrow x_k) + \log \psi{\texttt{TRANS}} (y'{k - 1} \rightarrow y'k)) \\ + \log \psi{\texttt{EMIT}} (y_i \rightarrow x_i) + \log \psi{\texttt{TRANS}} (\mathcal{Y}[t] \rightarrow \mathcal{Y}[j]))}$$
$$\displaylines{\pi[i][j] = \log \sum\limits_{t = 0} ^ {|\mathcal{Y}| - 1} \exp (\log (\sum\limits_{y' \in \mathcal{Y}^{i}, y'{-1} = \mathcal{Y}[t]} \exp(\sum\limits{k = 0}^{i - 1} \log \psi_{\texttt{EMIT}} (y'k \rightarrow x_k) + \log \psi{\texttt{TRANS}} (y'{k - 1} \rightarrow y'k)) \\ \cdot \psi{\texttt{EMIT}} (y_i \rightarrow x_i) \cdot \psi{\texttt{TRANS}} (\mathcal{Y}[t] \rightarrow \mathcal{Y}[j])))}$$
$$\displaylines{\pi[i][j] = \log \sum\limits_{t = 0} ^ {|\mathcal{Y}| - 1} \sum\limits_{y' \in \mathcal{Y}^{i}, y'{-1} = \mathcal{Y}[t]} \exp(\sum\limits{k = 0}^{i - 1} \log \psi_{\texttt{EMIT}} (y'k \rightarrow x_k) + \log \psi{\texttt{TRANS}} (y'{k - 1} \rightarrow y'k)) \\ \cdot \psi{\texttt{EMIT}} (y_i \rightarrow x_i) \cdot \psi{\texttt{TRANS}} (\mathcal{Y}[t] \rightarrow \mathcal{Y}[j])}$$
$$\displaylines{\pi[i][j] = \log \sum\limits_{t = 0} ^ {|\mathcal{Y}| - 1} \sum\limits_{y' \in \mathcal{Y}^{i}, y'{-1} = \mathcal{Y}[t]} \exp(\log\psi{\texttt{EMIT}} (y_i \rightarrow x_i) + \log\psi_{\texttt{TRANS}} (\mathcal{Y}[t] \rightarrow \mathcal{Y}[j]) \\ + \sum\limits_{k = 0}^{i - 1} \log \psi_{\texttt{EMIT}} (y'k \rightarrow x_k) + \log \psi{\texttt{TRANS}} (y'_{k - 1} \rightarrow y'_k))}$$
$$\pi[i][j] = \log \sum\limits_{y' \in \mathcal{Y}^{i + 1}, y'{-1} = \mathcal{Y}[j]} \exp(\sum\limits{k = 0}^{i} \log \psi_{\texttt{EMIT}} (y'k \rightarrow x_k) + \log \psi{\texttt{TRANS}} (y'_{k - 1} \rightarrow y'_k))$$
At the end, we add a potential vector, which is responsible for the probability of ending the sequence with the last token.
Since when recalculating the vector
If we take into account that gradient descent uses batches, then the calculation of the denominator in
def compute_log_denominator(self, x: torch.Tensor) -> torch.Tensor:
m = x.shape[0]
pi = self.tr_start + x[0]
# x.shape == [seq_len, batch_size, num_type_of_labels]
# pi.shape == [batch_size, num_type_of_labels]
# self.tr.shape == [num_type_of_labels, num_type_of_labels]
for i in range(1, m):
pi = torch.logsumexp(
pi.unsqueeze(2) + self.tr + x[i].unsqueeze(1),
dim=1,
)
pi += self.tr_end
return torch.logsumexp(pi, dim=1)
To get a sequence of labels for tokens from the hidden representation of the BERT
We can simplify the expression as follows:
The decoding problem is to find an entire sequence of labels such that the sum of potentials is maximized. The problem is also solved using dynamic programming.
All formulas are similar to those used when calculating the denominator in
def viterbi_decode(self, x: torch.Tensor) -> List[List[int]]:
m, bs, num_type_of_labels = x.shape
pi = self.tr_start + x[0]
backpointers = torch.empty_like(x)
# x.shape == [seq_len, batch_size, num_type_of_labels]
# pi.shape == [batch_size, num_type_of_labels]
# self.tr.shape == [num_type_of_labels, num_type_of_labels]
for i in range(1, m):
pi = (
pi.unsqueeze(2) + self.tr + x[i].unsqueeze(1)
) # shape = [batch_size, num_type_of_labels, num_type_of_labels]
pi, indices = pi.max(dim=1) # for each next label, determine from which it is most profitable to come to it.
backpointers[i] = indices
backpointers = backpointers[1:].int()
pi += self.tr_end
From the obtained backpointers
, it is easy to restore the path and the corresponding sequence of labels.
By running the script scripts/tokenize_texts.py described in the Data subsection of the NER section, among others, two files are obtained that describe the relationships between entities in the data:
-
relations.jsonl
– for each text all relations are described{"id": 3, "relations": [{"arg1_tag": "BIN", "arg2_tag": "ACT", "arg1_pos": [135, 136], "arg2_pos": [142, 149], "re_tag": "TSK", "tag": 1}]}
id
– index of a piece of text in the datasetarg1_tag
– the tag of the named entity that is the first argument to the relationarg2_tag
– the tag of the named entity that is the second argument to the relationarg1_pos
– the position of the first argument in the text, the indexes of tokens are indicated (not symbols and not words!), the named entity lies in the half-interval[arg1_pos[0], arg1_pos[1])
arg2_pos
– similar for the second argumentre_tag
– relationship tag string valuetag
– id tag relationship
-
retag2id.jsonl
– mapping from a relation tag's text representation to a number, e.g:{"PPS": 0, "TSK": 1, "NNG": 2, "FNG": 3, "GOL": 4, "FPS": 5,...}
These files, as well as the trained model for NER, are used to compile a dataset for training the relationship detection model.
This data is prepared using a script scripts/prepare_data_for_re.py
python -m scripts.prepare_data_for_re
This script creates two files:
-
re_data.jsonl
– relationships between entities that the model has identified for the detection of named entities. Each line of the file is a dictionary, which consists of the following fields:-
id
– index of a piece of text in the dataset -
seq_embedding
– embedding the entire piece of text -
entities_embeddings
– embeddings of named entities highlighted by the NER model. -
relation_matrix
– matrix of relations between tokens allocated by the NER model:relation_matrix[i][j] = tag_id
if there is a relationship between$i^{th}$ and$j^{th}$ entities from the arrayentities_tags
that hastag
with idtag_id
(according to the fileretag2id.jsonl
). -
entities_tags
– tags of selected named entities by the NER model (withoutB-
,I-
prefixes). -
entities_positions
– positions of entities highlighted by the NER model in the same format as for keysarg1_pos
,arg2_pos
in the filerelations.jsonl
$$
-
-
entity_tag_to_id.json
– mapping from an entity tag's text representation to a number, e.g:{"SOC": 0, "ECO": 1, "ACT": 2, "CMP": 3, "MET": 4, ...}
The idea for the model was taken from the article Enriching Pre-trained Language Model with Entity Information for Relation Classification
The input model recieves fields seq_embedding
, entities_embeddings
, entities_tags
from a file re_data.jsonl
and evaluates the relationship tag for each pair of named entities (or lack of relationship) according to the scheme from the figure above.
A relationship matrix is built and the cross entropy between the prediction matrix and the ground truth relation_matrix
from the file re_data.jsonl
is calculated.
In ner_experiments.ipynb notebook we've provide code for training different version of BERT model. Our finetunned BERT with CRF layer shows the best f1-micro score.
ruBERT | ruBERT + CRF | ruREBus-BERT | ruREBus-BERT + CRF | |
---|---|---|---|---|
f1-micro | 0.8046 | 0.8092 | 0.8119 | 0.8128 |
Since the text was broken into pieces no larger than 512 tokens, for some relations the arguments ended up in different pieces of text.
For the test dataset, there were 248 such lost relationships. We consider them undetected by the model and assign them to false negative
.
The remaining false negative
examples were calculated by comparing the file relations.jsonl
with the relation matrix predicted by the model.
true positive
and false positive
were calculated by comparing predicted relation matrix with ground truth relation matrix from re_data.jsonl
file.
Due to the lost relations and errors of the NER model, the value of f1-micro turned out to be not very large:
f1-micro = 0.2636