Entity Linking with BERT through joint modeling of mention detection and entity disambiguation.
This project has two companion Makefiles: one to build and run the Docker container, and one to run the Python code accessed from inside the container.
You will need a machine with CUDA to run this code. The Docker container is configured to use CUDA 10.1, cuDNN 7 (following the AD machine "Flavus").
The Docker container also relies on files and folders located at /nfs/students/amund-faller-raheim/ma_thesis_el_bert
Most scripts read from the 'config.ini' file located at the root of the project. To change the behavious of the program, it is recommended to look here before the CLI arguments of each script. Particularly with regard to training behaviour.
For example, to pretrain a model on a different dataset, change the path at "DATA" -> "Annotated Dataset" and "DATA" -> "[Cased/Uncased] Input Vectors Dir" and make sure you are using a Knowledge Base with the same IDs as the dataset (i.e. Wikidata or Wikipedia). Update the "DATA" -> "Data Split" to split the dataset as prefered.
From the root of the project folder (where this file is located), run
make build && make run
These commands will run Wharfer with correct paths and container names, and start the container.
Inside the container, you will be greeted by a new Makefile. To generate missing files, run
make setup
This will run three scripts to generate missing files:
- the Candidate Generation files /ex_data/entity_dict.json and /ex_data/alias_dict.json. Generated from the file at config's "KNOWLEDGE BASE" -> "Alias Mapping", which should point to the file prob_yago_crosswikis_wikipedia_p_e_m.txt with candidate sets from Ganea & Hofmann (2017);
- Wikipedia annotated AIDA-CoNLL file /ex_data/annotated_datasets/conll-wikipedia-iob-annotations. Generated from a Wikidata annotated AIDA-CoNLL file, a file /ex_data/AIDA-YAGO2-annotations.tsv with Wikidata and Wikipedia annotations, and an additional mapping from Wikidata to Wikipedia located at /ex_data/wikidata_wikipedia_mapping.csv;
- a compact Wikipedia2vec Knowledge Base at /ex_data/knowledgebases/wikipedia_score_15 used when evaluating without Candidate Generation. Generated from a Wikipedia2vec file defined by config's "KNOWLEDGE BASE" -> "Wikipedia2vec Directory".
To generate the data vectors digested by the model, run
make data-generation
This script uses the file defined by config's "DATA" -> "Annotated Dataset" to generate a vectorized data as digested by BERT. The script will use the tokenizer for the model defined by "MODEL" -> "Model ID" and infer "cased" or "uncased". The script writes to "DATA" -> "(Un)Cased Input Vectors Dir".
To run unittest
make unittest
To train a new model, run
make train
The model architecture is defined by parameters at config's "MODEL" -> "Model ID", "Hidden Output Layers", "Dropout After BERT".
The training procedure is influenced by the parameters at config's "TRAINING" -> "Epochs", "Batch Size", "Initial Learning Rate", "Loss Lambda", "Early Stopping".
The new model will be saved in /models/trained/<train-start>/<checkpoint-saved>
To evaluate the lastest trained model on AIDA-CoNLL Test with Candidate Generation, run
make eval-test
Or on the AIDA-CoNLL Validation dataset
make eval-val
Or without Candidate Generation (takes much longer)
make eval-test-nocg
make eval-val-nocg
To evaluate the most important models in the thesis, run the following commands (assuming you have access to the models).
Removing the flag '-c' will evaluate without Candidate Generation.
- The base model (best performing without pretraining):
python3 evaluate_model.py -c -d test --no_eval_unseen "/models/trained/5_4_3-table_11-base_model/epoch_180"
- The pretrained model:
python3 evaluate_model.py -c -d test --no_eval_unseen "/models/trained/5_4_3-table_11-pretrained_model/epoch_180"
- Our version of the model of Chen et al.:
python3 evaluate_model.py -c -d test --no_eval_unseen "/models/trained/6_1_1-table_12-our_chen/epoch_180"
There are a number of evaluation scripts used to harvest statistics for the thesis. These can be accessed with make
as well.
For Section 5.2.2 in thesis:
make compare-datasets
For Section 5.3.1 in thesis:
make evaluate-cg
For Section 5.3.2 in thesis:
make evaluate-kb
For Section 6.2.1 in thesis:
make evaluate-by-cat
For Section 6.2.1 in thesis:
make evaluate-unseen
For Section 6.2.2 in thesis:
make popularity-corr