Cillian
Berragan
[@cjberragan
]1* Alex
Singleton
[@alexsingleton
]1
Alessia
Calafiore
[@alel_domi
]1 Jeremy
Morley
[@jeremy_morley
]2
1 Geographic Data Science Lab, University of Liverpool,
Liverpool, United Kingdom
2 Ordnance Survey Limited, Explorer House, Adanac Drive,
Nursling, Southampton, United Kingdom
* Correspondence: [email protected]
Place names embedded in online natural language text present a useful source of geographic information. Despite this, many methods for the extraction of place names from text use pre-trained models that were not explicitly designed for this task. Our paper builds five custom-built Named Entity Recognition (NER) models, and evaluates them against three popular pre-built models for place name extraction. The models are evaluated using a set of manually annotated Wikipedia articles with reference to the F1 score metric. Our best performing model achieves an F1 score of 0.939 compared with 0.730 for the best performing pre-built model. Our model is then used to extract all place names from Wikipedia articles in Great Britain, demonstrating the ability to more accurately capture unknown place names from volunteered sources of online geographic information.
A fine-tuned DistilBERT transformer model is presented for the identification of place names from text. This repository contains the code used to build the model using AllenNLP and the code used to compare the fine-tuned model against existing models used in geoparsing systems.
ger_wiki
├── configs # allennlp model configurations
│ ├── wiki_bert.jsonnet
│ ├── wiki_crf_basic.jsonnet
│ ├── wiki_crf.jsonnet
│ ├── wiki_distil.jsonnet
│ └── wiki_roberta.jsonnet
├── data_processing # scripts relating to data processing
│ ├── preprocess
│ │ ├── dbpedia_query.py
│ │ └── doccano_functions.py
│ ├── oog
│ │ └── distil_oog.py
│ │ └── spacy_oog.py
├── Dockerfile
├── eval # scripts relating to evaluation between models
│ ├── crossval.py
│ ├── significance.py
│ └── visualise.py
├── ger_streamlit.py # model dashboard
├── ger_wiki # allennlp model scripts
│ ├── batch_predictor.py
│ ├── optimisation.py
│ ├── predictor.py
├── main.py # main cli script
├── paper
└── tests
-
Using a Docker image
docker build . -t ger_wiki
docker run -it --rm --gpus all ger_wiki
If the machine has no GPU, rundocker run -it --rm ger_wiki
-
Using a Python virtual environment
- Extract
ger_wiki.tar.gz
(or usegit clone
) - Ensure the current python version is 3.8.5,
pyenv
is a useful tool for this. - Virtual environment configuration may either use the
requirements.txt
file (e.g. withconda/venv
), or with thepyproject.toml
file withpoetry
. - Run
Rscript paper/setup.R
from the base directory. This will install the R dependencies used for figure creation. R major version 4 is likely required. Several external libraries are required for some R packages (e.g. gdal). If required, theDockerfile
may be consulted to see what may be needed.
- Extract
Assuming a correct environment setup, both methods should yield the same results in the following sections.
All CoNLL formatted data used to train the entity recognition models is included in data_processing/data/processed
, allowing the models to run. These models have much of their configuration contained within the configs/
directory as .jsonnet
files. These contain hyperparameters and model configurations with one file per model.
WARNING: These commands will likely require a GPU
- Run
python3 main.py wiki_distil
: Train + evaluate the DistilBERT model on Wikipedia data. Replacewiki_distil
with any other configuration in theconfigs/
dir for other models (e.g.configs/wiki_{model}.jsonnet
)
Once models have finished running, the metrics shown on Table 3 are given as output. This output is also given within the models/{model}
directories as metrics.json
. As Table 3 utilises these .json
files for creation, the metrics in the main paper will always be the most up to date.
The main setup for these models is contained within the ger_wiki
directory as python scripts.
-
reader.py
reads in the.conll
files as a format readable for these models (Instances
), it is also designed to preserve metadata associated with the text (e.g. Wikipedia article title). -
predictor.py
contains a single instance predictor that is able to read plain text into a trained model to output results. This class is used by theger_streamlit.py
file which may be used to demo the model. -
batch_predictor.py
contains a class used bymain.py
to read in the WikipediaCSV
that was queried from DBpedia. This class parallelises batches of text and outputs the place names identified by the model into a furtherCSV
file.
The main place names dataset is provided in data_processing/data/results/predictions.csv
and may be recreated in this section. If the wiki_distil
model is trained fully as above, this following section will work. However, if model training was not possible, the provided model.tar.gz
will need to be added to models/wiki_distil/
. For the Docker container, run this command from another shell while the container is running:
INFO: To find
container_id
rundocker ps
docker cp model.tar.gz (container_id):/project/models/wiki_distil/
To create predictions (Runs either on CPU or GPU):
- Run
python3 main.py wiki_distil --predict
: use the trained Wiki model to extract place names from all Wikipedia articles queried from Wikipedia. Will be very slow to complete on CPU. This relies ondata_processing/data/raw/wiki/wiki_info.csv
that was created by the DBpedia query.
- Run
streamlit run ger_streamlit.py
: host an interactive app to explore outputs from the model. This may be used to recreate images shown in Section 5.
All code relating to table and figure production is contained within paper/tables
and paper/figures
respectively. Changes to results will also update all figures and tables. Each R
script may be run independently as they are completely self-contained.
- For example run:
Rscript paper/figures/figure2_wiki_dist.R
. This may not produce output, it is likely easier to reproduce these figures using an R REPL.
The paper itself uses R Markdown to render figures and tables directly with text, source code, and data, to knit this document and produce an updated PDF:
Rscript -e 'rmarkdown::render("paper/main.Rmd")'
NOTE: If models have partially completed, files required for the creation of figures and tables may have been removed. If this occurs it is likely best to restart the Docker container, or start with the fresh archive.
Contained within data_processing/preprocess
are three python scripts. To evaluate the preprocessing it is best to inspect these files manually as they were not intended to be run sequentially.
To ensure functions were correctly implemented pytest
may be run from the base directory which uses toy data for the functions used here. Running pytest
from the base directory should not throw errors. The tests
directory contains the toy data, and unit tests relating to this preprocessing.
The three preprocessing scripts are:
-
data_processing/preprocess/dbpedia_query.py
: DBpedia query and text cleaning. -
data_processing/preprocess/doccano_functions.py
: Helper functions to convert model predictions to Doccano formatted files and these to CoNLL.
These files were not run more than once and are only kept for completeness to demonstrate the data preprocessing involved. The query script may be ran, to retrieve a more up to date Wikipedia corpus, but will be very slow.
python -m data_processing.preprocess.dbpedia_query
will update the raw Wikipedia data.
The directory eval
contains two scripts for the evaluation of the chosen model against pre-built solutions. Please note that the pre-built models must be downloaded first. python -m spacy download en_core_web_sm
, python -m spacy download en_core_web_lg
, Stanza’s model may be installed using python import stanza; stanza.download('en')
.
-
crossval.py
: Obtain F1 scores for 10 subsets of the validation data for each model. -
significance.py
: Significance testing between models. -
visualise.py
: Visualises the NER output from pre-built models + DistilBERT.
data_processing/oog
contains the processing which identifies place names that do not appear in the GeoNames gazetteer for the DistilBERT model and Spacy. Used in Table 5.
The included Dockerfile may be used to build the Docker image from scratch:
-
docker build -t ger_wiki .
-
docker run -it --rm --gpus all ger_wiki