Code for my B.Sc. thesis on Natural Language Processing (NLP) for the Ancient Greek language. The paper can be found here.
With the rise of the transformer Neural Network, Language Models (LMs) have been created and trained for most known languages. The performance of those models when fine-tuned on downstream tasks, surpasses all previous baselines by a significant margin. Thus, it can be argued that pre-training LMs has now become of paramount importance when aiming to achieve good performance on NLP tasks.
Surprisingly though, not many LMs have been created for Ancient Greek. After a thorough research, only two models were found:
- A character-level BERT by Brennan Nicholson, which can be found here.
- A token-level BERT by Pranaydeep et al., which can be found here.
The former achieved decent results, but since it's character-level, it doesn't have many use cases apart from Masked Language Modelling. The latter also achieved good results when fine-tuned on the downstream task of Part-of-Speech (PoS) Tagging.
The goal of the thesis was to create an improved Ancient Greek LM, by leveraging
new BERT-like architectures and pre-training methods. The results can be found
in the paper.
It is strongly advised to use a Unix-like OS for this project, as most scripts have been adapted to them.
-
Download the repo using the command
git clone https://github.com/AndrewSpano/BSc-Thesis.git && cd BSc-Thesis
-
Create and activate a virtual environment using the command
conda create --name ag-nlp-venv python=3.8 && conda activate ag-nlp-venv
-
Install the required packages using the command
pip install -r requirements.txt
To download the data, clean it, preprocess it and train a BPE tokenizer, simply run the script
python download_and_process_data.py
This script will
- Download, clean the text data and save it in
data/plain-text/
(directory will be overwritten if needed). - Train a tokenizer and save it in
objects/bpe_tokenizer/
(directory will be overwritten if needed). - Use the tokenizer to create input IDs from the plain-text data and save it
in
data/processed-data/
(directory will be overwritten if needed). - Train a sklearn LabelEncoder
on the PoS tags and save it in
objects/le.pkl
(overwriting it if needed).
The RoBERTa family of models
from the transformers open-source
library has been used. For the implementation of the training process, two
different frameworks were used. Specifically,
PyTorch Lightning (PL) and the
Trainer API
from huggingface were used.
The RoBERTa model uses the auxiliary task of Masked Language Modelling (MLM) to pre-train a Language Model.
The PL model for MLM is implemented in the LitRoBERTaMLM
class located in
mlm_model.py. To pre-train the LM with PL, the
script pl_train_mlm.py
can be used. Its arguments can be found in the function
parse_pl_mlm_input
located in cmd_args.py.
An example of running the script is:
python pl_train_mlm.py \
--logdir logs/pl-mlm/ \
--config-path configurations/pl-mlm-example-config.ini \
--savedir objects/PL-AG-RoBERTa/ \
--plot-savepath plots/pl-mlm.png \
--device cuda \
--distributed \
--seed random
To pre-train the LM with the Trainer API from huggingface, the script
hf_train_mlm.py
can be used. Its arguments can be found in the function
parse_hf_mlm_input
located in cmd_args.py.
An example of running the script is:
python hf_train_mlm.py \
--logdir logs/hf-mlm/ \
--config-path configurations/hf-mlm-example-config.ini \
--savedir objects/HF-AG-RoBERTa/ \
--plot-savepath plots/hf-mlm.png \
--seed 3407
Once a LM has been pre-trained, its performance can be evaluated by fine-tuning it to a downstream task and assessing the results on it. In this repo, the downstream task that was chosen is Part-of-Speech (PoS) Tagging.
The PL model for PoS Tagging is implemented in the class PoSRoBERTa
which
is located in pos_model.py. To fine-tune the LM
with PL, the script pl_train_pos.py
can be used. Its arguments can be found
in the function parse_pl_pos_input
located in
cmd_args.py. An example of running the script is:
python pl_train_pos.py \
--logdir logs/pl-pos/ \
--config-path configurations/pl-pos-example-config.ini \
--pre-trained-model objects/PL-AG-RoBERTa/ \
--savedir objects/PL-PoS-AG-RoBERTa \
--plot-savepath plots/pl-pos.png \
--confusion-matrix plots/pl-pos-cm.png \
--device cuda \
--distributed
To fine-tune the LM on PoS Tagging with the Trainer API from huggingface, the
script hf_train_pos.py can be used. Its arguments can be
found in the function parse_hf_pos_input
located in
cmd_args.py. An example of running the script is:
python hf_train_pos.py \
--logdir logs/hf-pos/ \
--config-path configurations/hf-pos-example-config.ini \
--pre-trained-model objects/HF-AG-RoBERTa/ \
--savedir objects/HF-PoS-AG-RoBERTa \
--plot-savepath plots/hf-pos.png \
--confusion-matrix plots/hf-pos-cm.png \
--seed 3
Hyperparameter tuning in order to minimize the MLM validation loss during pre-training has also been implemented for both frameworks. Specifically, the HyperOpt library was used. The search spaces can be found in the files pl_tune_mlm.py and hf_tune_mlm.py. The only command line argument that they accept is the number of maximum evaluations to perform, which by default is 100 if it's not provided. An example of running the scripts is:
-
PyTorch Lightning
python pl_tune_mlm.py --max-evals 150
-
Huggingface
python hf_tune_mlm.py