Data Science Exam - MSc Cognitive Science at Aarhus University - Spring 2022
Ida Bang Hansen, Sara Kolding & Katrine Nymann
Access our model through huggingface
This repository contains the code for creating an automatic abstractive summarisation tool in Danish. We fine-tuned a language-specific pruned mT5 model on an abstractive subset of the DaNewsroom dataset.
The model can be used for summarisation of individual news articles using this notebook, or through the huggingface API.
Automatic abstractive text summarisation is a challenging task in the field of natural language processing. This paper aims to further develop and refine previous work by the authors in domain-specific automatic summarisation for Danish news articles. We extend that work by cleaning the data, pruning the vocabulary of a multilingual model, and improving the parameter tuning and model selection, as well as evaluating results using additional metrics.
We fine-tune a pruned mT5 model on a cleaned subset of the DaNewsroom dataset consisting of abstractive summary-article pairs. The resulting model is evaluated quantitatively using ROUGE, BERTScore and density measures, and qualitatively by comparing the generated summaries to our previous work. We find that though model refinements increase quantitative and qualitative performance, the model is prone to hallucinations, and the resulting ROUGE scores are in the lower range of comparable abstractive summarisation efforts in other languages. A discussion of the limitations of the current evaluation methods for automatic abstractive summarisation underline the need for improved metrics and transparency within the field. Future work could employ methods for detecting and reducing hallucinations in model output, and employ methods for reference-less evaluation of summaries.
Key words: automatic summarisation, transformers, Danish, natural language processing
These are the quantitative results (mean F1 scores) of our model-generated summaries:
Metric | Result |
---|---|
BERTScore | 71.41 |
ROUGE-1 | 23.10 |
ROUGE-2 | 7.53 |
ROUGE-L | 18.52 |
- The DaNewsroom data set can be accessed upon request (https://github.com/danielvarab/da-newsroom)
- Clone the repo
git clone https://github.com/idabh/data-science-exam
- Install required modules
pip install -r requirements.txt
Ida Bang Hansen - [email protected]
Sara Kolding - [email protected]
Katrine Nymann - [email protected]
- Thank you to Daniel Varab for providing us with access to DaNewsroom
- DAT5 icon created with OpenAI's DALL-E 2 '