GitHub - idabh/data-science-exam

Automatic Abstractive Summarisation in Danish

Data Science Exam - MSc Cognitive Science at Aarhus University - Spring 2022
Ida Bang Hansen, Sara Kolding & Katrine Nymann
Access our model through huggingface

About The Project

This repository contains the code for creating an automatic abstractive summarisation tool in Danish. We fine-tuned a language-specific pruned mT5 model on an abstractive subset of the DaNewsroom dataset.

The model can be used for summarisation of individual news articles using this notebook, or through the huggingface API.

Abstract

Automatic abstractive text summarisation is a challenging task in the field of natural language processing. This paper aims to further develop and refine previous work by the authors in domain-specific automatic summarisation for Danish news articles. We extend that work by cleaning the data, pruning the vocabulary of a multilingual model, and improving the parameter tuning and model selection, as well as evaluating results using additional metrics. We fine-tune a pruned mT5 model on a cleaned subset of the DaNewsroom dataset consisting of abstractive summary-article pairs. The resulting model is evaluated quantitatively using ROUGE, BERTScore and density measures, and qualitatively by comparing the generated summaries to our previous work. We find that though model refinements increase quantitative and qualitative performance, the model is prone to hallucinations, and the resulting ROUGE scores are in the lower range of comparable abstractive summarisation efforts in other languages. A discussion of the limitations of the current evaluation methods for automatic abstractive summarisation underline the need for improved metrics and transparency within the field. Future work could employ methods for detecting and reducing hallucinations in model output, and employ methods for reference-less evaluation of summaries.

Key words: automatic summarisation, transformers, Danish, natural language processing

Model performance

These are the quantitative results (mean F1 scores) of our model-generated summaries:

Metric	Result
BERTScore	71.41
ROUGE-1	23.10
ROUGE-2	7.53
ROUGE-L	18.52

Get started

The DaNewsroom data set can be accessed upon request (https://github.com/danielvarab/da-newsroom)

Clone the repo

git clone https://github.com/idabh/data-science-exam

Install required modules
```
pip install -r requirements.txt
```

Contact

Ida Bang Hansen - [email protected]
Sara Kolding - [email protected]
Katrine Nymann - [email protected]

Acknowledgments

Thank you to Daniel Varab for providing us with access to DaNewsroom
DAT5 icon created with OpenAI's DALL-E 2 '

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data		data
results		results
README.md		README.md
bertscore_compute.py		bertscore_compute.py
compute_density.py		compute_density.py
compute_metrics.py		compute_metrics.py
create_daT5-base.ipynb		create_daT5-base.ipynb
daT5-base-summariser.py		daT5-base-summariser.py
generate_summary.ipynb		generate_summary.ipynb
generate_sums.py		generate_sums.py
inspect_results.py		inspect_results.py
plot_data.Rmd		plot_data.Rmd
prepare_data_splits.py		prepare_data_splits.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Abstractive Summarisation in Danish

About The Project

Abstract

Model performance

Get started

Contact

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

idabh/data-science-exam

Folders and files

Latest commit

History

Repository files navigation

Automatic Abstractive Summarisation in Danish

About The Project

Abstract

Model performance

Get started

Contact

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages