Skip to content

idabh/data-science-exam

Repository files navigation

PWC

Logo

Automatic Abstractive Summarisation in Danish

Data Science Exam - MSc Cognitive Science at Aarhus University - Spring 2022
Ida Bang Hansen, Sara Kolding & Katrine Nymann
Access our model through huggingface

About The Project

This repository contains the code for creating an automatic abstractive summarisation tool in Danish. We fine-tuned a language-specific pruned mT5 model on an abstractive subset of the DaNewsroom dataset.

The model can be used for summarisation of individual news articles using this notebook, or through the huggingface API.

Abstract

Automatic abstractive text summarisation is a challenging task in the field of natural language processing. This paper aims to further develop and refine previous work by the authors in domain-specific automatic summarisation for Danish news articles. We extend that work by cleaning the data, pruning the vocabulary of a multilingual model, and improving the parameter tuning and model selection, as well as evaluating results using additional metrics. We fine-tune a pruned mT5 model on a cleaned subset of the DaNewsroom dataset consisting of abstractive summary-article pairs. The resulting model is evaluated quantitatively using ROUGE, BERTScore and density measures, and qualitatively by comparing the generated summaries to our previous work. We find that though model refinements increase quantitative and qualitative performance, the model is prone to hallucinations, and the resulting ROUGE scores are in the lower range of comparable abstractive summarisation efforts in other languages. A discussion of the limitations of the current evaluation methods for automatic abstractive summarisation underline the need for improved metrics and transparency within the field. Future work could employ methods for detecting and reducing hallucinations in model output, and employ methods for reference-less evaluation of summaries.

Key words: automatic summarisation, transformers, Danish, natural language processing

Model performance

These are the quantitative results (mean F1 scores) of our model-generated summaries:

Metric Result
BERTScore 71.41
ROUGE-1 23.10
ROUGE-2 7.53
ROUGE-L 18.52

Get started

  • The DaNewsroom data set can be accessed upon request (https://github.com/danielvarab/da-newsroom)
  • Clone the repo
    git clone https://github.com/idabh/data-science-exam
  • Install required modules
    pip install -r requirements.txt

Contact

Ida Bang Hansen - [email protected]
Sara Kolding - [email protected]
Katrine Nymann - [email protected]

Acknowledgments

  • Thank you to Daniel Varab for providing us with access to DaNewsroom
  • DAT5 icon created with OpenAI's DALL-E 2 '

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •