A Language ANnotation Tool to undERstand Narratives
This repository contains a pipeline for computational narrative analysis assisted by Large Language Models (LLMs).
LANTERN can preprocess, annotate, and analyse entire collections of books, and understand what parts of a book expresses the following narrative information:
i.e., all that happens in the narrative world
i.e., all that happens within a character, such as memories, emotions, and perceptions.
i.e., additional details that contextualize the story, such as characters' relationships or sceneries.
Clone this repository and install the required dependencies.
$ gh repo clone cltl/event-detection-tool
$ pip3 install -r requirements.txt
Download Meta-Llama-3-8B-Instruct-GGUF and store it in ./llms/
.
1. Preprocess to split a book into paragraphs, sentences, and clauses.
python3 scripts/preprocess/preprocess_book.py --paragraphs --sentences --clauses
2. Annotate each clause with one of three types of information, among events,subjective experiences, and contextual information.
python3 scripts/annotate/tag.py
This step will produce corpus.tsv
in the output folder, where each row corresponds to an annoteated clause.
If you prefer to annotate sentences, run
python3 scripts/annotate/tag.py --sentences
3. Analyse stories, to observe their structure in terms of sequences of events subjective experiences, and contextual information.
python3 scripts/annotate/tag.py --clauses
if you want to analyze how clauses have been annotated, or
python3 scripts/annotate/tag.py --sentences
to do the same at the level of sentences.
This step visualizes
- the distribution of events,subjective experiences, and contextual information in the book,
- their frequency across chapters and book chunks,
- their entropy.
Here is an example of the frequency of the three labels in the book Max Havelaar, annotated at the clause level with with Openai gpt-4-1106-preview.
Right now, LANTERN runs on Max Havelaar by Multatuli and Nooit Meer Slaapen by Hermans, and it uses a quantized version of Llama-3 for clause splitting and annotation. But you can apply this pipeline on different books (either in English or Dutch) and with other LLMs .
NOTE: For copyright reasons, we make available only the results obtained on the Hermans' book, and not the book itself.
...is possible, as long as it is supported by llama-cpp.
Store your LLM in the folder ./llms/
, and specify its name in config.ini
.
In config.ini
, you can also change system and user prompts.
-
Write book title and language in
config.ini
. -
Specify the url to the .txt of your book in
config.ini
, for instance
[book]
title = "Max Havelaar"
path = "https://www.gutenberg.org/cache/epub/11024/pg11024.txt"If you already have a file containing your book, put it in
./inputs/
, and specify its location/name inconfig.ini
. The file can either be:- a .txt file
- a .tsv file where each row contains a paragraph, with the following columns
Column Name Description paragraph_id Integer identifying a paragraph. chapter_id Integer indicating the unique identifier for each chapter. paragraphs The actual text content of each paragraph.
-
a .tsv file where each row is a sentence in the book, with the following columns:
Column Name Description sentence_id Unique identifier for each sentence. paragraph_id Unique identifier for each paragraph. chapter_id Unique identifier for each chapter. sentences The actual text content of each sentence
- You're ready to follow these steps.
Note: if you already have the file containing paragraphs, you can preprocess the book running
python3 scripts/preprocess/preprocess_book.py --sentences --clauses
If you already have the sentences .tsv file, you can just run
python3 scripts/preprocess/preprocess_book.py --clauses
This tool was created in collaboration with the CLARIAH consortium.
Check out the corpus CLAUSE-ATLAS that we constructed using the LANTERN pipeline, and corresponding analyses in this publication.