An NLP pipeline for extracting information related to characters in fanfiction in English. For each input fanfiction story (or other document), the pipeline produces a list of characters. For each character, the pipeline produces:
- mentions of the character in the text (such as pronouns that refer to the character)
- quotes attributed to that character
- "assertions", or any non-quote span, that contains character mentions. These assertions include description and narration involving that character.
More information on the pipeline is available in the paper here. If you use it academically, please cite this work:
Michael Miller Yoder, Sopan Khosla, Qinlan Shen, Aakanksha Naik, Huiming Jin, Hariharan Muralidharan, and Carolyn P Rosé. 2021. FanfictionNLP: A Text Processing Pipeline for Fanfiction. In Proceedings of the 3rd Workshop on Narrative Understanding, pages 13–23.
Contact Michael Miller Yoder <mmyoder [at] pitt.edu> with any questions.
This pipeline processes a directory of fanfiction files and extracts text that is relevant for each character.
The pipeline does:
- Character coreference
- Character feature extraction
- Quote attribution
- Assertion attribution (narrative and evaluation about a character)
The pipeline is written in Python 3. Dependencies are listed below. Sorry about there being so many! We are planning on trimming this down.
- scipy
- pandas
- scikit-learn
- nltk
- spacy
- inflect
- pytorch
- HuggingFace transformers. Install with pip, since the conda version may raise a GLibC error when running the pipeline
- benepar (available with pip, not conda)
- protobuf
- pyhocon (available with pip, not conda)
A conda environment file that lists these dependencies with tested version numbers is at environment.yml
. A new environment with these dependencies can be created with conda env create -n fanfiction-nlp --file environment.yml
and then activated with conda activate fanfiction-nlp
.
Some additional data and model files are also required:
- spacy's en_core_web_sm model. This can be downloaded with
python -m spacy download en_core_web_sm
. - The wordnet, stopwords, and punkt packages from nltk. These can be downloaded with
python -m nltk.downloader wordnet stopwords punkt
. - An English parsing model for benepar. Download this through a Python interpreter (or see instructions):
import benepar
benepar.download('benepar_en3')
To run the SpanBERT-based coreference, a model file is required that is 534 MB, unfortunately too big for GitHub's file size limit. That file is available from https://cmu.box.com/s/leg9pkato6gtv9afg6e7tz9auwya2h3n. Please download it and place it in a new directory called model
in the spanbert_coref
directory.
To test that everything is set up properly, run python run.py example.cfg
, which by default will run the pipeline on a test story in the example_fandom
directory.
This will take ~2 GB of RAM to run.
The output should be placed in a new directory, output/example_fandom
. This output should be the same as that provided in output_test/example_fandom
.
Directory path to directory of fanfiction story CSV files.
If your input is raw text you'll need to format it like the examples in the example_fandom
directory. Here's an example. Eventually we'll support raw text file input.
Columns needed in the input are:
fic_id
, chapter_id
, para_id
, text
, text_tokenized
Please tokenize text (split into words) before running it through the pipeline and include this as a final column, text_tokenized
. We are working on including this as an option.
A script, tokenize_fics.py
, is included for convenience, though this will require modification to work with your input.
The pipeline uses quite a bit of RAM, mostly depending on the length of the input. It is not recommended to run on stories with greater than 5000 words. Running on stories with 5000 words can use ~20 GB of RAM.
-
Character coreference:
- a directory with a JSON file for each processed fic. The JSON file has the following keys:
document
: string of tokenized text (space between each token)clusters
: a list of character clusters, each with the fields:name
mentions
: a list of mentions, each a dictionary with keys:position
: [start_token_id, end_token_id+1]. The position of the start token of the mention in thedocument
list (inclusive), and the position 1 after the last token in the mention in thedocument
list.text
: The text of the mention
- a directory with a JSON file for each processed fic. The JSON file has the following keys:
-
Quote attribution:
- a directory with a JSON file for each processed fic. Each JSON file has cluster-level character names as keys and a list of dictionaries as values, one for each quote spoken by the character:
position
: [start_token_id, end_token_id+1]. The position of the start token of the mention in the coreferencedocument
list (inclusive), and the position 1 after the last token in the mention.text
: The text of the quote
- a directory with a JSON file for each processed fic. Each JSON file has cluster-level character names as keys and a list of dictionaries as values, one for each quote spoken by the character:
-
Assertion attribution:
- a directory with a JSON file for each processed fic. Each JSON file has cluster-level character names as keys and a list of dictionaries as values, one for each assertion (narrative or evaluation) about the character:
position
: [start_token_id, end_token_id+1]. The position of the start token of the assertion in thedocument
list (inclusive), and the position 1 after the last token in the assertion in thedocument
list.text
: The text of the assertion Assertions are any text other than quotes that is relevant to seeing how a character is portrayed.
- a directory with a JSON file for each processed fic. Each JSON file has cluster-level character names as keys and a list of dictionaries as values, one for each assertion (narrative or evaluation) about the character:
The pipeline takes settings and input/output filepaths in a configuration file. An example config file is example.cfg
. Descriptions of each configuration setting by section are as follows:
collection_name
: the name of the dataset (user-defined)
input_path
: path to the directory of input files
output_path
: path to the directory where processed files will be stored
run_coref
: (True or False) Whether to run character coreference.
n_threads
: (integer) The number of threads (actually processes) to run the coreference
run_quote_attribution
: Whether to run quote attribution (True or False)
n_threads
: (integer) The number of threads (actually processes) to run the quote attribution
run_assertion_extraction
: Whether to run assertion extraction (True or False)
n_threads
: (integer) The number of threads (actually processes) to run the quote attribution
python run.py <config_file_path>
This pipeline was inspired by David Bamman's BookNLP.