Skip to content

Latest commit

 

History

History
80 lines (55 loc) · 2.4 KB

README.md

File metadata and controls

80 lines (55 loc) · 2.4 KB

CANarEx pipeline

Runs on Linux and macOS using Python 3.9.5

Factiva and Hansard 'First Nations' dataset

  • CaNarEx environment

     cd CaNarEx
     python3 -m venv venv_canarex
     source venv_canarex/bin/activate
     pip install -r requirements.txt

Step1: Split data into sentences

  • Use CaNarEx environment
  • Run split_sentences_trf.py (data already provided)
        python 1.split_sentences_trf.py

Step2: Coreference resolution

Using SpanBERT

  • Download https://github.com/mandarjoshi90/coref and follow installation instructions from "Jonathan K. Kummerfeld's notebook" ('spanbert_base') into coref_env environment
  • Install following packages into coref_env:
        pip install tokenization
        pip install sacremoses
  • Run coreference resolution python python 2.coref_bert.py

Step3: SRL extraction

  • Use CaNarEx environment
  • Run run_canarex.py
    python 3.run_canarex.py

Step4: Filtering narratives

TopN clustering (document level clustering) and Textrank clustering

 python 4.clustering.py

Evaluation

The evaluation folder contains generation of synthetic test data for narrative time-series clustering using jupyter notebook.

Reference (Baseline: Relatio)

    python 5.run_relatio.py

References