CANarEx pipeline

Runs on Linux and macOS using Python 3.9.5

Factiva and Hansard 'First Nations' dataset

CaNarEx environment

 cd CaNarEx
 python3 -m venv venv_canarex
 source venv_canarex/bin/activate
 pip install -r requirements.txt

Step1: Split data into sentences

Use CaNarEx environment
Run split_sentences_trf.py (data already provided)
```
    python 1.split_sentences_trf.py
```

Step2: Coreference resolution

Using SpanBERT

Download https://github.com/mandarjoshi90/coref and follow installation instructions from "Jonathan K. Kummerfeld's notebook" ('spanbert_base') into coref_env environment

Install following packages into coref_env:

    pip install tokenization
    pip install sacremoses

Run coreference resolution python python 2.coref_bert.py

Step3: SRL extraction

Use CaNarEx environment
Run run_canarex.py
```
python 3.run_canarex.py
```

Step4: Filtering narratives

TopN clustering (document level clustering) and Textrank clustering

 python 4.clustering.py

Evaluation

The evaluation folder contains generation of synthetic test data for narrative time-series clustering using jupyter notebook.

Reference (Baseline: Relatio)

Environment: Follow setup steps from relatio: https://github.com/relatio-nlp/relatio
Relatio folder provided: changed to add document ids to output generated.

    python 5.run_relatio.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CANarEx pipeline

Factiva and Hansard 'First Nations' dataset

Step1: Split data into sentences

Step2: Coreference resolution

Step3: SRL extraction

Step4: Filtering narratives

TopN clustering (document level clustering) and Textrank clustering

Evaluation

Reference (Baseline: Relatio)

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
clustering		clustering
data		data
evaluation		evaluation
narratives		narratives
relatio		relatio
.gitignore		.gitignore
1.split_sentences_trf.py		1.split_sentences_trf.py
2.coref_bert.py		2.coref_bert.py
3.run_canarex.py		3.run_canarex.py
4.clustering.py		4.clustering.py
LICENSE		LICENSE
README.md		README.md
bert-base-uncased-vocab.txt		bert-base-uncased-vocab.txt
requirements.txt		requirements.txt
run.ipynb		run.ipynb

License

sodalabsio/CANarEx

Folders and files

Latest commit

History

Repository files navigation

CANarEx pipeline

Factiva and Hansard 'First Nations' dataset

Step1: Split data into sentences

Step2: Coreference resolution

Step3: SRL extraction

Step4: Filtering narratives

TopN clustering (document level clustering) and Textrank clustering

Evaluation

Reference (Baseline: Relatio)

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages