This repository contains spaCy's rule-based matching
&
noun-chunk extraction
scripts.
Colab | Description |
---|---|
make_dataset |
Transform document-based dataset into a paragraph/sentence-based one |
noun_chunks |
Extract noun phrases using spaCy's noun_chunks attribute |
dep_matcher |
Match documents/sentences on dependecy tree |
To use or contribute to this repository, first checkout the code. Then create a new virtual environment:
Windows
$ git clone https://github.com/hcss-utils/spacy-phrases.git
$ cd spacy-phrases
$ python -m venv env
$ . env/Scripts/activate
$ pip install -r requirements.txt
MacOS / Linux
$ git clone https://github.com/hcss-utils/spacy-phrases.git
$ cd spacy-phrases
$ python3 -m venv env
$ . env/bin/activate
$ pip install -r requirements.txt
As in some cases we want to have a couple of 'versions' (document-, paragraph-, and sentence-based) of our corpora, there are a scripts/make_dataset.py that transforms document-based datasets into a paragraph/sentence-based ones and scripts/process.py that handles text preprocessing.
Data transformation
To prepare dataset, run python scripts/make_dataset.py
:
Usage: make_dataset.py [OPTIONS] INPUT_TABLE OUTPUT_TABLE
Typer app that processes datasets.
Arguments:
INPUT_TABLE [required]
OUTPUT_TABLE [required]
Options:
--lang [en|ru] sentecizer's base model [default:
Languages.EN]
--docs-max-length INTEGER Doc's max length. [default: 2000000]
--paragraph / --sentence [default: sentence]
--text TEXT [default: fulltext]
--uuid TEXT [default: uuid]
--lemmatize / --no-lemmatize [default: no-lemmatize]
--install-completion [bash|zsh|fish|powershell|pwsh]
Install completion for the specified shell.
--show-completion [bash|zsh|fish|powershell|pwsh]
Show completion for the specified shell, to
copy it or customize the installation.
--help Show this message and exit.
We've developed two different approaches to extracting noun phrases:
- our first guess was to use
Doc
'snoun_chunks
attribute (we iterate over noun_chunks and keep those that fit out criteria). But this approach isn't perfect and doesn't for work ru models. - we then moved to
Rule-based matching
which is more flexible as long as you write accurate patterns (and works for both en and ru models).
Noun_chunks
To extract phrases using noun_chunks approach, run python scripts/noun_chunks.py
:
Usage: noun_chunks.py [OPTIONS] INPUT_TABLE OUTPUT_JSONL
Extract noun phrases using spaCy.
Arguments:
INPUT_TABLE [required]
OUTPUT_JSONL [required]
Options:
--model TEXT [default: en_core_web_sm]
--docs-max-length INTEGER [default: 2000000]
--batch-size INTEGER [default: 50]
--text-field TEXT [default: fulltext]
--uuid-field TEXT [default: uuid]
--pattern TEXT [default: influenc]
--install-completion [bash|zsh|fish|powershell|pwsh]
Install completion for the specified shell.
--show-completion [bash|zsh|fish|powershell|pwsh]
Show completion for the specified shell, to
copy it or customize the installation.
--help Show this message and exit.
Dependency Matcher
To extract phrases using Dependency Matcher approach, run python scripts/dep_matcher.py
:
Usage: dep_matcher.py [OPTIONS] INPUT_TABLE PATTERNS OUTPUT_JSONL
Match dependencies using spaCy's dependency matcher.
Arguments:
INPUT_TABLE Input table containing text & metadata [required]
PATTERNS Directory or a single pattern file with rules [required]
OUTPUT_JSONL Output JSONLines file where matches will be stored [required]
Options:
--model TEXT SpaCy model's name [default:
en_core_web_sm]
--docs-max-length INTEGER Doc's max length. [default: 2000000]
--text-field TEXT [default: fulltext]
--uuid-field TEXT [default: uuid]
--batch-size INTEGER [default: 50]
--context-depth INTEGER
--merge-entities / --no-merge-entities
[default: no-merge-entities]
--merge-noun-chunks / --no-merge-noun-chunks
[default: no-merge-noun-chunks]
--keep-sentence / --no-keep-sentence
[default: no-keep-sentence]
--keep-fulltext / --no-keep-fulltext
[default: no-keep-fulltext]
--install-completion [bash|zsh|fish|powershell|pwsh]
Install completion for the specified shell.
--show-completion [bash|zsh|fish|powershell|pwsh]
Show completion for the specified shell, to
copy it or customize the installation.
--help Show this message and exit.
Once phrases/matches extracted, you could transform them into a usable format, or/and count their frequencies:
- to extract phrases from matches (process rule-based matching output), see notebooks/count-matcher-phrases.ipynb
- to count extacted phrases, see notebooks/count-noun-chunk-phrases.ipynb