Skip to content

hcss-utils/spacy-phrases

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spacy-phrases

This repository contains spaCy's rule-based matching & noun-chunk extraction scripts.

Google colabs

Colab Description
make_dataset Transform document-based dataset into a paragraph/sentence-based one
noun_chunks Extract noun phrases using spaCy's noun_chunks attribute
dep_matcher Match documents/sentences on dependecy tree

Installation (locally)

To use or contribute to this repository, first checkout the code. Then create a new virtual environment:

Windows

$ git clone https://github.com/hcss-utils/spacy-phrases.git
$ cd spacy-phrases
$ python -m venv env 
$ . env/Scripts/activate
$ pip install -r requirements.txt

MacOS / Linux

$ git clone https://github.com/hcss-utils/spacy-phrases.git
$ cd spacy-phrases
$ python3 -m venv env 
$ . env/bin/activate
$ pip install -r requirements.txt

Usage

Data transformation

As in some cases we want to have a couple of 'versions' (document-, paragraph-, and sentence-based) of our corpora, there are a scripts/make_dataset.py that transforms document-based datasets into a paragraph/sentence-based ones and scripts/process.py that handles text preprocessing.

Data transformation

To prepare dataset, run python scripts/make_dataset.py:

Usage: make_dataset.py [OPTIONS] INPUT_TABLE OUTPUT_TABLE

  Typer app that processes datasets.

Arguments:
  INPUT_TABLE   [required]
  OUTPUT_TABLE  [required]

Options:
  --lang [en|ru]                  sentecizer's base model  [default:
                                  Languages.EN]
  --docs-max-length INTEGER       Doc's max length.  [default: 2000000]
  --paragraph / --sentence        [default: sentence]
  --text TEXT                     [default: fulltext]
  --uuid TEXT                     [default: uuid]
  --lemmatize / --no-lemmatize    [default: no-lemmatize]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Matching phrases

We've developed two different approaches to extracting noun phrases:

  • our first guess was to use Doc's noun_chunks attribute (we iterate over noun_chunks and keep those that fit out criteria). But this approach isn't perfect and doesn't for work ru models.
  • we then moved to Rule-based matching which is more flexible as long as you write accurate patterns (and works for both en and ru models).
Noun_chunks

To extract phrases using noun_chunks approach, run python scripts/noun_chunks.py:

Usage: noun_chunks.py [OPTIONS] INPUT_TABLE OUTPUT_JSONL

  Extract noun phrases using spaCy.

Arguments:
  INPUT_TABLE   [required]
  OUTPUT_JSONL  [required]

Options:
  --model TEXT                    [default: en_core_web_sm]
  --docs-max-length INTEGER       [default: 2000000]
  --batch-size INTEGER            [default: 50]
  --text-field TEXT               [default: fulltext]
  --uuid-field TEXT               [default: uuid]
  --pattern TEXT                  [default: influenc]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Dependency Matcher

To extract phrases using Dependency Matcher approach, run python scripts/dep_matcher.py:

Usage: dep_matcher.py [OPTIONS] INPUT_TABLE PATTERNS OUTPUT_JSONL

  Match dependencies using spaCy's dependency matcher.

Arguments:
  INPUT_TABLE   Input table containing text & metadata  [required]
  PATTERNS      Directory or a single pattern file with rules  [required]
  OUTPUT_JSONL  Output JSONLines file where matches will be stored  [required]

Options:
  --model TEXT                    SpaCy model's name  [default:
                                  en_core_web_sm]
  --docs-max-length INTEGER       Doc's max length.  [default: 2000000]
  --text-field TEXT               [default: fulltext]
  --uuid-field TEXT               [default: uuid]
  --batch-size INTEGER            [default: 50]
  --context-depth INTEGER
  --merge-entities / --no-merge-entities
                                  [default: no-merge-entities]
  --merge-noun-chunks / --no-merge-noun-chunks
                                  [default: no-merge-noun-chunks]
  --keep-sentence / --no-keep-sentence
                                  [default: no-keep-sentence]
  --keep-fulltext / --no-keep-fulltext
                                  [default: no-keep-fulltext]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Counting

Once phrases/matches extracted, you could transform them into a usable format, or/and count their frequencies:

About

Extract phrases using spaCy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published