Skip to content

hcss-utils/Relevant_topic_modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Relevant topic modeling

This is repository with scripts for similarity search and topic modeling

Preperation

Creating and activating virtual environment (optional) Creating virtual environment
python -m venv venv

Activating virtual environment

Windows

venv\Scripts\activate.bat

Linux

source <venv>/bin/activate

Requirements installation

pip install -r requirenments.txt

Full process

Example:

python scripts/process.py example/test.csv example/queries.txt -o data/
process.py usage
usage: process.py [-h] [-o OUTPUT] [-l LANG] [-s] [-m MODEL] [-t THRESHOLD] [-sm SPACY_MODEL] [-gpt GPT_MODEL]
                  input queries

positional arguments:
  input                 path to input file
  queries               path to file with regex queries for relevant sentences search

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to directory where output files will be stored (default: ../data/)
  -l LANG, --lang LANG  language of documents (default: en)
  -s, --smart           use smart paragraphisation
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)
  -t THRESHOLD, --threshold THRESHOLD
                        threshold to determine relevant sentences (default: 0.5)
  -sm SPACY_MODEL, --spacy_model SPACY_MODEL
                        spacy model for lemmatization (default: en_core_web_lg)
  -gpt GPT_MODEL, --gpt_model GPT_MODEL
                        model for topic representation and summary (default: None)

Paragraphs and sentences split process

Example:

python scripts/split.py example/test.csv -o data/
split.py usage
usage: split.py [-h] [-o OUTPUT] [-l LANG] [-s] [-m MODEL] input

positional arguments:
  input                 path to input file

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to directory where output files will be stored (default: ../data/)
  -l LANG, --lang LANG  language of documents (default: en)
  -s, --smart           use smart paragraphisation
  -m MODEL, --model MODEL
                        model for smart paragraphisation (default: sentence-transformers/sentence-t5-xl)

Similarity score computing

Example:

python scripts/similarity.py example/queries.txt -i data/ -o data/
similarity.py usage
usage: similarity.py [-h] [-i INPUT] [-o OUTPUT] [-e EMBEDDINGS] [-m MODEL] queries

positional arguments:
  queries               path to file with regex queries for relevant sentences search

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to directory with paragraphs.csv and sentences.csv (default: ../data/)
  -o OUTPUT, --output OUTPUT
                        path to directory where files will be stored (default: ../data/)
  -e EMBEDDINGS, --embeddings EMBEDDINGS
                        is there embeddings
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)

Topic modeling

Example:

python scripts/topic_modeling.py -i data/ -o data/
topic_modeling.py usage
usage: topic_modeling.py [-h] [-i INPUT] [-o OUTPUT] [-t THRESHOLD] [-sm SPACY_MODEL] [-m MODEL] [-gpt GPT_MODEL]

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to directory with sentences_sim.csv, optionaly with sentences_embeddings.npy, documents.csv (default:
                        ../data/)
  -o OUTPUT, --output OUTPUT
                        path to directory where files will be stored (default: ../data/)
  -t THRESHOLD, --threshold THRESHOLD
                        threshold to determine relevant sentences (default: 0.5)
  -sm SPACY_MODEL, --spacy_model SPACY_MODEL
                        spacy model for lemmatization (default: en_core_web_lg)
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)
  -gpt GPT_MODEL, --gpt_model GPT_MODEL
                        model for topic representation and summary (default: None)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages