Relevant topic modeling

This is repository with scripts for similarity search and topic modeling

Preperation

Creating and activating virtual environment (optional)

Creating virtual environment

python -m venv venv

Activating virtual environment

Windows

venv\Scripts\activate.bat

Linux

source <venv>/bin/activate

Requirements installation

pip install -r requirenments.txt

Full process

Example:

python scripts/process.py example/test.csv example/queries.txt -o data/

process.py usage

usage: process.py [-h] [-o OUTPUT] [-l LANG] [-s] [-m MODEL] [-t THRESHOLD] [-sm SPACY_MODEL] [-gpt GPT_MODEL]
                  input queries

positional arguments:
  input                 path to input file
  queries               path to file with regex queries for relevant sentences search

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to directory where output files will be stored (default: ../data/)
  -l LANG, --lang LANG  language of documents (default: en)
  -s, --smart           use smart paragraphisation
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)
  -t THRESHOLD, --threshold THRESHOLD
                        threshold to determine relevant sentences (default: 0.5)
  -sm SPACY_MODEL, --spacy_model SPACY_MODEL
                        spacy model for lemmatization (default: en_core_web_lg)
  -gpt GPT_MODEL, --gpt_model GPT_MODEL
                        model for topic representation and summary (default: None)

Paragraphs and sentences split process

Example:

python scripts/split.py example/test.csv -o data/

split.py usage

usage: split.py [-h] [-o OUTPUT] [-l LANG] [-s] [-m MODEL] input

positional arguments:
  input                 path to input file

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to directory where output files will be stored (default: ../data/)
  -l LANG, --lang LANG  language of documents (default: en)
  -s, --smart           use smart paragraphisation
  -m MODEL, --model MODEL
                        model for smart paragraphisation (default: sentence-transformers/sentence-t5-xl)

Similarity score computing

Example:

python scripts/similarity.py example/queries.txt -i data/ -o data/

similarity.py usage

usage: similarity.py [-h] [-i INPUT] [-o OUTPUT] [-e EMBEDDINGS] [-m MODEL] queries

positional arguments:
  queries               path to file with regex queries for relevant sentences search

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to directory with paragraphs.csv and sentences.csv (default: ../data/)
  -o OUTPUT, --output OUTPUT
                        path to directory where files will be stored (default: ../data/)
  -e EMBEDDINGS, --embeddings EMBEDDINGS
                        is there embeddings
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)

Topic modeling

Example:

python scripts/topic_modeling.py -i data/ -o data/

topic_modeling.py usage

usage: topic_modeling.py [-h] [-i INPUT] [-o OUTPUT] [-t THRESHOLD] [-sm SPACY_MODEL] [-m MODEL] [-gpt GPT_MODEL]

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to directory with sentences_sim.csv, optionaly with sentences_embeddings.npy, documents.csv (default:
                        ../data/)
  -o OUTPUT, --output OUTPUT
                        path to directory where files will be stored (default: ../data/)
  -t THRESHOLD, --threshold THRESHOLD
                        threshold to determine relevant sentences (default: 0.5)
  -sm SPACY_MODEL, --spacy_model SPACY_MODEL
                        spacy model for lemmatization (default: en_core_web_lg)
  -m MODEL, --model MODEL
                        model for embedding (default: sentence-transformers/sentence-t5-xl)
  -gpt GPT_MODEL, --gpt_model GPT_MODEL
                        model for topic representation and summary (default: None)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
example		example
scripts		scripts
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Relevant topic modeling

Preperation

Full process

Paragraphs and sentences split process

Similarity score computing

Topic modeling

About

Releases

Packages

Languages

hcss-utils/Relevant_topic_modeling

Folders and files

Latest commit

History

Repository files navigation

Relevant topic modeling

Preperation

Full process

Paragraphs and sentences split process

Similarity score computing

Topic modeling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages