Snowflake Data Pipeline for Semantic Search and Topic Modelling (BERTopic)

This project implements an end-to-end data pipeline in Snowflake to enable semantic search and topic modeling on documents. It empowers users to efficiently retrieve and analyze documents using keyword/s, and to uncover hidden themes and topics within the documents.

Overview

The project consists of a few main components:

Data Ingestion and Transformation: Implemented using Snowflake's SQL and Snowpark worksheets for Extract, Load, and Transform (ELT) processes.
Semantic Search: Utilizes pre-trained language model "all-MiniLM-L6-v2" for text embedding and computing cosine-similarity scores.
Topic Modelling: Utilizes the BERTopic framework/ model to identify and extract hidden themes and topics from the document corpus.

Architecture

Stored Procedures

The Stored_Procedures directory contains the stored procedures used in the Snowflake environment.

LOAD_AND_REFRESH_DATABASE_sub-procedures

This directory contains stored procedures related to loading and refreshing the database.

SP1.1_load_to_table_from_stage.txt: Script to load data into Snowflake table from staging area.
SP1.2_snowpark_transformations.txt: Script for Snowpark transformations.
SP1.3_chunking_text.txt: Script for chunking text.
SP1.4_embedding_chunks.txt: Script for embedding chunks.
SP1.5_tokenizing_chunks.txt: Script for tokenizing chunks.
SP1.6_embedding_tokens.txt: Script for embedding tokens.
SP1.7_dimension_reduction.txt: Script for reducing vector dimensions

RUN_SIMILARITY_SEARCH_sub-procedures

This directory contains stored procedures related to running similarity searches.

SP2.1_INSERT_QUERY.txt: Script to insert queries into Snowflake table.
SP2.2_GET_SIMILARITY.txt: Script to get similarity score between query and every chunk.
SP2.3_GET_TOKEN_SIMILARITY.txt: Script to get token similarity score between query and every token.
SP2.4_GET_TOP_5_TOKENS.txt: Script to get top 5 tokens for each chunk.
SP2.5_VIEW_CREATION.txt: Script to create a temporary table for subsequent topic modelling step
SP2.6_ENTITY_EXTRACTION.txt: Script to extract only the organization entities from the NER field

Main Scripts/ Procedures

main_LOAD_AND_REFRESH_DATABASE.txt: Main script for running all the 5 sub-procedures sequentially to load and refresh the database
main_RUN_SIMILARITY_SEARCH.txt: Main script for running all the 7 sub-procedures of similarity search
main_BERTopic_Modelling.txt: Main script for performing topic modeling using the BERTopic framework.

User-Defined Functions

The User_Defined_Functions directory contains user-defined functions used in the Snowflake environment.

udf_CHUNK_TEXT.txt: Function to chunk text into smaller paragraphs/ sentences
udf_MiniLM_L6_v2_embedding.txt: Function to embed text into vectors using MiniLM-L6-v2.
udf_py_spacy.txt: Function using Python's spaCy library for text processing and tokenizing.
udf_sklearn_cosine_similarity.txt: Function to compute cosine similarity score using scikit-learn.
udf_py_spacy_NER.txt: Function using Python's spaCy library for Named Entity Recognition.

Usage

Set up Environment: Before executing the scripts, ensure you have access to a Snowflake environment with the necessary permissions.
Execution Steps:
- Open Snowflake's SQL or Snowpark worksheet.
- Copy the contents of each script and execute them sequentially to create the sub-procedures/ functions/ main-procedures
- To run the main procedures just use the call function
  - CALL LOAD_AND_REFRESH_DATABASE()
  - CALL RUN_SIMILARITY_SEARCH('QUERY')
  - CALL TOPIC_MODEL('limit')

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Stored_Procedures		Stored_Procedures
User_Defined_Functions		User_Defined_Functions
Data_Pipeline.png		Data_Pipeline.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snowflake Data Pipeline for Semantic Search and Topic Modelling (BERTopic)

Overview

Architecture

Stored Procedures

LOAD_AND_REFRESH_DATABASE_sub-procedures

RUN_SIMILARITY_SEARCH_sub-procedures

Main Scripts/ Procedures

User-Defined Functions

Usage

About

Releases

Packages

SummerCo0L/Semantic_search_BERTopic_Snowflake

Folders and files

Latest commit

History

Repository files navigation

Snowflake Data Pipeline for Semantic Search and Topic Modelling (BERTopic)

Overview

Architecture

Stored Procedures

LOAD_AND_REFRESH_DATABASE_sub-procedures

RUN_SIMILARITY_SEARCH_sub-procedures

Main Scripts/ Procedures

User-Defined Functions

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages