This project implements an end-to-end data pipeline in Snowflake to enable semantic search and topic modeling on documents. It empowers users to efficiently retrieve and analyze documents using keyword/s, and to uncover hidden themes and topics within the documents.
The project consists of a few main components:
- Data Ingestion and Transformation: Implemented using Snowflake's SQL and Snowpark worksheets for Extract, Load, and Transform (ELT) processes.
- Semantic Search: Utilizes pre-trained language model "all-MiniLM-L6-v2" for text embedding and computing cosine-similarity scores.
- Topic Modelling: Utilizes the BERTopic framework/ model to identify and extract hidden themes and topics from the document corpus.
The Stored_Procedures
directory contains the stored procedures used in the Snowflake environment.
This directory contains stored procedures related to loading and refreshing the database.
SP1.1_load_to_table_from_stage.txt
: Script to load data into Snowflake table from staging area.SP1.2_snowpark_transformations.txt
: Script for Snowpark transformations.SP1.3_chunking_text.txt
: Script for chunking text.SP1.4_embedding_chunks.txt
: Script for embedding chunks.SP1.5_tokenizing_chunks.txt
: Script for tokenizing chunks.SP1.6_embedding_tokens.txt
: Script for embedding tokens.SP1.7_dimension_reduction.txt
: Script for reducing vector dimensions
This directory contains stored procedures related to running similarity searches.
SP2.1_INSERT_QUERY.txt
: Script to insert queries into Snowflake table.SP2.2_GET_SIMILARITY.txt
: Script to get similarity score between query and every chunk.SP2.3_GET_TOKEN_SIMILARITY.txt
: Script to get token similarity score between query and every token.SP2.4_GET_TOP_5_TOKENS.txt
: Script to get top 5 tokens for each chunk.SP2.5_VIEW_CREATION.txt
: Script to create a temporary table for subsequent topic modelling stepSP2.6_ENTITY_EXTRACTION.txt
: Script to extract only the organization entities from the NER field
main_LOAD_AND_REFRESH_DATABASE.txt
: Main script for running all the 5 sub-procedures sequentially to load and refresh the databasemain_RUN_SIMILARITY_SEARCH.txt
: Main script for running all the 7 sub-procedures of similarity searchmain_BERTopic_Modelling.txt
: Main script for performing topic modeling using the BERTopic framework.
The User_Defined_Functions
directory contains user-defined functions used in the Snowflake environment.
udf_CHUNK_TEXT.txt
: Function to chunk text into smaller paragraphs/ sentencesudf_MiniLM_L6_v2_embedding.txt
: Function to embed text into vectors using MiniLM-L6-v2.udf_py_spacy.txt
: Function using Python's spaCy library for text processing and tokenizing.udf_sklearn_cosine_similarity.txt
: Function to compute cosine similarity score using scikit-learn.udf_py_spacy_NER.txt
: Function using Python's spaCy library for Named Entity Recognition.
-
Set up Environment: Before executing the scripts, ensure you have access to a Snowflake environment with the necessary permissions.
-
Execution Steps:
- Open Snowflake's SQL or Snowpark worksheet.
- Copy the contents of each script and execute them sequentially to create the sub-procedures/ functions/ main-procedures
- To run the main procedures just use the call function
- CALL LOAD_AND_REFRESH_DATABASE()
- CALL RUN_SIMILARITY_SEARCH('QUERY')
- CALL TOPIC_MODEL('limit')