Skip to content

yahaha-233/CSCI596-Final-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSCI596-Final-Project

This is the repository for the final project of CSCI596 Scientific Computing & Visualization at USC. The project is mainly about parallelizing the Locality Sensitive Hashing (LSH) algorithm.

Introduction to Locality Sensitive Hashing (LSH)

Locality Sensitive Hashing (LSH) is a popular technique used for efficient approximate similarity serach in high-dimensional space. It empowers vector databases that align with Language Model Models (LLMs). Even when comparing a single query to billions of samples, the achieved complexity remains at best O(n). The core idea is to hash data points such that similar items are mapped to the same hash buckets with high probability. There are 3 main steps in tradition LSH:

  1. Shingling: Convert documents to sets of strings of length k. A k-shingle is a sequence of k tokens appears in the document. After we get shingles for all the documents, we union them to create a big vocabulary set. Then we create a sparse vector for each documents. For each shingle in the vocabulary set, we set the corresponding position to 1 if the document has the shingle and set to 0 if not, just like one-hot encoding.
  2. MinHashing: The sparse vectors are too large to compare. In this step we will convert them to short signatures, which are dense integer vectors that represent sets and reflect their similarity. Each number in the signature corresponds to a minhash function. The signature values are generated by first picking a randomly permuted count vector of the rows (from 1 to len(vocab)+1) and finds the minimum number of the row in which is 1 in our sparse vector. minhash example
  3. Banded LSH function: In order to find signature pairs that share even some similarity and further improve the efficiency, we segment the signatures into bands b (each has r rows), and pass each band through a hash function. Mark the ones that are hashed to the same bucket as candidate pairs. Repeat the process for each band. band lsh example

Implementation Methods

We will parallelize the LSH using MPI and OpenMP target. Both the MinHashing and Banded LSH functions are possible to be parallelized.

  • MinHashing: There are n documents which need to generate the signatures. For each document, there are x hash functions to be applied to y rows to get the signature with x numbers. We will split them to 2 MPI nodes, while each node has an OpenMP master thread with league of teams. Each team will deal with one document, with OpenMP pragma directive to parallelize the nested loops regarding the hash functions. Below is the architecture design: minhash architecture

  • Banded LSH function:We will get candidate pairs from b bands. For each band, we need to check for each document if its signature pieces are the same as others. If so, they are the candidate pairs. We will split b bands to 2 MPI nodes, while each node has an OpenMP master thread with league of teams. Each team will deal with one band, with OpenMP pragma directive to parallelize the nested loops for the comparisons between every pair of signature pieces. Below is the architecture design: Banded LSH architecture

Expected Results

We will test the runtime and efficiency of the algorithm with different numbers of nodes and threads by running the program on the CARC clusters. We are expected to see a similar efficiency pattern with strong scaling with more nodes and threads, while the runtime should decrease.

Reference

[1] https://www.pinecone.io/learn/series/faiss/locality-sensitive-hashing/

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 100.0%