In this repository we have implemented a B+ Tree to execute range and exact match queries and a Locality Sensitive Hashing (LSH) algorithm, using the MinHash method for finding similar documents as measured by Jaccard similarity.
The documents returned from B+ tree are then processed by the LSH algorithm. The LSH algorithm consists of these steps:
- Shingling of Documents:
we convert each document into a set of characters of length k (also known as k-shingles). Then we convert every k-shingle into a 32-bit number via a hash function.
- Input-matrix creation:
the columns of the input-matrix correspond to the documents, and the rows correspond to the hashed shingles. There is an 1 in position (r,c), where r is a row and c a column of the matrix, if the document of column c contains the hashed shingle of row r. Otherwise the value in position (r, c) is 0.
- Minhashing:
in this step we create a new matrix, the signature matrix. Initially, this matrix consists of very large values. For every row of the input-matrix, we construct 100 universal hash functions ( ha,b(x) = (ax + b) mod p ). For every row r of the input matrix, we iterate every column. If column c has 1 in row r, then for each hash function, if hash_function(hf) is less than signature_matrix(r, c), set siganture_matrix(r, c) to the smaller of the current value of siganture_matrix(r, c) and hf(r).
- Locality-sensitive hashing
- Band partition: we separate the signature_matrix rows to b bands each of which consists of r rows (b*r=n). For each band, we hash its portion of each column, using the same universal hash function, to a hash table with k buckets.
- Find candidate pairs: we consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair.
In this final step, we calculate the Jaccard similarity of all candidate pairs.
Execute:
git clone https://github.com/rkapsalis/Range-and-similarity-queries.git
to clone the repositorycd Range-and-similarity-queries
to access the project directorypip install -r requirements.txt
to install all dependenciespython main.py