This README provides instructions for using different indexing techniques for table search in data lakes. Below you'll find the main files for each technique and additional guidance on their setup and usage.
Here are the primary scripts for each indexing technique:
test_hnsw_search.py
- Implements HNSW indexing.test_diskann_search.py
- Implements DiskANN indexing.test_lsh_search.py
- Implements LSH indexing.
Each script requires specific input parameters which determine how each indexing technique functions. For LSH and HNSW, we used parameters as specified in the Starmie project, which you can review here. For DiskANN, we used some HNSW parameters and default values from the original DiskANN project.
Note: Our code and study focus on using multiple columns and union table search as a use case, following the approaches detailed in the Starmie project.
The code files in this repository are primarily based on the Starmie project. Some files have been directly replicated with no changes, e.g., lsh.py
, while others have been modified or expanded upon to better suit our specific needs like test_hnsw_search.py
. Additionally, there are completely new files that have been created to complement the existing functionalities and address new use cases such as diskann.py
.
We have added new functionalities to both HNSW and DiskANN methods to assist with their implementation:
- HNSW Enhancements: Find the modifications at this GitHub pull request.
- DiskANN Enhancements: Instructions to add helper functionalities to check the internal structure of the index graph:
-
Navigate to
src/index.cpp
which can be found online. -
Locate and modify the
save_graph
function as follows:// iTaha Code starts here float degree_sum = 0; int count_all = 0; float distance_sum = 0; for (int i = 0; i < _nd; i++) { degree_sum += _graph_store->get_neighbours(i).size(); for (location_t num : _graph_store->get_neighbours(i)) { distance_sum += _data_store->get_distance(i, num); count_all++; } } std::ofstream file("file_name", std::ios::app); file << degree_sum / (float)_nd << std::endl; file << distance_sum / (float)count_all << std::endl; diskann::cout << "_start:" << _start << std::endl; // iTaha Code ends here return graph_store->store(graph_file, nd + num_frozen_pts, num_frozen_pts, _start);
-
Replace
"file_name"
with your desired output file path, for example,"/diskann_internal_structure.txt"
.
-
- The
requirements.txt
file contains the necessary libraries for all the indexing techniques to work. To use therun_pretrain.py
andextractVectors.py
files with the most efficient performance, one can install the latest versions of these libraries.
If you are using the code in this repo, please cite the following paper:
@INPROCEEDINGS{10475618,
author={Taha, Ibraheem and Lissandrini, Matteo and Simitsis, Alkis and Ioannidis, Yannis},
title={A Study on Efficient Indexing for Table Search in Data Lakes},
booktitle={2024 IEEE 18th International Conference on Semantic Computing (ICSC)},
year={2024},
pages={245-252},
doi={10.1109/ICSC59802.2024.00046}
}
Feel free to explore the techniques and enhancements described, and adjust them according to your project needs!