Table Search Indexing Techniques

This README provides instructions for using different indexing techniques for table search in data lakes. Below you'll find the main files for each technique and additional guidance on their setup and usage.

Key Files

Here are the primary scripts for each indexing technique:

test_hnsw_search.py - Implements HNSW indexing.
test_diskann_search.py - Implements DiskANN indexing.
test_lsh_search.py - Implements LSH indexing.

Setup and Configuration

Each script requires specific input parameters which determine how each indexing technique functions. For LSH and HNSW, we used parameters as specified in the Starmie project, which you can review here. For DiskANN, we used some HNSW parameters and default values from the original DiskANN project.

Note: Our code and study focus on using multiple columns and union table search as a use case, following the approaches detailed in the Starmie project.

Enhancements and Modifications

Code

The code files in this repository are primarily based on the Starmie project. Some files have been directly replicated with no changes, e.g., lsh.py, while others have been modified or expanded upon to better suit our specific needs like test_hnsw_search.py. Additionally, there are completely new files that have been created to complement the existing functionalities and address new use cases such as diskann.py.

HNSW and DiskANN

We have added new functionalities to both HNSW and DiskANN methods to assist with their implementation:

HNSW Enhancements: Find the modifications at this GitHub pull request.

DiskANN Enhancements: Instructions to add helper functionalities to check the internal structure of the index graph:

Navigate to src/index.cpp which can be found online.

Locate and modify the save_graph function as follows:

// iTaha Code starts here
float degree_sum = 0;
int count_all = 0;
float distance_sum = 0;
for (int i = 0; i < _nd; i++) {
    degree_sum += _graph_store->get_neighbours(i).size();
    for (location_t num : _graph_store->get_neighbours(i)) {
        distance_sum += _data_store->get_distance(i, num);
        count_all++;
    }
}

std::ofstream file("file_name", std::ios::app);
file << degree_sum / (float)_nd << std::endl;
file << distance_sum / (float)count_all << std::endl;
diskann::cout << "_start:" << _start << std::endl;
// iTaha Code ends here

return graph_store->store(graph_file, nd + num_frozen_pts, num_frozen_pts, _start);

Replace "file_name" with your desired output file path, for example, "/diskann_internal_structure.txt".

Notes

The requirements.txt file contains the necessary libraries for all the indexing techniques to work. To use the run_pretrain.py and extractVectors.py files with the most efficient performance, one can install the latest versions of these libraries.

Citation

If you are using the code in this repo, please cite the following paper:

@INPROCEEDINGS{10475618,
  author={Taha, Ibraheem and Lissandrini, Matteo and Simitsis, Alkis and Ioannidis, Yannis},
  title={A Study on Efficient Indexing for Table Search in Data Lakes},
  booktitle={2024 IEEE 18th International Conference on Semantic Computing (ICSC)},
  year={2024},
  pages={245-252},
  doi={10.1109/ICSC59802.2024.00046}
}

Feel free to explore the techniques and enhancements described, and adjust them according to your project needs!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table Search Indexing Techniques

Key Files

Setup and Configuration

Enhancements and Modifications

Code

HNSW and DiskANN

Notes

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
sdd		sdd
LICENSE		LICENSE
README.md		README.md
checkPrecisionRecall.py		checkPrecisionRecall.py
diskann_search.py		diskann_search.py
extractVectors.py		extractVectors.py
hnsw_search.py		hnsw_search.py
lsh.py		lsh.py
lsh_search.py		lsh_search.py
run_pretrain.py		run_pretrain.py
test_diskann_search.py		test_diskann_search.py
test_hnsw_search.py		test_hnsw_search.py
test_lsh_search.py		test_lsh_search.py
to_evaluate_all.sh		to_evaluate_all.sh

License

athenarc/table-search

Folders and files

Latest commit

History

Repository files navigation

Table Search Indexing Techniques

Key Files

Setup and Configuration

Enhancements and Modifications

Code

HNSW and DiskANN

Notes

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages