A Comprehensive Survey and Experimental Comparison of Graph-based Approximate Nearest Neighbor Search
Approcimate Nearest Neighbor Search (ANNS) is a fundamental building block in various application domains. Recently, graph-based algorithms have emerged as a very effective choice to implement ANNS. Our paper (arXiv link, PDF) provides a comprehensive comparative analysis and experimental evaluation of representative graph-based ANNS algorithms on carefully selected datasets of varying sizes and characteristics.
This project contains the code, dataset, optimal parameters, and other detailed information used for the experiments of our paper. It is worth noting that we reimplement all algorithms based on exactly the same design pattern, programming language (except for the hash table in IEH) and tricks, and experimental setup, which makes the comparison more fair.
we evaluate thirteen representative graph-based ANNS algorithms, and their papers and the original codes are given in the following table.
ALGORITHM | PAPER | CODE |
---|---|---|
KGraph | WWW'2011 | C++/Python |
FANNG | CVPR'2016 | - |
NSG | VLDB'2019 | C++ |
NSSG | TPAMI'2021 | C++/Python |
DPG | TKDE'2019 | C++ |
Vamana | NeurIPS'2019 | - |
EFANNA | arXiv'2016 | C++/MATLAB |
IEH | IEEE T CYBERNETICS'2014 | - |
NSW | IS'2014 | C++/Python |
HNSW | TPAMI'2018 | C++/Python |
NGT-panng | SISAP'2016 | C++/Python |
NGT-onng | arXiv'2018 | C++/Python |
SPTAG-KDT | ACM MM'2012; CVPR'2012; TPAMI'2014 | C++ |
SPTAG-BKT | ACM MM'2012; CVPR'2012; TPAMI'2014 | C++ |
HCNNG | PR'2019 | - |
KDRG | SIGKDD'2011 | - |
Our experiment involves eight real-world datasets popularly deployed by existing works. All datasets are pre-split into base data and query data and come with groundtruth data in the form of the top 20 or 100 neighbors. Additional twelve synthetic datasets are used to test the scalability of each algorithm to the performance of different datasets.
Note that, all base data and query data are converted to fvecs
format, and groundtruth data is converted to ivecs
format. Please refer here for the description of fvecs
and ivecs
format. All datasets in this format can be downloaded from here.
For the optimal parameters of each algorithm on all experimental datasets, see the parameters page.
- GCC 4.9+ with OpenMP
- CMake 2.8+
- Boost 1.55+
$ git clone https://github.com/Lsyhprum/WEAVESS.git
$ cd WEAVESS/
$ mkdir build && cd build/
$ cmake ..
$ make -j
Before building index, you should set the root directory for the dataset in WEAVESS/test/main.cpp
first. Then, you can run the following instructions for build graph index.
cd WEAVESS/build/test/
./main algorithm_name dataset_name build
After the build is completed, the graph index will be written in the current folder in binary format (for index size). The index construction time can be viewed from the output log information. You can run the following command in current directory for getting other index information, such as average out-degree, graph quality, and the number of connected components.
./main algorithm_name dataset_name info
With the index built, you can run the following commands to perform the search. Related information about the search such as search time, distance evaluation times, candidate set size, average query path length, memory load can be obtained or calculated according to the output log information.
cd WEAVESS/build/test/
./main algorithm_name dataset_name search
Note that the default dev
branch is the evaluation of the overall performance of all algorithms, and the evaluation of a certain component needs to be carried out under the test
branch. For more details, please see our paper.
See here for details.
Please cite our work in your publications if it helps your research:
@misc{wang2021comprehensive,
title={A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search},
author={Mengzhao Wang and Xiaoliang Xu and Qiang Yue and Yuxiang Wang},
year={2021},
eprint={2101.12631},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
Thanks to everyone who provided references for this project. Special thanks to Dr. Weijie Zhao, Dr. Mingjie Li, and Dr. Cong Fu for their assistance in the necessary implementation of this project.