Billion-Scale ANN

Install

The only prerequisite is Python (tested with 3.6) and Docker. Works with newer versions of Python as well but probably requires an updated requirements.txt on the host. (Suggestion: copy requirements.txt to requirements${PYTHON_VERSION}.txt and remove all fixed versions. requirements.txt has to be kept for the docker containers.)

Clone the repo.
Run pip install -r requirements.txt (Use requirements_py38.txt if you have Python 3.8.)
Install docker by following instructions here. You might also want to follow the post-install steps for running docker in non-root user mode.
Run python install.py to build all the libraries inside Docker containers.

Storing Data

The framework assumes that all data is stored in data/. Please use a symlink if your datasets and indices are supposed to be stored somewhere else.

Data sets

See http://big-ann-benchmarks.com/ for details on the different datasets.

Dataset Preparation

Before running experiments, datasets have to be downloaded. All preparation can be carried out by calling

python create_dataset.py --dataset [bigann-1B | deep-1B | text2image-1B | ssnpp-1B | msturing-1B | msspacev-1B]

Note that downloading the datasets can potentially take many hours.

For local testing, there exist smaller random datasets random-xs and random-range-xs. Furthermore, most datasets have 1M, 10M and 100M versions, run python create_dataset -h to get an overview.

Running the benchmark

Run python run.py --dataset $DS --algorithm $ALGO where DS is the dataset you are running on, and ALGO is the name of the algorithm. (Use python run.py --list-algorithms) to get an overview. python run.py -h provides you with further options.

The parameters used by the implementation to build and query the index can be found in algos.yaml.

Evaluating the Results

Run sudo python plot.py --dataset ... or sudo python data_export.py --output res.csv to plot results or dump all of them to csv for further post-processing. To avoid sudo, run sudo chmod -R 777 results/ before invoking these scripts.

To get a table overview over the best recall/ap achieved over a certain threshold, run python3 eval/show_operating_points.py --algorithm $ALGO --threshold $THRESHOLD res.csv, where res.csv is the file produced by running data_export.py above.

For the track1 baseline, the output python3 eval/show_operating_points.py --algorithm faiss-t1 --threshold 10000 res.csv led to

                         recall/ap
algorithm dataset
faiss-t1  bigann-1B       0.634510
          deep-1B         0.650280
          msspacev-1B     0.728861
          msturing-1B     0.703611
          ssnpp-1B        0.753780
          text2image-1B   0.069275

Running the track 1 baseline

After running the installation, we can evaluate the baseline as follows.

for DS in bigann-1B  deep-1B  text2image-1B  ssnpp-1B  msturing-1B  msspacev-1B;
do
    python run.py --dataset $DS --algorithm faiss-t1;
done

On a 28-core Xeon E5-2690 v4 that provided 100MB/s downloads, carrying out the baseline experiments took roughly 7 days.

To evaluate the results, run

sudo chmod -R 777 results/
python data_export.py --output res.csv
python3.8 eval/show_operating_points.py --algorithm faiss-t1 --threshold 10000

Including your algorithm

Add your algorithm into benchmark/algorithms by providing a small Python wrapper inheriting from BaseANN defined in benchmark/algorithms/base.py. See benchmark/algorithm/faiss_t1.py for an example.
Add a Dockerfile in install/
Edit `algos.yaml with the parameter choices you would like to test.
(Add an option to download pre-built indexes as seen in faiss_t1.py.)

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github/workflows		.github/workflows
benchmark		benchmark
dataset_preparation		dataset_preparation
eval		eval
install		install
notebooks		notebooks
results		results
t3		t3
track1_baseline_faiss		track1_baseline_faiss
track3_baseline_faiss		track3_baseline_faiss
.dockerignore		.dockerignore
LICENSE		LICENSE
README.md		README.md
algos.yaml		algos.yaml
create_dataset.py		create_dataset.py
data_export.py		data_export.py
install.py		install.py
logging.conf		logging.conf
plot.py		plot.py
requirements.txt		requirements.txt
requirements_py38.txt		requirements_py38.txt
run.py		run.py
run_algorithm.py		run_algorithm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Billion-Scale ANN

Install

Storing Data

Data sets

Dataset Preparation

Running the benchmark

Evaluating the Results

Running the track 1 baseline

Including your algorithm

About

Releases

Packages

Languages

License

gony-noreply/big-ann-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Billion-Scale ANN

Install

Storing Data

Data sets

Dataset Preparation

Running the benchmark

Evaluating the Results

Running the track 1 baseline

Including your algorithm

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages