Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the recall rate drop significantly after normalization? #570

Open
shaozhixue opened this issue Jun 26, 2024 · 2 comments
Open

Why does the recall rate drop significantly after normalization? #570

shaozhixue opened this issue Jun 26, 2024 · 2 comments

Comments

@shaozhixue
Copy link

shaozhixue commented Jun 26, 2024

hello:
When testing recall rate using the bindings_test_recall.py script, I found that the recall rate drops significantly after normalizing the vectors. The metrics used are inner product. In this case, do the vectors need to be normalized? Why does the recall rate drop significantly after normalization?

image

code snippet:
image
image

@yurymalkov
Copy link
Member

Hi @shaozhixue,

Whether to normalize your vectors or not, depends on your application/vector representations.

I don't have all the details, but the optimal parameters for HNSW should depend on the data. If you are using the same data for brute force and the hnsw and the recall drops, it is probably worth to increase ef.

@shaozhixue
Copy link
Author

shaozhixue commented Jun 27, 2024

Hi @shaozhixue,

Whether to normalize your vectors or not, depends on your application/vector representations.

I don't have all the details, but the optimal parameters for HNSW should depend on the data. If you are using the same data for brute force and the hnsw and the recall drops, it is probably worth to increase ef.

Hi yurymalkov,
Thanks for your reply. Both brute-force search and HNSW use the same data. I have attached the complete code.

import os
import hnswlib
import numpy as np
import unittest

class RandomSelfTestCase(unittest.TestCase):
def normalize_vector_l2(self, vector):
norm = np.linalg.norm(vector)
if norm == 0:
return vector
return vector / norm
def testRandomSelf(self):
dim = 384
num_elements = 100000
k = 50
num_queries = 100

    recall_threshold = 0.95

    # Generating sample data
    data = np.float32(np.random.random((num_elements, dim)))
    i = 0
    for one in data:
        data[i] = self.normalize_vector_l2(one)
        i = i + 1

    # Declaring index
    hnsw_index = hnswlib.Index(space='ip', dim=dim)  # possible options are l2, cosine or ip
    bf_index = hnswlib.BFIndex(space='ip', dim=dim)

    # Initing both hnsw and brute force indices
    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
    # during insertion of an element.
    # The capacity can be increased by saving/loading the index, see below.
    #
    # hnsw construction params:
    # ef_construction - controls index search speed/build speed tradeoff
    #
    # M - is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M)
    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

    hnsw_index.init_index(max_elements=num_elements, ef_construction=200, M=32)
    bf_index.init_index(max_elements=num_elements)

    # Controlling the recall for hnsw by setting ef:
    # higher ef leads to better accuracy, but slower search
    hnsw_index.set_ef(200)

    # Set number of threads used during batch search/construction in hnsw
    # By default using all available cores
    hnsw_index.set_num_threads(4)

    print("Adding batch of %d elements" % (len(data)))
    hnsw_index.add_items(data)
    bf_index.add_items(data)

    print("Indices built")

    # Generating query data
    query_data = np.float32(np.random.random((num_queries, dim)))
    #j = 0
    #for one in query_data:
    #    query_data[j] = self.normalize_vector_l2(one)
    #    j = j + 1

    # Query the elements and measure recall:
    labels_hnsw, distances_hnsw = hnsw_index.knn_query(query_data, k)
    labels_bf, distances_bf = bf_index.knn_query(query_data, k)

    # Measure recall
    correct = 0
    for i in range(num_queries):
        for label in labels_hnsw[i]:
            for correct_label in labels_bf[i]:
                if label == correct_label:
                    correct += 1
                    break

    recall_before = float(correct) / (k*num_queries)
    print("recall is :", recall_before)
    self.assertGreater(recall_before, recall_threshold)

    # test serializing  the brute force index
    index_path = 'bf_index.bin'
    print("Saving index to '%s'" % index_path)
    bf_index.save_index(index_path)
    del bf_index

    # Re-initiating, loading the index
    bf_index = hnswlib.BFIndex(space='ip', dim=dim)

    print("\nLoading index from '%s'\n" % index_path)
    bf_index.load_index(index_path)

    # Query the brute force index again to verify that we get the same results
    labels_bf, distances_bf = bf_index.knn_query(query_data, k)

    # Measure recall
    correct = 0
    for i in range(num_queries):
        for label in labels_hnsw[i]:
            for correct_label in labels_bf[i]:
                if label == correct_label:
                    correct += 1
                    break

    recall_after = float(correct) / (k*num_queries)
    print("recall after reloading is :", recall_after)

    self.assertEqual(recall_before, recall_after)

    os.remove(index_path)

if name == 'main':
test = RandomSelfTestCase()
test.testRandomSelf()

Problem:By removing the normalization part of the code, the recall rate can reach around 96%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants