[BUG] CAGRA recall is very low for moderate to heavy pre-filtering #429

cjnolet · 2024-10-28T18:39:37Z

CAGRA currently supports pre-filtering, but it doesn't support the most practical use-cases for filtering (where the filter itself is > 50% sparsity) because the recall drops so low that the results just aren't usable at all. It is believes this is largely because the hash table fills up during search, causing the algorithm to ignore any additional potential edges as its popagating through the graph. We need to verify this assumption, but also think of ways we might be able to improve the search space to increase recall.

This problem is obviously not unique to CAGRA, but heavy prefiltering is a hard problem in general. When the number of vectors not being filtered out is <=10% of the graph, it can be efficient to use pre-filtered brute-force, but that also won't scale particularly well as we hit indexes into the billions and 10's to 100's of billions. We need to find a good solution for improving the filtered search over the graph (or using another method altogether that is cheap to do on the fly but scalable).

cc @rhdong @lowener @achirkin

cjnolet added the bug Something isn't working label Oct 28, 2024

cjnolet added this to VS/ML/DM Primitives Release Board Oct 28, 2024

cjnolet moved this to In Progress in VS/ML/DM Primitives Release Board Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CAGRA recall is very low for moderate to heavy pre-filtering #429

[BUG] CAGRA recall is very low for moderate to heavy pre-filtering #429

cjnolet commented Oct 28, 2024 •

edited

Loading

[BUG] CAGRA recall is very low for moderate to heavy pre-filtering #429

[BUG] CAGRA recall is very low for moderate to heavy pre-filtering #429

Comments

cjnolet commented Oct 28, 2024 • edited Loading

cjnolet commented Oct 28, 2024 •

edited

Loading