You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CAGRA currently supports pre-filtering, but it doesn't support the most practical use-cases for filtering (where the filter itself is > 50% sparsity) because the recall drops so low that the results just aren't usable at all. It is believes this is largely because the hash table fills up during search, causing the algorithm to ignore any additional potential edges as its popagating through the graph. We need to verify this assumption, but also think of ways we might be able to improve the search space to increase recall.
This problem is obviously not unique to CAGRA, but heavy prefiltering is a hard problem in general. When the number of vectors not being filtered out is <=10% of the graph, it can be efficient to use pre-filtered brute-force, but that also won't scale particularly well as we hit indexes into the billions and 10's to 100's of billions. We need to find a good solution for improving the filtered search over the graph (or using another method altogether that is cheap to do on the fly but scalable).
CAGRA currently supports pre-filtering, but it doesn't support the most practical use-cases for filtering (where the filter itself is > 50% sparsity) because the recall drops so low that the results just aren't usable at all. It is believes this is largely because the hash table fills up during search, causing the algorithm to ignore any additional potential edges as its popagating through the graph. We need to verify this assumption, but also think of ways we might be able to improve the search space to increase recall.
This problem is obviously not unique to CAGRA, but heavy prefiltering is a hard problem in general. When the number of vectors not being filtered out is <=10% of the graph, it can be efficient to use pre-filtered brute-force, but that also won't scale particularly well as we hit indexes into the billions and 10's to 100's of billions. We need to find a good solution for improving the filtered search over the graph (or using another method altogether that is cheap to do on the fly but scalable).
cc @rhdong @lowener @achirkin
The text was updated successfully, but these errors were encountered: