Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Filtered CAGRA to automatically use pre-filtered brute-force when the filter ratio reaches a particular theshold. #252

Open
cjnolet opened this issue Jul 26, 2024 · 12 comments
Assignees
Labels
feature request New feature or request

Comments

@cjnolet
Copy link
Member

cjnolet commented Jul 26, 2024

While CAGRA offers the capability to specify a pre-filter, it is limited in that heavy filters can end up reducing recall significantly after a particular point. For this reason, we recommend to use pre-filtered brute-force in the cases where 90%+ of the vectors are being filtered out, since it'll conly compute the distances for the vectors that are not being filtered out.

Many users are getting unpleasant surprises when they attempt to filter out 99% of the vectors from a CAGRA index and find that the recall is close to 0. We also have CAGRA integrated into several different places atm and instead of having each integrator manually switch over to pre-filtered brute-force in these cases, we should do the switch in CAGRA itself so everyone benefits from this feature automatically.

@rhdong
Copy link
Member

rhdong commented Jul 26, 2024

Does the issue mentioned here depend on the specific dataset possibly?

@cjnolet
Copy link
Member Author

cjnolet commented Jul 26, 2024

No specific dataset, but the core use-case is hybrid search, where Lucene or Milvus might be accepting a filter that has been constructed based on having done a prior structured search. For example, a user searches for the nearest vectors within a specific geographic region that have certain attributes (Eg above a certain age, or work at a certain store). The results returned from the structured search are often very small in comparison to the total number of elements in the index and so it results in a filter that only includes maybe 1% (sometimes maybe slightly more).

@rhdong
Copy link
Member

rhdong commented Jul 26, 2024

No specific dataset, but the core use-case is hybrid search, where Lucene or Milvus might be accepting a filter that has been constructed based on having done a prior structured search. For example, a user searches for the nearest vectors within a specific geographic region that have certain attributes (Eg above a certain age, or work at a certain store). The results returned from the structured search are often very small in comparison to the total number of elements in the index and so it results in a filter that only includes maybe 1% (sometimes maybe slightly more).

Got it! Thank you for the clarification!

@achirkin
Copy link
Contributor

Now that we are quite far into implementation, I think more and more that this idea could backfire on us.

  1. At this moment we haven't identified the reason for the recall drop in CAGRA. If MULTI_CTA version is used, it may be related to a known bug [BUG] Decreasing recall when increasing threads in ANN benchmark  #208 (for which the fix is in the work). There's no clear indication that the recall drop at high filtering rates is a feature of the algorithm. Making an automatic switch to the bruteforce may simply be a workaround for a bug.
  2. From the design point of view, this logic goes a dangerous path. If we start to do these ad-hoc tweaks to some algorithms switching (against the user request) to others, it's going to be hard to analyse the performance of a particular algorithm. I think, if the user knows their data, it should be their responsibility to choose an appropriate index. We can, however, come up with a compromise in a form of a third, "hybrid" index, which would alternate between CAGRA and bruteforce according to a pre-defined set of criteria.
  3. From the performance point of view, this may add an unwanted overhead for checking the filter "sparsity" in cases where CAGRA is still preferred over bruteforce. Consider especially the case of a small batch, big dataset (n_queries = 1, n_rows >= 10M; overall, the request is fast, but the check may be relatively slow).

@rhdong
Copy link
Member

rhdong commented Nov 18, 2024

Now that we are quite far into implementation, I think more and more that this idea could backfire on us.

  1. At this moment we haven't identified the reason for the recall drop in CAGRA. If MULTI_CTA version is used, it may be related to a known bug [BUG] Decreasing recall when increasing threads in ANN benchmark  #208 (for which the fix is in the work). There's no clear indication that the recall drop at high filtering rates is a feature of the algorithm. Making an automatic switch to the bruteforce may simply be a workaround for a bug.
  2. From the design point of view, this logic goes a dangerous path. If we start to do these ad-hoc tweaks to some algorithms switching (against the user request) to others, it's going to be hard to analyse the performance of a particular algorithm. I think, if the user knows their data, it should be their responsibility to choose an appropriate index. We can, however, come up with a compromise in a form of a third, "hybrid" index, which would alternate between CAGRA and bruteforce according to a pre-defined set of criteria.
  3. From the performance point of view, this may add an unwanted overhead for checking the filter "sparsity" in cases where CAGRA is still preferred over bruteforce. Consider especially the case of a small batch, big dataset (n_queries = 1, n_rows >= 10M; overall, the request is fast, but the check may be relatively slow).

Here is the latest benchmark that can be referred to make this decision: #378 (comment)

@cjnolet
Copy link
Member Author

cjnolet commented Nov 18, 2024

There's no clear indication that the recall drop at high filtering rates is a feature of the algorithm.

@achirkin, according to @enp1s0 and @anaruse, significant recall drop at high filtering levels is a fundamental limitation of the algorithm and happens as a result of the hash table filling up during search. I agree that we want to investigate to know with absolute certainty that this is what is happening, but I've been given a fairly strong impression that this is what's happening, and it's not just a limitation of CAGRA itself because many of the CPU-based algorithms struggle with this problem as well.

The most common use-case for filtering is for hybrid search, where 99% of the vectors need to be excluded from the index. Pre-filtered brute-force works well in these cases, and we've been asked by users to fall back to brute-force so that we can return correct results. If it turns out we find a better way to enable CAGRA to do this while maintaining high recall and performance, then I'll be the first to suggest that we remove the automatic fallback to pre-filtered brute-force.

This isn't a specialized or niche use-case. Most users we have encountered who are doing generative AI and RAG in production today need to be able to do hybrid search. Even if this is a stopgap, it's an opportunity to get them using the GPU for search.

@cjnolet cjnolet moved this from Todo to In Progress in VS/ML/DM Primitives Release Board Nov 18, 2024
@achirkin
Copy link
Contributor

If the filling up of hash table is really causing a significant recall drop, I think we may be able expose the hashmap size (bitlen) parameter or improve the heuristic there.
But I'm really not sure about that. Besides #208, which I mentioned above, @rhdong has just found a likely another bug causing a significant recall drop in some cases with filtering with even with none elements filtered out (@rhdong please create an issue for this when you have a time).

@cjnolet
Copy link
Member Author

cjnolet commented Nov 18, 2024

@rhdong has just found a likely another bug causing a significant recall drop in some cases with filtering with even with none elements filtered out (@rhdong please create an issue for this when you have a time).

Oh, I didn't realize that was with no filtering at all! Yes, that's concerning for sure and should definitely create an issue for it.

@achirkin
Copy link
Contributor

If we really want to compose the search algorithms, maybe we should better introduce dedicated indices for that?
For example, one can create a bruteforce_enabled_index<upstream>, which would take an upstream index plus a criterion for switching between upstream and bruteforce (similar to the dynamic_batching index #261).

Or, at the very least, one could add an explicit search parameter in CAGRA (disabled by default), which would allow automatic switching between CAGRA and bruteforce. This would at least leave the control to the user and avoid any surprises.

@cjnolet
Copy link
Member Author

cjnolet commented Nov 18, 2024

The goal here is intentionally to make this work for users without them having to change their code. Users aren't asking for heavy filtering w/ IVF, mostly because that's nearly impossible to do with high recall and do efficiently. They are using graph-based algorithms today, and don't want to have to do special things in their code to support the use-cases they need. I don't see this as a surprise in any negative way- they are getting high recall for heavy filters. I have not yet encountered a user that would prefer to have garbage recall over a loss in performance if it meant getting the correct results.

We will be introducing a tiered index in a future version and maybe that's ultimately a route for users to have more composability going forward. But Milvus is among the users asking for this today and automatically switching to bfknn for 99%+ filters is not an unreasable solution in the meantime (and it'll enable a very large number of important users without any code changes to Milvus).

@rhdong
Copy link
Member

rhdong commented Nov 18, 2024

@rhdong has just found a likely another bug causing a significant recall drop in some cases with filtering with even with none elements filtered out (@rhdong please create an issue for this when you have a time).

Oh, I didn't realize that was with no filtering at all! Yes, that's concerning for sure and should definitely create an issue for it.

@achirkin @cjnolet , here it is: #472

@achirkin
Copy link
Contributor

@cjnolet ok, so to summarize our conversation about my three concerns so far: (2) you're ok with the current solution in terms of design, (3) from the experiments provided by @rhdong the slowdown level is acceptable. Point (1) is largely unanswered in my opinion.
I see you're determined to add this feature in 24.12 as you have a very strong ask for it from the users. I'm ok with that, but I want to outline why I think this should be a temporary solution below.

I agree that we want to investigate to know with absolute certainty that this is what is happening, but I've been given a fairly strong impression that this is what's happening, and it's not just a limitation of CAGRA itself because many of the CPU-based algorithms struggle with this problem as well.

I assume you mean HNSW here? I suspect it may have problem with the 'hierarchical' aspect: if the graph is structured so that it resembles a tree, the search process may likely struggle with going between the branches. Much like our IVF methods surely struggle with high filtering ratios because they only go through a fixed number of clusters.
In CAGRA, in contrast, I don't see any fundamental reason for a drop in recall. It starts from random nodes in a graph, goes in arbitrary directions. We have some hypotheses on why recall drops in some cases, but they boil down to issues that can be fixed by tweaking parameters, such as the hash size or the maximum number of search iterations. Issues #208 and #472 also suggest some (if not all) of the observed recall drops come from bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

3 participants