Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add filtering option to benchmarks #479

Open
4 tasks
achirkin opened this issue Nov 20, 2024 · 0 comments
Open
4 tasks

[FEA] Add filtering option to benchmarks #479

achirkin opened this issue Nov 20, 2024 · 0 comments
Labels
feature request New feature or request

Comments

@achirkin
Copy link
Contributor

Is your feature request related to a problem? Please describe.
As cuVS algorithms get more pre-filtering support, we need the be able to benchmark this functionality and compare the algorithms

Describe the solution you'd like

  • Add a search-only option to the benchmark executable --filter_ratio x, where x is a float between 0 and 1 means the proportion of the records in the index passing the filter; by default it's 1 meaning the legacy behavior (no filtering).
  • Modify the dataset class: add an extra bitset field of the same size as the dataset itself; allow generating it (or loading from file? when generating, maybe also expose the random seed parameter?).
    • Note 1: this way, the filter is set up once per whole benchmark; this ensures low overheads and a fair comparison.
    • Note 2: we cannot use raft's bitset here, because the common benchmark headers don't depend on raft.
  • Pass the bitset filter to the algorithms. I think, the easiest way would be to add a new api function set_filter(bitset ptr) similar to set_search_parameters. It would be called once per benchmark loop and only if the filter ratio is lower than 1.
  • Adapt the ground truth calculation. I think, the easiest way here would be to replace the total_count in the calculation of the recall with the number of non-filtered items. This will have noise (when the count is low) and will not be entirely correct (the recall is averaged across threads/loops), but it's better than nothing.
    • Note: there's a way to enhance the quality at somewhat low effort when k is smaller than the available max_k in the ground truth file; that is to consider not first k values in the ground truth, but first k non-filtered values in the ground truth.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Development

No branches or pull requests

1 participant