[Feat] CAGRA filtering with BFKNN when sparsity matching threshold #378

rhdong · 2024-10-02T19:00:00Z

Reopen for rebasing the new prefilter API
Restriction: This feature only supports the contiguous device dataset, and float32 or half distance type
Related to: [FEA] Filtered CAGRA to automatically use pre-filtered brute-force when the filter ratio reaches a particular theshold. #252

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type    Queries  n_rows   Dim   K     Metric         Layout    BruteF/Cagra    Sparsity     Recall rate (%)   Search Time (ms)   Build Time (ms)    Total Time (ms)    Throughput (q/s)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
float   100      1048576  32    64    InnerProduct   row       Brute           0.100        100.000           4.041              15891.638          15895.679          6.291
float   100      1048576  32    64    InnerProduct   row       Cagra           0.100         69.922           0.825              16162.612          16163.438          6.187
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.300        100.000           4.216              16265.309          16269.525          6.146
float   100      1048576  32    64    InnerProduct   row       Cagra           0.300         67.547           0.857              16227.900          16228.758          6.162
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.500        100.000           4.285              16110.632          16114.916          6.205
float   100      1048576  32    64    InnerProduct   row       Cagra           0.500         64.344           0.881              15642.374          15643.255          6.393
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.800        100.000           4.268              16707.514          16711.783          5.984
float   100      1048576  32    64    InnerProduct   row       Cagra           0.800         37.547           1.312              20631.773          20633.085          4.847
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.900        100.000           4.149              16323.327          16327.475          6.125
float   100      1048576  32    64    InnerProduct   row       Cagra           0.900         20.703           0.887              16195.060          16195.947          6.174
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.990        100.000           5.542              20701.861          20707.403          4.829
float   100      1048576  32    64    InnerProduct   row       Cagra           0.990          2.391           0.942              17216.771          17217.714          5.808

cjnolet · 2024-10-28T18:26:47Z

cpp/src/neighbors/detail/cagra/cagra_search.cuh

@@ -140,6 +142,61 @@ void search_main(raft::resources const& res,
                 raft::device_matrix_view<DistanceT, int64_t, raft::row_major> distances,
                 CagraSampleFilterT sample_filter = CagraSampleFilterT())
 {
+  if constexpr (!std::is_same_v<CagraSampleFilterT,


Can you pull this out into a separate function that can be invoked here please? This search_main function is gettig pretty massive.

cjnolet · 2024-10-28T18:30:16Z

@rhdong it looks like the recall rates for CAGRA filtered search are all pretty terrible (and I can recall some other experiments you did recently that seemed to have shown much better recall for the higher filter rates... but i could be mistaken). Was 0.9 just chosen because I mentioned it's a common case or did you choose it based on your benchmarks above?

Also, I would do L2 in addition to inner product. I don't think nn-descent actually supports inner product atm. Did you use ivf-pq here to build the graph or do we have a bug where CAGRA is not throwing an error when inner product is used with nn-descent?

rhdong · 2024-10-28T20:12:13Z

@rhdong it looks like the recall rates for CAGRA filtered search are all pretty terrible (and I can recall some other experiments you did recently that seemed to have shown much better recall for the higher filter rates... but i could be mistaken). Was 0.9 just chosen because I mentioned it's a common case or did you choose it based on your benchmarks above?

Also, I would do L2 in addition to inner product. I don't think nn-descent actually supports inner product atm. Did you use ivf-pq here to build the graph or do we have a bug where CAGRA is not throwing an error when inner product is used with nn-descent?

The 0.9 is based on the benchmark I mentioned in Slack, and I used the NN+Euclidean, so it sounds like no need to re-test them. (Revised on Oct 29)

…ra-bf

cpp/include/cuvs/neighbors/cagra.hpp

cpp/src/neighbors/detail/cagra/cagra_search.cuh

achirkin

I'd suggest to to add an extra member to the cagra index to avoid (1) complicating it even more than it is right now, (2) changing the abi.

cpp/src/neighbors/detail/cagra/cagra_search.cuh

achirkin · 2024-10-31T12:54:52Z

cpp/src/neighbors/detail/cagra/cagra_search.cuh

+        cuvs::neighbors::brute_force::build(res, *brute_force_dataset, index.metric());
+
+      auto brute_force_queries = queries;
+      auto padding_queries     = raft::make_device_matrix<T, int64_t>(res, 0, 0);


Suggested change

auto padding_queries = raft::make_device_matrix<T, int64_t>(res, 0, 0);

// Allocate the padded queries in the workspace resource

auto padding_queries = raft::make_device_mdarray<T, int64_t>(

res,

raft::resource::get_workspace_resource(res),

raft::make_extents<int64_t>(n_queries, dataset_view.extent(1)));

// Copy the queries and fill the padded elements with zeros

raft::linalg::map_offset(res,

padding_queries.view(),

[queries, stride = dataset_view.extent(1)] __device__(int64_t i) {

auto row_ix = i / stride;

auto el_ix = i % stride;

return el_ix < queries.extent(1) ? queries(row_ix, el_ix) : T{0};

});

Thank you for the suggestion! I found a way to allocate the queries in the workspace while keeping the reuse of copy_with_padding, which could help with clean code. If I missed something, please feel free to point it out.

I'd prefer to stick to raft primitives and eventually remove copy_with_padding, because the latter is not a commonly used primitive and thus one requires to go read its code to understand what does it do exactly.
But I don't mind to keep it here for now.

cpp/src/neighbors/detail/cagra/cagra_search.cuh

rhdong · 2024-11-15T03:09:09Z

Hi @achirkin @cjnolet @lowener, per the benchmark result on deep-image-96-inner with the 9990000 rows data on A6000 32GB, the sparsity overhead could be ignored, which shows a very random difference in the performance. The total latency could be 7.29 ms vs 7.32 ms under n_queries=1, and 18.88 ms vs 18.87 ms under n_queries=1000; the sparsity only spent 0.013ms according to the benchmark of sparsity(raft::popc) which was consistent with the benchmark of CAGRA-ANN. What do you think about keeping the current implementation? (Revised @nov 15, 9:39AM)

The benchmark of sparsity(raft::popc):

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
PopcBenchI64/0/manual_time       0.005 ms        0.041 ms       131086 2#131072#0.4
PopcBenchI64/1/manual_time       0.009 ms        0.045 ms        77488 8#131072#0.5
PopcBenchI64/2/manual_time       0.014 ms        0.050 ms        49516 16#131072#0.2
PopcBenchI64/3/manual_time       0.004 ms        0.040 ms       182993 2#8192#0.4
PopcBenchI64/4/manual_time       0.004 ms        0.040 ms       157537 16#8192#0.5
PopcBenchI64/5/manual_time       0.009 ms        0.045 ms        76827 128#8192#0.2
PopcBenchI64/6/manual_time       0.012 ms        0.047 ms        56139 1024#8192#0.1
PopcBenchI64/7/manual_time       0.013 ms        0.048 ms        55972 1024#8192#0.1
PopcBenchI64/8/manual_time       0.013 ms        0.048 ms        55916 1024#8192#0.1
PopcBenchI64/9/manual_time       0.013 ms        0.048 ms        55888 1024#8192#0.1
PopcBenchI64/10/manual_time      0.013 ms        0.048 ms        55995 1024#8192#0.1
PopcBenchI64/11/manual_time      0.013 ms        0.048 ms        55764 1024#8192#0.1
PopcBenchI64/12/manual_time      0.013 ms        0.048 ms        55728 1024#8192#0.1
PopcBenchI64/13/manual_time      0.013 ms        0.048 ms        55718 1024#8192#0.1
PopcBenchI64/14/manual_time      0.013 ms        0.048 ms        55757 1024#8192#0.4
PopcBenchI64/15/manual_time      0.013 ms        0.048 ms        55731 1024#8192#0.5
PopcBenchI64/16/manual_time      0.013 ms        0.048 ms        55689 1024#8192#0.2
PopcBenchI64/17/manual_time      0.013 ms        0.048 ms        55713 1024#8192#0.4
PopcBenchI64/18/manual_time      0.013 ms        0.048 ms        55787 1024#8192#0.5
PopcBenchI64/19/manual_time      0.013 ms        0.048 ms        55507 1024#8192#0.2
PopcBenchI64/20/manual_time      0.013 ms        0.048 ms        55536 1024#8192#0.4
PopcBenchI64/21/manual_time      0.013 ms        0.048 ms        55734 1024#8192#0.5
PopcBenchI64/22/manual_time      0.013 ms        0.048 ms        55760 1024#8192#0.2
PopcBenchI64/23/manual_time      0.013 ms        0.048 ms        55594 1024#8192#0.4
PopcBenchI64/24/manual_time      0.013 ms        0.048 ms        55671 1024#8192#0.5
PopcBenchI64/25/manual_time      0.013 ms        0.048 ms        55660 1024#8192#0.2
PopcBenchI64/26/manual_time      0.013 ms        0.048 ms        55663 1024#8192#0.5
PopcBenchI64/27/manual_time      0.013 ms        0.048 ms        55699 1024#8192#0.2
PopcBenchI64/28/manual_time      0.013 ms        0.048 ms        55718 1024#8192#0.4
PopcBenchI64/29/manual_time      0.013 ms        0.048 ms        55729 1024#8192#0.5
PopcBenchI64/30/manual_time      0.013 ms        0.048 ms        55663 1024#8192#0.2
PopcBenchI64/31/manual_time      0.013 ms        0.048 ms        55674 1024#8192#0.4
PopcBenchI64/32/manual_time      0.013 ms        0.048 ms        55485 1024#8192#0.5
PopcBenchI64/33/manual_time      0.013 ms        0.048 ms        55638 1024#8192#0.2
PopcBenchI64/34/manual_time      0.206 ms        0.239 ms         3393 1#1073741824#0.5
PopcBenchI64/35/manual_time      0.206 ms        0.239 ms         3394 1#1073741824#0.2
PopcBenchI64/36/manual_time      0.206 ms        0.239 ms         3394 1#1073741824#0.1
PopcBenchI64/37/manual_time      0.206 ms        0.239 ms         3402 1#1073741824#0.01

The benchmark of CAGRA on deep-image-96-inner with configuration

name: cuvs_cagra
constraints:
  build: cuvs_bench.config.algos.constraints.cuvs_cagra_build
  search: cuvs_bench.config.algos.constraints.cuvs_cagra_search
groups:
  base:
    build:
      graph_degree: [64]
      intermediate_graph_degree: [96]
      graph_build_algo: ["NN_DESCENT"]
    search:
      itopk: [256]
      search_width: [1]

dataset: deep-image-96-inner
dim: 96
distance: euclidean
gpu_driver_version: 12.4
gpu_gpuDirectRDMASupported: 1
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 384
gpu_mem_freq: 8001000000.000000
gpu_mem_global_size: 51033931776
gpu_mem_shared_size: 102400
gpu_name: NVIDIA RTX A6000
gpu_pageableMemoryAccess: 0
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.5
gpu_sm_count: 84
gpu_sm_freq: 1800000000.000000
host_cores_used: 16
host_cpu_freq_max: 5881000000
host_cpu_freq_min: 400000000
host_pagesize: 4096
host_processors_sysconf: 32
host_processors_used: 32
host_total_ram_size: 134152171520
host_total_swap_size: 66571988992
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                  Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries search_width total_queries
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by cagra / 0.89 / **no sparsity** computing /  (baseline 24.12)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       7.29 ms         7.29 ms           96   7.28165m   7.28635m   0.108229   0.699489        137.243/s        256        100          1            1            96
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       18.8 ms         18.8 ms           37  0.0188214  0.0188264   0.108476   0.696578       53.1169k/s        256        100       1000            1           37k

# filter by cagra / 0.89 / **with sparsity** computing  (PR #378)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       7.32 ms         7.32 ms           96   7.31196m   7.31662m   0.111042   0.702396        136.675/s        256        100          1            1            96
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       18.7 ms         18.7 ms           37  0.0186631  0.0186691   0.109044   0.690756       53.5646k/s        256        100       1000            1           37k

#filter by **brute force** / 0.90 / with sparsity computing  (PR #378)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       17.9 ms         17.9 ms           39  0.0178896   0.017895   0.102308   0.697904        55.8817/s        256        100          1            1            39
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1        409 ms          409 ms            2   0.409333   0.409342   0.100875   0.818684       2.44295k/s        256        100       1000            1            2k

# No filter(cuvs::neighbors::filtering::none_sample_filter)/ no sparsity computing / by cagra (baseline 24.12)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1      0.167 ms        0.167 ms         4180   162.573u   166.937u   0.968471   0.697795       5.99031k/s        256        100          1            1         4.18k
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       7.86 ms         7.86 ms           89   7.85858m   7.86357m   0.992259   0.699858       127.169k/s        256        100       1000            1           89k

achirkin · 2024-11-15T07:51:04Z

Hi @rhdong thanks for the initial benchmarks, but we'd need something more conclusive.

First of all, please check what happens to the recall (value 0.1 suggests there's a bug). Let's aim at some realistic levels, such as e.g. 0.95. Please, choose an appropriate build/search config. I'd start looking with the common default graph degrees 48/32 and the search itopk of 256.

Second, with this benchmark we aim to find potential cases where the introduced change can hurt performance most. I figure such a case could be a larger dataset (50M-100M records) where the linear sparsity check takes longer while CAGRA is still fast. I think, the ideal candidate for this is a larger subset of the deep dataset you used. The most relevant search setting here is n_queries = 1 (again, less time spent in CAGRA search compared to the sparsity check); although I think checking the bigger batch sizes as well wouldn't hurt.

Third, the benchmark setup. I think, there's no need to use the throughput mode here (your changes do not affect multithreading). Hence, I'd stick to the latency mode to save time, but make sure to set the benchmark time a little longer and add warmup iterations for more precise results (e.g. --benchmark_min_time=3s --benchmark_min_warmup_time=0.001).

Finally, to exclude the filtering-vs-no-filtering overheads from the benchmark comparison, I'd suggest to benchmark with filtering only. Compare the two executables: before and after the introduced change. I think, it makes sense to test it at a few sparsity levels as you did in the beginning (so that both cagra and bruteforce codepaths are benchmarked).

rhdong · 2024-11-15T17:46:58Z

Hi @rhdong thanks for the initial benchmarks, but we'd need something more conclusive.

First of all, please check what happens to the recall (value 0.1 suggests there's a bug). Let's aim at some realistic levels, such as e.g. 0.95. Please, choose an appropriate build/search config. I'd start looking with the common default graph degrees 48/32 and the search itopk of 256.

Second, with this benchmark we aim to find potential cases where the introduced change can hurt performance most. I figure such a case could be a larger dataset (50M-100M records) where the linear sparsity check takes longer while CAGRA is still fast. I think, the ideal candidate for this is a larger subset of the deep dataset you used. The most relevant search setting here is n_queries = 1 (again, less time spent in CAGRA search compared to the sparsity check); although I think checking the bigger batch sizes as well wouldn't hurt.

Third, the benchmark setup. I think, there's no need to use the throughput mode here (your changes do not affect multithreading). Hence, I'd stick to the latency mode to save time, but make sure to set the benchmark time a little longer and add warmup iterations for more precise results (e.g. --benchmark_min_time=3s --benchmark_min_warmup_time=0.001).

Finally, to exclude the filtering-vs-no-filtering overheads from the benchmark comparison, I'd suggest to benchmark with filtering only. Compare the two executables: before and after the introduced change. I think, it makes sense to test it at a few sparsity levels as you did in the beginning (so that both cagra and bruteforce codepaths are benchmarked).

Hi @achirkin , thank you so very much for your suggestions, as we discussed in the last night meeting, I have made the benchmark(the results) as you request on the deep-image-96, and I'm trying to do it on the deep-100M, there is only one different: these two arguments are not supported in the cuvs_bench --benchmark_min_time=3s --benchmark_min_warmup_time=0.001.

rhdong · 2024-11-18T14:18:24Z

# No filter(cuvs::neighbors::filtering::none_sample_filter)/ no sparsity computing / by cagra (baseline 24.12)

Hi @achirkin , @cjnolet , @lowener , here are the benchmark results on the deep-100M. First, the conclusion is still : there is no negative effect on the performance, in the worst situation, the perf. decreased by about 1%

dataset: deep-100M
dim: 96
distance: euclidean
gpu_driver_version: 12.7
gpu_gpuDirectRDMASupported: 1
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 5120
gpu_mem_freq: 1512000000.000000
gpu_mem_global_size: 85097971712
gpu_mem_shared_size: 167936
gpu_name: NVIDIA A100 80GB PCIe
gpu_pageableMemoryAccess: 0
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.5
gpu_sm_count: 108
gpu_sm_freq: 1410000000.000000
host_cores_used: 64
host_cpu_freq_max: 2250000000
host_cpu_freq_min: 1500000000
host_pagesize: 4096
host_processors_sysconf: 256
host_processors_used: 256
host_total_ram_size: 1081985519616
host_total_swap_size: 0
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

Sparsity = 0.89

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by cagra / 0.89 / **no sparsity** computing /  (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time      0.999 ms        0.998 ms        13919   986.065u   999.468u    0.10397    13.9116        1000.53/s        128         10          1       13.919k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       2.21 ms         2.21 ms         6339   2.19542m   2.20986m   0.106152    14.0083        452.518/s        256         10          1        6.339k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.27 ms         2.27 ms         6155   2.25913m   2.27269m    0.10265    13.9884       440.007k/s        128         10       1000        6.155M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.89 ms         4.89 ms         2858   4.87807m   4.89017m    0.10583    13.9761       204.492k/s        256         10       1000        2.858M algo="single_cta"


# filter by cagra / 0.89 / **with sparsity** computing  (PR #378), ----in worst situation, the perf. decreased by (2 / 999) = 0.2%
raft_cagra.dim32/0/0/process_time/real_time       1.01 ms         1.01 ms        13864   995.968u   1007.31u    0.10311    13.9654         992.74/s        128         10          1       13.864k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       2.20 ms         2.20 ms         6372   2.18463m    2.1961m   0.105728    13.9936        455.352/s        256         10          1        6.372k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.31 ms         2.31 ms         6043   2.30312m   2.31494m    0.10375    13.9892       431.977k/s        128         10       1000        6.043M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.94 ms         4.94 ms         2835    4.9261m    4.9381m    0.10522    13.9995       202.507k/s        256         10       1000        2.835M algo="single_cta"

Sparsity = 0.5

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by cagra / 0.50 / **no sparsity** computing /  (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time      0.967 ms        0.967 ms        14480   955.493u    966.77u    0.46733    13.9988        1034.37/s        128         10          1        14.48k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       2.08 ms         2.08 ms         6720   2.07125m   2.08272m   0.455119    13.9959        480.142/s        256         10          1         6.72k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.22 ms         2.22 ms         6298   2.20983m   2.22182m    0.46974     13.993       450.082k/s        128         10       1000        6.298M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.90 ms         4.90 ms         2863   4.88828m    4.9002m    0.45572    14.0293       204.073k/s        256         10       1000        2.863M algo="single_cta"

# filter by cagra / 0.50 / **with sparsity** computing  (PR #378), ---- in worst situation, the perf. decreased by (9 / 967) = 1%
raft_cagra.dim32/0/0/process_time/real_time      0.976 ms        0.976 ms        14336   965.031u   976.492u    0.47207     13.999        1024.07/s        128         10          1       14.336k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       2.07 ms         2.07 ms         6756   2.05985m   2.07132m   0.452191    13.9938        482.784/s        256         10          1        6.756k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.26 ms         2.26 ms         6175   2.25309m   2.26489m    0.46716    13.9857       441.524k/s        128         10       1000        6.175M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.94 ms         4.94 ms         2835   4.93155m   4.94368m    0.45646    14.0153       202.279k/s        256         10       1000        2.835M algo="single_cta"

Sparsity = 0.1

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by cagra / 0.10 / **no sparsity** computing /  (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time      0.934 ms        0.934 ms        14990     922.6u   933.953u    0.84729         14        1070.72/s        128         10          1        14.99k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       1.95 ms         1.95 ms         7188   1.93549m   1.94687m   0.748637    13.9941        513.646/s        256         10          1        7.188k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.16 ms         2.16 ms         6472   2.14989m   2.16168m    0.84653    13.9904       462.603k/s        128         10       1000        6.472M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.39 ms         4.39 ms         3189   4.38259m   4.39459m     0.7505    14.0144       227.552k/s        256         10       1000        3.189M algo="single_cta"


# filter by cagra / 0.10 / **with sparsity** computing  (PR #378), ---- in worst situation, perf. decreased by (9 / 934) = 1%
raft_cagra.dim32/0/0/process_time/real_time      0.943 ms        0.943 ms        14845   931.751u    943.14u    0.84642    14.0009        1060.29/s        128         10          1       14.845k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       1.93 ms         1.93 ms         7229   1.92334m   1.93483m    0.74843    13.9869         516.84/s        256         10          1        7.229k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.21 ms         2.21 ms         6340   2.19379m   2.20577m    0.84641    13.9846       453.357k/s        128         10       1000         6.34M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.44 ms         4.44 ms         3153   4.42647m   4.43852m    0.74817    13.9946       225.301k/s        256         10       1000        3.153M algo="single_cta"

Sparsity = 0.9001 (to Brute force with prefilter and sparsity)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by **brute force** / 0.9001 / with sparsity computing  (PR #378)
raft_cagra.dim32/0/0/process_time/real_time       59.6 ms         59.6 ms          236   0.059615  0.0596262   0.113559    14.0718        16.7712/s        128         10          1           236 algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       60.0 ms         60.0 ms          234  0.0600018  0.0600131  0.0987179    14.0431         16.663/s        256         10          1           234 algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2813 ms         2813 ms            5    2.81277    2.81281    0.10188    14.0641        355.516/s        128         10       1000            5k algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       2816 ms         2815 ms            5     2.8155    2.81554    0.09832    14.0777        355.172/s        256         10       1000            5k algo="single_cta"

No prefilter & no sparsity computing, CAGRA

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# No filter(cuvs::neighbors::filtering::none_sample_filter)/ no sparsity computing / by cagra (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time      0.815 ms        0.815 ms        17169   804.125u   815.174u    0.93833    13.9957       1.22673k/s        128         10          1       17.169k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       1.76 ms         1.76 ms         7963   1.74589m   1.75703m   0.973038    13.9912        569.144/s        256         10          1        7.963k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       1.77 ms         1.77 ms         7899   1.75715m    1.7687m    0.93833    13.9709       565.389k/s        128         10       1000        7.899M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       3.70 ms         3.70 ms         3788   3.69044m   3.70208m    0.97275    14.0235       270.118k/s        256         10       1000        3.788M algo="single_cta"

achirkin

Thank you @rhdong for the benchmarks, I think the observed 1-2% are a reasonable compromise.

I have one more request though: make this a user-controlled feature. I think we'd better keep it off by default to avoid any surprises. It also may have some compatibility issues with other features (since the sparsity check runs a kernel, it's likely to interfere with the persistent kernel feature of single-cta algorithm).

achirkin · 2024-11-18T15:52:29Z

cpp/src/neighbors/detail/cagra/cagra_search.cuh

+  raft::device_matrix_view<InternalIdxT, int64_t, raft::row_major> neighbors,
+  raft::device_matrix_view<DistanceT, int64_t, raft::row_major> distances,
+  CagraSampleFilterT& sample_filter,
+  double threshold_to_bf = 0.9)


Could you please make it possible to disable/enable this feature by the user (see #252 (comment) for the reasoning):

Expose threshold_to_bf as CAGRA search parameter and set it to 1.0 by default there

Add a check here: if threshold_to_bf >= 1.0 then disable further checks and proceed with CAGRA search immediately (i.e. no need to run the sparsity check).

Most users with a filter are going to be specifying the filter in batch, and will know the sparsity of the filter. I suggest instead of turning this feature off by default, we allow the user specified filter to know its own nnz unless updated.

Turning this off by default undermines the fundamental benefits of this feature.

Most users are not specifying a filter, and when they do, it's expected the filter is going to be heavy. This should not impact all users.

cjnolet · 2024-11-18T16:14:13Z

@artem any user specifying a filter is opting in to this feature. It's already off by default. The users specifying a filter are going to be specifying a heavy filter most of the time. I've asked James to build this feature because it's needed. I don't think it's a bad idea to expose the threshold by which it crosses over the brute force, but we will be setting it to 99% by default.

achirkin

Thanks for the updates! Please move the new search argument to the search_params structure.

achirkin · 2024-11-19T07:29:35Z

cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h

Let's perhaps remove this from the benchmark wrapper for now?
The reason I'm suggesting this, is that filter creation should probably be a part of common benchmarking infrastructure rather than specific for CAGRA and, therefore, is a little out of the scope of this PR.

cpp/include/cuvs/neighbors/cagra.hpp

cpp/src/neighbors/detail/cagra/cagra_search.cuh

cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h

This reverts commit b5dcc02.

achirkin

Thanks for the updates! LGTM on cagra_search.cuh side

cjnolet · 2024-11-21T19:34:45Z

cpp/src/neighbors/detail/cagra/cagra_search.cuh

+  auto brute_force_dataset = raft::make_device_matrix_view<const T, int64_t, raft::row_major>(
+    strided_dataset.view().data_handle(), strided_dataset.n_rows(), strided_dataset.stride());
+
+  auto brute_force_idx = cuvs::neighbors::brute_force::build(res, brute_force_dataset, metric);


This is being called each and every time a user performs a search? There's overhead in this call, and this should cache off the built brute-force index because for many common distances this computes a set of norms.

cjnolet · 2024-11-21T19:38:14Z

cpp/src/neighbors/detail/cagra/cagra_search.cuh

+  auto n_dataset = strided_dataset.n_rows();
+
+  auto bitset_filter_view = sample_filter.bitset_view_;
+  auto sparsity           = bitset_filter_view.sparsity(res);


Isn't the number of positive bits in the bitmap also needed to compute this? But then we compute n_elements below. We should cache off the n_elements. We should consider making the bitmap immutable. That way the sparsity and the number of positive elements can be safely cached. This isn't something users are going to be updating, like ever.

cjnolet · 2024-11-21T19:41:55Z

cpp/src/neighbors/detail/cagra/cagra_search.cuh

+  auto n_queries = queries.extent(0);
+  auto n_dataset = strided_dataset.n_rows();
+
+  auto bitset_filter_view = sample_filter.bitset_view_;


What happens here if the 2d bitmap isn't able to be converted to a 1d bitet without losing information?

[Feat] CAGRA filtering with BFKNN when sparsity matching threshold

0faf889

rhdong added feature request New feature or request non-breaking Introduces a non-breaking change labels Oct 2, 2024

rhdong requested review from benfred, cjnolet and lowener October 2, 2024 19:00

rhdong requested a review from a team as a code owner October 2, 2024 19:00

github-actions bot added the cpp label Oct 2, 2024

rhdong changed the base branch from branch-24.10 to branch-24.12 October 2, 2024 20:03

rhdong added 4 commits October 2, 2024 18:00

Merge branch 'branch-24.12' into rhdong/cagra-bf

f3388f0

revert: update_dataset on strided matrix

f14be71

Merge branch 'branch-24.12' into rhdong/cagra-bf

062ca87

Merge branch 'branch-24.12' into rhdong/cagra-bf

a9fd8d8

rhdong requested a review from achirkin October 23, 2024 16:48

cjnolet assigned rhdong Oct 28, 2024

cjnolet reviewed Oct 28, 2024

View reviewed changes

rhdong added 3 commits October 28, 2024 13:19

Merge branch 'branch-24.12' into rhdong/cagra-bf

8e27b74

Support strided matrix on queries & respond to the review comments

5378827

Merge branch 'branch-24.12' into rhdong/cagra-bf

651387f

rhdong requested a review from cjnolet October 29, 2024 22:27

rhdong added 4 commits October 29, 2024 15:36

fix a style issue

757c222

Merge remote-tracking branch 'rhdong/rhdong/cagra-bf' into rhdong/cag…

018879f

…ra-bf

Merge branch 'branch-24.12' into rhdong/cagra-bf

bddae7f

fix: don't invoke 'copy_with_padding' from src/neighbors/detail

caab88b

rhdong commented Oct 30, 2024

View reviewed changes

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved

rhdong commented Oct 30, 2024

View reviewed changes

cpp/src/neighbors/detail/cagra/cagra_search.cuh Outdated Show resolved Hide resolved

achirkin requested changes Oct 31, 2024

View reviewed changes

Merge branch 'branch-24.12' into rhdong/cagra-bf

bac646d

rhdong added 2 commits November 14, 2024 10:19

Merge branch 'branch-24.12' into rhdong/cagra-bf

2552d8d

Merge branch 'branch-24.12' into rhdong/cagra-bf

ef734d4

rhdong requested a review from achirkin November 15, 2024 03:09

fix: RAFT_LOG_DEBUG %f for double & other optimization

0036127

rhdong mentioned this pull request Nov 18, 2024

[FEA] Filtered CAGRA to automatically use pre-filtered brute-force when the filter ratio reaches a particular theshold. #252

Open

achirkin requested changes Nov 18, 2024

View reviewed changes

rhdong added 4 commits November 18, 2024 16:15

Merge branch 'branch-24.12' into rhdong/cagra-bf

2876506

benchmark: support pre-filter on CAGRA

b5dcc02

adjust the kernel selection condition to be 0.9f

5c9c5de

expose the threshold-to-bf to callers & test cases

d190b9d

rhdong requested a review from achirkin November 19, 2024 00:19

achirkin requested changes Nov 19, 2024

View reviewed changes

rhdong added 2 commits November 19, 2024 13:22

move the threshold-to-bf into search_params

9aa1bb1

Merge branch 'branch-24.12' into rhdong/cagra-bf

a0fba17

rhdong requested a review from achirkin November 19, 2024 21:45

skip the test on half when cusparse version is unsupported.

e29d74d

achirkin requested changes Nov 20, 2024

View reviewed changes

cpp/src/neighbors/detail/cagra/cagra_search.cuh Show resolved Hide resolved

cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h Outdated Show resolved Hide resolved

Revert "benchmark: support pre-filter on CAGRA"

4d0fc8e

This reverts commit b5dcc02.

rhdong requested a review from achirkin November 20, 2024 11:00

if (params.threshold_to_bf >= 1.0) { return false; }

1bcba66

achirkin approved these changes Nov 21, 2024

View reviewed changes

cjnolet reviewed Nov 21, 2024

View reviewed changes

Merge branch 'branch-24.12' into rhdong/cagra-bf

0cafa23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] CAGRA filtering with BFKNN when sparsity matching threshold #378

[Feat] CAGRA filtering with BFKNN when sparsity matching threshold #378

rhdong commented Oct 2, 2024 •

edited

Loading

cjnolet Oct 28, 2024

rhdong Oct 30, 2024

cjnolet commented Oct 28, 2024 •

edited

Loading

rhdong commented Oct 28, 2024 •

edited

Loading

achirkin left a comment

achirkin Oct 31, 2024

rhdong Oct 31, 2024

achirkin Nov 7, 2024

rhdong commented Nov 15, 2024 •

edited

Loading

achirkin commented Nov 15, 2024 •

edited

Loading

rhdong commented Nov 15, 2024 •

edited

Loading

rhdong commented Nov 18, 2024

achirkin left a comment

achirkin Nov 18, 2024 •

edited

Loading

cjnolet Nov 18, 2024 •

edited

Loading

cjnolet Nov 18, 2024

cjnolet commented Nov 18, 2024

achirkin left a comment

achirkin Nov 19, 2024

achirkin left a comment

cjnolet Nov 21, 2024

cjnolet Nov 21, 2024 •

edited

Loading

cjnolet Nov 21, 2024

-      auto padding_queries     = raft::make_device_matrix<T, int64_t>(res, 0, 0);
+      // Allocate the padded queries in the workspace resource
+      auto padding_queries = raft::make_device_mdarray<T, int64_t>(
+        res,
+        raft::resource::get_workspace_resource(res),
+        raft::make_extents<int64_t>(n_queries, dataset_view.extent(1)));
+      // Copy the queries and fill the padded elements with zeros
+      raft::linalg::map_offset(res,
+                               padding_queries.view(),
+                               [queries, stride = dataset_view.extent(1)] __device__(int64_t i) {
+                                 auto row_ix = i / stride;
+                                 auto el_ix  = i % stride;
+                                 return el_ix < queries.extent(1) ? queries(row_ix, el_ix) : T{0};
+                               });

[Feat] CAGRA filtering with BFKNN when sparsity matching threshold #378

Are you sure you want to change the base?

[Feat] CAGRA filtering with BFKNN when sparsity matching threshold #378

Conversation

rhdong commented Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet commented Oct 28, 2024 • edited Loading

rhdong commented Oct 28, 2024 • edited Loading

achirkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhdong commented Nov 15, 2024 • edited Loading

achirkin commented Nov 15, 2024 • edited Loading

rhdong commented Nov 15, 2024 • edited Loading

rhdong commented Nov 18, 2024

Sparsity = 0.89

Sparsity = 0.5

Sparsity = 0.1

Sparsity = 0.9001 (to Brute force with prefilter and sparsity)

No prefilter & no sparsity computing, CAGRA

achirkin left a comment

Choose a reason for hiding this comment

achirkin Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

cjnolet Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet commented Nov 18, 2024

achirkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achirkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhdong commented Oct 2, 2024 •

edited

Loading

cjnolet commented Oct 28, 2024 •

edited

Loading

rhdong commented Oct 28, 2024 •

edited

Loading

rhdong commented Nov 15, 2024 •

edited

Loading

achirkin commented Nov 15, 2024 •

edited

Loading

rhdong commented Nov 15, 2024 •

edited

Loading

achirkin Nov 18, 2024 •

edited

Loading

cjnolet Nov 18, 2024 •

edited

Loading

cjnolet Nov 21, 2024 •

edited

Loading