Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] CAGRA filtering with BFKNN when sparsity matching threshold #378

Open
wants to merge 29 commits into
base: branch-24.12
Choose a base branch
from

Conversation

rhdong
Copy link
Member

@rhdong rhdong commented Oct 2, 2024

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type    Queries  n_rows   Dim   K     Metric         Layout    BruteF/Cagra    Sparsity     Recall rate (%)   Search Time (ms)   Build Time (ms)    Total Time (ms)    Throughput (q/s)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
float   100      1048576  32    64    InnerProduct   row       Brute           0.100        100.000           4.041              15891.638          15895.679          6.291
float   100      1048576  32    64    InnerProduct   row       Cagra           0.100         69.922           0.825              16162.612          16163.438          6.187
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.300        100.000           4.216              16265.309          16269.525          6.146
float   100      1048576  32    64    InnerProduct   row       Cagra           0.300         67.547           0.857              16227.900          16228.758          6.162
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.500        100.000           4.285              16110.632          16114.916          6.205
float   100      1048576  32    64    InnerProduct   row       Cagra           0.500         64.344           0.881              15642.374          15643.255          6.393
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.800        100.000           4.268              16707.514          16711.783          5.984
float   100      1048576  32    64    InnerProduct   row       Cagra           0.800         37.547           1.312              20631.773          20633.085          4.847
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.900        100.000           4.149              16323.327          16327.475          6.125
float   100      1048576  32    64    InnerProduct   row       Cagra           0.900         20.703           0.887              16195.060          16195.947          6.174
                                                                                                                                                    
float   100      1048576  32    64    InnerProduct   row       Brute           0.990        100.000           5.542              20701.861          20707.403          4.829
float   100      1048576  32    64    InnerProduct   row       Cagra           0.990          2.391           0.942              17216.771          17217.714          5.808

@rhdong rhdong added feature request New feature or request non-breaking Introduces a non-breaking change labels Oct 2, 2024
@rhdong rhdong requested a review from a team as a code owner October 2, 2024 19:00
@github-actions github-actions bot added the cpp label Oct 2, 2024
@rhdong rhdong changed the base branch from branch-24.10 to branch-24.12 October 2, 2024 20:03
@@ -140,6 +142,61 @@ void search_main(raft::resources const& res,
raft::device_matrix_view<DistanceT, int64_t, raft::row_major> distances,
CagraSampleFilterT sample_filter = CagraSampleFilterT())
{
if constexpr (!std::is_same_v<CagraSampleFilterT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pull this out into a separate function that can be invoked here please? This search_main function is gettig pretty massive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@cjnolet
Copy link
Member

cjnolet commented Oct 28, 2024

@rhdong it looks like the recall rates for CAGRA filtered search are all pretty terrible (and I can recall some other experiments you did recently that seemed to have shown much better recall for the higher filter rates... but i could be mistaken). Was 0.9 just chosen because I mentioned it's a common case or did you choose it based on your benchmarks above?

Also, I would do L2 in addition to inner product. I don't think nn-descent actually supports inner product atm. Did you use ivf-pq here to build the graph or do we have a bug where CAGRA is not throwing an error when inner product is used with nn-descent?

@rhdong
Copy link
Member Author

rhdong commented Oct 28, 2024

@rhdong it looks like the recall rates for CAGRA filtered search are all pretty terrible (and I can recall some other experiments you did recently that seemed to have shown much better recall for the higher filter rates... but i could be mistaken). Was 0.9 just chosen because I mentioned it's a common case or did you choose it based on your benchmarks above?

Also, I would do L2 in addition to inner product. I don't think nn-descent actually supports inner product atm. Did you use ivf-pq here to build the graph or do we have a bug where CAGRA is not throwing an error when inner product is used with nn-descent?

The 0.9 is based on the benchmark I mentioned in Slack, and I used the NN+Euclidean, so it sounds like no need to re-test them. (Revised on Oct 29)

@rhdong rhdong requested a review from cjnolet October 29, 2024 22:27
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to to add an extra member to the cagra index to avoid (1) complicating it even more than it is right now, (2) changing the abi.

cpp/src/neighbors/detail/cagra/cagra_search.cuh Outdated Show resolved Hide resolved
cpp/src/neighbors/detail/cagra/cagra_search.cuh Outdated Show resolved Hide resolved
cpp/src/neighbors/detail/cagra/cagra_search.cuh Outdated Show resolved Hide resolved
cuvs::neighbors::brute_force::build(res, *brute_force_dataset, index.metric());

auto brute_force_queries = queries;
auto padding_queries = raft::make_device_matrix<T, int64_t>(res, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto padding_queries = raft::make_device_matrix<T, int64_t>(res, 0, 0);
// Allocate the padded queries in the workspace resource
auto padding_queries = raft::make_device_mdarray<T, int64_t>(
res,
raft::resource::get_workspace_resource(res),
raft::make_extents<int64_t>(n_queries, dataset_view.extent(1)));
// Copy the queries and fill the padded elements with zeros
raft::linalg::map_offset(res,
padding_queries.view(),
[queries, stride = dataset_view.extent(1)] __device__(int64_t i) {
auto row_ix = i / stride;
auto el_ix = i % stride;
return el_ix < queries.extent(1) ? queries(row_ix, el_ix) : T{0};
});

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion! I found a way to allocate the queries in the workspace while keeping the reuse of copy_with_padding, which could help with clean code. If I missed something, please feel free to point it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to stick to raft primitives and eventually remove copy_with_padding, because the latter is not a commonly used primitive and thus one requires to go read its code to understand what does it do exactly.
But I don't mind to keep it here for now.

cpp/src/neighbors/detail/cagra/cagra_search.cuh Outdated Show resolved Hide resolved
@rhdong
Copy link
Member Author

rhdong commented Nov 15, 2024

Hi @achirkin @cjnolet @lowener, per the benchmark result on deep-image-96-inner with the 9990000 rows data on A6000 32GB, the sparsity overhead could be ignored, which shows a very random difference in the performance. The total latency could be 7.29 ms vs 7.32 ms under n_queries=1, and 18.88 ms vs 18.87 ms under n_queries=1000; the sparsity only spent 0.013ms according to the benchmark of sparsity(raft::popc) which was consistent with the benchmark of CAGRA-ANN. What do you think about keeping the current implementation? (Revised @nov 15, 9:39AM)

The benchmark of sparsity(raft::popc):

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
PopcBenchI64/0/manual_time       0.005 ms        0.041 ms       131086 2#131072#0.4
PopcBenchI64/1/manual_time       0.009 ms        0.045 ms        77488 8#131072#0.5
PopcBenchI64/2/manual_time       0.014 ms        0.050 ms        49516 16#131072#0.2
PopcBenchI64/3/manual_time       0.004 ms        0.040 ms       182993 2#8192#0.4
PopcBenchI64/4/manual_time       0.004 ms        0.040 ms       157537 16#8192#0.5
PopcBenchI64/5/manual_time       0.009 ms        0.045 ms        76827 128#8192#0.2
PopcBenchI64/6/manual_time       0.012 ms        0.047 ms        56139 1024#8192#0.1
PopcBenchI64/7/manual_time       0.013 ms        0.048 ms        55972 1024#8192#0.1
PopcBenchI64/8/manual_time       0.013 ms        0.048 ms        55916 1024#8192#0.1
PopcBenchI64/9/manual_time       0.013 ms        0.048 ms        55888 1024#8192#0.1
PopcBenchI64/10/manual_time      0.013 ms        0.048 ms        55995 1024#8192#0.1
PopcBenchI64/11/manual_time      0.013 ms        0.048 ms        55764 1024#8192#0.1
PopcBenchI64/12/manual_time      0.013 ms        0.048 ms        55728 1024#8192#0.1
PopcBenchI64/13/manual_time      0.013 ms        0.048 ms        55718 1024#8192#0.1
PopcBenchI64/14/manual_time      0.013 ms        0.048 ms        55757 1024#8192#0.4
PopcBenchI64/15/manual_time      0.013 ms        0.048 ms        55731 1024#8192#0.5
PopcBenchI64/16/manual_time      0.013 ms        0.048 ms        55689 1024#8192#0.2
PopcBenchI64/17/manual_time      0.013 ms        0.048 ms        55713 1024#8192#0.4
PopcBenchI64/18/manual_time      0.013 ms        0.048 ms        55787 1024#8192#0.5
PopcBenchI64/19/manual_time      0.013 ms        0.048 ms        55507 1024#8192#0.2
PopcBenchI64/20/manual_time      0.013 ms        0.048 ms        55536 1024#8192#0.4
PopcBenchI64/21/manual_time      0.013 ms        0.048 ms        55734 1024#8192#0.5
PopcBenchI64/22/manual_time      0.013 ms        0.048 ms        55760 1024#8192#0.2
PopcBenchI64/23/manual_time      0.013 ms        0.048 ms        55594 1024#8192#0.4
PopcBenchI64/24/manual_time      0.013 ms        0.048 ms        55671 1024#8192#0.5
PopcBenchI64/25/manual_time      0.013 ms        0.048 ms        55660 1024#8192#0.2
PopcBenchI64/26/manual_time      0.013 ms        0.048 ms        55663 1024#8192#0.5
PopcBenchI64/27/manual_time      0.013 ms        0.048 ms        55699 1024#8192#0.2
PopcBenchI64/28/manual_time      0.013 ms        0.048 ms        55718 1024#8192#0.4
PopcBenchI64/29/manual_time      0.013 ms        0.048 ms        55729 1024#8192#0.5
PopcBenchI64/30/manual_time      0.013 ms        0.048 ms        55663 1024#8192#0.2
PopcBenchI64/31/manual_time      0.013 ms        0.048 ms        55674 1024#8192#0.4
PopcBenchI64/32/manual_time      0.013 ms        0.048 ms        55485 1024#8192#0.5
PopcBenchI64/33/manual_time      0.013 ms        0.048 ms        55638 1024#8192#0.2
PopcBenchI64/34/manual_time      0.206 ms        0.239 ms         3393 1#1073741824#0.5
PopcBenchI64/35/manual_time      0.206 ms        0.239 ms         3394 1#1073741824#0.2
PopcBenchI64/36/manual_time      0.206 ms        0.239 ms         3394 1#1073741824#0.1
PopcBenchI64/37/manual_time      0.206 ms        0.239 ms         3402 1#1073741824#0.01
  • The benchmark of CAGRA on deep-image-96-inner with configuration
name: cuvs_cagra
constraints:
  build: cuvs_bench.config.algos.constraints.cuvs_cagra_build
  search: cuvs_bench.config.algos.constraints.cuvs_cagra_search
groups:
  base:
    build:
      graph_degree: [64]
      intermediate_graph_degree: [96]
      graph_build_algo: ["NN_DESCENT"]
    search:
      itopk: [256]
      search_width: [1]
dataset: deep-image-96-inner
dim: 96
distance: euclidean
gpu_driver_version: 12.4
gpu_gpuDirectRDMASupported: 1
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 384
gpu_mem_freq: 8001000000.000000
gpu_mem_global_size: 51033931776
gpu_mem_shared_size: 102400
gpu_name: NVIDIA RTX A6000
gpu_pageableMemoryAccess: 0
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.5
gpu_sm_count: 84
gpu_sm_freq: 1800000000.000000
host_cores_used: 16
host_cpu_freq_max: 5881000000
host_cpu_freq_min: 400000000
host_pagesize: 4096
host_processors_sysconf: 32
host_processors_used: 32
host_total_ram_size: 134152171520
host_total_swap_size: 66571988992
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                  Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries search_width total_queries
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by cagra / 0.89 / **no sparsity** computing /  (baseline 24.12)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       7.29 ms         7.29 ms           96   7.28165m   7.28635m   0.108229   0.699489        137.243/s        256        100          1            1            96
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       18.8 ms         18.8 ms           37  0.0188214  0.0188264   0.108476   0.696578       53.1169k/s        256        100       1000            1           37k

# filter by cagra / 0.89 / **with sparsity** computing  (PR #378)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       7.32 ms         7.32 ms           96   7.31196m   7.31662m   0.111042   0.702396        136.675/s        256        100          1            1            96
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       18.7 ms         18.7 ms           37  0.0186631  0.0186691   0.109044   0.690756       53.5646k/s        256        100       1000            1           37k

#filter by **brute force** / 0.90 / with sparsity computing  (PR #378)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       17.9 ms         17.9 ms           39  0.0178896   0.017895   0.102308   0.697904        55.8817/s        256        100          1            1            39
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1        409 ms          409 ms            2   0.409333   0.409342   0.100875   0.818684       2.44295k/s        256        100       1000            1            2k

# No filter(cuvs::neighbors::filtering::none_sample_filter)/ no sparsity computing / by cagra (baseline 24.12)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1      0.167 ms        0.167 ms         4180   162.573u   166.937u   0.968471   0.697795       5.99031k/s        256        100          1            1         4.18k
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1       7.86 ms         7.86 ms           89   7.85858m   7.86357m   0.992259   0.699858       127.169k/s        256        100       1000            1           89k

@rhdong rhdong requested a review from achirkin November 15, 2024 03:09
@achirkin
Copy link
Contributor

achirkin commented Nov 15, 2024

Hi @rhdong thanks for the initial benchmarks, but we'd need something more conclusive.

First of all, please check what happens to the recall (value 0.1 suggests there's a bug). Let's aim at some realistic levels, such as e.g. 0.95. Please, choose an appropriate build/search config. I'd start looking with the common default graph degrees 48/32 and the search itopk of 256.

Second, with this benchmark we aim to find potential cases where the introduced change can hurt performance most. I figure such a case could be a larger dataset (50M-100M records) where the linear sparsity check takes longer while CAGRA is still fast. I think, the ideal candidate for this is a larger subset of the deep dataset you used. The most relevant search setting here is n_queries = 1 (again, less time spent in CAGRA search compared to the sparsity check); although I think checking the bigger batch sizes as well wouldn't hurt.

Third, the benchmark setup. I think, there's no need to use the throughput mode here (your changes do not affect multithreading). Hence, I'd stick to the latency mode to save time, but make sure to set the benchmark time a little longer and add warmup iterations for more precise results (e.g. --benchmark_min_time=3s --benchmark_min_warmup_time=0.001).

Finally, to exclude the filtering-vs-no-filtering overheads from the benchmark comparison, I'd suggest to benchmark with filtering only. Compare the two executables: before and after the introduced change. I think, it makes sense to test it at a few sparsity levels as you did in the beginning (so that both cagra and bruteforce codepaths are benchmarked).

@rhdong
Copy link
Member Author

rhdong commented Nov 15, 2024

Hi @rhdong thanks for the initial benchmarks, but we'd need something more conclusive.

First of all, please check what happens to the recall (value 0.1 suggests there's a bug). Let's aim at some realistic levels, such as e.g. 0.95. Please, choose an appropriate build/search config. I'd start looking with the common default graph degrees 48/32 and the search itopk of 256.

Second, with this benchmark we aim to find potential cases where the introduced change can hurt performance most. I figure such a case could be a larger dataset (50M-100M records) where the linear sparsity check takes longer while CAGRA is still fast. I think, the ideal candidate for this is a larger subset of the deep dataset you used. The most relevant search setting here is n_queries = 1 (again, less time spent in CAGRA search compared to the sparsity check); although I think checking the bigger batch sizes as well wouldn't hurt.

Third, the benchmark setup. I think, there's no need to use the throughput mode here (your changes do not affect multithreading). Hence, I'd stick to the latency mode to save time, but make sure to set the benchmark time a little longer and add warmup iterations for more precise results (e.g. --benchmark_min_time=3s --benchmark_min_warmup_time=0.001).

Finally, to exclude the filtering-vs-no-filtering overheads from the benchmark comparison, I'd suggest to benchmark with filtering only. Compare the two executables: before and after the introduced change. I think, it makes sense to test it at a few sparsity levels as you did in the beginning (so that both cagra and bruteforce codepaths are benchmarked).

Hi @achirkin , thank you so very much for your suggestions, as we discussed in the last night meeting, I have made the benchmark(the results) as you request on the deep-image-96, and I'm trying to do it on the deep-100M, there is only one different: these two arguments are not supported in the cuvs_bench --benchmark_min_time=3s --benchmark_min_warmup_time=0.001.

@rhdong
Copy link
Member Author

rhdong commented Nov 18, 2024

# No filter(cuvs::neighbors::filtering::none_sample_filter)/ no sparsity computing / by cagra (baseline 24.12)

Hi @achirkin , @cjnolet , @lowener , here are the benchmark results on the deep-100M. First, the conclusion is still : there is no negative effect on the performance, in the worst situation, the perf. decreased by about 1%

dataset: deep-100M
dim: 96
distance: euclidean
gpu_driver_version: 12.7
gpu_gpuDirectRDMASupported: 1
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 5120
gpu_mem_freq: 1512000000.000000
gpu_mem_global_size: 85097971712
gpu_mem_shared_size: 167936
gpu_name: NVIDIA A100 80GB PCIe
gpu_pageableMemoryAccess: 0
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.5
gpu_sm_count: 108
gpu_sm_freq: 1410000000.000000
host_cores_used: 64
host_cpu_freq_max: 2250000000
host_cpu_freq_min: 1500000000
host_pagesize: 4096
host_processors_sysconf: 256
host_processors_used: 256
host_total_ram_size: 1081985519616
host_total_swap_size: 0
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

Sparsity = 0.89

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by cagra / 0.89 / **no sparsity** computing /  (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time      0.999 ms        0.998 ms        13919   986.065u   999.468u    0.10397    13.9116        1000.53/s        128         10          1       13.919k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       2.21 ms         2.21 ms         6339   2.19542m   2.20986m   0.106152    14.0083        452.518/s        256         10          1        6.339k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.27 ms         2.27 ms         6155   2.25913m   2.27269m    0.10265    13.9884       440.007k/s        128         10       1000        6.155M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.89 ms         4.89 ms         2858   4.87807m   4.89017m    0.10583    13.9761       204.492k/s        256         10       1000        2.858M algo="single_cta"


# filter by cagra / 0.89 / **with sparsity** computing  (PR #378), ----in worst situation, the perf. decreased by (2 / 999) = 0.2%
raft_cagra.dim32/0/0/process_time/real_time       1.01 ms         1.01 ms        13864   995.968u   1007.31u    0.10311    13.9654         992.74/s        128         10          1       13.864k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       2.20 ms         2.20 ms         6372   2.18463m    2.1961m   0.105728    13.9936        455.352/s        256         10          1        6.372k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.31 ms         2.31 ms         6043   2.30312m   2.31494m    0.10375    13.9892       431.977k/s        128         10       1000        6.043M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.94 ms         4.94 ms         2835    4.9261m    4.9381m    0.10522    13.9995       202.507k/s        256         10       1000        2.835M algo="single_cta"

Sparsity = 0.5

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by cagra / 0.50 / **no sparsity** computing /  (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time      0.967 ms        0.967 ms        14480   955.493u    966.77u    0.46733    13.9988        1034.37/s        128         10          1        14.48k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       2.08 ms         2.08 ms         6720   2.07125m   2.08272m   0.455119    13.9959        480.142/s        256         10          1         6.72k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.22 ms         2.22 ms         6298   2.20983m   2.22182m    0.46974     13.993       450.082k/s        128         10       1000        6.298M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.90 ms         4.90 ms         2863   4.88828m    4.9002m    0.45572    14.0293       204.073k/s        256         10       1000        2.863M algo="single_cta"

# filter by cagra / 0.50 / **with sparsity** computing  (PR #378), ---- in worst situation, the perf. decreased by (9 / 967) = 1%
raft_cagra.dim32/0/0/process_time/real_time      0.976 ms        0.976 ms        14336   965.031u   976.492u    0.47207     13.999        1024.07/s        128         10          1       14.336k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       2.07 ms         2.07 ms         6756   2.05985m   2.07132m   0.452191    13.9938        482.784/s        256         10          1        6.756k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.26 ms         2.26 ms         6175   2.25309m   2.26489m    0.46716    13.9857       441.524k/s        128         10       1000        6.175M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.94 ms         4.94 ms         2835   4.93155m   4.94368m    0.45646    14.0153       202.279k/s        256         10       1000        2.835M algo="single_cta"

Sparsity = 0.1

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by cagra / 0.10 / **no sparsity** computing /  (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time      0.934 ms        0.934 ms        14990     922.6u   933.953u    0.84729         14        1070.72/s        128         10          1        14.99k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       1.95 ms         1.95 ms         7188   1.93549m   1.94687m   0.748637    13.9941        513.646/s        256         10          1        7.188k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.16 ms         2.16 ms         6472   2.14989m   2.16168m    0.84653    13.9904       462.603k/s        128         10       1000        6.472M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.39 ms         4.39 ms         3189   4.38259m   4.39459m     0.7505    14.0144       227.552k/s        256         10       1000        3.189M algo="single_cta"


# filter by cagra / 0.10 / **with sparsity** computing  (PR #378), ---- in worst situation, perf. decreased by (9 / 934) = 1%
raft_cagra.dim32/0/0/process_time/real_time      0.943 ms        0.943 ms        14845   931.751u    943.14u    0.84642    14.0009        1060.29/s        128         10          1       14.845k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       1.93 ms         1.93 ms         7229   1.92334m   1.93483m    0.74843    13.9869         516.84/s        256         10          1        7.229k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2.21 ms         2.21 ms         6340   2.19379m   2.20577m    0.84641    13.9846       453.357k/s        128         10       1000         6.34M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       4.44 ms         4.44 ms         3153   4.42647m   4.43852m    0.74817    13.9946       225.301k/s        256         10       1000        3.153M algo="single_cta"

Sparsity = 0.9001 (to Brute force with prefilter and sparsity)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# filter by **brute force** / 0.9001 / with sparsity computing  (PR #378)
raft_cagra.dim32/0/0/process_time/real_time       59.6 ms         59.6 ms          236   0.059615  0.0596262   0.113559    14.0718        16.7712/s        128         10          1           236 algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       60.0 ms         60.0 ms          234  0.0600018  0.0600131  0.0987179    14.0431         16.663/s        256         10          1           234 algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       2813 ms         2813 ms            5    2.81277    2.81281    0.10188    14.0641        355.516/s        128         10       1000            5k algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       2816 ms         2815 ms            5     2.8155    2.81554    0.09832    14.0777        355.172/s        256         10       1000            5k algo="single_cta"

No prefilter & no sparsity computing, CAGRA

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# No filter(cuvs::neighbors::filtering::none_sample_filter)/ no sparsity computing / by cagra (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time      0.815 ms        0.815 ms        17169   804.125u   815.174u    0.93833    13.9957       1.22673k/s        128         10          1       17.169k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time       1.76 ms         1.76 ms         7963   1.74589m   1.75703m   0.973038    13.9912        569.144/s        256         10          1        7.963k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time       1.77 ms         1.77 ms         7899   1.75715m    1.7687m    0.93833    13.9709       565.389k/s        128         10       1000        7.899M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time       3.70 ms         3.70 ms         3788   3.69044m   3.70208m    0.97275    14.0235       270.118k/s        256         10       1000        3.788M algo="single_cta"

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rhdong for the benchmarks, I think the observed 1-2% are a reasonable compromise.

I have one more request though: make this a user-controlled feature. I think we'd better keep it off by default to avoid any surprises. It also may have some compatibility issues with other features (since the sparsity check runs a kernel, it's likely to interfere with the persistent kernel feature of single-cta algorithm).

raft::device_matrix_view<InternalIdxT, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<DistanceT, int64_t, raft::row_major> distances,
CagraSampleFilterT& sample_filter,
double threshold_to_bf = 0.9)
Copy link
Contributor

@achirkin achirkin Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make it possible to disable/enable this feature by the user (see #252 (comment) for the reasoning):

  1. Expose threshold_to_bf as CAGRA search parameter and set it to 1.0 by default there
  2. Add a check here: if threshold_to_bf >= 1.0 then disable further checks and proceed with CAGRA search immediately (i.e. no need to run the sparsity check).

Copy link
Member

@cjnolet cjnolet Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users with a filter are going to be specifying the filter in batch, and will know the sparsity of the filter. I suggest instead of turning this feature off by default, we allow the user specified filter to know its own nnz unless updated.

Turning this off by default undermines the fundamental benefits of this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users are not specifying a filter, and when they do, it's expected the filter is going to be heavy. This should not impact all users.

@cjnolet
Copy link
Member

cjnolet commented Nov 18, 2024

@artem any user specifying a filter is opting in to this feature. It's already off by default. The users specifying a filter are going to be specifying a heavy filter most of the time. I've asked James to build this feature because it's needed. I don't think it's a bad idea to expose the threshold by which it crosses over the brute force, but we will be setting it to 99% by default.

@rhdong rhdong requested a review from achirkin November 19, 2024 00:19
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Please move the new search argument to the search_params structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's perhaps remove this from the benchmark wrapper for now?
The reason I'm suggesting this, is that filter creation should probably be a part of common benchmarking infrastructure rather than specific for CAGRA and, therefore, is a little out of the scope of this PR.

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved
@rhdong rhdong requested a review from achirkin November 19, 2024 21:45
@rhdong rhdong requested a review from achirkin November 20, 2024 11:00
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! LGTM on cagra_search.cuh side

auto brute_force_dataset = raft::make_device_matrix_view<const T, int64_t, raft::row_major>(
strided_dataset.view().data_handle(), strided_dataset.n_rows(), strided_dataset.stride());

auto brute_force_idx = cuvs::neighbors::brute_force::build(res, brute_force_dataset, metric);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being called each and every time a user performs a search? There's overhead in this call, and this should cache off the built brute-force index because for many common distances this computes a set of norms.

auto n_dataset = strided_dataset.n_rows();

auto bitset_filter_view = sample_filter.bitset_view_;
auto sparsity = bitset_filter_view.sparsity(res);
Copy link
Member

@cjnolet cjnolet Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the number of positive bits in the bitmap also needed to compute this? But then we compute n_elements below. We should cache off the n_elements. We should consider making the bitmap immutable. That way the sparsity and the number of positive elements can be safely cached. This isn't something users are going to be updating, like ever.

auto n_queries = queries.extent(0);
auto n_dataset = strided_dataset.n_rows();

auto bitset_filter_view = sample_filter.bitset_view_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if the 2d bitmap isn't able to be converted to a 1d bitet without losing information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp feature request New feature or request non-breaking Introduces a non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants