Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup concurrent multi-segment HNSW graph search 2 #12962

Merged

Conversation

mayya-sharipova
Copy link
Contributor

A second implementation of #12794 using Queue instead of MaxScoreAccumulator.

Speedup concurrent multi-segment HNWS graph search by exchanging the global top scores collected so far across segments. These global top scores set the minimum threshold that candidates need to pass to be considered. This allows earlier stopping for segments that don't have good candidates.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Dec 21, 2023

Experiments:
Available processors: 10; thread pool size: 16
luceneutil tool

Search:

Summary:
As seen from the results below, this PR implementation based on queue:

  • As compared with baseline always delivers higher QPS: 1-300% faster depending on k and number of vectors. But comes at the 3-11% drop in recall compared with the single segment recall.
  • As compared with the implementation based on MaxScoreAccumulator: delivers higher QPS: 1-83% faster. But comes at
    slight drop in recall 1-6%.

I think increase in QPS worth the slight drop in recall. And we should go with this implementation.


Details:

1M vectors of 100 dims

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 980 2336 0.739
Baseline 3 segments concurrent 2627 2857 0.772
Candidate1_with_min_score 2458 2816 0.766
Candidate2_with_queue 2477 2865 0.767

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 6722 430 0.921
Baseline 3 segments concurrent 17595 438 0.949
Candidate1_with_min_score 16386 464 0.947
Candidate2_with_queue 13483 469 0.940

Candidate2_with_queue VS Baseline:

  • ${\color{green}Recall}$ is better than single segment
  • ${\color{green} QPS}$ are 0.3-7% better than multiple segments

Candidate2_with_queue VS Candidate1_with_min_score:

  • ${\color{red}Recall}$ is slighlty worse
  • ${\color{green} QPS}$ are 1-2% better

10M vectors of 100 dims

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 1081 1798 0.634
Baseline 13 segments concurrent 11869 1371 0.680
Candidate1_with_min_score 7660 1845 0.600
Candidate2_with_queue 7942 1776 0.606

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 7213 272 0.824
Baseline 13 segments concurrent 78069 239 0.894
Candidate1_with_min_score 50649 301 0.868
Candidate2_with_queue 33139 361 0.834

k=100, fanout=9900

Avg visited nodes QPS Recall
Baseline Single segment 55168 36 0.916
Baseline 13 segments concurrent 521851 29 0.962
Candidate1_with_min_score 331252 34 0.954
Candidate2_with_queue 193082 55 0.938

Candidate2_with_queue VS Baseline:

  • ${\color{black}Recall}$ is slightly worse or better than single segment
  • ${\color{green} QPS}$ are 30-52% better than multiple segments

Candidate2_with_queue VS Candidate1_with_min_score:

  • ${\color{red}Recall}$ is slighlty worse
  • ${\color{green} QPS}$ is 20-60% better (but is -6% worse on small k)

10M vectors of 768 dims

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 1095 974 0.542
Baseline 19 segments concurrent 18091 569 0.541
Candidate1_with_min_score 10700 824 0.474
Candidate2_with_queue 11007 783 0.479

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 7118 152 0.688
Baseline 19 segments concurrent 120453 88 0.732
Candidate1_with_min_score 65778 139 0.693
Candidate2_with_queue 44775 183 0.650

k=100, fanout=9900

Avg visited nodes QPS Recall
Baseline Single segment 59308 19 0.797
Baseline 19 segments concurrent 818268 11 0.847
Candidate1_with_min_score 463464 18 0.828
Candidate2_with_queue 251612 33 0.779

Candidate2_with_queue VS Baseline:

  • ${\color{red}Recall}$ is slightly worse than than single segment
  • ${\color{green} QPS}$ are 38-300% better than multiple segments

Candidate2_with_queue VS Candidate1_with_min_score:

  • ${\color{red}Recall}$ is slighlty worse (but better on small k)
  • ${\color{green} QPS}$ is 31-83% better (but is 5% worse on small k)

@tveasey @jimczi @benwtrent what do you think?

@mayya-sharipova mayya-sharipova marked this pull request as draft December 22, 2023 14:59
@tveasey
Copy link

tveasey commented Jan 2, 2024

IMO we shouldn't focus too much on recall since the greediness of globally non-competitive search allows us to tune this. My main concern is does contention on the queue updates cause slow down. This aside, I think the queue is strictly better.

The search might wind up visiting fewer vertices for min score sharing, because of earlier decisions might mean it by chance gets transiently better bounds, but this should be low probability particularly when the search has to visit many vertices. And indeed these cases are where we see big wins from using a queue.

There appears to be some evidence of contention. This is suggested by looking at the runtime vs expected runtime from vertices visited, e.g.

scenario QPS(score) / QPS(queue) Visited(queue) / Visited(score)
n=10M, dim=100, k = 100, fo = 900 0.83 0.65
n=10M, dim=768, k = 100, fo = 900 0.76 0.68

Note that the direction of this effect is consistent, but the size is not (fo = 900 shows the largest effect). However, all that said we still get significant wins in performance, so my vote would be to use the queue and work on strategies for reducing contention, there are various ideas we had for ways to achieve this.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jan 12, 2024

I have also done experiments using Cohere dataset, as as seen below:

  • for 10M docs dataset, the speedups with the proposed approach are 1.7-2.5x times.
  • for 10M docs dataset, where k+fanout = 1000, QPS is close to the QPS of a single segment, while recall is better.

Cohere/wikipedia-22-12-en-embeddings

1M vectors

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 2033 1212 0.880
Baseline 8 segments concurrent 13927 807 0.974
Candidate2_with_queue 12241 888 0.963

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 13083 167 0.964
Baseline 8 segments concurrent 81824 124 0.997
Candidate2_with_queue 38929 193 0.985

10M vectors

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 2189 984 0.929
Baseline 19 segments concurrent 37656 316 0.951
Candidate2_with_queue 22041 472 0.932

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 15040 122 0.950
Baseline 19 segments concurrent 229945 48 0.990
Candidate2_with_queue 78740 115 0.972

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova for all the testing. I left some comments.
Some of the recalls for the single segment baseline seem seem quite off (0.477?). Are you sure that there was no issue during the testing?

if (kResultsCollected) {
// as we've collected k results, we can start do periodic updates with the global queue
if (firstKResultsCollected || (visitedCount & interval) == 0) {
cachedGlobalMinSim = globalSimilarityQueue.offer(updatesQueue.getHeap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of updating the global queue periodically doesn't bring much if k is close to the size of the queue. For instance if k is 1000, we only save 24 updates with this strategy. That's fine I guess especially considering that the interval is not only to save concurrent update in the queue but also to ensure that we are far enough in the walk of the graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Sorry I did not understand this comment, can you please clarify it.

What does it mean k is close to the size of queue? The globalSimilarityQueue is of size k.

Also I am not clear how you derive the value of 24?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant if k is close to the interval (1024). In such case delaying the update to the global queue doesn't save much.

Copy link

@tveasey tveasey Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a subtlety: note that firstKResultsCollected is only true the first time we visit k nodes, so this is saying only refresh every 1024 iterations thereafter. The refresh schedule is k, 1024, 2048, .... (The || threw me initially.)

As per my comment above, 1024 seems infrequent to me. (Of course you may have tested smaller values and determined this to be a good choice.) If we think it is risky sharing information too early, I would be inclined to share on the schedule max(k, c1), max(k, c1) + c2, max(k, c1) + 2 * c2, ... with c1 > c2 and decouple the concerns.

Also, there maybe issues with using information too early, in which case minCompetitiveSimilarity would need a check that visitedCount > max(k, c1) before it starts modifying minSim.

In any case, I can see results being rather sensitive to these choices so if you haven't done a parameter exploration it might be worth trying a few choices.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jan 15, 2024

@jimczi Thanks for your feedback.

Some of the recalls for the single segment baseline seem seem quite off (0.477?). Are you sure that there was no issue during the testing?

Sorry, I made a mistake. I've updated the results.

I will work on more experiments to address your other comments.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jan 16, 2024

I have done more experiments with different interval values:

Cohere dataset of 786 dims:

1M vectors, k=10, fanout=90

Interval Avg visited nodes QPS Recall
255 10283 959 0.946
511 11455 912 0.956
1023 12241 888 0.963

1M vectors, k=100, fanout=900

Interval Avg visited nodes QPS Recall
255 28175 286 0.978
511 32410 231 0.982
1023 38929 193 0.985

10M vectors, k=10, fanout=90

Interval Avg visited nodes QPS Recall
255 18325 567 0.906
511 20108 518 0.919
1023 22041 472 0.932

10M vectors, k=100, fanout=900

Interval Avg visited nodes QPS Recall
255 58543 157 0.958
511 66518 134 0.966
1023 78740 115 0.972

Updating from 1023 interval to 255 interval gives us:

  • average 28% decrease in visited nodes and 28% gains in QPS at the expense of lower recall
  • Recall with 255 interval is mostly larger than on recall on a single segment (except a case of 10M docs with k=10)

Dataset of 100 dims:

1M vectors, k=10, fanout=90

Interval Avg visited nodes QPS Recall
255 2244 3021 0.753
511 2398 2898 0.762
1023 2477 2865 0.767

1M vectors, k=100, fanout=900

Interval Avg visited nodes QPS Recall
255 10600 566 0.927
511 11882 520 0.933
1023 13483 469 0.940

10M vectors, k=10, fanout=90

Interval Avg visited nodes QPS Recall
255 6562 1897 0.580
511 7332 1773 0.595
1023 7942 1776 0.606

10M vectors, k=100, fanout=900

Interval Avg visited nodes QPS Recall
255 26801 424 0.821
511 29534 399 0.826
1023 33139 361 0.834

Updating from 1023 interval to 255 interval gives us:

  • average 20% decrease in visited nodes and 13% gains in QPS at the expense of lower recall
  • recall with 255 interval is larger than on recall on a single segment for 1M docs, and slightly lower for 10M docs.

This makes an interval of 255 a reasonable choice. @jimczi @tveasey What do you think?

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jan 17, 2024

It's also important to check the order of execution. For instance what happens if all segments are executed serially (rather than in parallel), does it changes the recall?

@jimczi The results are below. For sequential run, this change also brings significant speedups:

  • average 2.2x decrease in visited nodes and 2.2x gains in QPS at the expense of lower recall. Gains in QPS with higher k (for k=100, gains reach up to 3x)
  • We usually can achieve recall comparable to the recall on a single segment (except the case of 10M docs k=10)

Sequential run cohere: 768 dims; interval: 255

1M vectors

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 2033 1212 0.880
Baseline 8 segments sequential 13927 175 0.974
Candidate_with_queue sequential 9407 269 0.923

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 13083 167 0.964
Baseline 8 segments sequential 81824 28 0.997
Candidate_with_queue sequential 36851 62 0.983

10M vectors

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 2189 984 0.929
Baseline 19 segments sequential 37656 59 0.951
Candidate_with_queue sequential 17748 121 0.884

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 15040 122 0.950
Baseline 19 segments sequential 229945 9 0.990
Candidate_with_queue sequential 77910 27 0.964

@tveasey
Copy link

tveasey commented Jan 17, 2024

This makes an interval of 255 a reasonable choice.

I agree. This looks better to me. One thing I would be intrigued to try is the slight change in schedules as per this. Particularly, what happens if we delay using the information in minCompetitiveSimilarity. However, these results are already very good and we could push testing these out to a follow on PR.

One last thing I think we should consider is exploring the variance we get in recall as a result of this change. Specifically, if we were to run with some random waits in the different segment searches what is the variation in the recalls we see?

The danger is we get unlucky in ordering of searches and prune segment searches which contain the true nearest neighbours more aggressively sometimes. On average this shouldn't happen, but if we also see low variation in recall for the 1M realistic vectors in such a test case it would reassuring.

@benwtrent
Copy link
Member

benwtrent commented Jan 26, 2024

I ran my own experiment, which showed some interesting and frustrating results.

My idea was that gathering at least k documents for every segment, no matter their size, doesn't make much sense. So I combined the dynamic k setting from @jimczi, which adjusts how much we explore the graph based on the total number of vectors in the shard vs. how many vectors are in the segment.

float v = (float)Math.log(sumVectorCount / (double) leafSize);
float filterWeightValue = 1/v;

I then only require at a minimum k * filterWeightValue per segment to be explored before reaching to the global queue for non-competitive scores.

Additionally, I adjusted the indexing to randomly commit() on every 500 docs or so. I indexed the first 10M docs of cohere-wiki and used max-inner product over the raw float32. This showed that we have some graph building problems, will include those results as well.

Max Connections: 16
Beamwidth: 250

This resulted in some weird graphs (which we should fix separately), but around 47 segments of non-uniform sizes in various tiers.

Graph

image

The python code & raw data used:

# Add pareto lines
plt.plot([1508, 1784, 1926, 2190, 2822], [0.856, 0.881, 0.886, 0.887, 0.887], marker='o', label='baseline_single')
plt.plot([3297, 3562, 3698, 5797, 14196], [0.840, 0.861, 0.866, 0.866, 0.866], marker='o', label='baseline_multi')
plt.plot([3050, 3345, 3503, 3739, 3963], [0.520, 0.846, 0.858, 0.866, 0.866], marker='o', label='candidate_dynamic_k')
plt.plot([3297, 3562, 3698, 5797, 14191], [0.840, 0.861, 0.866, 0.866, 0.866], marker='o', label='candidate_static_k')
# Add labels and title
plt.xlabel('Avg Visited')
plt.ylabel('Recall')
plt.title("Recall vs. Avg Visited for corpus-wiki-cohere.fvec 10M")
plt.legend()

Graph stats over the multiple segments. This shows the connectedness of layer 0 and the percentiles of the connectedness.

Many many segments are extremely sparse, including some larger segments, which is not very nice :(

Leaf 0 has 5 layers
Leaf 0 has 1728501 documents
Graph size=1728501, Fanout min=1, mean=12.26, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1  16  22  28  32  32
Leaf 1 has 5 layers
Leaf 1 has 1743500 documents
Graph size=1743500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 2 has 5 layers
Leaf 2 has 508000 documents
Graph size=508000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 3 has 5 layers
Leaf 3 has 1147500 documents
Graph size=1147500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 4 has 5 layers
Leaf 4 has 1196000 documents
Graph size=1196000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 5 has 5 layers
Leaf 5 has 506500 documents
Graph size=506500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 6 has 4 layers
Leaf 6 has 200500 documents
Graph size=200500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 7 has 5 layers
Leaf 7 has 521500 documents
Graph size=521500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 8 has 5 layers
Leaf 8 has 1088000 documents
Graph size=1088000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 9 has 5 layers
Leaf 9 has 480500 documents
Graph size=480500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 10 has 4 layers
Leaf 10 has 176000 documents
Graph size=176000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 11 has 4 layers
Leaf 11 has 157500 documents
Graph size=157500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 12 has 4 layers
Leaf 12 has 51500 documents
Graph size=51500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 13 has 3 layers
Leaf 13 has 39000 documents
Graph size=39000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 14 has 3 layers
Leaf 14 has 21500 documents
Graph size=21500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 15 has 3 layers
Leaf 15 has 50000 documents
Graph size=50000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 16 has 3 layers
Leaf 16 has 19000 documents
Graph size=19000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 17 has 4 layers
Leaf 17 has 95500 documents
Graph size=95500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 18 has 3 layers
Leaf 18 has 17000 documents
Graph size=17000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 19 has 3 layers
Leaf 19 has 16500 documents
Graph size=16500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 20 has 3 layers
Leaf 20 has 17500 documents
Graph size=17500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 21 has 3 layers
Leaf 21 has 53000 documents
Graph size=53000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 22 has 4 layers
Leaf 22 has 68500 documents
Graph size=68500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 23 has 3 layers
Leaf 23 has 3500 documents
Graph size=3500, Fanout min=1, mean=1.01, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 24 has 3 layers
Leaf 24 has 10000 documents
Graph size=10000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 25 has 3 layers
Leaf 25 has 15000 documents
Graph size=15000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 26 has 3 layers
Leaf 26 has 11500 documents
Graph size=11500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 27 has 3 layers
Leaf 27 has 10500 documents
Graph size=10500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 28 has 3 layers
Leaf 28 has 5000 documents
Graph size=5000, Fanout min=1, mean=1.01, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 29 has 3 layers
Leaf 29 has 21500 documents
Graph size=21500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 30 has 3 layers
Leaf 30 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 31 has 3 layers
Leaf 31 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 32 has 3 layers
Leaf 32 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 33 has 3 layers
Leaf 33 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 34 has 3 layers
Leaf 34 has 1500 documents
Graph size=1500, Fanout min=1, mean=1.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 35 has 3 layers
Leaf 35 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 36 has 3 layers
Leaf 36 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 37 has 3 layers
Leaf 37 has 500 documents
Graph size=500, Fanout min=1, mean=1.06, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 38 has 3 layers
Leaf 38 has 500 documents
Graph size=500, Fanout min=1, mean=1.06, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 39 has 3 layers
Leaf 39 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 40 has 3 layers
Leaf 40 has 4000 documents
Graph size=4000, Fanout min=1, mean=1.01, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 41 has 3 layers
Leaf 41 has 1500 documents
Graph size=1500, Fanout min=1, mean=1.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 42 has 3 layers
Leaf 42 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 43 has 3 layers
Leaf 43 has 1500 documents
Graph size=1500, Fanout min=1, mean=1.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 44 has 3 layers
Leaf 44 has 500 documents
Graph size=500, Fanout min=1, mean=1.06, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 45 has 3 layers
Leaf 45 has 500 documents
Graph size=500, Fanout min=1, mean=1.06, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 46 has 3 layers
Leaf 46 has 1499 documents
Graph size=1499, Fanout min=1, mean=1.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32

@benwtrent
Copy link
Member

OK, I reran my tests, but over the first 500k docs. Randomly committing & merging with a 500MB buffer. The numbers are much saner. I wonder if I have a bug in my data and start sending garbage once I get above 1M vectors...

This shows that having the dynamic k based on segment size brings us closer to baseline single segment number of visited. We are still in a different magnitude, but both static & dynamic k with the queue are much much better (2-3x) than the baseline multi-segment search.

plt.plot([985, 1225, 1398], [0.993, 1.0, 1.0], marker='o', label='baseline_single')
plt.plot([37505, 57693, 72611], [1.0, 1.0, 1.0], marker='o', label='baseline_multi')
plt.plot([12099, 18411, 24452], [0.999, 1.0, 1.0], marker='o', label='candidate_static_k')
plt.plot([10078, 13369, 16769], [0.992, 0.999, 1.0], marker='o', label='candidate_dynamic_k')

image

@benwtrent
Copy link
Member

benwtrent commented Jan 31, 2024

I fixed my data and ran with 1.5M cohere:

static_k is this PR
dynamic_k is this PR + scaling the k explored by

loat v = (float)Math.log(sumVectorCount / (double) leafSize);
float filterWeightValue = 1/v;

Scaling the k is another nice incremental change. Up and to the left is best. We get better recall with visiting fewer consistently.

plt.plot([2304, 2485, 3189, 4028, 5610, 16165], [0.875, 0.886, 0.916, 0.938, 0.960, 0.992], marker='o', label='baseline_single')
plt.plot([43015, 45946, 56864, 69149, 90772, 210727], [0.980, 0.982, 0.989, 0.992, 0.996, 1.000], marker='o', label='baseline_multi')
plt.plot([22706, 23749, 27142, 30893, 37162, 76032], [0.959, 0.962, 0.970, 0.976, 0.983, 0.996], marker='o', label='candidate_static_k')
plt.plot([16099, 17183, 20788, 24809, 31334, 64921], [0.937, 0.945, 0.962, 0.973, 0.983, 0.996], marker='o', label='candidate_dynamic_k')

image

EDIT: Here is 6.5M English Cohere (I randomly commit every 500 docs, ended up with 20-30 segments, which is hilarious, as baseline multi-segment visits about 24x more docs than baseline single segment, dynamic-k only visits about 7x more docs).

plt.plot([2362, 2557, 3291, 4193, 5925, 18114], [0.845, 0.856, 0.886, 0.910, 0.935, 0.979 ], marker='o', label='baseline_single')
plt.plot([58236, 62131, 76589, 92736, 120898, 272212], [0.942, 0.946, 0.960, 0.970, 0.980, 0.994], marker='o', label='baseline_multi')
plt.plot([24758, 26242, 31826, 37229, 46271, 103811], [0.931, 0.936, 0.950, 0.960, 0.970, 0.988], marker='o', label='candidate_static_k')
plt.plot([16367, 17262, 20630, 24232, 31281, 70344], [0.897, 0.904, 0.926, 0.943, 0.959, 0.987 ], marker='o', label='candidate_dynamic_k')

image

@mayya-sharipova mayya-sharipova force-pushed the multi-search-graph-with-queue branch 2 times, most recently from 12ccd3a to 3ed93df Compare January 31, 2024 16:26
@mayya-sharipova mayya-sharipova force-pushed the multi-search-graph-with-queue branch from 3ed93df to 13ae978 Compare January 31, 2024 16:31
@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jan 31, 2024

@benwtrent Thanks for running additional tests. Looks like running with dynamic k can speed up searches, but I think we need to run more experiments on smaller dims datasets as well, how about we leave this for the follow up?

@jimczi @tveasey I've addressed your comments. Are we ok to merge as it is now.


The following is left for the follow up work:

  1. Dynamic k based on segment size. Currently we gather at least k documents for every segment before starting synchronizing with the global queue. We need to adjust k based on the segment size.
  2. Improve synchronization schedule with the global queue. Currently we do synchronization with the global queue every 255 visited docs. Can we improve schedule as per comment.

@benwtrent
Copy link
Member

but I think we need to run more experiments on smaller dims datasets as well, how about we leave this for the follow up?

I am 100% fine with this. It was a crazy idea and it only gives us an incremental change. This PR gives a much better improvement as it is already.

@tveasey
Copy link

tveasey commented Jan 31, 2024

@jimczi @tveasey I've addressed your comments. Are we ok to merge as it is now.

I'm happy

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jan 31, 2024

I've re-ran the sets with latest changes on this PR (candidate) and main branch (baseline):

I have also done experiments using Cohere dataset, as as seen below:

  • for 10M docs dataset, the speedups with the proposed approach are 1.7-2.5x times.
  • for 10M docs dataset, where k+fanout = 1000, QPS is close to the QPS of a single segment, while recall is better.

Cohere/wikipedia-22-12-en-embeddings

1M vectors

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 0.880
Baseline 8 segments concurrent 13927 815 0.974
Candidate2_with_queue 12670 859 0.964

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 0.964
Baseline 8 segments concurrent 81824 126 0.997
Candidate2_with_queue 62085 165 0.995

10M vectors

k=10, fanout=90

Avg visited nodes QPS Recall
Baseline Single segment 0.929
Baseline 19 segments concurrent 37656 271 0.951
Candidate2_with_queue 21921 443 0.927

k=100, fanout=900

Avg visited nodes QPS Recall
Baseline Single segment 0.950
Baseline 19 segments concurrent 229945 44 0.990
Candidate2_with_queue 101970 91 0.984

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @mayya-sharipova it's getting close. I left more comments. Can you set the pr to ready for review in case others want to chime in.
@benwtrent can you look at the changes in the knn query with the KnnCollectorManager?

@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException {
filterWeight = null;
}

final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slice can have multiple segments so it should rather check the reader's leaves here and maybe call the boolean isMultiSegments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this how we do that for top docs collectors

But you are right, because we see speedups even in sequential run, you suggestions make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we see speedups even in sequential run

Do you mean speed ups without concurrency via sharing information? That is interesting, I wonder why that is.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because later graph searches are faster: they use information gathered in earlier ones. (Note though the speed up is only relative to not sharing information and searching multiple graphs not relative to searching a single graph.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the dynamic k follow up we can also look at the order of the segments. For instance if we start with largest first we can be more aggressive on smaller segments.

@benwtrent benwtrent self-requested a review February 1, 2024 17:38
@mayya-sharipova mayya-sharipova marked this pull request as ready for review February 1, 2024 18:43
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException {
filterWeight = null;
}

final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we see speedups even in sequential run

Do you mean speed ups without concurrency via sharing information? That is interesting, I wonder why that is.

* @param visitedLimit the maximum number of nodes that the search is allowed to visit
* @param parentBitSet the parent bitset, {@code null} if not applicable
*/
public abstract C newCollector(int visitedLimit, BitSet parentBitSet) throws IOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public abstract C newCollector(int visitedLimit, BitSet parentBitSet) throws IOException;
public abstract C newCollector(int visitedLimit, LeafReaderContext context) throws IOException;

Also, I am not even sure visitedLimit should be there. It seems like something the manager should already know about (as in this instance its static) and we just need to know about the context (the context is for DiversifyingChildrenFloatKnnVectorQuery so that its collector manager can create BitSet parentBitSet from its encapsulated BitSetProducer).

I also think this method could return null if collection is not applicable for that given leaf context.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed except visitedLimit, because visitedLimit changes per segment as different cost of using filter in a segment context

@@ -123,7 +124,16 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept
}

@Override
protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit)
protected KnnCollectorManager<?> getKnnCollectorManager(int k, boolean supportsConcurrency) {
return new DiversifyingNearestChildrenKnnCollectorManager(k);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we adjust the interface, this manager could know about BitSetProducer parentsFilter; and abstract that away from this query.

@mayya-sharipova mayya-sharipova force-pushed the multi-search-graph-with-queue branch from 3e5b0a4 to 9005b04 Compare February 2, 2024 20:57
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.TopDocsCollector;
import org.apache.lucene.search.TotalHits;
import org.apache.lucene.search.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we want * here

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor thing, but this looks much better.

@mayya-sharipova mayya-sharipova merged commit d095ed0 into apache:main Feb 6, 2024
4 checks passed
@mayya-sharipova mayya-sharipova deleted the multi-search-graph-with-queue branch February 6, 2024 14:16
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Feb 7, 2024
Speedup concurrent multi-segment HNWS graph search by exchanging 
the global top candidated collected so far across segments. These global top 
candidates set the minimum threshold that new candidates need to pass
 to be considered. This allows earlier stopping for segments that don't have 
good candidates.
@jpountz
Copy link
Contributor

jpountz commented Feb 23, 2024

FYI I pushed an annotation, this yielded a major speedup: http://people.apache.org/~mikemccand/lucenebench/VectorSearch.html.

@benwtrent
Copy link
Member

So cool. We are now faster at 768 dimensions than we were on 100 dimensions.

⚡ ⚡ ⚡ ⚡ ⚡

@uschindler
Copy link
Contributor

Is it called HNSW or HNWS? I just noticed the title of this PR and differing changes entries.

@msokolov
Copy link
Contributor

HNSW stands for "hierarchical navigable small world" - that should make it easy to remember :)

@msokolov msokolov changed the title Speedup concurrent multi-segment HNWS graph search 2 Speedup concurrent multi-segment HNSW graph search 2 Feb 25, 2024
@mikemccand mikemccand modified the milestones: 9.10, 9.10.0, 10.0.0 May 17, 2024
@mikemccand
Copy link
Member

I think this was released in 9.10.0? I added a milestone.

@wurui90
Copy link

wurui90 commented Oct 2, 2024

Dear experts: I wonder

  1. Will this PR introduce query-time non determinism? i.e. When we issue the same query multiple times on the same index, will the query results be identical?

  2. Is this PR automatically on? i.e. we still use the IndexSearcher.search(vectorQuery) without extra args for executor or boolean flags, and this feature will be on. (Based on our reading of the codes, the answer is yes.)

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Oct 3, 2024

@wurui90 Answering your questions

  1. Yes, indeed, there is some non-determinism, but I think it would not be much noticed in practice, as the algorithm ensures that each segment still collects some good results before exchanging globally with other segments.
  2. Yes, indeed, it automatically on. If your indexSearcher is using an executor with a single thread, than you will get deterministic behaviour. Also, you can set the greediness to 0 here, which will effectively disable algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants