-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup concurrent multi-segment HNSW graph search 2 #12962
Speedup concurrent multi-segment HNSW graph search 2 #12962
Conversation
Experiments: Search:
Summary:
I think increase in QPS worth the slight drop in recall. And we should go with this implementation. Details: 1M vectors of 100 dimsk=10, fanout=90
k=100, fanout=900
Candidate2_with_queue VS Baseline:
Candidate2_with_queue VS Candidate1_with_min_score:
10M vectors of 100 dimsk=10, fanout=90
k=100, fanout=900
k=100, fanout=9900
Candidate2_with_queue VS Baseline:
Candidate2_with_queue VS Candidate1_with_min_score:
10M vectors of 768 dimsk=10, fanout=90
k=100, fanout=900
k=100, fanout=9900
Candidate2_with_queue VS Baseline:
Candidate2_with_queue VS Candidate1_with_min_score:
@tveasey @jimczi @benwtrent what do you think? |
IMO we shouldn't focus too much on recall since the greediness of globally non-competitive search allows us to tune this. My main concern is does contention on the queue updates cause slow down. This aside, I think the queue is strictly better. The search might wind up visiting fewer vertices for min score sharing, because of earlier decisions might mean it by chance gets transiently better bounds, but this should be low probability particularly when the search has to visit many vertices. And indeed these cases are where we see big wins from using a queue. There appears to be some evidence of contention. This is suggested by looking at the runtime vs expected runtime from vertices visited, e.g.
Note that the direction of this effect is consistent, but the size is not (fo = 900 shows the largest effect). However, all that said we still get significant wins in performance, so my vote would be to use the queue and work on strategies for reducing contention, there are various ideas we had for ways to achieve this. |
I have also done experiments using Cohere dataset, as as seen below:
Cohere/wikipedia-22-12-en-embeddings
1M vectorsk=10, fanout=90
k=100, fanout=900
10M vectorsk=10, fanout=90
k=100, fanout=900
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova for all the testing. I left some comments.
Some of the recalls for the single segment baseline seem seem quite off (0.477
?). Are you sure that there was no issue during the testing?
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java
Outdated
Show resolved
Hide resolved
if (kResultsCollected) { | ||
// as we've collected k results, we can start do periodic updates with the global queue | ||
if (firstKResultsCollected || (visitedCount & interval) == 0) { | ||
cachedGlobalMinSim = globalSimilarityQueue.offer(updatesQueue.getHeap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way of updating the global queue periodically doesn't bring much if k
is close to the size of the queue. For instance if k
is 1000, we only save 24 updates with this strategy. That's fine I guess especially considering that the interval is not only to save concurrent update in the queue but also to ensure that we are far enough in the walk of the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi Sorry I did not understand this comment, can you please clarify it.
What does it mean k
is close to the size of queue? The globalSimilarityQueue
is of size k
.
Also I am not clear how you derive the value of 24
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant if k
is close to the interval (1024). In such case delaying the update to the global queue doesn't save much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a subtlety: note that firstKResultsCollected
is only true the first time we visit k nodes, so this is saying only refresh every 1024 iterations thereafter. The refresh schedule is k, 1024, 2048, ...
. (The || threw me initially.)
As per my comment above, 1024 seems infrequent to me. (Of course you may have tested smaller values and determined this to be a good choice.) If we think it is risky sharing information too early, I would be inclined to share on the schedule max(k, c1), max(k, c1) + c2, max(k, c1) + 2 * c2, ...
with c1 > c2
and decouple the concerns.
Also, there maybe issues with using information too early, in which case minCompetitiveSimilarity
would need a check that visitedCount > max(k, c1)
before it starts modifying minSim
.
In any case, I can see results being rather sensitive to these choices so if you haven't done a parameter exploration it might be worth trying a few choices.
@jimczi Thanks for your feedback.
Sorry, I made a mistake. I've updated the results. I will work on more experiments to address your other comments. |
I have done more experiments with different Cohere dataset of 786 dims: 1M vectors, k=10, fanout=90
1M vectors, k=100, fanout=900
10M vectors, k=10, fanout=90
10M vectors, k=100, fanout=900
Updating from
Dataset of 100 dims: 1M vectors, k=10, fanout=90
1M vectors, k=100, fanout=900
10M vectors, k=10, fanout=90
10M vectors, k=100, fanout=900
Updating from
This makes an interval of |
@jimczi The results are below. For sequential run, this change also brings significant speedups:
Sequential run cohere: 768 dims; interval: 255 1M vectorsk=10, fanout=90
k=100, fanout=900
10M vectorsk=10, fanout=90
k=100, fanout=900
|
I agree. This looks better to me. One thing I would be intrigued to try is the slight change in schedules as per this. Particularly, what happens if we delay using the information in One last thing I think we should consider is exploring the variance we get in recall as a result of this change. Specifically, if we were to run with some random waits in the different segment searches what is the variation in the recalls we see? The danger is we get unlucky in ordering of searches and prune segment searches which contain the true nearest neighbours more aggressively sometimes. On average this shouldn't happen, but if we also see low variation in recall for the 1M realistic vectors in such a test case it would reassuring. |
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
Outdated
Show resolved
Hide resolved
I ran my own experiment, which showed some interesting and frustrating results. My idea was that gathering at least
I then only require at a minimum Additionally, I adjusted the indexing to randomly commit() on every 500 docs or so. I indexed the first 10M docs of cohere-wiki and used max-inner product over the raw float32. This showed that we have some graph building problems, will include those results as well. Max Connections: 16 This resulted in some weird graphs (which we should fix separately), but around 47 segments of non-uniform sizes in various tiers. GraphThe python code & raw data used: # Add pareto lines
plt.plot([1508, 1784, 1926, 2190, 2822], [0.856, 0.881, 0.886, 0.887, 0.887], marker='o', label='baseline_single')
plt.plot([3297, 3562, 3698, 5797, 14196], [0.840, 0.861, 0.866, 0.866, 0.866], marker='o', label='baseline_multi')
plt.plot([3050, 3345, 3503, 3739, 3963], [0.520, 0.846, 0.858, 0.866, 0.866], marker='o', label='candidate_dynamic_k')
plt.plot([3297, 3562, 3698, 5797, 14191], [0.840, 0.861, 0.866, 0.866, 0.866], marker='o', label='candidate_static_k')
# Add labels and title
plt.xlabel('Avg Visited')
plt.ylabel('Recall')
plt.title("Recall vs. Avg Visited for corpus-wiki-cohere.fvec 10M")
plt.legend() Graph stats over the multiple segments. This shows the connectedness of layer Many many segments are extremely sparse, including some larger segments, which is not very nice :(
|
OK, I reran my tests, but over the first 500k docs. Randomly committing & merging with a 500MB buffer. The numbers are much saner. I wonder if I have a bug in my data and start sending garbage once I get above 1M vectors... This shows that having the dynamic
|
I fixed my data and ran with 1.5M cohere: static_k is this PR
Scaling the
EDIT: Here is 6.5M English Cohere (I randomly commit every 500 docs, ended up with 20-30 segments, which is hilarious, as baseline multi-segment visits about 24x more docs than baseline single segment, dynamic-k only visits about 7x more docs).
|
12ccd3a
to
3ed93df
Compare
3ed93df
to
13ae978
Compare
@benwtrent Thanks for running additional tests. Looks like running with dynamic @jimczi @tveasey I've addressed your comments. Are we ok to merge as it is now. The following is left for the follow up work:
|
I am 100% fine with this. It was a crazy idea and it only gives us an incremental change. This PR gives a much better improvement as it is already. |
I've re-ran the sets with latest changes on this PR (candidate) and main branch (baseline): I have also done experiments using Cohere dataset, as as seen below:
Cohere/wikipedia-22-12-en-embeddings
1M vectorsk=10, fanout=90
k=100, fanout=900
10M vectorsk=10, fanout=90
k=100, fanout=900
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @mayya-sharipova it's getting close. I left more comments. Can you set the pr to ready for review in case others want to chime in.
@benwtrent can you look at the changes in the knn query with the KnnCollectorManager?
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
Outdated
Show resolved
Hide resolved
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { | |||
filterWeight = null; | |||
} | |||
|
|||
final boolean supportsConcurrency = indexSearcher.getSlices().length > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slice can have multiple segments so it should rather check the reader's leaves here and maybe call the boolean isMultiSegments
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this how we do that for top docs collectors
But you are right, because we see speedups even in sequential run, you suggestions make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we see speedups even in sequential run
Do you mean speed ups without concurrency via sharing information? That is interesting, I wonder why that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because later graph searches are faster: they use information gathered in earlier ones. (Note though the speed up is only relative to not sharing information and searching multiple graphs not relative to searching a single graph.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the dynamic k follow up we can also look at the order of the segments. For instance if we start with largest first we can be more aggressive on smaller segments.
lucene/core/src/java/org/apache/lucene/search/AbstractKnnCollector.java
Outdated
Show resolved
Hide resolved
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { | |||
filterWeight = null; | |||
} | |||
|
|||
final boolean supportsConcurrency = indexSearcher.getSlices().length > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we see speedups even in sequential run
Do you mean speed ups without concurrency via sharing information? That is interesting, I wonder why that is.
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
Outdated
Show resolved
Hide resolved
* @param visitedLimit the maximum number of nodes that the search is allowed to visit | ||
* @param parentBitSet the parent bitset, {@code null} if not applicable | ||
*/ | ||
public abstract C newCollector(int visitedLimit, BitSet parentBitSet) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public abstract C newCollector(int visitedLimit, BitSet parentBitSet) throws IOException; | |
public abstract C newCollector(int visitedLimit, LeafReaderContext context) throws IOException; |
Also, I am not even sure visitedLimit
should be there. It seems like something the manager should already know about (as in this instance its static) and we just need to know about the context (the context is for DiversifyingChildrenFloatKnnVectorQuery
so that its collector manager can create BitSet parentBitSet
from its encapsulated BitSetProducer
).
I also think this method could return null
if collection is not applicable for that given leaf context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed except visitedLimit
, because visitedLimit
changes per segment as different cost of using filter in a segment context
@@ -123,7 +124,16 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept | |||
} | |||
|
|||
@Override | |||
protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) | |||
protected KnnCollectorManager<?> getKnnCollectorManager(int k, boolean supportsConcurrency) { | |||
return new DiversifyingNearestChildrenKnnCollectorManager(k); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we adjust the interface, this manager could know about BitSetProducer parentsFilter;
and abstract that away from this query.
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java
Outdated
Show resolved
Hide resolved
3e5b0a4
to
9005b04
Compare
import org.apache.lucene.search.TopDocs; | ||
import org.apache.lucene.search.TopDocsCollector; | ||
import org.apache.lucene.search.TotalHits; | ||
import org.apache.lucene.search.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we want *
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor thing, but this looks much better.
Speedup concurrent multi-segment HNWS graph search by exchanging the global top candidated collected so far across segments. These global top candidates set the minimum threshold that new candidates need to pass to be considered. This allows earlier stopping for segments that don't have good candidates.
FYI I pushed an annotation, this yielded a major speedup: http://people.apache.org/~mikemccand/lucenebench/VectorSearch.html. |
So cool. We are now faster at 768 dimensions than we were on 100 dimensions. ⚡ ⚡ ⚡ ⚡ ⚡ |
Is it called HNSW or HNWS? I just noticed the title of this PR and differing changes entries. |
HNSW stands for "hierarchical navigable small world" - that should make it easy to remember :) |
I think this was released in 9.10.0? I added a milestone. |
Dear experts: I wonder
|
@wurui90 Answering your questions
|
A second implementation of #12794 using Queue instead of MaxScoreAccumulator.
Speedup concurrent multi-segment HNWS graph search by exchanging the global top scores collected so far across segments. These global top scores set the minimum threshold that candidates need to pass to be considered. This allows earlier stopping for segments that don't have good candidates.