Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) #620

Merged
merged 12 commits into from
Dec 3, 2023

Conversation

alexklibisz
Copy link
Owner

@alexklibisz alexklibisz commented Dec 3, 2023

Related Issue

#611 #617

Changes

Copying the implementations of dotProduct and l2Distance from Lucene's PanamaVectorUtilSupport. Lucene calls it "squaredDistance" instead of "l2Distance", but it's the same thing.

The authors of Lucene have made some really nice optimizations here, which I'm shamelessly copying into Elastiknn. I also considered just using their class and functions directly (things are private, package private, etc), but found it tricky to instantiate, so I just copied over the implementations.

The rough idea in this optimization is: when doing anything with Panama vectors, we have to iterate over the original vector in segments that match the size expected by the processor's SIMD implementation. But we can actually go faster if we unroll the loop by iterating four chunks at a time. Somehow the JVM figures out that we can compute these four chunks in parallel, but does not seem to figure this out if we're going one chunk at a time.

In doing this I went ahead and updated to from JDK 19 to 21 in the .tool-versions file and in the Github workflows. JDK 21 is used by the latest Elasticsearch. I was just behind in this repo.

I updated the JMH micro benchmarks, and it looks like the new implementation is indeed ~4x faster:

[info] Benchmark                                          Mode  Cnt         Score        Error  Units
[info] FloatVectorOpsBenchmark.dotProductDefault         thrpt    5    889558.936 ±    661.452  ops/s
[info] FloatVectorOpsBenchmark.dotProductLucene          thrpt    5  13655547.975 ±  65406.816  ops/s
[info] FloatVectorOpsBenchmark.dotProductPanama          thrpt    5  11208095.266 ± 986906.771  ops/s
[info] FloatVectorOpsBenchmark.dotProductPanamaOriginal  thrpt    5   2474870.573 ± 172489.562  ops/s
[info] Benchmark                                          Mode  Cnt         Score       Error  Units
[info] FloatVectorOpsBenchmark.l2DistanceDefault         thrpt    5    858090.918 ±  3095.643  ops/s
[info] FloatVectorOpsBenchmark.l2DistanceLucene          thrpt    5  10886417.277 ± 55081.371  ops/s
[info] FloatVectorOpsBenchmark.l2DistancePanama          thrpt    5  10590012.970 ± 99390.590  ops/s
[info] FloatVectorOpsBenchmark.l2DistancePanamaOriginal  thrpt    5   2223699.798 ±  7011.267  ops/s

The ann-benchmarks also got faster:

image

There was some variability in the results. They went up as high as 96% recall at 211 qps in f7d082f:

image

Testing and Validation

Standard CI and benchmarking (see above)

@alexklibisz alexklibisz changed the title Performance: use four parallel accumulators in PanamaFloatVectorOps.dotProduct Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance Dec 3, 2023
@alexklibisz alexklibisz changed the title Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance Performance: use parallel accumulators to speed up PanamaFloatVectorOps dotProduct and l2Distance (96% recall at 204 qps) Dec 3, 2023
@alexklibisz alexklibisz marked this pull request as ready for review December 3, 2023 18:55
@alexklibisz alexklibisz merged commit ac3fed8 into main Dec 3, 2023
5 checks passed
@alexklibisz alexklibisz deleted the dotproduct-parallel-accumulators branch December 3, 2023 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant