Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CompressionLevel Calculation for PQ #2200

Merged

Conversation

jmazanec15
Copy link
Member

Description

Currently, for product quantization, we set the calculated compression level to NOT_CONFIGURED. The main issue with this is that if a user sets up a disk-based index with PQ, no re-scoring will happen by default.

This change adds the calculation so that the proper re-scoring will happen. The formula is fairly straightforward =>
actual compression = (d * 32) / (m * code_size). Then, we round to the neareste compression level (because we only support discrete compression levels).

One small issue with this is that if PQ is configured to have compression > 32x, the value will be 32x. Functionally, the only issue will be that we may not be as aggressive on oversampling for on disk mode.

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 added Enhancements Increases software capabilities beyond original client specifications backport 2.x labels Oct 9, 2024
@jmazanec15 jmazanec15 force-pushed the pq-compression-level-fix branch 4 times, most recently from f2d7c89 to 907f1ec Compare October 10, 2024 17:01
@navneet1v
Copy link
Collaborator

One small issue with this is that if PQ is configured to have compression > 32x, the value will be 32x. Functionally, the only issue will be that we may not be as aggressive on oversampling for on disk mode.

should we allow more compression level?

@@ -29,6 +29,8 @@ public enum CompressionLevel {
x16(16, "16x", new RescoreContext(3.0f, false), Set.of(Mode.ON_DISK)),
x32(32, "32x", new RescoreContext(3.0f, false), Set.of(Mode.ON_DISK));

public static final CompressionLevel MAX_COMPRESSION_LEVEL = CompressionLevel.x32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have 64 as a max compression level? I don't have a solid point to have 64 but I think have 1 more extra compression is always good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. I guess for 64x, default for all dimensions should probably be 5x.

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me. Just 1 minor comment.

Copy link
Collaborator

@heemin32 heemin32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Do we want to add integ test for it? Recall should be higher with rescoring than baseline.

heemin32
heemin32 previously approved these changes Oct 10, 2024
Currently, for product quantization, we set the calculated compression
level to NOT_CONFIGURED. The main issue with this is that if a user sets
up a disk-based index with PQ, no re-scoring will happen by default.

This change adds the calculation so that the proper re-scoring will
happen. The formula is fairly straightforward =>
actual compression = (d * 32) / (m * code_size). Then, we round to the
neareste compression level (because we only support discrete compression
levels).

One small issue with this is that if PQ is configured to have
compression > 32x, the value will be 32x. Functionally, the only issue
will be that we may not be as aggressive on oversampling for on disk
mode.

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 merged commit 228aead into opensearch-project:main Oct 16, 2024
31 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 16, 2024
Currently, for product quantization, we set the calculated compression
level to NOT_CONFIGURED. The main issue with this is that if a user sets
up a disk-based index with PQ, no re-scoring will happen by default.

This change adds the calculation so that the proper re-scoring will
happen. The formula is fairly straightforward =>
actual compression = (d * 32) / (m * code_size). Then, we round to the
neareste compression level (because we only support discrete compression
levels).

One small issue with this is that if PQ is configured to have
compression > 64x, the value will be 64x. Functionally, the only issue
will be that we may not be as aggressive on oversampling for on disk
mode.

Signed-off-by: John Mazanec <[email protected]>
(cherry picked from commit 228aead)
@navneet1v
Copy link
Collaborator

@jmazanec15 did we do any benchmarks which suggest how much improvement we will get with PQ based rescoring?

@jmazanec15
Copy link
Member Author

@navneet1v we did here: #1779 (comment)

@navneet1v
Copy link
Collaborator

@navneet1v we did here: #1779 (comment)

Thanks. I completely forgot about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants