Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max token score for SparseEncodingQueryBuilder and do renaming #348

Merged
merged 7 commits into from
Oct 1, 2023

Conversation

zhichao-aws
Copy link
Member

@zhichao-aws zhichao-aws commented Sep 27, 2023

Description

For sparse semantic retrieval in neural search, we use lucene FeatureField for storage and use lucene FeatureQuery to search. The feature queries of input tokens are wrapped by lucene BooleanQuery, which use WAND algorithm to accelerate the execution. The WAND algorithm leverage the score upper bound of sub-queries to skip non-competitive tokens. However, origin lucene FeatureQuery use Float.MAX_VALUE as the score upper bound, and this invalidates WAND.

To mitigate this issue, we rewrite the FeatureQuery to BoundedLinearFeatureQuery. The caller can set the token score upperbound of this query. And according to our use case, we use LinearFunction as the score function.

We have conducted several end to end benchmark experiments with this optimization. Using a doc-only SPLADE like model, this optimization reduce the query latency from P90 40ms to P90 26ms (1 million docs), and reduce the query latency from P90 231ms to P90 80ms (8.8 million docs).

After lucene version 9.8, the FeatureQuery are rewritten, and lucene optimize the speed for BooleanQuery. Then this optimization is no longer needed. However, we're most likely not upgrade to lucene 9.8 in 2.11 release opensearch-project/OpenSearch#8668. So we create this PR for 2.x only, and in main, we'll create another PR to deperacate the max_token_score parameter in sparse query clause.

Issues Resolved

#230

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
private final String featureName;
private final Float scoreUpperBound;

public BoundedLinearFeatureQuery(String fieldName, String featureName, Float scoreUpperBound) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we comment on the lines which we have added on top Lucene code so that it is easy to debug and fix later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can treat this a brand new class, it combines the FeatureQuery class and some methods from FeatureField class in lucene, we'll add some comments on top of the core function that makes our new parameter work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes lets add comments over lines which are copied so that we are aware of from where the code is coming from

@@ -99,6 +104,7 @@ protected void doXContent(XContentBuilder xContentBuilder, Params params) throws
xContentBuilder.startObject(fieldName);
xContentBuilder.field(QUERY_TEXT_FIELD.getPreferredName(), queryText);
xContentBuilder.field(MODEL_ID_FIELD.getPreferredName(), modelId);
if (null != maxTokenScore) xContentBuilder.field(MAX_TOKEN_SCORE_FIELD.getPreferredName(), maxTokenScore);
printBoostAndQueryName(xContentBuilder);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to print ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The print is not logging, but output to XContent. The boost and query name are also parameters and should be included in the XContent

}

// the field and decodeFeatureValue are modified from FeatureField.decodeFeatureValue
static final int MAX_FREQ = Float.floatToIntBits(Float.MAX_VALUE) >>> 15;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the value of this MAX_FREQ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from https://github.com/apache/lucene/blob/6d764c3397d00f93bd4273bd8d1c9e51d6e104e6/lucene/core/src/java/org/apache/lucene/document/FeatureField.java#L207, including the below method is from FeatureField but we changed it to make our new parameter scoreUpperBound work.

@navneet1v
Copy link
Collaborator

Minor comments on the code.

* Address code review comments

Signed-off-by: zane-neo <[email protected]>

* Change lower case neural_sparse to upper case

Signed-off-by: zane-neo <[email protected]>

* Change back processor type name to sparse_encoding

Signed-off-by: zane-neo <[email protected]>

* Change names

Signed-off-by: zane-neo <[email protected]>

* Format code

Signed-off-by: zane-neo <[email protected]>

---------

Signed-off-by: zane-neo <[email protected]>
@zhichao-aws zhichao-aws changed the title Add max token score for SparseEncodingQueryBuilder Add max token score for SparseEncodingQueryBuilder and do renaming Sep 28, 2023
@zhichao-aws
Copy link
Member Author

Also include renaming changes in this PR. The renaming changes in main: #353

1 similar comment
@zhichao-aws
Copy link
Member Author

Also include renaming changes in this PR. The renaming changes in main: #353

@codecov
Copy link

codecov bot commented Sep 28, 2023

Codecov Report

Merging #348 (ddb5d69) into 2.x (415082e) will decrease coverage by 4.26%.
The diff coverage is 14.11%.

@@             Coverage Diff              @@
##                2.x     #348      +/-   ##
============================================
- Coverage     84.56%   80.30%   -4.26%     
- Complexity      427      429       +2     
============================================
  Files            35       36       +1     
  Lines          1289     1366      +77     
  Branches        189      200      +11     
============================================
+ Hits           1090     1097       +7     
- Misses          118      186      +68     
- Partials         81       83       +2     
Files Coverage Δ
...rch/neuralsearch/processor/InferenceProcessor.java 92.71% <100.00%> (ø)
...euralsearch/processor/SparseEncodingProcessor.java 100.00% <ø> (ø)
...neuralsearch/processor/TextEmbeddingProcessor.java 100.00% <ø> (ø)
.../opensearch/neuralsearch/util/TokenWeightUtil.java 86.36% <ø> (ø)
...g/opensearch/neuralsearch/plugin/NeuralSearch.java 73.68% <0.00%> (ø)
...h/neuralsearch/query/NeuralSparseQueryBuilder.java 65.60% <61.11%> (ø)
...a/org/apache/lucene/BoundedLinearFeatureQuery.java 0.00% <0.00%> (ø)

@navneet1v
Copy link
Collaborator

@zane-neo @zhichao-aws the unit tests have not covered some lines leading to failures of GH workflow, can we fix it

return searcher.rewrite(tq).createWeight(searcher, scoreMode, boost);
}

return new Weight(this) {
Copy link
Collaborator

@navneet1v navneet1v Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this weight class as inner class or a separate class? This will ensure that we can properly test it and abstract this weight class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from lucene FeatureQuery, I think we can save the UT effort since this feature will only live a very short time(one version) and the added method is already covered.


@Override
public boolean isCacheable(LeafReaderContext ctx) {
return false;
Copy link
Collaborator

@navneet1v navneet1v Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is not cacheable? can you add comments around this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also copied code.


@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
Terms terms = Terms.getTerms(context.reader(), fieldName);
Copy link
Collaborator

@navneet1v navneet1v Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen in the case when the field with fieldName is not present

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating this new object, the fieldName is required non null.

@@ -51,7 +51,7 @@ public abstract class NLPProcessor extends AbstractProcessor {

private final Environment environment;

public NLPProcessor(
public InferenceProcessor(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check comments related to this in the main branch PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please checkout the comments in 353.

@zane-neo zane-neo merged commit 7ae48df into opensearch-project:2.x Oct 1, 2023
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants