Add max token score for SparseEncodingQueryBuilder and do renaming #348

zhichao-aws · 2023-09-27T13:05:01Z

Description

For sparse semantic retrieval in neural search, we use lucene FeatureField for storage and use lucene FeatureQuery to search. The feature queries of input tokens are wrapped by lucene BooleanQuery, which use WAND algorithm to accelerate the execution. The WAND algorithm leverage the score upper bound of sub-queries to skip non-competitive tokens. However, origin lucene FeatureQuery use Float.MAX_VALUE as the score upper bound, and this invalidates WAND.

To mitigate this issue, we rewrite the FeatureQuery to BoundedLinearFeatureQuery. The caller can set the token score upperbound of this query. And according to our use case, we use LinearFunction as the score function.

We have conducted several end to end benchmark experiments with this optimization. Using a doc-only SPLADE like model, this optimization reduce the query latency from P90 40ms to P90 26ms (1 million docs), and reduce the query latency from P90 231ms to P90 80ms (8.8 million docs).

After lucene version 9.8, the FeatureQuery are rewritten, and lucene optimize the speed for BooleanQuery. Then this optimization is no longer needed. However, we're most likely not upgrade to lucene 9.8 in 2.11 release opensearch-project/OpenSearch#8668. So we create this PR for 2.x only, and in main, we'll create another PR to deperacate the max_token_score parameter in sparse query clause.

Issues Resolved

#230

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zhichao-aws <[email protected]>

zhichao-aws · 2023-09-27T13:18:16Z

The changes we made to origin FeatureQuery:
https://github.com/opensearch-project/neural-search/pull/348/files/9d4a591e3bc62715efd635380cb2cea38ecc54fd..df99d0b4ab43e3eb8648e8384f3b804c309ea73c#diff-ff541c86a7b171bd13284d781b39e31ce83861938b6acea54894c0e5d6702bba

src/main/java/org/opensearch/neuralsearch/query/BoundedLinearFeatureQuery.java

navneet1v · 2023-09-27T18:47:47Z

src/main/java/org/opensearch/neuralsearch/query/BoundedLinearFeatureQuery.java

+    private final String featureName;
+    private final Float scoreUpperBound;
+
+    public BoundedLinearFeatureQuery(String fieldName, String featureName, Float scoreUpperBound) {


can we comment on the lines which we have added on top Lucene code so that it is easy to debug and fix later.

We can treat this a brand new class, it combines the FeatureQuery class and some methods from FeatureField class in lucene, we'll add some comments on top of the core function that makes our new parameter work.

Yes lets add comments over lines which are copied so that we are aware of from where the code is coming from

src/main/java/org/opensearch/neuralsearch/query/SparseEncodingQueryBuilder.java

navneet1v · 2023-09-27T18:51:31Z

src/main/java/org/opensearch/neuralsearch/query/SparseEncodingQueryBuilder.java

@@ -99,6 +104,7 @@ protected void doXContent(XContentBuilder xContentBuilder, Params params) throws
        xContentBuilder.startObject(fieldName);
        xContentBuilder.field(QUERY_TEXT_FIELD.getPreferredName(), queryText);
        xContentBuilder.field(MODEL_ID_FIELD.getPreferredName(), modelId);
+        if (null != maxTokenScore) xContentBuilder.field(MAX_TOKEN_SCORE_FIELD.getPreferredName(), maxTokenScore);
        printBoostAndQueryName(xContentBuilder);


why do we need to print ?

This is similar to exist code: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/query/NeuralQueryBuilder.java#L120, I think this is for debugging purpose.

The print is not logging, but output to XContent. The boost and query name are also parameters and should be included in the XContent

src/main/java/org/opensearch/neuralsearch/query/SparseEncodingQueryBuilder.java

navneet1v · 2023-09-27T18:53:44Z

src/main/java/org/opensearch/neuralsearch/query/BoundedLinearFeatureQuery.java

+    }
+
+    // the field and decodeFeatureValue are modified from FeatureField.decodeFeatureValue
+    static final int MAX_FREQ = Float.floatToIntBits(Float.MAX_VALUE) >>> 15;


what is the value of this MAX_FREQ?

This is copied from https://github.com/apache/lucene/blob/6d764c3397d00f93bd4273bd8d1c9e51d6e104e6/lucene/core/src/java/org/apache/lucene/document/FeatureField.java#L207, including the below method is from FeatureField but we changed it to make our new parameter scoreUpperBound work.

navneet1v · 2023-09-27T18:55:29Z

Minor comments on the code.

* Address code review comments Signed-off-by: zane-neo <[email protected]> * Change lower case neural_sparse to upper case Signed-off-by: zane-neo <[email protected]> * Change back processor type name to sparse_encoding Signed-off-by: zane-neo <[email protected]> * Change names Signed-off-by: zane-neo <[email protected]> * Format code Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]>

zhichao-aws · 2023-09-28T06:16:03Z

Also include renaming changes in this PR. The renaming changes in main: #353

zhichao-aws · 2023-09-28T06:16:03Z

Also include renaming changes in this PR. The renaming changes in main: #353

codecov · 2023-09-28T06:19:05Z

Codecov Report

Merging #348 (ddb5d69) into 2.x (415082e) will decrease coverage by 4.26%.
The diff coverage is 14.11%.

@@             Coverage Diff              @@
##                2.x     #348      +/-   ##
============================================
- Coverage     84.56%   80.30%   -4.26%     
- Complexity      427      429       +2     
============================================
  Files            35       36       +1     
  Lines          1289     1366      +77     
  Branches        189      200      +11     
============================================
+ Hits           1090     1097       +7     
- Misses          118      186      +68     
- Partials         81       83       +2

Files	Coverage Δ
...rch/neuralsearch/processor/InferenceProcessor.java	`92.71% <100.00%> (ø)`
...euralsearch/processor/SparseEncodingProcessor.java	`100.00% <ø> (ø)`
...neuralsearch/processor/TextEmbeddingProcessor.java	`100.00% <ø> (ø)`
.../opensearch/neuralsearch/util/TokenWeightUtil.java	`86.36% <ø> (ø)`
...g/opensearch/neuralsearch/plugin/NeuralSearch.java	`73.68% <0.00%> (ø)`
...h/neuralsearch/query/NeuralSparseQueryBuilder.java	`65.60% <61.11%> (ø)`
...a/org/apache/lucene/BoundedLinearFeatureQuery.java	`0.00% <0.00%> (ø)`

navneet1v · 2023-09-28T16:25:23Z

@zane-neo @zhichao-aws the unit tests have not covered some lines leading to failures of GH workflow, can we fix it

src/main/java/org/apache/lucene/BoundedLinearFeatureQuery.java

navneet1v · 2023-09-28T16:38:12Z

src/main/java/org/apache/lucene/BoundedLinearFeatureQuery.java

+            return searcher.rewrite(tq).createWeight(searcher, scoreMode, boost);
+        }
+
+        return new Weight(this) {


can we move this weight class as inner class or a separate class? This will ensure that we can properly test it and abstract this weight class.

This is copied from lucene FeatureQuery, I think we can save the UT effort since this feature will only live a very short time(one version) and the added method is already covered.

navneet1v · 2023-09-28T16:39:06Z

src/main/java/org/apache/lucene/BoundedLinearFeatureQuery.java

+
+            @Override
+            public boolean isCacheable(LeafReaderContext ctx) {
+                return false;


why this is not cacheable? can you add comments around this?

This is also copied code.

navneet1v · 2023-09-28T16:40:37Z

src/main/java/org/apache/lucene/BoundedLinearFeatureQuery.java

+
+            @Override
+            public Scorer scorer(LeafReaderContext context) throws IOException {
+                Terms terms = Terms.getTerms(context.reader(), fieldName);


what will happen in the case when the field with fieldName is not present

When creating this new object, the fieldName is required non null.

navneet1v · 2023-09-28T16:43:05Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

@@ -51,7 +51,7 @@ public abstract class NLPProcessor extends AbstractProcessor {

    private final Environment environment;

-    public NLPProcessor(
+    public InferenceProcessor(


check comments related to this in the main branch PR.

Please checkout the comments in 353.

src/main/java/org/apache/lucene/BoundedLinearFeatureQuery.java

zhichao-aws added 5 commits September 27, 2023 18:13

add lucene FeatureQuery

9d4a591

Signed-off-by: zhichao-aws <[email protected]>

add max token score

4da8ff2

Signed-off-by: zhichao-aws <[email protected]>

add comments

df99d0b

Signed-off-by: zhichao-aws <[email protected]>

add check and test

ee9d27d

Signed-off-by: zhichao-aws <[email protected]>

add doc

f39b63e

Signed-off-by: zhichao-aws <[email protected]>

zhichao-aws requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, wujunshen, zane-neo, ylwu-amzn and jngz-es as code owners September 27, 2023 13:05

add change log

86b0d9c

Signed-off-by: zhichao-aws <[email protected]>

navneet1v reviewed Sep 27, 2023

View reviewed changes

src/main/java/org/opensearch/neuralsearch/query/BoundedLinearFeatureQuery.java Outdated Show resolved Hide resolved

navneet1v reviewed Sep 27, 2023

View reviewed changes

src/main/java/org/opensearch/neuralsearch/query/BoundedLinearFeatureQuery.java Outdated Show resolved Hide resolved

navneet1v reviewed Sep 27, 2023

View reviewed changes

zhichao-aws changed the title ~~Add max token score for SparseEncodingQueryBuilder~~ Add max token score for SparseEncodingQueryBuilder and do renaming Sep 28, 2023

zane-neo approved these changes Sep 28, 2023

View reviewed changes

navneet1v reviewed Sep 28, 2023

View reviewed changes

src/main/java/org/apache/lucene/BoundedLinearFeatureQuery.java Show resolved Hide resolved

navneet1v reviewed Sep 28, 2023

View reviewed changes

src/main/java/org/apache/lucene/BoundedLinearFeatureQuery.java Show resolved Hide resolved

navneet1v reviewed Sep 28, 2023

View reviewed changes

jmazanec15 reviewed Sep 28, 2023

View reviewed changes

src/main/java/org/apache/lucene/BoundedLinearFeatureQuery.java Show resolved Hide resolved

model-collapse approved these changes Oct 1, 2023

View reviewed changes

zane-neo merged commit 7ae48df into opensearch-project:2.x Oct 1, 2023
13 of 15 checks passed

zhichao-aws mentioned this pull request Apr 18, 2024

[BUG FIX] Fix bwc failure in neural sparse search #696

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max token score for SparseEncodingQueryBuilder and do renaming #348

Add max token score for SparseEncodingQueryBuilder and do renaming #348

zhichao-aws commented Sep 27, 2023 •

edited

Loading

zhichao-aws commented Sep 27, 2023

navneet1v Sep 27, 2023

zane-neo Sep 28, 2023

navneet1v Sep 28, 2023

navneet1v Sep 27, 2023

zane-neo Sep 28, 2023

zhichao-aws Sep 28, 2023

navneet1v Sep 27, 2023

zane-neo Sep 28, 2023

navneet1v commented Sep 27, 2023

zhichao-aws commented Sep 28, 2023

zhichao-aws commented Sep 28, 2023

codecov bot commented Sep 28, 2023 •

edited

Loading

navneet1v commented Sep 28, 2023

navneet1v Sep 28, 2023 •

edited

Loading

zane-neo Sep 29, 2023

navneet1v Sep 28, 2023 •

edited

Loading

zane-neo Sep 29, 2023

navneet1v Sep 28, 2023 •

edited

Loading

zane-neo Sep 29, 2023

navneet1v Sep 28, 2023

model-collapse Sep 29, 2023

Add max token score for SparseEncodingQueryBuilder and do renaming #348

Add max token score for SparseEncodingQueryBuilder and do renaming #348

Conversation

zhichao-aws commented Sep 27, 2023 • edited Loading

Description

Issues Resolved

Check List

zhichao-aws commented Sep 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v commented Sep 27, 2023

zhichao-aws commented Sep 28, 2023

zhichao-aws commented Sep 28, 2023

codecov bot commented Sep 28, 2023 • edited Loading

Codecov Report

navneet1v commented Sep 28, 2023

navneet1v Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhichao-aws commented Sep 27, 2023 •

edited

Loading

codecov bot commented Sep 28, 2023 •

edited

Loading

navneet1v Sep 28, 2023 •

edited

Loading

navneet1v Sep 28, 2023 •

edited

Loading

navneet1v Sep 28, 2023 •

edited

Loading