-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max token score for SparseEncodingQueryBuilder and do renaming #348
Add max token score for SparseEncodingQueryBuilder and do renaming #348
Conversation
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
src/main/java/org/opensearch/neuralsearch/query/BoundedLinearFeatureQuery.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/query/BoundedLinearFeatureQuery.java
Outdated
Show resolved
Hide resolved
private final String featureName; | ||
private final Float scoreUpperBound; | ||
|
||
public BoundedLinearFeatureQuery(String fieldName, String featureName, Float scoreUpperBound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we comment on the lines which we have added on top Lucene code so that it is easy to debug and fix later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can treat this a brand new class, it combines the FeatureQuery class and some methods from FeatureField class in lucene, we'll add some comments on top of the core function that makes our new parameter work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes lets add comments over lines which are copied so that we are aware of from where the code is coming from
src/main/java/org/opensearch/neuralsearch/query/SparseEncodingQueryBuilder.java
Outdated
Show resolved
Hide resolved
@@ -99,6 +104,7 @@ protected void doXContent(XContentBuilder xContentBuilder, Params params) throws | |||
xContentBuilder.startObject(fieldName); | |||
xContentBuilder.field(QUERY_TEXT_FIELD.getPreferredName(), queryText); | |||
xContentBuilder.field(MODEL_ID_FIELD.getPreferredName(), modelId); | |||
if (null != maxTokenScore) xContentBuilder.field(MAX_TOKEN_SCORE_FIELD.getPreferredName(), maxTokenScore); | |||
printBoostAndQueryName(xContentBuilder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to print ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to exist code: https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/query/NeuralQueryBuilder.java#L120, I think this is for debugging purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The print is not logging, but output to XContent. The boost and query name are also parameters and should be included in the XContent
src/main/java/org/opensearch/neuralsearch/query/SparseEncodingQueryBuilder.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
// the field and decodeFeatureValue are modified from FeatureField.decodeFeatureValue | ||
static final int MAX_FREQ = Float.floatToIntBits(Float.MAX_VALUE) >>> 15; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the value of this MAX_FREQ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from https://github.com/apache/lucene/blob/6d764c3397d00f93bd4273bd8d1c9e51d6e104e6/lucene/core/src/java/org/apache/lucene/document/FeatureField.java#L207, including the below method is from FeatureField
but we changed it to make our new parameter scoreUpperBound work.
Minor comments on the code. |
* Address code review comments Signed-off-by: zane-neo <[email protected]> * Change lower case neural_sparse to upper case Signed-off-by: zane-neo <[email protected]> * Change back processor type name to sparse_encoding Signed-off-by: zane-neo <[email protected]> * Change names Signed-off-by: zane-neo <[email protected]> * Format code Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]>
Also include renaming changes in this PR. The renaming changes in main: #353 |
1 similar comment
Also include renaming changes in this PR. The renaming changes in main: #353 |
Codecov Report
@@ Coverage Diff @@
## 2.x #348 +/- ##
============================================
- Coverage 84.56% 80.30% -4.26%
- Complexity 427 429 +2
============================================
Files 35 36 +1
Lines 1289 1366 +77
Branches 189 200 +11
============================================
+ Hits 1090 1097 +7
- Misses 118 186 +68
- Partials 81 83 +2
|
@zane-neo @zhichao-aws the unit tests have not covered some lines leading to failures of GH workflow, can we fix it |
return searcher.rewrite(tq).createWeight(searcher, scoreMode, boost); | ||
} | ||
|
||
return new Weight(this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this weight class as inner class or a separate class? This will ensure that we can properly test it and abstract this weight class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from lucene FeatureQuery, I think we can save the UT effort since this feature will only live a very short time(one version) and the added method is already covered.
|
||
@Override | ||
public boolean isCacheable(LeafReaderContext ctx) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is not cacheable? can you add comments around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also copied code.
|
||
@Override | ||
public Scorer scorer(LeafReaderContext context) throws IOException { | ||
Terms terms = Terms.getTerms(context.reader(), fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen in the case when the field with fieldName
is not present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating this new object, the fieldName is required non null.
@@ -51,7 +51,7 @@ public abstract class NLPProcessor extends AbstractProcessor { | |||
|
|||
private final Environment environment; | |||
|
|||
public NLPProcessor( | |||
public InferenceProcessor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check comments related to this in the main branch PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please checkout the comments in 353.
Description
For sparse semantic retrieval in neural search, we use lucene FeatureField for storage and use lucene FeatureQuery to search. The feature queries of input tokens are wrapped by lucene BooleanQuery, which use WAND algorithm to accelerate the execution. The WAND algorithm leverage the score upper bound of sub-queries to skip non-competitive tokens. However, origin lucene FeatureQuery use Float.MAX_VALUE as the score upper bound, and this invalidates WAND.
To mitigate this issue, we rewrite the FeatureQuery to BoundedLinearFeatureQuery. The caller can set the token score upperbound of this query. And according to our use case, we use LinearFunction as the score function.
We have conducted several end to end benchmark experiments with this optimization. Using a doc-only SPLADE like model, this optimization reduce the query latency from P90 40ms to P90 26ms (1 million docs), and reduce the query latency from P90 231ms to P90 80ms (8.8 million docs).
After lucene version 9.8, the FeatureQuery are rewritten, and lucene optimize the speed for BooleanQuery. Then this optimization is no longer needed. However, we're most likely not upgrade to lucene 9.8 in 2.11 release opensearch-project/OpenSearch#8668. So we create this PR for 2.x only, and in main, we'll create another PR to deperacate the
max_token_score
parameter in sparse query clause.Issues Resolved
#230
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.