-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid negative scores returned from multi_match
query with cross_fields
#13829
Avoid negative scores returned from multi_match
query with cross_fields
#13829
Conversation
a50898c
to
2353a42
Compare
❌ Gradle check result for a50898c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java
Outdated
Show resolved
Hide resolved
For some context, I came up with this fix after talking myself through the logic of the (previously) failing test in https://github.com/opensearch-project/OpenSearch/pull/13627/files#r1614240922 |
@msfroh mind please backport to 2.x manually? thank you |
…lds` (opensearch-project#13829) Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the (nonzero) value of `N` for each field and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <[email protected]> --------- Signed-off-by: Michael Froh <[email protected]> (cherry picked from commit fffd101)
…lds` (opensearch-project#13829) Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the (nonzero) value of `N` for each field and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <[email protected]> --------- Signed-off-by: Michael Froh <[email protected]> (cherry picked from commit fffd101)
Backport PR is ready: #13983 |
…lds` (opensearch-project#13829) Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the (nonzero) value of `N` for each field and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <[email protected]> --------- Signed-off-by: Michael Froh <[email protected]>
…lds` (opensearch-project#13829) (opensearch-project#13983) Signed-off-by: kkewwei <[email protected]>
…lds` (opensearch-project#13829) Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the (nonzero) value of `N` for each field and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <[email protected]> --------- Signed-off-by: Michael Froh <[email protected]>
Description
Under specific circumstances, when using
cross_fields
scoring on amulti_match
query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula.Specifically, the IDF is calculated as:
where
N
is the number of documents containing the field andn
is the number of documents containing the given term in the field. Obviously,n
should always be less than or equal toN
.Unfortunately,
cross_fields
makes up a new value forn
and tries to use it across all fields.This change finds the minimum (nonzero) value of
N
and uses that as an upper bound for the new value ofn
.Related Issues
Resolves #7860
Check List
New functionality has been documented.New functionality has javadoc addedAPI changes companion pull request created.Public documentation issue/PR createdBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.