Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds various Query overrides to Keyword Field to enable doc_values #10425

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

harshavamsi
Copy link
Contributor

@harshavamsi harshavamsi commented Oct 5, 2023

Description

The keywordfield mapper provides access to various query types, e.g. the termsQuery, fuzzyQuery. These are inherited as is from the StringType. But we do not take into account the fact that keyword fields can have doc_values enabled. This PR adds the ability for various queries to first check if doc_values are enabled and if so outsource the work to lucene to decide if it's better to use index values or doc_values when running queries.

Related Issues

Resolves #7057

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@harshavamsi harshavamsi changed the title Adds various Query overrides to Keyword Field to enabled doc_values Adds various Query overrides to Keyword Field to enable doc_values Oct 5, 2023
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs Search:Performance v2.12.0 Issues and PRs related to version 2.12.0 labels Oct 5, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

Compatibility status:

Checks if related components are compatible with change a80478b

Incompatible components

Incompatible components: [https://github.com/opensearch-project/performance-analyzer.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git]

@noCharger
Copy link
Contributor

REF https://build.ci.opensearch.org/job/gradle-check/27203/console

REPRODUCE WITH: ./gradlew ':qa:mixed-cluster:v2.11.0#mixedClusterTest' --tests "org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT" -Dtests.method="test {p0=search.highlight/40_keyword_ignore/Plain Highligher should skip highlighting ignored keyword values}" -Dtests.seed=66EE8CD404E0EC5B -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=bg -Dtests.timezone=America/Guatemala -Druntime.java=20

org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT > test {p0=search.highlight/40_keyword_ignore/Plain Highligher should skip highlighting ignored keyword values} FAILED
java.lang.AssertionError: Failure at [search.highlight/40_keyword_ignore:42]: field [hits.hits.0.highlight.k2.0] is null
at __randomizedtesting.SeedInfo.seed([66EE8CD404E0EC5B:EEBAB30EAA1C81A3]:0)
at org.opensearch.test.rest.yaml.OpenSearchClientYamlSuiteTestCase.executeSection(OpenSearchClientYamlSuiteTestCase.java:460)
at org.opensearch.test.rest.yaml.OpenSearchClientYamlSuiteTestCase.test(OpenSearchClientYamlSuiteTestCase.java:433)
at java./jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.
/java.lang.reflect.Method.invoke(Method.java:578)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)

@harshavamsi
Copy link
Contributor Author

harshavamsi commented Oct 10, 2023

The issue here seems to be that the change I introduced seems to not take into consideration the highlight object in a query

 "highlight": {
        "require_field_match": false,
        "fields": {
        "k2": {
            "type": "plain"
        }
        }
    }

this above query fails to return the highlight field.

@harshavamsi
Copy link
Contributor Author

https://github.com/opensearch-project/opensearch/blob/e3542011cd9584c354405ba6cbf9a31ced7bc5ce/server/src/main/java/org/opensearch/search/fetch/subphase/highlight/PlainHighlighter.java#L197-L204

This here seems to be the issue.

Debugging through the code seems to indicate that noMatchSize is set to 0 when the textsToHighlight is present.

image

image

@harshavamsi
Copy link
Contributor Author

@msfroh need some help understanding the highlighter subphase here. The use of IndexorDocValue here is impacting the way the highlighting is happening.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@msfroh
Copy link
Collaborator

msfroh commented Oct 13, 2023

Met with @harshavamsi today to discuss the failing highlighting.

IMO, we should open a Lucene issue to unwrap the IndexOrDocValuesQuery in https://github.com/apache/lucene/blob/main/lucene/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java#L128

For now, we can handle it in

protected void extract(Query query, float boost, Map<String, WeightedSpanTerm> terms) throws IOException {

@msfroh
Copy link
Collaborator

msfroh commented Oct 13, 2023

@harshavamsi, can you please add YAML integ tests to cover the docvalue-only cases?

I'm sure that we don't have tests for those because they wouldn't have worked until this commit. (Yay! We're making searches work that wouldn't have previously been possible.)

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

Merging #10425 (29f79fe) into main (448635f) will decrease coverage by 0.04%.
Report is 3 commits behind head on main.
The diff coverage is 69.07%.

❗ Current head 29f79fe differs from pull request most recent head a80478b. Consider uploading reports for the commit a80478b to get more accurate results

@@             Coverage Diff              @@
##               main   #10425      +/-   ##
============================================
- Coverage     71.26%   71.23%   -0.04%     
+ Complexity    58722    58429     -293     
============================================
  Files          4870     4845      -25     
  Lines        276608   275430    -1178     
  Branches      40207    40117      -90     
============================================
- Hits         197137   196202     -935     
+ Misses        63060    62816     -244     
- Partials      16411    16412       +1     
Files Coverage Δ
...a/org/opensearch/index/mapper/MappedFieldType.java 74.28% <66.66%> (-0.47%) ⬇️
...ch/fetch/subphase/highlight/CustomQueryScorer.java 29.03% <0.00%> (-2.01%) ⬇️
...rg/opensearch/index/mapper/KeywordFieldMapper.java 81.46% <70.78%> (-5.17%) ⬇️

... and 604 files with indirect coverage changes

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@harshavamsi harshavamsi force-pushed the range_over_keyword branch 2 times, most recently from 5907130 to 7e1de58 Compare October 17, 2023 16:55
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@harshavamsi
Copy link
Contributor Author

Gradle failure is un-related to my change:

java.lang.RuntimeException: Failure at [pit/10_basic:140]: Unexpected end-of-input: was expecting closing '"' for name
 at [Source: (org.opensearch.core.common.io.stream.InputStreamStreamInput); line: 1, column: 12]
	at __randomizedtesting.SeedInfo.seed([84E4CFE1B1B6F129:CB0F03B1F4A9CD1]:0)
	at org.opensearch.test.rest.yaml.OpenSearchClientYamlSuiteTestCase.executeSection(OpenSearchClientYamlSuiteTestCase.java:462)
	at org.opensearch.test.rest.yaml.OpenSearchClientYamlSuiteTestCase.test(OpenSearchClientYamlSuiteTestCase.java:433)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1623)
Caused by: com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input: was expecting closing '"' for name
 at [Source: (org.opensearch.core.common.io.stream.InputStreamStreamInput); line: 1, column: 12]
	at app//com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:697)
	at app//com.fasterxml.jackson.core.json.UTF8StreamJsonParser.slowParseName(UTF8StreamJsonParser.java:2003)
	at app//com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1798)
	at app//com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
	at app//org.opensearch.common.xcontent.json.JsonXContentParser.nextToken(JsonXContentParser.java:66)
	at app//org.opensearch.core.xcontent.AbstractXContentParser.readValueUnsafe(AbstractXContentParser.java:415)
	at app//org.opensearch.core.xcontent.AbstractXContentParser.readMapEntries(AbstractXContentParser.java:336)
	at app//org.opensearch.core.xcontent.AbstractXContentParser.readMapSafe(AbstractXContentParser.java:322)
	at app//org.opensearch.core.xcontent.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:277)
	at app//org.opensearch.test.rest.yaml.ObjectPath.createFromXContent(ObjectPath.java:75)
	at app//org.opensearch.test.rest.yaml.ClientYamlTestResponse.<init>(ClientYamlTestResponse.java:75)
	at app//org.opensearch.test.rest.yaml.ClientYamlTestClient.callApi(ClientYamlTestClient.java:227)
	at app//org.opensearch.test.rest.yaml.ClientYamlTestExecutionContext.callApiInternal(ClientYamlTestExecutionContext.java:201)
	at app//org.opensearch.test.rest.yaml.ClientYamlTestExecutionContext.callApi(ClientYamlTestExecutionContext.java:124)
	at app//org.opensearch.test.rest.yaml.section.DoSection.execute(DoSection.java:305)
	at app//org.opensearch.test.rest.yaml.OpenSearchClientYamlSuiteTestCase.executeSection(OpenSearchClientYamlSuiteTestCase.java:449)
	... 39 more

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@msfroh
Copy link
Collaborator

msfroh commented Oct 30, 2023

@harshavamsi -- code looks good to me.

I'm retrying Gradle check. If it still fails, you might need to rebase.

Have you raised a documentation issue? We probably need to update the documentation for keyword fields at least to cover the fact that queries can still run on a non-indexed keyword field if it has doc values.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

The keywordfield mapper provides access to various query types, e.g. the termsQuery, fuzzyQuery. These are inherited as is from the StringType. But we do not take into account the fact that keyword fields can have doc_values enabled. This PR adds the ability for various queries to first check if doc_values are enabled and if so out-source the work to lucene to decide if it's better to use index values or doc_values when running queries.

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteStoreStatsIT.testDownloadStatsCorrectnessSinglePrimaryMultipleReplicaShards

@harshavamsi
Copy link
Contributor Author

Failing test is flaky and unrelated to my change.

@reta @nknize @rishabhmaurya @andrross can you take a look?

@harshavamsi
Copy link
Contributor Author

@harshavamsi -- code looks good to me.

I'm retrying Gradle check. If it still fails, you might need to rebase.

Have you raised a documentation issue? We probably need to update the documentation for keyword fields at least to cover the fact that queries can still run on a non-indexed keyword field if it has doc values.

Raised opensearch-project/documentation-website#5453

@msfroh
Copy link
Collaborator

msfroh commented Oct 31, 2023

Failing test is flaky and unrelated to my change.

An "unstable" result is acceptable -- it means that a test failed on some runs but passed on others.

@msfroh msfroh merged commit 63aff16 into opensearch-project:main Oct 31, 2023
18 checks passed
@harshavamsi harshavamsi added the backport 2.x Backport to 2.x branch label Oct 31, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-10425-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 63aff16c0d08bac44e1d1a6158ee3f838a043074
# Push it to GitHub
git push --set-upstream origin backport/backport-10425-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-10425-to-2.x.

harshavamsi added a commit to harshavamsi/OpenSearch that referenced this pull request Oct 31, 2023
The keywordfield mapper provides access to various query types, e.g. the termsQuery, fuzzyQuery. These are inherited as is from the StringType. But we do not take into account the fact that keyword fields can have doc_values enabled. This PR adds the ability for various queries to first check if doc_values are enabled and if so out-source the work to lucene to decide if it's better to use index values or doc_values when running queries.

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
(cherry picked from commit 63aff16)
msfroh added a commit that referenced this pull request Nov 1, 2023
…#11031)

* Adds various Query overrides to Keyword Field (#10425)

The keywordfield mapper provides access to various query types, e.g. the termsQuery, fuzzyQuery. These are inherited as is from the StringType. But we do not take into account the fact that keyword fields can have doc_values enabled. This PR adds the ability for various queries to first check if doc_values are enabled and if so out-source the work to lucene to decide if it's better to use index values or doc_values when running queries.

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
(cherry picked from commit 63aff16)

* Update CHANGELOG.md

---------

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
Co-authored-by: Michael Froh <[email protected]>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
The keywordfield mapper provides access to various query types, e.g. the termsQuery, fuzzyQuery. These are inherited as is from the StringType. But we do not take into account the fact that keyword fields can have doc_values enabled. This PR adds the ability for various queries to first check if doc_values are enabled and if so out-source the work to lucene to decide if it's better to use index values or doc_values when running queries.

Signed-off-by: Harsha Vamsi Kalluri <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs Search:Performance v2.12.0 Issues and PRs related to version 2.12.0
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Optimize MultiTermQueries over keyword field using doc values query
3 participants