Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index level setting to unmap fields beyond total fields limit #14939

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

gaobinlong
Copy link
Collaborator

@gaobinlong gaobinlong commented Jul 24, 2024

Description

This PR introduces a new index level setting index.mapping.total_fields.unmap_fields_beyond_limit, the default value is false, if it's set to true, then all the new detected fields beyond the index.mapping.total_fields.limit will not be added to the mapping, not indexed, not searchable but still be stored in the _source field.

Example usage:

  1. Create an index
PUT index1
{
  "settings": {
    "index.mapping.total_fields.limit":3,
    "index.mapping.total_fields.unmap_fields_beyond_limit":true
  }
}
  1. Index a document
POST index1/_doc/1
{
  "field1": "field1",
  "field2": 10000,
  "field3": [
    1,
    2,
    3
  ],
  "field4": {
    "field5": "field5"
  }
}
  1. Check the document
GET index1/_doc/1
{
  "_index": "index1",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "field1": "field1",
    "field2": 10000,
    "field3": [
      1,
      2,
      3
    ],
    "field4": {
      "field5": "field5"
    }
  }
}
  1. Check the mapping
GET index1/_mapping
{
  "index1": {
    "mappings": {
      "properties": {
        "field1": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "field2": {
          "type": "long"
        }
      }
    }
  }
}

, we can see that the fields field3, field4 and field5 are not in the mapping because the total fields limit is 3.

Related Issues

#14031

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gao Binlong <[email protected]>
Copy link
Contributor

✅ Gradle check result for 1307156: SUCCESS

Copy link
Contributor

✅ Gradle check result for e13f314: SUCCESS

Copy link

codecov bot commented Jul 24, 2024

Codecov Report

Attention: Patch coverage is 91.83673% with 4 lines in your changes missing coverage. Please review.

Project coverage is 71.95%. Comparing base (d442f7c) to head (9495dae).
Report is 1 commits behind head on main.

Files Patch % Lines
...va/org/opensearch/index/mapper/DocumentParser.java 90.90% 1 Missing and 1 partial ⚠️
...ava/org/opensearch/index/mapper/MappingLookup.java 93.33% 0 Missing and 1 partial ⚠️
...java/org/opensearch/index/mapper/ParseContext.java 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #14939      +/-   ##
============================================
- Coverage     71.95%   71.95%   -0.01%     
+ Complexity    63128    63092      -36     
============================================
  Files          5196     5196              
  Lines        295311   295357      +46     
  Branches      42677    42683       +6     
============================================
+ Hits         212496   212528      +32     
+ Misses        65436    65380      -56     
- Partials      17379    17449      +70     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gaobinlong gaobinlong added backport 2.x Backport to 2.x branch Indexing Indexing, Bulk Indexing and anything related to indexing enhancement Enhancement or improvement to existing feature or request labels Jul 24, 2024
Signed-off-by: Gao Binlong <[email protected]>
Signed-off-by: Gao Binlong <[email protected]>
Copy link
Contributor

✅ Gradle check result for 52d051d: SUCCESS

Copy link
Contributor

✅ Gradle check result for 814f715: SUCCESS

Copy link
Contributor

❌ Gradle check result for 553f304: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 456e8ba: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Gao Binlong <[email protected]>
Copy link
Contributor

❌ Gradle check result for 9495dae: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 9495dae: SUCCESS

Signed-off-by: Gao Binlong <[email protected]>
Copy link
Contributor

❕ Gradle check result for 9f15262: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Sep 13, 2024
@gaobinlong
Copy link
Collaborator Author

@vikasvb90 @andrross @reta ,could you help to review this PR? Thank you!

@reta
Copy link
Collaborator

reta commented Oct 28, 2024

@vikasvb90 @andrross @reta ,could you help to review this PR? Thank you!

Thanks @gaobinlong , I would strongly advocate to use the existing means (primarily processors or pre-processing before sending for ingestion) instead of adding a new setting. I think this setting is confusing and also defeats the purpose of these limits in first place:

  • even if we just store _source, it will be returned in search results and could easily blow up the consumer
  • in combination with dynamic mapping, the documents shape could be unpredictable and highly depend on the order of the fields

In my opinion, we should be very explicit of what to drop or keep, and it should be always driven by user / client intent. We have knobs / instruments for that.

@gaobinlong
Copy link
Collaborator Author

Thanks @reta , we cannot achieve that in the client side or by ingest processor, because what the users want is to keep the fields in the document but not index them, rather than dropping any fields.

even if we just store _source, it will be returned in search results and could easily blow up the consumer

We already have this feature, when set dynamic to false, the behavior is that the new detected fields are still stored in _source but doesn't show in the mapping, and not searchable.

in combination with dynamic mapping, the documents shape could be unpredictable and highly depend on the order of the fields

We don't drop any fields, only unmap the fields which cause the total fields count in the mapping exceeds the limit, the document shape doesn't change.

@reta
Copy link
Collaborator

reta commented Oct 29, 2024

Thanks @gaobinlong

We already have this feature, when set dynamic to false, the behavior is that the new detected fields are still stored in _source but doesn't show in the mapping, and not searchable.

Oh, let me look at it closely, but to double check: we store new detected fields in _source even if they violate the depth constraints?

We don't drop any fields, only unmap the fields which cause the total fields count in the mapping exceeds the limit, the document shape doesn't change.

It does, depending which document shape is going to come first to form the mapping. For example, let us assume we have a 1000 limit and the first document have this order:

  • <f1, f2, f3, ... f1100>

But what if the order it reversed?

  • <f1100, f1099, f1098, ... f1>

The problem I am trying to convey is that this setting introduces unpredictable behaviour (in some important cases) where the decision what to take in or out is not controllable.

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Oct 31, 2024
@gaobinlong
Copy link
Collaborator Author

Oh, let me look at it closely, but to double check: we store new detected fields in _source even if they violate the depth constraints?

Yes, if dynamic is false, any new detected fields will still be stored in _source even they violate the total limit or depth limit constraints.

The problem I am trying to convey is that this setting introduces unpredictable behaviour (in some important cases) where the decision what to take in or out is not controllable.

The document comes first will form the mapping, that is the existing behavior, if the total fields limit is 1000, then for the document with fields <f1, f2, f3, ... f1100>, the field f1001 and the following will be rejected with error Limit of total fields [1000] has been exceeded(dynamic parameter is true), and then for the document with fields <f1100, f1099, f1098, ... f1>, the field f1001 and the following will also be rejected because f1...f1000 already exists in the mapping, so nothing changed. The new setting index.mapping.total_fields.unmap_fields_beyond_limit doesn't change any existing behavior, but only control the behavior when total fields in the mapping exceed the limit, when it's true, we don't reject the fields f1001...f1100 , no error will be thrown, the indexing operation will succeed.

@reta
Copy link
Collaborator

reta commented Nov 6, 2024

Yes, if dynamic is false, any new detected fields will still be stored in _source even they violate the total limit or depth limit constraints.

👍 got it, thanks!

The document comes first will form the mapping

This is exactly the problem. The issue is not about succeeding or not, the issue is what forms the mapping. The current behaviour is clear - either all fields are in the mapping or document is rejected. With this change - what ever comes first is in the mapping, everything beyond the limit is just ignored.

@andrross @msfroh wondering if you folks have an opinion on this change.

@andrross
Copy link
Member

andrross commented Nov 6, 2024

This is exactly the problem. The issue is not about succeeding or not, the issue is what forms the mapping.

@reta I agree this is a concern. Rolling over indexes is a super common pattern in log analytics. With this behavior, then a user may see different field mappings on every index rollover because it may be more or less random which 1000 unique fields are first observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants