[Derived Field] Dynamic FieldType inference based on random sampling of documents #13592

rishabhmaurya · 2024-05-07T19:17:20Z

Description

This class performs type inference by analyzing the _source documents. This will be useful in inferring the type of nested derived field of object type. See #13143 for more details on requirement.

It uses a random sample of documents to infer the field type, similar to dynamic mapping type guessing logic. Unlike guessing based on the first document, where field could be missing, this method generates a random sample to make a more accurate inference. This approach is especially useful for handling missing fields, which could be common in nested fields within derived fields of object types.
The sample size should be chosen carefully to ensure a high probability of selecting at least one document where the field is present. However, it's essential to strike a balance because a large sample size can lead to performance issues since each sample document's _source field is loaded and examined until the field is found.
Determining the sample size (S) is akin to deciding how many balls to draw from a bin, ensuring a high probability (>=P) of drawing at least one green ball (documents with the field) from a mixture of R red balls (documents without the field) and G green balls:

 P >= 1 - C(R, S) / C(R + G, S)

Here, C() represents the binomial coefficient. For a high confidence level, we aim for P >= 0.95. For example, with 10^7 documents where the field is present in 2% of them, the sample size S should be around 149 to achieve a probability of 0.95.

Here is the small python script which i used to calculate above

Related Issues

Resolves #13143

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
~~[ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)~~
~~[ ] Public documentation issue/PR created~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-05-07T19:26:13Z

❌ Gradle check result for 31d2152: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-05-07T19:34:09Z

❌ Gradle check result for 5ed477e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-05-07T19:47:03Z

❌ Gradle check result for 540ff72: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rishabhmaurya · 2024-05-24T16:44:22Z

@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better.

github-actions · 2024-05-24T20:20:57Z

❌ Gradle check result for b0050b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-05-24T23:51:22Z

✅ Gradle check result for 5e276cb: SUCCESS

msfroh · 2024-05-25T00:02:20Z

@msfroh @harshavamsi what do you think about order in which we should scan the segments here? If we start with the smaller segments and if the field is found, then it would be pretty fast, whereas, if we start with a bigger segment, the odds of finding a field is high but comes at a cost of loading a bigger segment. So for rare fields, later performs better whereas for common fields, the former performs better.

I think I would optimize more for the common fields.

I appreciate that an advantage of this feature is that it's another way of handling a mix of different document types, similar to flat_object fields -- just pushing the hard work to search time, rather than flattening at indexing time. But at the same time, I feel like it makes more sense to assume that you want to search on "relatively" common fields (i.e. fields present in at least 5-10% of documents).

rishabhmaurya · 2024-05-27T23:16:11Z

@msfroh looking at holistic picture, I agree that optimizing on common fields is a wiser choice. If you think this isn't super critical, I can take it up as a subsequent PR.

Signed-off-by: Rishabh Maurya <[email protected]>

github-actions · 2024-06-03T19:49:30Z

❕ Gradle check result for c546323: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

…of documents (#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]> (cherry picked from commit 6c1896b) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>

…of documents (#13592) (#13953) --------- (cherry picked from commit 6c1896b) Signed-off-by: Rishabh Maurya <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>

…of documents (opensearch-project#13592) (opensearch-project#13953) --------- (cherry picked from commit 6c1896b) Signed-off-by: Rishabh Maurya <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: kkewwei <[email protected]>

…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>

github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Performance labels May 7, 2024

rishabhmaurya marked this pull request as ready for review May 7, 2024 19:20

rishabhmaurya requested review from anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, dbwiddis, dreamer-89, gbbafna, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami, tlfeng and VachaShah as code owners May 7, 2024 19:20

rishabhmaurya force-pushed the derived-field-type-inference branch from 31d2152 to 5ed477e Compare May 7, 2024 19:23

rishabhmaurya force-pushed the derived-field-type-inference branch from 5ed477e to 540ff72 Compare May 7, 2024 19:29

rishabhmaurya force-pushed the derived-field-type-inference branch from ca050d0 to 16c2071 Compare May 7, 2024 19:56

rishabhmaurya force-pushed the derived-field-type-inference branch from f6ed5e1 to b0050b9 Compare May 24, 2024 19:28

rishabhmaurya force-pushed the derived-field-type-inference branch from b0050b9 to 5e276cb Compare May 24, 2024 23:01

msfroh approved these changes May 30, 2024

View reviewed changes

rishabhmaurya added 3 commits June 3, 2024 11:55

Dynamic FieldType inference based on random sampling of documents

17eaae1

Signed-off-by: Rishabh Maurya <[email protected]>

Use Randomness#get()

de0fea6

Signed-off-by: Rishabh Maurya <[email protected]>

Address PR comments

c546323

Signed-off-by: Rishabh Maurya <[email protected]>

rishabhmaurya force-pushed the derived-field-type-inference branch from 5e276cb to c546323 Compare June 3, 2024 18:55

msfroh approved these changes Jun 3, 2024

View reviewed changes

msfroh added the backport 2.x Backport to 2.x branch label Jun 3, 2024

msfroh merged commit 6c1896b into opensearch-project:main Jun 3, 2024
33 checks passed

opensearch-trigger-bot bot mentioned this pull request Jun 3, 2024

[Backport 2.x] [Derived Field] Dynamic FieldType inference based on random sampling of documents #13953

Merged

akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Jun 5, 2024

[Derived Field] Dynamic FieldType inference based on random sampling …

bcdca37

…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>

LantaoJin pushed a commit to LantaoJin/OpenSearch that referenced this pull request Jun 6, 2024

[Derived Field] Dynamic FieldType inference based on random sampling …

c7bdd38

…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>

rishabhmaurya mentioned this pull request Jun 10, 2024

[META] Derived Fields #12281

Open

6 tasks

parv0201 pushed a commit to parv0201/OpenSearch that referenced this pull request Jun 10, 2024

[Derived Field] Dynamic FieldType inference based on random sampling …

ca6047a

…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>

opensearch-ci-bot mentioned this pull request Jun 13, 2024

[AUTOCUT] Gradle Check Flaky Test Report for ClusterRerouteIT #14298

Open

finnegancarroll mentioned this pull request Jun 14, 2024

Remove composeBuild fixture tasks when docker support not found #14357

Merged

3 tasks

prudhvigodithi mentioned this pull request Jun 18, 2024

[AUTOCUT] Gradle Check Flaky Test Report for IndicesRequestCacheIT prudhvigodithi/OpenSearch#27

Closed

This was referenced Jun 27, 2024

[AUTOCUT] Gradle Check Flaky Test Report for IndicesRequestCacheIT #14288

Open

[AUTOCUT] Gradle Check Flaky Test Report for IndexServiceTests #14407

Open

wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024

[Derived Field] Dynamic FieldType inference based on random sampling …

4925e45

…of documents (opensearch-project#13592) --------- Signed-off-by: Rishabh Maurya <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Derived Field] Dynamic FieldType inference based on random sampling of documents #13592

[Derived Field] Dynamic FieldType inference based on random sampling of documents #13592

rishabhmaurya commented May 7, 2024 •

edited

Loading

github-actions bot commented May 7, 2024

github-actions bot commented May 7, 2024

github-actions bot commented May 7, 2024

rishabhmaurya commented May 24, 2024

github-actions bot commented May 24, 2024

github-actions bot commented May 24, 2024

msfroh commented May 25, 2024

rishabhmaurya commented May 27, 2024

github-actions bot commented Jun 3, 2024

[Derived Field] Dynamic FieldType inference based on random sampling of documents #13592

[Derived Field] Dynamic FieldType inference based on random sampling of documents #13592

Conversation

rishabhmaurya commented May 7, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented May 7, 2024

github-actions bot commented May 7, 2024

github-actions bot commented May 7, 2024

rishabhmaurya commented May 24, 2024

github-actions bot commented May 24, 2024

github-actions bot commented May 24, 2024

msfroh commented May 25, 2024

rishabhmaurya commented May 27, 2024

github-actions bot commented Jun 3, 2024

rishabhmaurya commented May 7, 2024 •

edited

Loading