[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution #168757

qn895 · 2023-10-12T23:14:13Z

Summary

KS-Test p-value for pdays on the bank-marketing dataset is very small, although reference and comparison distributions are identical. I guess we have a numerical issue here we need to investigate.

This is actually an edge case. This we because we currently assume uniform distribution - that in every range we will have 5% of the overall data. If this is true, then performing the ks_test aggregation, we don’t need to specify the fractions parameter, since all fractions should be 0.05 by design and this is the default value (uniform distribution).
Now, in the case with pdays an overwhelming amount of values (70% or so) have the value -1. It’s probably a default value for missing data or something like this. Now when we run the percentile aggregation, we get a bunch of [-1,-1] intervals, since aggregation is not able to split the values into distinct 5% intervals. And this breaks our assumption of uniform distributed fractions in the ks_test.

So, to fix this in the general case, we need to run an additional ranges agg to get the doc count for the ranges that we get from the percentiles aggregation. Having the doc counts we can compute fractions explicitly and then pass the values as a list to the ks_test.

This PR adds that additional ranges agg, use the doc_count to compute the fractions, and pass that to the ks_test agg.

After

NOTE: Logging is temporarily added to have more visibility to the ES queries.

In addition, this PR also fixes:

Clicking on the comparison chart will no longer generate any additional brushes
If either data set is empty, no need to make KS agg tests, default to Drift detected = yes

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

…es request. Default to <0.000001 and data drifted = yes.

alvarezmelissa87 · 2023-10-17T17:32:45Z

x-pack/plugins/data_visualizer/public/application/data_drift/data_drift_page.tsx

@@ -359,7 +359,7 @@ export const DataDriftPage: FC<Props> = ({ initialSettings }) => {
                label={comparisonIndexPatternLabel}
                randomSampler={randomSamplerProd}
                reload={forceRefresh}
-                brushSelectionUpdateHandler={brushSelectionUpdate}
+                brushSelectionUpdateHandler={undefined}


I think you can just omit 'brushSelectionUpdateHandler' completely

alvarezmelissa87

Code LGTM ⚡

elasticmachine · 2023-10-17T17:56:24Z

Pinging @elastic/ml-ui (:ml)

kibana-ci · 2023-10-17T18:41:29Z

💛 Build succeeded, but was flaky

Failed CI Steps

FTR Configs #10

Test Failures

[job] [logs] FTR Configs #10 / lens app - group 3 lens inline editing tests should reset changes made to the previous state

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`dataVisualizer`	613.8KB	614.3KB	+466.0B

History

💚 Build #168714 succeeded 7f4176b
💔 Build #168419 failed def15de
💔 Build #167946 failed 368c27c
💔 Build #167873 failed eeeef8e
💔 Build #167559 failed 9cb611f

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @qn895

…of uniform distribution (elastic#168757) (cherry picked from commit 6d06dc3)

kibanamachine · 2023-10-17T19:40:58Z

💚 All backports created successfully

Status	Branch	Result
✅	8.11

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…is not of uniform distribution (#168757) (#169168) # Backport This will backport the following commits from `main` to `8.11`: - [[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution (#168757)](#168757)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Quynh Nguyen (Quinn) <[email protected]>

…of uniform distribution (elastic#168757)

qn895 added 2 commits October 12, 2023 17:57

Add fractions and logging

6a4243f

Add fractions and logging

9cb611f

qn895 added the ci:cloud-deploy Create or update a Cloud deployment label Oct 12, 2023

qn895 added 2 commits October 13, 2023 11:05

Fix ranges baseline logic

6e8f474

Merge remote-tracking branch 'upstream/main' into ml-fix-range

eeeef8e

qn895 self-assigned this Oct 13, 2023

qn895 requested review from valeriy42 and removed request for valeriy42 October 13, 2023 16:15

qn895 added 5 commits October 13, 2023 12:33

Disable click on comparison chart

5d17ea0

Don't even make request for ks test if sum of doc_count = 0 from rang…

368c27c

…es request. Default to <0.000001 and data drifted = yes.

Remove console.log

1a6ab82

Merge remote-tracking branch 'upstream/main' into ml-fix-range

def15de

Update type

7f4176b

qn895 requested a review from alvarezmelissa87 October 17, 2023 17:25

qn895 marked this pull request as ready for review October 17, 2023 17:27

qn895 requested a review from a team as a code owner October 17, 2023 17:27

alvarezmelissa87 reviewed Oct 17, 2023

View reviewed changes

alvarezmelissa87 approved these changes Oct 17, 2023

View reviewed changes

Remove prop completely

a0c433e

qn895 added bug Fixes for quality problems that affect the customer experience :ml release_note:skip Skip the PR/issue when compiling release notes v8.11.0 labels Oct 17, 2023

qn895 enabled auto-merge (squash) October 17, 2023 18:59

qn895 merged commit 6d06dc3 into elastic:main Oct 17, 2023

kibanamachine added the v8.12.0 label Oct 17, 2023

kibanamachine mentioned this pull request Oct 17, 2023

[8.11] [ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution (#168757) #169168

Merged

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 17, 2023

[ML] Fix data drift calculating inaccurate p value when range is not …

7d01c4f

…of uniform distribution (elastic#168757) (cherry picked from commit 6d06dc3)

hop-dev pushed a commit to hop-dev/kibana that referenced this pull request Oct 18, 2023

[ML] Fix data drift calculating inaccurate p value when range is not …

f3facd7

…of uniform distribution (elastic#168757)

qn895 mentioned this pull request Oct 19, 2023

[ML] Data Drift View bugs #168090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution #168757

[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution #168757

qn895 commented Oct 12, 2023 •

edited by kibanamachine

Loading

alvarezmelissa87 Oct 17, 2023

alvarezmelissa87 left a comment

elasticmachine commented Oct 17, 2023

kibana-ci commented Oct 17, 2023 •

edited

Loading

kibanamachine commented Oct 17, 2023

[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution #168757

[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution #168757

Conversation

qn895 commented Oct 12, 2023 • edited by kibanamachine Loading

Summary

Checklist

Risk Matrix

For maintainers

alvarezmelissa87 Oct 17, 2023

Choose a reason for hiding this comment

alvarezmelissa87 left a comment

Choose a reason for hiding this comment

elasticmachine commented Oct 17, 2023

kibana-ci commented Oct 17, 2023 • edited Loading

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

Async chunks

History

kibanamachine commented Oct 17, 2023

💚 All backports created successfully

Questions ?

qn895 commented Oct 12, 2023 •

edited by kibanamachine

Loading

kibana-ci commented Oct 17, 2023 •

edited

Loading