-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution #168757
Conversation
…es request. Default to <0.000001 and data drifted = yes.
@@ -359,7 +359,7 @@ export const DataDriftPage: FC<Props> = ({ initialSettings }) => { | |||
label={comparisonIndexPatternLabel} | |||
randomSampler={randomSamplerProd} | |||
reload={forceRefresh} | |||
brushSelectionUpdateHandler={brushSelectionUpdate} | |||
brushSelectionUpdateHandler={undefined} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just omit 'brushSelectionUpdateHandler' completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM ⚡
Pinging @elastic/ml-ui (:ml) |
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Async chunks
History
To update your PR or re-run it, just comment with: cc @qn895 |
…of uniform distribution (elastic#168757) (cherry picked from commit 6d06dc3)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…is not of uniform distribution (#168757) (#169168) # Backport This will backport the following commits from `main` to `8.11`: - [[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution (#168757)](#168757) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Quynh Nguyen (Quinn)","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-10-17T19:34:12Z","message":"[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution (#168757)","sha":"6d06dc3d2d2fd9440ce474c9f8fdfc45b720fc59","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug",":ml","release_note:skip","ci:cloud-deploy","v8.11.0","v8.12.0"],"number":168757,"url":"https://github.com/elastic/kibana/pull/168757","mergeCommit":{"message":"[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution (#168757)","sha":"6d06dc3d2d2fd9440ce474c9f8fdfc45b720fc59"}},"sourceBranch":"main","suggestedTargetBranches":["8.11"],"targetPullRequestStates":[{"branch":"8.11","label":"v8.11.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/168757","number":168757,"mergeCommit":{"message":"[ML] Fix data drift calculating inaccurate p value when range is not of uniform distribution (#168757)","sha":"6d06dc3d2d2fd9440ce474c9f8fdfc45b720fc59"}}]}] BACKPORT--> Co-authored-by: Quynh Nguyen (Quinn) <[email protected]>
…of uniform distribution (elastic#168757)
Summary
Addresses #168090
This is actually an edge case. This we because we currently assume uniform distribution - that in every range we will have 5% of the overall data. If this is true, then performing the ks_test aggregation, we don’t need to specify the fractions parameter, since all fractions should be 0.05 by design and this is the default value (uniform distribution).
Now, in the case with pdays an overwhelming amount of values (70% or so) have the value -1. It’s probably a default value for missing data or something like this. Now when we run the percentile aggregation, we get a bunch of [-1,-1] intervals, since aggregation is not able to split the values into distinct 5% intervals. And this breaks our assumption of uniform distributed fractions in the
ks_test
.So, to fix this in the general case, we need to run an additional ranges agg to get the doc count for the ranges that we get from the percentiles aggregation. Having the doc counts we can compute fractions explicitly and then pass the values as a list to the
ks_test
.This PR adds that additional ranges agg, use the doc_count to compute the fractions, and pass that to the ks_test agg.
After
NOTE: Logging is temporarily added to have more visibility to the ES queries.
In addition, this PR also fixes:
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
Delete this section if it is not applicable to this PR.
Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.
When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:
For maintainers