[Synthetics] Fix heatmap on monitor detail/history page for very large doc counts #184177

justinkambic · 2024-05-23T20:56:42Z

Summary

Resolves #180076.

User-facing updates

We can now display the heatmap for "very large" document counts

This PR

note: The first bucket in this image is empty because that's when my monitor actually started.

Previously

We display appropriate bucketed data

This is a weird behavior I only noticed once I had made the new query. For some reason, the chart didn't display all the data it should. This may be related to the initial issue, but even in ranges less than 10k docs I was seeing this behavior. I manually checked this by querying for the bucket spans and ensuring the right number of docs were showing up, which is what I'd expect, because it's highly unlikely there would be a bug like this in the date histogram agg.

This PR

Previously

Why are you doing this?

We have noticed in a few SDH that in cases where users are querying very large document tallies, like a month's worth of data for a monitor that runs every minute, they're not seeing their status heatmap fill up.

This patch would introduce a multi-stage query mechanism to the pings query endpoint. The PR is in draft state, so the procedure performs the query in this manner all the time, but we may want to add a new param to the endpoint that allows the client to opt into this behavior, or simply cap at the default 10k documents like it does today.

I intend to do a good amount of benchmark testing on this fix locally and in Cloud before I try to merge it. We may also need to explore imposing a hard upper limit on the number of stages we are willing to execute. Right now the query is unbounded and a user could try to query datasets that would result in Kibana retrieving and trying to respond with hundreds of MB of data.

What changed

This PR modifies the backend query to rely on a Date Histogram agg, rather than querying the documents directly for this purpose. This agg uses the same interval that the client calculates based on the width of the page. I have kept the existing time bin logic on the client side, and modified the function that builds the data structure to pour the buckets from the histogram into the bins that we serve the chart that the user sees.

One drawback is because the buckets are computed by Elasticsearch, we need to make API calls to retrieve more data, whereas on resizes previously, all the data was already present and would get re-bucketed on the client. I think this is acceptable because of the ability to handle larger datasets and given that the query should be very fast still.

Additionally:

I have made additional modifications to the frontend logic to prevent unnecessary API calls. Before, the component would always make at least two calls on an initial load. I have also switched from lodash throttle to useDebounce from react-use, as this seems to more properly drop intermediate resize events where we would want to skip hitting the API.
I have also changed the render logic. We used to use a default value to build out an initial set of time bins. Unfortunately, with all these changes in place, this results in a weird behavior where we show an empty scale while waiting for data, and the bins then snap to the proper size once the element's width has been determined and the bins get scaled out properly. Instead of this, I skip rendering the chart until the initial width is available via a ref applied to the wrapper element. At that point, we send the init width to the hook and use that value for our initial query and bin calculations, so we only ever show bins on the charts that will correspond to the bucketed data we eventually display.
I added a progress indicator to the wrapper element so the user receives an indication that the data is being refreshed.
I added a quiet fetch similar to how the overview page works, so that after an initial render the chart will not have its data dropped until we have new data to replace it with. Along those lines, the hook now watches the page location and if it changes (i.e. to from History to Overview), state management will destroy the heatmap data so we don't have a flicker of other data during a new fetch.

Testing this PR

Unfortunately, to truly test this PR you'd need a monitor with over 10k documents, as that's the criteria specified in the initial issue. I did this by running a monitor over about 10 days on a 1-run-per-minute schedule. You could create a cloud deployment from this PR, or create dummy documents in some way (this can be hard to verify though). I am pretty confident this works and fixes the issue based on my real-world testing.

apmmachine · 2024-05-23T20:56:58Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

justinkambic · 2024-05-30T17:00:21Z

/oblt-deploy

elasticmachine · 2024-06-03T16:46:36Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

.../public/apps/synthetics/components/monitor_details/monitor_status/use_monitor_status_data.ts

crespocarlos · 2024-06-19T15:22:08Z

x-pack/plugins/observability_solution/synthetics/server/common/pings/monitor_status_heatmap.ts

+                  // @ts-expect-error strings work
+                  gte: from,
+                  // @ts-expect-error strings work
+                  lte: to,


Would this work?

Suggested change

// @ts-expect-error strings work

gte: from,

// @ts-expect-error strings work

lte: to,

gte: new Date(from).toISOString(),

lte: new Date(from).toISOString(),

format: 'strict_date_optional_time',

x-pack/plugins/observability_solution/synthetics/server/routes/pings/ping_heatmap.ts

.../public/apps/synthetics/components/monitor_details/monitor_status/use_monitor_status_data.ts

crespocarlos · 2024-06-19T15:30:13Z

.../public/apps/synthetics/components/monitor_details/monitor_status/use_monitor_status_data.ts

+  const { heatmap: dateHistogram, loading } = useSelector(selectHeatmap);
+
+  useEffect(() => {
+    if (binsAvailableByWidth === null && initSize) {


Since initSize comes from a ref, is there a chance that this value will be outdated by the time this useEffect runs? I'm curious because if the element resizes as child components finish loading, the initValue could be wrong

justinkambic · 2024-07-10T16:46:42Z

/oblt-deploy

kibanamachine · 2024-09-30T14:48:28Z

Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync.

kibanamachine · 2024-10-01T15:49:15Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add the label auto-backport or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

kibanamachine · 2024-10-02T15:49:47Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add the label auto-backport or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

kibanamachine · 2024-10-03T16:49:20Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

kibanamachine · 2024-10-04T17:49:02Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

kibanamachine · 2024-10-04T21:21:37Z

Starting backport for target branches: 8.15, 8.x

https://github.com/elastic/kibana/actions/runs/11187158574

kibanamachine · 2024-10-04T21:26:05Z

💔 All backports failed

Status	Branch	Result
❌	8.15	Backport failed because of merge conflicts
❌	8.x	Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 184177

Questions ?

Please refer to the Backport tool documentation

kibanamachine · 2024-10-07T21:48:56Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

## Summary We addressed #180076 recently with these two PRs: - #184177 - #192508 We were seeing a strange error that was difficult to repro, so we put in a best-effort patch that was still ineffective. The reason this issue happens is because in the code it's possible to divide by 0, which yields a value of `Infinity`, which at some point causes our interval value supplied to the server route to be an empty string. This patch will make it so that we never pass a value of 0 to be used in the calculation of bucket sizes in this hook.

## Summary We addressed elastic#180076 recently with these two PRs: - elastic#184177 - elastic#192508 We were seeing a strange error that was difficult to repro, so we put in a best-effort patch that was still ineffective. The reason this issue happens is because in the code it's possible to divide by 0, which yields a value of `Infinity`, which at some point causes our interval value supplied to the server route to be an empty string. This patch will make it so that we never pass a value of 0 to be used in the calculation of bucket sizes in this hook. (cherry picked from commit 560d561)

kibanamachine · 2024-10-08T22:48:55Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

kibanamachine · 2024-10-09T23:48:51Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

kibanamachine · 2024-10-11T00:49:04Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

kibanamachine · 2024-10-14T01:48:47Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 184177 locally

justinkambic · 2024-10-14T14:44:29Z

💚 All backports created successfully

Status	Branch	Result
✅	8.15

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…e doc counts (elastic#184177) ## Summary Resolves elastic#180076. ### User-facing updates #### We can now display the heatmap for "very large" document counts ##### This PR _note: The first bucket in this image is empty because that's when my monitor actually started._ <img width="2083" alt="image" src="https://github.com/elastic/kibana/assets/18429259/72daf378-586e-4c7d-a1b0-a05bf95fb09a"> ##### Previously <img width="1420" alt="image" src="https://github.com/elastic/kibana/assets/18429259/94f46913-debc-4bce-9794-600241fc4d6f"> #### We display appropriate bucketed data This is a weird behavior I only noticed once I had made the new query. For some reason, the chart didn't display all the data it should. This may be related to the initial issue, but even in ranges less than 10k docs I was seeing this behavior. I manually checked this by querying for the bucket spans and ensuring the right number of docs were showing up, which is what I'd expect, because it's highly unlikely there would be a bug like this in the date histogram agg. ##### This PR <img width="1420" alt="image" src="https://github.com/elastic/kibana/assets/18429259/8620f709-dd4f-4a52-b39d-a3ff778ff7b4"> ##### Previously <img width="1420" alt="image" src="https://github.com/elastic/kibana/assets/18429259/745fa36d-db35-46a4-be1d-330bed799884"> ### Why are you doing this? We have noticed in a few SDH that in cases where users are querying very large document tallies, like a month's worth of data for a monitor that runs every minute, they're not seeing their status heatmap fill up. ~This patch would introduce a multi-stage query mechanism to the pings query endpoint. The PR is in draft state, so the procedure performs the query in this manner all the time, but we may want to add a new param to the endpoint that allows the client to opt into this behavior, or simply cap at the default 10k documents like it does today.~ ~I intend to do a good amount of benchmark testing on this fix locally and in Cloud before I try to merge it. We may also need to explore imposing a hard upper limit on the number of stages we are willing to execute. Right now the query is unbounded and a user could try to query datasets that would result in Kibana retrieving and trying to respond with hundreds of MB of data.~ ### What changed This PR modifies the backend query to rely on a Date Histogram agg, rather than querying the documents directly for this purpose. This agg uses the same interval that the client calculates based on the width of the page. I have kept the existing time bin logic on the client side, and modified the function that builds the data structure to pour the buckets from the histogram into the bins that we serve the chart that the user sees. One drawback is because the buckets are computed by Elasticsearch, we need to make API calls to retrieve more data, whereas on resizes previously, all the data was already present and would get re-bucketed on the client. I think this is acceptable because of the ability to handle larger datasets and given that the query should be very fast still. Additionally: - I have made additional modifications to the frontend logic to prevent unnecessary API calls. Before, the component would always make at least two calls on an initial load. I have also switched from `lodash` `throttle` to `useDebounce` from `react-use`, as this seems to more properly drop intermediate resize events where we would want to skip hitting the API. - I have also changed the render logic. We used to use a default value to build out an initial set of time bins. Unfortunately, with all these changes in place, this results in a weird behavior where we show an empty scale while waiting for data, and the bins then snap to the proper size once the element's width has been determined and the bins get scaled out properly. Instead of this, I skip rendering the chart until the initial width is available via a `ref` applied to the wrapper element. At that point, we send the init width to the hook and use that value for our initial query and bin calculations, so we only ever show bins on the charts that will correspond to the bucketed data we eventually display. - I added a progress indicator to the wrapper element so the user receives an indication that the data is being refreshed. - I added a quiet fetch similar to how the overview page works, so that after an initial render the chart will not have its data dropped until we have new data to replace it with. Along those lines, the hook now watches the page location and if it changes (i.e. to from History to Overview), state management will destroy the heatmap data so we don't have a flicker of other data during a new fetch. ## Testing this PR Unfortunately, to truly test this PR you'd need a monitor with over 10k documents, as that's the criteria specified in the initial issue. I did this by running a monitor over about 10 days on a 1-run-per-minute schedule. You could create a cloud deployment from this PR, or create dummy documents in some way (this can be hard to verify though). I am pretty confident this works and fixes the issue based on my real-world testing. --------- Co-authored-by: shahzad31 <[email protected]> (cherry picked from commit 50f8bd6) # Conflicts: # x-pack/plugins/observability_solution/synthetics/common/constants/synthetics/rest_api.ts # x-pack/plugins/observability_solution/synthetics/public/apps/synthetics/state/root_effect.ts # x-pack/plugins/observability_solution/synthetics/server/routes/index.ts # x-pack/plugins/observability_solution/synthetics/server/routes/pings/get_ping_statuses.ts

justinkambic · 2024-10-14T14:46:44Z

I have checked and the necessary changes are present in the 8.x branch already, attempts to run the backport tool against that branch find no diff.

I'm canceling 8.15 backport as these changes have already cased enough issues getting backported and IMO it's acceptable to require upgrade to fix this issue.

justinkambic added bug Fixes for quality problems that affect the customer experience release_note:fix Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team v8.15.0 labels May 23, 2024

justinkambic self-assigned this May 23, 2024

justinkambic force-pushed the 180076/fix-heatmap-for-very-large-doc-counts branch 2 times, most recently from cf180b8 to fe14c4f Compare May 30, 2024 16:29

justinkambic force-pushed the 180076/fix-heatmap-for-very-large-doc-counts branch 2 times, most recently from 78fa636 to b4a4ada Compare June 3, 2024 16:38

justinkambic marked this pull request as ready for review June 3, 2024 16:46

justinkambic requested a review from a team as a code owner June 3, 2024 16:46

justinkambic added 7 commits June 6, 2024 14:47

Handled bounded search volume with multiple requests.

3040db1

Use aggs for monitor status heatmap chart.

1860181

Rename response type.

d897d05

Revert changes to query pings endpoint.

02f5d08

Simplify.

6d4c1c9

Rename type.

232c61d

Don't render time bins until there is a proper initial size.

af444ef

justinkambic force-pushed the 180076/fix-heatmap-for-very-large-doc-counts branch from b4a4ada to af444ef Compare June 6, 2024 18:47

botelastic bot added the ci:project-deploy-observability Create an Observability project label Jun 6, 2024

crespocarlos reviewed Jun 19, 2024

View reviewed changes

justinkambic added 2 commits July 1, 2024 11:26

Merge branch 'main' into 180076/fix-heatmap-for-very-large-doc-counts

f315bc1

Pass entire ref to monitor status data hook.

6c9ee1f

justinkambic force-pushed the 180076/fix-heatmap-for-very-large-doc-counts branch from 992fd02 to 6c9ee1f Compare July 1, 2024 15:41

justinkambic added 2 commits August 30, 2024 15:49

Merge branch 'main' into 180076/fix-heatmap-for-very-large-doc-counts

47b5ef3

Update stale code.

1d94688

shahzad31 removed v8.15.0 backport missing Added to PRs automatically when the are determined to be missing a backport. labels Sep 30, 2024

kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Oct 1, 2024

justinkambic mentioned this pull request Oct 4, 2024

[Synthetics] Fix ping heatmap payload #195107

Merged

smith added backport:prev-major Backport to (8.x, 8.17, 8.16) the previous major branch and other branches in development and removed backport missing Added to PRs automatically when the are determined to be missing a backport. labels Oct 4, 2024

kibanamachine added the v9.0.0 label Oct 4, 2024

kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Oct 7, 2024

justinkambic mentioned this pull request Oct 14, 2024

[8.15] [Synthetics] Fix heatmap on monitor detail/history page for very large doc counts (#184177) #196157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Synthetics] Fix heatmap on monitor detail/history page for very large doc counts #184177

[Synthetics] Fix heatmap on monitor detail/history page for very large doc counts #184177

justinkambic commented May 23, 2024 •

edited by kibanamachine

Loading

apmmachine commented May 23, 2024

justinkambic commented May 30, 2024

elasticmachine commented Jun 3, 2024

crespocarlos Jun 19, 2024

crespocarlos Jun 19, 2024

justinkambic commented Jul 10, 2024

kibanamachine commented Sep 30, 2024

kibanamachine commented Oct 1, 2024

kibanamachine commented Oct 2, 2024

kibanamachine commented Oct 3, 2024

kibanamachine commented Oct 4, 2024

kibanamachine commented Oct 4, 2024

kibanamachine commented Oct 4, 2024

kibanamachine commented Oct 7, 2024

kibanamachine commented Oct 8, 2024

kibanamachine commented Oct 9, 2024

kibanamachine commented Oct 11, 2024

kibanamachine commented Oct 14, 2024

justinkambic commented Oct 14, 2024

justinkambic commented Oct 14, 2024

[Synthetics] Fix heatmap on monitor detail/history page for very large doc counts #184177

[Synthetics] Fix heatmap on monitor detail/history page for very large doc counts #184177

Conversation

justinkambic commented May 23, 2024 • edited by kibanamachine Loading

Summary

User-facing updates

We can now display the heatmap for "very large" document counts

This PR

Previously

We display appropriate bucketed data

This PR

Previously

Why are you doing this?

What changed

Testing this PR

apmmachine commented May 23, 2024

🤖 GitHub comments

justinkambic commented May 30, 2024

elasticmachine commented Jun 3, 2024

crespocarlos Jun 19, 2024

Choose a reason for hiding this comment

crespocarlos Jun 19, 2024

Choose a reason for hiding this comment

justinkambic commented Jul 10, 2024

kibanamachine commented Sep 30, 2024

kibanamachine commented Oct 1, 2024

kibanamachine commented Oct 2, 2024

kibanamachine commented Oct 3, 2024

kibanamachine commented Oct 4, 2024

kibanamachine commented Oct 4, 2024

kibanamachine commented Oct 4, 2024

💔 All backports failed

Manual backport

Questions ?

kibanamachine commented Oct 7, 2024

kibanamachine commented Oct 8, 2024

kibanamachine commented Oct 9, 2024

kibanamachine commented Oct 11, 2024

kibanamachine commented Oct 14, 2024

justinkambic commented Oct 14, 2024

💚 All backports created successfully

Questions ?

justinkambic commented Oct 14, 2024

justinkambic commented May 23, 2024 •

edited by kibanamachine

Loading