[RCA] Create investigation detail page (V1) #187286

jasonrhodes · 2024-07-01T16:32:09Z

Prerequisites

In order to start implementing the investigation detail page according to the design mockup, we first need to have following PRs merged:

Acceptance Criteria

We are aiming to have a v1 investigation DETAIL page that has the following components.

Header

Each detail page will have a header with the following design.

Initially, we'll delay the implementation of the "Escalate" button and the "three dots" more menu, and instead opt for a more basic design where the top right button is only "Close investigation". Later, the "Close investigation" action will move into the 3 dot menu and the primary button will have some sort of "Escalate" or related verbiage.

When the "Escalate" button is added (final wording on that button name TBD), clicking it will reveal a menu of connectors that have previously been set up, along with the ability to add a new connector. Note: You can see how this works by looking at how it already works today in the Cases UI, for the most part.

Adding a new connector will open a flyout that should be available from the Response Ops team, since they manage the connectors flow currently.

Related Events

We need to continue to refine how this part of the UI will work, but there will be some concept of "related events" represented for each investigation. This feature can be displayed in "timeline view":

Or in "list view":

The events on the timeline can be optionally filtered:

Observations Stream

Observations can be added to the primary stream on the page, and then are displayed.

To start, we'll prioritize the ability to add visualizations of 3 different types:

Any existing embeddable visualization ("from the library")
ES|QL interface, which allows adding a data table, single metric, or Lens visualization from its query results
Some subset of existing visualizations that exist in the observability app (to begin, this will mostly be visualizations that appear on the alert details pages -- when an investigation is created from an alert, some of the charts from that alert detail page should be automatically added to the investigation when it's created)

Important caveats to the above mockup:

Language should always be "observation" and not "observation chart", based on early feedback from users
The option to "Import from > Inventory/Entity" is not planned for V1

Notes

There will be a sidebar displaying collaborative notes, which can be added by any Kibana user who visits this investigation.

Outstanding questions

What text should be allowed for V1? Is it plain text only, or some amount of markup allowed? The mockup seems to show the ability to bold and make new paragraphs, at the very least.
Are hyperlinks allowed in V1?
Are uploaded images planned for V1? If so, can they also be hyperlinked?
Links to external ticketing systems: should these be possible? If arbitrary hyperlinks are allowed, that would solve this, but if not, are we able to do this?
Is there a concept of a linked runbook, possibly if it came from the alert the investigation was started from?
Notes should be deleteable but only by the user who created the note.
V1 notes may or may not be editable

jasonrhodes · 2024-07-03T17:02:01Z

@mgiota I just synced with @kdelemme and @benakansara and they've got some good context now from the other POC, so they're going to jump in on these investigation UI side tickets. Feel free to continue to be involved as we refine these (asking questions, syncing between the entry point flow and this flow). Thanks all.

benakansara · 2024-07-04T21:15:52Z

First iteration of Investigation detail page

Main components on Investigation detail page when starting a new investigation:

Rule related charts in merged state
Related events from underlying dataview/index pattern
Suggested observations (depending on rule type - app related visualizations and/or ML visualizations)
Ability to add ES|QL visualization
Ability to import visualizations from dashboards
Investigation timeline
- Runbook selected by user in rule form
- Hypothesis panel with ability to add notes and screenshots

I'll put details of each component in comments below.

Note: The initial timerange would be same as timerange used for main chart in alert details page.

Future iterations

Invite other users
Jira / Github action / other integrations for escalation
Add more charts automatically
Add ability to remove charts, adjust query of charts
Detect more events to show in event timeline
- look for events in other index patterns for same source/entities
Ability to add more runbooks in investigation with an option to update runbooks in rule as well for it to show in future alerts/other existing alerts investigations
Share investigation with other users
Lock visualizations to compare same visualization with different filters
AI assistant to auto-suggest observations/visualizations based on user activity / what's on screen
AI assistant to auto-suggest hypothesis

benakansara · 2024-07-04T21:16:17Z

Rule related charts in merged state

In merged state, y-axis is not shown. All y-values are normalized in a way different charts can be correlated.

Custom threshold rule/Metric threshold/Log threshold with single or multiple conditions

All condition charts

APM Latency/Error count/Failed transaction/Anomaly rules

Latency chart
Error distribution chart
Failed transaction rate chart

SLO Burn rate rule

Burn rate chart

benakansara · 2024-07-04T21:16:41Z

Related events from underlying dataview / index pattern

Compact view in alert details page

Expanded view in investigation detail page

Common for all use cases

1. Log rate ⬆️ / ⬇️ (based on all log documents)
To find log rate, we need two timeblocks to compare and define internal threshold to indicate there is a significant increase or decrease in log rate. My suggestion is that we divide the timerange in blocks of time windows (based on rule lookback window?) and compare each time window with next to find if there is an increase or decrease in logs, calculate rate at which logs increased/decreased. If it is significant enough e.g. 1.5x or 2.0x, this would be an event.

When rule is not log based, we can check *log* index pattern filtered with source/entities and time range to find relevant logs.

2. Error rate ⬆️ / ⬇️ (based on documents with log.level: error)
Same logic as "Log rate" to find Error rate events

There could be other fields that contribute to error rate:

http.response.status_code
...

3. Related alerts ([3] annotation in event timeline)
Show number of alerts triggered for the same source/entities in compact view. In detail view, show short reason message for example, "Latency threshold breached" or simply "Latency increased", for each of the alerts.

4. SLO burn rate alert
If there are SLO burn rate alerts for same source/entities, show it as event on event timeline.

Use case specific events

Divide timerange in blocks, check in each block for a set of fields to add in event timeline depending on the use case.

Use case: Log (Custom threshold / Log threshold) alert on kubernetes.pod.uid

kubernetes.event.reason: Unhealthy
kubernetes.container.status.restarts > 5? (Container restarts count)
kubernetes.container.status.phase: terminated/waiting
kubernetes.container.status.reason: Error, OOMKilled state

Use case: Log (Error count / Custom threshold) alert on service.name

service.version change - need to compare each block to find if there is any version upgrade/downgrade history
service.state: failed (need to confirm what are possible values)

Use case: Log (Custom threshold / Log threshold) alert on container.id

...

Use case: Log (Custom threshold / Log threshold) alert on host.name

...

mgiota · 2024-07-04T22:33:04Z

Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.

benakansara · 2024-07-05T10:16:48Z

Suggested observations

These can be app specific charts - from infra, APM, synthetics - and ML visualizations.

Alert type	common	source: host	source: k8s pod	source: container
Log	Log rate analysis Log pattern analysis
Infra Metric	Change point detection	Memory usage CPU usage Disk usage Network traffic	Memory usage CPU usage Network traffic	Memory usage CPU usage
APM	Throughput Time spent by span Transactions table Error occurrences Errors table Service map

benakansara · 2024-07-05T10:16:56Z

Add new observation

ES|QL - Allow users to write their own ES|QL query, see results of query in form of chart/table or both, add resulted visualizations to investigation
Import existing visualizations

benakansara · 2024-07-05T10:17:00Z

Investigation timeline

Add ability to link runbooks in rule form so that users can add runbook links when creating rules. In investigation detail page, the runbook link is shown at the top under Investigation timeline.

Users can create new hypothesis and start adding notes/screenshots to it. Multiple hypothesis can be created.

benakansara · 2024-07-05T11:22:16Z

Dashboards

As per the design, for some of the events, users have possibility to go to relevant dashboard. For this, we can allow users to link dashboards in rule form. If we detect event related to entity (for example, container restart, node failure), we can show all dashboard links that users have added while creating rule. This is under assumption that users linked dashboards related to monitoring entities.

Alternatively, we can create a section on investigation detail page to show dashboard links that users added in rule form without attaching them to any particular event in event timeline.

benakansara · 2024-07-05T13:40:57Z

Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.

I have updated this comment to add future iterations section. I added the points you mentioned plus some other topics.

mgiota · 2024-07-05T18:50:33Z

@benakansara I think you nailed it! I suggest we add a few more charts for SLO burn rate rule, for example error budget consumption, historical SLI, good & bad events, basically what we currently have in the SLO detail page. Unless we think some of these charts don't bring that value in the investigation process.

jasonrhodes · 2024-07-08T12:00:19Z

This is great, thanks so much, @benakansara !

botelastic bot added the needs-team Issues missing a team label label Jul 1, 2024

jasonrhodes added the Team:obs-ux-management Observability Management User Experience Team label Jul 2, 2024

botelastic bot removed the needs-team Issues missing a team label label Jul 2, 2024

maryam-saeidi assigned mgiota Jul 3, 2024

jasonrhodes assigned kdelemme and benakansara Jul 3, 2024

benakansara mentioned this issue Jul 8, 2024

[RCA] [POC] Create Events API to find related events #187787

Closed

This was referenced Jul 17, 2024

chore(investigate): Add investigate-app plugin from poc #188122

Merged

Investigate UI cleanup #188639

Closed

jasonrhodes unassigned kdelemme, mgiota and benakansara Aug 5, 2024

jasonrhodes added the Meta label Aug 5, 2024

jasonrhodes changed the title ~~[RCA] Create investigation detail page~~ [RCA] Create investigation detail page (V1) Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RCA] Create investigation detail page (V1) #187286

[RCA] Create investigation detail page (V1) #187286

jasonrhodes commented Jul 1, 2024 •

edited by mgiota

Loading

jasonrhodes commented Jul 3, 2024

benakansara commented Jul 4, 2024 •

edited

Loading

benakansara commented Jul 4, 2024

benakansara commented Jul 4, 2024 •

edited

Loading

mgiota commented Jul 4, 2024

benakansara commented Jul 5, 2024

benakansara commented Jul 5, 2024

benakansara commented Jul 5, 2024 •

edited

Loading

benakansara commented Jul 5, 2024

benakansara commented Jul 5, 2024

mgiota commented Jul 5, 2024

jasonrhodes commented Jul 8, 2024

[RCA] Create investigation detail page (V1) #187286

[RCA] Create investigation detail page (V1) #187286

Comments

jasonrhodes commented Jul 1, 2024 • edited by mgiota Loading

Prerequisites

Acceptance Criteria

Header

Related Events

Observations Stream

Notes

jasonrhodes commented Jul 3, 2024

benakansara commented Jul 4, 2024 • edited Loading

First iteration of Investigation detail page

Future iterations

benakansara commented Jul 4, 2024

Rule related charts in merged state

Custom threshold rule/Metric threshold/Log threshold with single or multiple conditions

APM Latency/Error count/Failed transaction/Anomaly rules

SLO Burn rate rule

benakansara commented Jul 4, 2024 • edited Loading

Related events from underlying dataview / index pattern

Common for all use cases

Use case specific events

mgiota commented Jul 4, 2024

benakansara commented Jul 5, 2024

Suggested observations

benakansara commented Jul 5, 2024

Add new observation

benakansara commented Jul 5, 2024 • edited Loading

Investigation timeline

benakansara commented Jul 5, 2024

Dashboards

benakansara commented Jul 5, 2024

mgiota commented Jul 5, 2024

jasonrhodes commented Jul 8, 2024

jasonrhodes commented Jul 1, 2024 •

edited by mgiota

Loading

benakansara commented Jul 4, 2024 •

edited

Loading

benakansara commented Jul 4, 2024 •

edited

Loading

benakansara commented Jul 5, 2024 •

edited

Loading