Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RCA] Create investigation detail page (V1) #187286

Open
2 tasks done
jasonrhodes opened this issue Jul 1, 2024 · 12 comments
Open
2 tasks done

[RCA] Create investigation detail page (V1) #187286

jasonrhodes opened this issue Jul 1, 2024 · 12 comments
Labels
Meta Team:obs-ux-management Observability Management User Experience Team

Comments

@jasonrhodes
Copy link
Member

jasonrhodes commented Jul 1, 2024

Prerequisites

In order to start implementing the investigation detail page according to the design mockup, we first need to have following PRs merged:

Acceptance Criteria

We are aiming to have a v1 investigation DETAIL page that has the following components.

Header

Each detail page will have a header with the following design.

Image

Initially, we'll delay the implementation of the "Escalate" button and the "three dots" more menu, and instead opt for a more basic design where the top right button is only "Close investigation". Later, the "Close investigation" action will move into the 3 dot menu and the primary button will have some sort of "Escalate" or related verbiage.

Image

When the "Escalate" button is added (final wording on that button name TBD), clicking it will reveal a menu of connectors that have previously been set up, along with the ability to add a new connector. Note: You can see how this works by looking at how it already works today in the Cases UI, for the most part.

Image

Adding a new connector will open a flyout that should be available from the Response Ops team, since they manage the connectors flow currently.

Image

Related Events

We need to continue to refine how this part of the UI will work, but there will be some concept of "related events" represented for each investigation. This feature can be displayed in "timeline view":

Image

Or in "list view":

Image

The events on the timeline can be optionally filtered:

Image

Observations Stream

Observations can be added to the primary stream on the page, and then are displayed.

Image

To start, we'll prioritize the ability to add visualizations of 3 different types:

  • Any existing embeddable visualization ("from the library")
  • ES|QL interface, which allows adding a data table, single metric, or Lens visualization from its query results
  • Some subset of existing visualizations that exist in the observability app (to begin, this will mostly be visualizations that appear on the alert details pages -- when an investigation is created from an alert, some of the charts from that alert detail page should be automatically added to the investigation when it's created)

Image

Important caveats to the above mockup:

  • Language should always be "observation" and not "observation chart", based on early feedback from users
  • The option to "Import from > Inventory/Entity" is not planned for V1

Notes

There will be a sidebar displaying collaborative notes, which can be added by any Kibana user who visits this investigation.

Image

Outstanding questions

  • What text should be allowed for V1? Is it plain text only, or some amount of markup allowed? The mockup seems to show the ability to bold and make new paragraphs, at the very least.
  • Are hyperlinks allowed in V1?
  • Are uploaded images planned for V1? If so, can they also be hyperlinked?
  • Links to external ticketing systems: should these be possible? If arbitrary hyperlinks are allowed, that would solve this, but if not, are we able to do this?
  • Is there a concept of a linked runbook, possibly if it came from the alert the investigation was started from?
  • Notes should be deleteable but only by the user who created the note.
  • V1 notes may or may not be editable
@botelastic botelastic bot added the needs-team Issues missing a team label label Jul 1, 2024
@jasonrhodes jasonrhodes added the Team:obs-ux-management Observability Management User Experience Team label Jul 2, 2024
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jul 2, 2024
@jasonrhodes
Copy link
Member Author

@mgiota I just synced with @kdelemme and @benakansara and they've got some good context now from the other POC, so they're going to jump in on these investigation UI side tickets. Feel free to continue to be involved as we refine these (asking questions, syncing between the entry point flow and this flow). Thanks all.

@benakansara
Copy link
Contributor

benakansara commented Jul 4, 2024

First iteration of Investigation detail page

Main components on Investigation detail page when starting a new investigation:

  • Rule related charts in merged state
  • Related events from underlying dataview/index pattern
  • Suggested observations (depending on rule type - app related visualizations and/or ML visualizations)
  • Ability to add ES|QL visualization
  • Ability to import visualizations from dashboards
  • Investigation timeline
    • Runbook selected by user in rule form
    • Hypothesis panel with ability to add notes and screenshots

I'll put details of each component in comments below.

Note: The initial timerange would be same as timerange used for main chart in alert details page.

Future iterations

  • Invite other users
  • Jira / Github action / other integrations for escalation
  • Add more charts automatically
  • Add ability to remove charts, adjust query of charts
  • Detect more events to show in event timeline
    • look for events in other index patterns for same source/entities
  • Ability to add more runbooks in investigation with an option to update runbooks in rule as well for it to show in future alerts/other existing alerts investigations
  • Share investigation with other users
  • Lock visualizations to compare same visualization with different filters
  • AI assistant to auto-suggest observations/visualizations based on user activity / what's on screen
  • AI assistant to auto-suggest hypothesis

@benakansara
Copy link
Contributor

Rule related charts in merged state

In merged state, y-axis is not shown. All y-values are normalized in a way different charts can be correlated.

Screenshot 2024-07-04 at 20 42 44

Custom threshold rule/Metric threshold/Log threshold with single or multiple conditions

  • All condition charts

APM Latency/Error count/Failed transaction/Anomaly rules

  • Latency chart
  • Error distribution chart
  • Failed transaction rate chart

SLO Burn rate rule

  • Burn rate chart

@benakansara
Copy link
Contributor

benakansara commented Jul 4, 2024

Related events from underlying dataview / index pattern

Compact view in alert details page
Screenshot 2024-07-04 at 22 20 45

Expanded view in investigation detail page
Screenshot 2024-07-04 at 22 20 13

Common for all use cases

1. Log rate ⬆️ / ⬇️ (based on all log documents)
To find log rate, we need two timeblocks to compare and define internal threshold to indicate there is a significant increase or decrease in log rate. My suggestion is that we divide the timerange in blocks of time windows (based on rule lookback window?) and compare each time window with next to find if there is an increase or decrease in logs, calculate rate at which logs increased/decreased. If it is significant enough e.g. 1.5x or 2.0x, this would be an event.

When rule is not log based, we can check *log* index pattern filtered with source/entities and time range to find relevant logs.

2. Error rate ⬆️ / ⬇️ (based on documents with log.level: error)
Same logic as "Log rate" to find Error rate events

There could be other fields that contribute to error rate:

  • http.response.status_code
  • ...

3. Related alerts ([3] annotation in event timeline)
Show number of alerts triggered for the same source/entities in compact view. In detail view, show short reason message for example, "Latency threshold breached" or simply "Latency increased", for each of the alerts.

4. SLO burn rate alert
If there are SLO burn rate alerts for same source/entities, show it as event on event timeline.

Use case specific events

Divide timerange in blocks, check in each block for a set of fields to add in event timeline depending on the use case.

Use case: Log (Custom threshold / Log threshold) alert on kubernetes.pod.uid

  • kubernetes.event.reason: Unhealthy
  • kubernetes.container.status.restarts > 5? (Container restarts count)
  • kubernetes.container.status.phase: terminated/waiting
  • kubernetes.container.status.reason: Error, OOMKilled state

Use case: Log (Error count / Custom threshold) alert on service.name

  • service.version change - need to compare each block to find if there is any version upgrade/downgrade history
  • service.state: failed (need to confirm what are possible values)

Use case: Log (Custom threshold / Log threshold) alert on container.id

  • ...

Use case: Log (Custom threshold / Log threshold) alert on host.name

  • ...

@mgiota
Copy link
Contributor

mgiota commented Jul 4, 2024

Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.

@benakansara
Copy link
Contributor

Suggested observations

Screenshot 2024-07-05 at 11 50 55

These can be app specific charts - from infra, APM, synthetics - and ML visualizations.

Alert type common source: host source: k8s pod source: container
Log
  • Log rate analysis
  • Log pattern analysis
Infra Metric
  • Change point detection
  • Memory usage
  • CPU usage
  • Disk usage
  • Network traffic
  • Memory usage
  • CPU usage
  • Network traffic
  • Memory usage
  • CPU usage
APM
  • Throughput
  • Time spent by span
  • Transactions table
  • Error occurrences
  • Errors table
  • Service map

@benakansara
Copy link
Contributor

Add new observation

Screenshot 2024-07-05 at 12 06 52
  • ES|QL - Allow users to write their own ES|QL query, see results of query in form of chart/table or both, add resulted visualizations to investigation
  • Import existing visualizations

@benakansara
Copy link
Contributor

benakansara commented Jul 5, 2024

Investigation timeline

Add ability to link runbooks in rule form so that users can add runbook links when creating rules. In investigation detail page, the runbook link is shown at the top under Investigation timeline.

Users can create new hypothesis and start adding notes/screenshots to it. Multiple hypothesis can be created.

Screenshot 2024-07-05 at 12 13 04

@benakansara
Copy link
Contributor

Dashboards

As per the design, for some of the events, users have possibility to go to relevant dashboard. For this, we can allow users to link dashboards in rule form. If we detect event related to entity (for example, container restart, node failure), we can show all dashboard links that users have added while creating rule. This is under assumption that users linked dashboards related to monitoring entities.

Alternatively, we can create a section on investigation detail page to show dashboard links that users added in rule form without attaching them to any particular event in event timeline.

Screenshot 2024-07-04 at 22 20 13

@benakansara
Copy link
Contributor

Great work! I think you covered most of the main parts. Two small details that are missing, are the invited members and the escalated integrations (Jira ticket, Github actions). These don't have to be part of "first version" of course, but still I would add them to the list, and when we create the subtickets we prioritize accordingly.

I have updated this comment to add future iterations section. I added the points you mentioned plus some other topics.

@mgiota
Copy link
Contributor

mgiota commented Jul 5, 2024

@benakansara I think you nailed it! I suggest we add a few more charts for SLO burn rate rule, for example error budget consumption, historical SLI, good & bad events, basically what we currently have in the SLO detail page. Unless we think some of these charts don't bring that value in the investigation process.

@jasonrhodes
Copy link
Member Author

This is great, thanks so much, @benakansara !

@jasonrhodes jasonrhodes changed the title [RCA] Create investigation detail page [RCA] Create investigation detail page (V1) Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
Development

No branches or pull requests

4 participants