Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RCA] Define new "investigation" CRUD API #187284

Closed
Tracked by #190060
jasonrhodes opened this issue Jul 1, 2024 · 14 comments · Fixed by #190094
Closed
Tracked by #190060

[RCA] Define new "investigation" CRUD API #187284

jasonrhodes opened this issue Jul 1, 2024 · 14 comments · Fixed by #190094
Assignees
Labels
Team:obs-ux-management Observability Management User Experience Team

Comments

@jasonrhodes
Copy link
Member

jasonrhodes commented Jul 1, 2024

Acceptance Criteria

  • Initial version of saved object storage is in place
  • REST API exists in Kibana back end with following initial endpoints
    • Create investigation
    • List one investigation
    • List all investigations
@botelastic botelastic bot added the needs-team Issues missing a team label label Jul 1, 2024
@jasonrhodes jasonrhodes added the Team:obs-ux-management Observability Management User Experience Team label Jul 2, 2024
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jul 2, 2024
@maryam-saeidi maryam-saeidi self-assigned this Jul 3, 2024
@jasonrhodes
Copy link
Member Author

@maryam-saeidi I just synced with @kdelemme and @benakansara and they've got some good context now from the other POC, so they're going to jump in on these investigation UI side tickets. Feel free to continue to be involved as we refine these (asking questions, syncing between the entry point flow and this flow). Thanks all.

@kdelemme
Copy link
Contributor

kdelemme commented Jul 4, 2024

Design: https://www.figma.com/design/YJPufJ9KJjvBY9pGvNDRSR/RCA-workflows?node-id=0-1&t=IgH5U7bAwPV2PxFS-1

First attempt at providing an overview of what needs to be done in order to bring the aforementioned design to life.
We should focus only on the scenario of log alerts with kubernetes data.

Design elements

We have a date range picker that seems to be global to all the widget present on the page.
The investigation is linked to an alert, and maybe to a rule as well so we can link this investigation to similar alerts.
We have different widgets, like charts (which one?), recent events (need to define what events, and how we find them). We can add hypotheses which can be text and/or image.

Recent Events

We need to find for every event shown in the design, how we get the data. it can be using an existing API if one exists, or a query into the index used by the rule.

  • New version release: TBD
  • Node resource failure: TBD
  • Container failure start: TBD
  • Latency increase: TBD
  • Error rate increase: TBD
  • Log rate increase: TBD
  • Elasticsearch upgrade: TBD

Model

Investigation: {
 id: uuid;
 title: string;
 createdAt: Date;
 createdBy: user;
 tags: string[];
 status: "ongoing" | "closed";
 relatedRuleId: string;
 relatedAlertId: string;
 widgets: Widget[]
 links: Link[]
 hypotheses: Hypothesis[]
}


Widget: {
 id: uuid;
 title: string;
 type: "esql" | "embeddable" | "recentEvents" | "chart";
 parameters: any;
 layout: Layout;
}


Link: {
 text: string;
 link: string;
}


Hypothesis: {
 id: uuid;
 text: string;
 attachments: Attachment[]
 createdAt: Date;
 createdBy: user;
}


Attachment: {
 id: uuid;
 type: "image";
 url: string;
}

API

GET /investigations

POST /investigations

GET /investigations/:id

POST /investigations/:id/widgets
PUT /investigations/:id/widgets/:widgetId
DELETE /investigations/:id/widgets/:widgetId

POST /investigations/:id/hypotheses
PUT /investigations/:id/hypotheses/:hypothesisId
DELETE /investigations/:id/hypotheses/:hypothesisId

Kibana plugins

Because of cyclic dependencies, Dario's initial POC used two plugins, one responsible for the registry of widgets: other plugins would depend upon it to register their widgets, e.g. APM, SLO, Synthetics, etc...

And the main plugin depending on the registry one, and on the other plugins like SLO, APM (e.g. for the clients), and containing the investigation UI and API.

Concerns

As we focus on one particular alert, we make this investigation UI a fixed set of elements or simply a different alert details page. We need to keep in mind that the user knows better and should construct the investigation block as they want.

@mgiota
Copy link
Contributor

mgiota commented Jul 4, 2024

@kdelemme Great points! In the Model section I suggest we add the concept of User, Escalation/Integration as well.

User: {
  id: uuid;
  username: string;
  password: string
}

# example Jira, Github issue etc
Integration: {
  id: uuid;
  title: string;
  description: string;
}

The Investigation model needs to be adapted as well to include the list Integrations.

Since there is the concept of Escalation and inviting more users to the investigation, we should add the list of invited users as well.

Investigation: {
 id: uuid;
 title: string;
 createdAt: Date;
 createdBy: user;
 invitedUsers: User[];
 tags: string[];
 status: "ongoing" | "closed";
 relatedRuleId: string;
 relatedAlertId: string;
 widgets: Widget[];
 links: Link[];
 hypotheses: Hypothesis[];
 integrations: Integration[]
}

@mgiota
Copy link
Contributor

mgiota commented Jul 4, 2024

@kdelemme Regarding status field of the Investigation, in the design there is acknowledged. So let's use "acknowledged" | "closed";

@mgiota
Copy link
Contributor

mgiota commented Jul 4, 2024

we can link this investigation to similar alerts

In the design there is the concept of Related investigations. What do we consider similar alerts? Alerts that are linked to the same rule type? We need to define what relevant investigations are.

@maryam-saeidi
Copy link
Member

I feel like we are doing the same thing as cases with additional components like widgets and hypotheses 🙈

Putting that aside and only focusing on the proposal, do we also need to keep a field related to the latest update? (like updatedAt, updatedBy, for Investigation) I assume hypotheses are not editable in this model, right?

@mgiota For integrations, how is it different from the Link that Kevin mentioned?

@chrisdistasio
Copy link

New version release: TBD
Node resource failure: TBD
Container failure start: TBD
Latency increase: TBD
Error rate increase: TBD
Log rate increase: TBD
Elasticsearch upgrade: TBD

Is the above a full set of events that need to be captured?

@jasonrhodes
Copy link
Member Author

New version release: TBD
Node resource failure: TBD
Container failure start: TBD
Latency increase: TBD
Error rate increase: TBD
Log rate increase: TBD
Elasticsearch upgrade: TBD

Is the above a full set of events that need to be captured?

Just the ones that appear in the design. As the current approach requires us to manually extract each event using specific logic, we'll need to understand what the intended universe of events is, to start. A question for @drewpost I think.

@chrisdistasio
Copy link

thanks, @jasonrhodes. Do you have a sense of how you will capture and compute some of these? Do you expect to have these attached to an entity?

@jasonrhodes
Copy link
Member Author

Do you have a sense of how you will capture and compute some of these? Do you expect to have these attached to an entity?

I don't have good ideas yet. If they were available via the entity system, that would be great, but I don't want to block this work on that one so we will look at alternative ways of computing some of these, as well, and will have to punt on the ones that aren't possible (at least until entities are available).

@michaelolo24
Copy link
Contributor

Hey all! Just wanted to drop a heads up that security is also going to have a concept of investigations that at this moment, will only serves as a navigation item within security, but will most likely expand to be more in the future. Given that, it would be great to have any api's/so's scoped to observability if possible i.e. /api/observability/investigation or /api/obs-investigation to prevent any collisions in the future.

Is there additional documentation on this feature that we may be able to read up on?

@jasonrhodes
Copy link
Member Author

Ping @drewpost re: this security overlap in the "investigations" concept ^^

@michaelolo24, who would be the product person for Drew to sync up with here?

@michaelolo24
Copy link
Contributor

@jasonrhodes => @paulewing would be the person for him to speak with, thanks!

@jasonrhodes jasonrhodes changed the title [RCA] Define new "investigation" system object [RCA] Define new "investigation" CRUD API Aug 12, 2024
@jasonrhodes
Copy link
Member Author

Closed by #190094

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants