Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SLOs troubleshooting document (attempt 2) #4486

Merged
merged 33 commits into from
Nov 28, 2024

Conversation

eedugon
Copy link
Contributor

@eedugon eedugon commented Nov 6, 2024

Description

New document Troubleshoot SLOs created, per #4237 request.
SLO pre-requisites updated in all pages
SLO Beta instructions moved to troubleshooting doc

Preview:

Before moving forward I'd like @lucabelluccini to take a look at the current structure and decide if all suggested parts are useful.

After agreeing on main structure and areas to talk about we can discuss further the exact content.

Pending actions

  • Analyze and decide if serverless documentation needs to be updated.

Documentation sets edited in this PR

Check all that apply.

  • Stateful (docs/en/observability/*)
  • Serverless (docs/en/serverless/*)
  • Integrations Developer Guide (docs/en/integrations/*)
  • None of the above

Related issue

Closes #4237

Checklist

  • Product/Engineering Review
  • Writer Review

Follow-up tasks

Select one.

  • This PR does not need to be ported to another doc set because:
    • The concepts in this PR only apply to one doc set (serverless or stateful)
    • The PR contains edits to both doc sets (serverless and stateful)
  • This PR needs to be ported to another doc set:
    • Port to stateful docs: <link to PR or tracking issue>
    • Port to serverless docs: <link to PR or tracking issue>

@eedugon eedugon requested a review from a team as a code owner November 6, 2024 11:11
Copy link
Contributor

github-actions bot commented Nov 6, 2024

A documentation preview will be available soon.

Request a new doc build by commenting
  • Rebuild this PR: run docs-build
  • Rebuild this PR and all Elastic docs: run docs-build rebuild

run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.

If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Nov 6, 2024
@eedugon eedugon added backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify needs-product-review and removed backport-skip Skip notification from the automated backport with mergify labels Nov 6, 2024
@elastic elastic deleted a comment from mergify bot Nov 6, 2024
@eedugon
Copy link
Contributor Author

eedugon commented Nov 7, 2024

@dedemorton : This is the PR to focus for your review, I have closed the other one as it got corrupted after the major changes we have done with serverless docs :)

Copy link
Contributor

@dedemorton dedemorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a nice level of technical depth that's often missing from the docs we write, so kudos to that! I've made some editorial suggestions (hope that's what you were looking for) plus pointed out some things I thought might be confusing to users. I Hope this is helpful.

docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
@eedugon
Copy link
Contributor Author

eedugon commented Nov 19, 2024

@lucabelluccini : would you please comment out on the current status of the PR and the remaining topics?
thanks in advance!

@lucabelluccini
Copy link
Contributor

General disclaimers:

  • I think a developer should be involved into the review. We need to understand how much we want to disclose and is subject to major changes in the medium-long term.
  • Depending on the above, docs will become a "contract" which must be updated whenever the architecture will change.

For what it's worth, this is the dependency graph of the assets I've deduced from the behavior of SLO

image

Source here

Suggestions:

Let's add a WARNING/INFO banner at the top

We should add a top banner in Troubleshoot SLOs page:

  • Do Not edit or delete any "internal" asset mentioned in the page (e.g. transforms or ingest pipelines created by the SLO application
  • Do Not build upon or tamper the "internal" assets mentioned in this page (e.g. do not expect a user can programmatically inject or edit transforms or ingest pipelines named as those we expect).
  • those are implementation details which are subject to change
  • do not attempt to edit the .slo-observability.* indices (overriding index templates or editing the settings/mappings)

Understanding SLOs section

Understanding SLOs should be renamed as it might be intended as "how do I understand/interpret a Service Level Objectives".
I do not know what the name could be, but something like "Elastic SLO internals"

I would avoid mentioning the v3 in .slo-observability.sli-v3 and .slo-observability.summary-v3. Those are likely going to be updated over time, so I would replace them with a {internal version} or something similar.
As an example, we're at v3.3 on 8.16.

The part:

The rollup documents are stored in an index named .slo-observability.sli-v3 (index split per month through an ingest pipeline) while summary documents are stored in .slo-observability.summary-v3.
Each time an SLO is updated, a new transform is created using the latest SLO definition. The transform ID is generated by combining the SLO ID and the SLO revision, following the format: slo-{slo.id}-{slo.revision}.

Maybe it's better something like this (apart the formatting/list):

image

Common problems section

  • No transform or ingest nodes 👍
  • Unhealthy transforms 👍
    • Maybe rename it Unhealthy or missing transforms
    • Maybe add "check the Health tab of the Transform (TBH this should be done on the Transforms troubleshooting)
  • Missing Ingest Pipelines
    • I think the solution to recreate the ingest pipelines is to edit/save the SLO or do a reset (to check with devs)
  • Missing Templates
    • I do not know how to trigger the recreation of those assets. Idk if restarting Kibana is enough or if there's an API we can use (to chekc with devs). This would be great to have.

SLO troubleshooting actions section

  • Inspect SLO 👍
  • Reset SLO 👍

Upgrade from beta to GA section

👍 I think this would deserve a dedicated page OR we should have a TOC at the beginning of the Troubleshooting page. Otherwise it can be easily missed. Idk if it is possible

Using API calls to retrieve SLO details section

I would recommend to only point to the SLO API docs https://www.elastic.co/docs/api/doc/kibana/v8/operation/operation-findslosop

Accessing system or internal indices is not useful to the final user, but we gave enough info to eventually perform a sanity check of the assets which are required in order for SLO to work.


Hope it helps

@eedugon
Copy link
Contributor Author

eedugon commented Nov 20, 2024

Awesome @lucabelluccini ! I'll work on a new version of the PR and as soon as it looks ok to you we will have to find a dev or PM reviewer here.

About the name of the detailed section, what about Understanding SLO internals? :) That should solve the confusion you mentioned.

Copy link
Contributor

@lucabelluccini lucabelluccini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colleenmcginnis colleenmcginnis added the backport-8.17 Automated backport with mergify label Nov 22, 2024
Copy link
Contributor

@lucabelluccini lucabelluccini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're 99.9% done. I've left few comments.

docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
Comment on lines 209 to 210
* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L453-L514[SLO Definitions Find API] (`/api/observability/slos/_definitions`)
* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L368-L410[SLO Reset API] (`/api/observability/slos/${id}/_reset`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to point to the new Kibana SLO API docs instead of the openapi souce link?
https://www.elastic.co/docs/api/doc/kibana/v8/operation/operation-findslosop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a lookm thanks!.
That content wasn't really mine, we moved it from the mail SLO page, but I'll check the links as now we have added a link to the SLO API docs at the end of this doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucabelluccini , this is interesting! I have updated (fixed) the second link (SLO RESET) because that one exist in our API docs: https://www.elastic.co/docs/api/doc/kibana/v8/operation/operation-resetsloop

But in the first one we have two problems:

  • First of all there's a bug in the text and the API is not api/observability/slos/_definition, but internal/observability/slos/_definitions (based on the link).
  • Secondly that API call is not in the public API docs.
  • Thirdly (:-D) that link doesn't work for the main branch, so it all looks weird.

What would you suggest? should we check with devs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdelemme : would you be able to share anything on this topic? This isn't really related with the goal of this PR, but it's something that we have found out.

@eedugon
Copy link
Contributor Author

eedugon commented Nov 26, 2024

Notes for PM / Dev review:

This PR has been created as a result of @lucabelluccini request #4237

We have created a SLO Troubleshooting doc where we add details about SLO implementation that should help users to understand SLOs better and help with troubleshooting.

We would like your review on this, and to decide the level and amount of information that we want to disclose on this subject.

Besides the overall review of the content of the new file we also need to clarify the provided link in Upgrade from beta to GA section:

https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L453-L514[SLO Definitions Find API] (/api/observability/slos/_definitions)

Is it ok to provide such a link in that section? Also the path /api/observability/... is apparently wrong as based on the referenced code the path should be /internal/observability/....

Copy link
Contributor

@kdelemme kdelemme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment about updates. But otherwise I think this document very useful! Thanks for making this.

docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
docs/en/observability/slo-troubleshoot.asciidoc Outdated Show resolved Hide resolved
@eedugon
Copy link
Contributor Author

eedugon commented Nov 27, 2024

@lucabelluccini, and @kdelemme : one final question.
Should we add this doc to serverless? I would say so based on @lucabelluccini's initial request at #4237.

Anyway I will add it in a port-PR.

Also feel free to approve this as soon as you believe it's ready ;)

Copy link
Contributor

@dedemorton dedemorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work!!

@eedugon eedugon merged commit 64b89f9 into elastic:main Nov 28, 2024
3 checks passed
@eedugon eedugon deleted the slo_troubleshoot_2 branch November 28, 2024 13:19
mergify bot pushed a commit that referenced this pull request Nov 28, 2024
* slo troubleshooting doc added

(cherry picked from commit 64b89f9)
mergify bot pushed a commit that referenced this pull request Nov 28, 2024
* slo troubleshooting doc added

(cherry picked from commit 64b89f9)
mergify bot pushed a commit that referenced this pull request Nov 28, 2024
* slo troubleshooting doc added

(cherry picked from commit 64b89f9)
eedugon added a commit that referenced this pull request Nov 28, 2024
* slo troubleshooting doc added

(cherry picked from commit 64b89f9)

Co-authored-by: Edu González de la Herrán <[email protected]>
eedugon added a commit that referenced this pull request Nov 28, 2024
* slo troubleshooting doc added

(cherry picked from commit 64b89f9)

Co-authored-by: Edu González de la Herrán <[email protected]>
eedugon added a commit that referenced this pull request Nov 28, 2024
* slo troubleshooting doc added

(cherry picked from commit 64b89f9)

Co-authored-by: Edu González de la Herrán <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify needs-product-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Request]: Document SLO Transforms and troubleshooting steps
5 participants