Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding SLOs troubleshooting document #4419

Closed
wants to merge 17 commits into from

Conversation

eedugon
Copy link
Contributor

@eedugon eedugon commented Oct 22, 2024

Description

New document Troubleshoot SLOs created, per #4237 request.

Preview:

Before moving forward I'd like @lucabelluccini to take a look at the current structure and decide if all suggested parts are useful.

After agreeing on main structure and areas to talk about we can discuss further the exact content.

Pending actions

  • Analyze and decide if serverless documentation needs to be updated.

Documentation sets edited in this PR

Check all that apply.

  • Stateful (docs/en/observability/*)
  • Serverless (docs/en/serverless/*)
  • Integrations Developer Guide (docs/en/integrations/*)
  • None of the above

Related issue

Closes #4237

Checklist

  • Product/Engineering Review
  • Writer Review

Follow-up tasks

Select one.

  • This PR does not need to be ported to another doc set because:
    • The concepts in this PR only apply to one doc set (serverless or stateful)
    • The PR contains edits to both doc sets (serverless and stateful)
  • This PR needs to be ported to another doc set:
    • Port to stateful docs: <link to PR or tracking issue>
    • Port to serverless docs: <link to PR or tracking issue>

@eedugon eedugon added needs-writer-review backport-8.x Automated backport to the 8.x branch with mergify needs-dev-review labels Oct 22, 2024
@eedugon eedugon requested a review from a team as a code owner October 22, 2024 10:35
Copy link
Contributor

A documentation preview will be available soon.

Request a new doc build by commenting
  • Rebuild this PR: run docs-build
  • Rebuild this PR and all Elastic docs: run docs-build rebuild

run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.

If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.

@bmorelli25 bmorelli25 added the backport-8.16 Automated backport with mergify label Oct 24, 2024
@eedugon
Copy link
Contributor Author

eedugon commented Oct 29, 2024

Summary of notes after meeting with @lucabelluccini :

(done)- We need a pre-requisites section because the current note about the needed license is not enough. The prerequisites should cover license + node roles + user permissions. Where exactly to add this information (it's not long) is unclear, options are:

  • In main page (probably this is my preference), with a link for the privileges to the configure SLOs
  • In Configure SLOs page, as that page shows the privileges / setup needed for SLOs
  • In a dedicated page.

^^ @bmorelli25 : I'd like to discuss that point with you to determine where to allocate the prerequisites information.

(done)- Upgrade from Beta to GA in main page --> we will move it to troubleshooting doc and we will create a reference in the main page only. Also the current info on troubleshooting doc about this topic should be enriched with the data that was in the main page.

(done, section renamed)- Troubleshooting page: we need to move the SLO Overview (the detailed descriptionof the SLOs showing the relation with transforms and pipelines) somewhere else, as that's not really troubleshooting. I will think of a better way to include this.
^^ @bmorelli25 , maybe you can also help with that.

(pending)- APIs: we will keep the APIs at the very end of troubleshooting doc, with a warning to use them for information retrieval and not to update the SLOs or other objects (for example when accessing directly to kibana saved objects).

(done)- The following content of the intro should be moved to Common problems / Unhealthy transforms

One of the common issues with SLOs arises when there are underlying problems in the cluster, such as unavailable shards or failed transforms. Since SLOs rely on transforms to aggregate and process data, any failure or misconfiguration in these components can lead to inaccurate or incomplete SLO calculations. Additionally, unavailable shards can affect the data retrieval process, further complicating the reliability of SLO metrics.

  • add a hint to healthy cluster + correctly sized in overview.

(done)- Inspect SLOs --> we have to guide users to use the inspect SLO to analyze the content of the SLO, the transforms, etc. Probably we can show the details in the doc of what the inspect offers.

(done)- Explain the users how they can jump to the transforms page directly from some links of the SLOs.

(done)- Action --> Reset: instead of editing and save we have to suggest that to fix possible issues.

(done)General troubleshooting guidance (to somehow include it if possible):

  • inspect SLO assets
    • ensure transforms exist and are healthy / working (use the direct link to transforms here)
      • check transform stats
    • ensure associated pipeline exists.

@bmorelli25 bmorelli25 requested a review from dedemorton October 30, 2024 21:13
@eedugon
Copy link
Contributor Author

eedugon commented Oct 31, 2024

Updating status after getting some suggestions from @bmorelli25. This is what we are planning to do before reviewing this again with @lucabelluccini and @dedemorton :

  • The prerequisites will be shown in the main page, the 3 of them, with a link to the configure SLO page as the permissions requirement really deserves a dedicated page. In the other pages we will remind about the prerequisites too.

  • In important concepts section of the main page we will also add a brief introduction about the relation of SLOs with transforms (and maybe other system components), and we will point to the section where we provide the lower level details of SLOs implementation.

  • In troubleshooting doc we will leave the detailed view but with a different name in the section, something like

  • Understanding SLOs.

    • In this section we will add a TIP at the beggining telling the user to go directly to the common problems section if they are aware of how SLOs work and their relation with transforms, pipelines, etc.
  • we will also remove the jumping list in the troubleshooting doc and we will only add something if really worthy.

We'll follow also items mentioned at #4419 (comment)

colleenmcginnis and others added 13 commits November 6, 2024 11:11
…#4327)

* initial draft

* address feedback

* add feedback

---------

Co-authored-by: Brandon Morelli <[email protected]>
Co-authored-by: Brandon Morelli <[email protected]>
* Add network metrics to list of docker container metrics

* Fix build error

* Fix typo
* document synthetics monitor status rule

* document how to move from the uptime rule to the synthetics rule

* add code comments

* address feedback from @paulb-elastic

* add info on using slos for availability

* add link

* delete table

* port to serverless
… apm-data plugin (elastic#4333)

* update getting started docs

* update upgrade guide

* fix build

* audit mentions of the apm integration

* audit mentions of data streams

* audit mentions of index templates

* audit mentions of index mappings

* address feedback from @carsonip

* update diagram

* address more feedback from @carsonip

* update ilm guide

* address more feedback from @lahsivjar

* address more feedback from @lahsivjar

* add link

* fix table formatting

* address more feedback from @lahsivjar
* split apm alerts and format like other rule docs

* fill out individual apm rule pages

* address feedback

* apm ui -> applications ui

---------

Co-authored-by: bmorelli25 <[email protected]>
…4385)

* Add Osquery tab to host monitoring page

* Fix var tag syntax

* Add missing image

* Add role requirements
…c#4438)

* Add new page on inventory

* Update docs/en/observability/apm/view-and-analyze/inventory.asciidoc

Co-authored-by: DeDe Morton <[email protected]>

* Crop images to remove nav pane

---------

Co-authored-by: DeDe Morton <[email protected]>
Copy link
Contributor

mergify bot commented Nov 6, 2024

This pull request is now in conflict. Could you fix it @eedugon? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b slo_troubleshooting upstream/slo_troubleshooting
git merge upstream/main
git push upstream slo_troubleshooting

@eedugon eedugon force-pushed the slo_troubleshooting branch from 8dafce2 to f87fd53 Compare November 6, 2024 10:59
@eedugon
Copy link
Contributor Author

eedugon commented Nov 6, 2024

@dedemorton , I have made the changes I wanted, feel free to review it.

On the other hand I don't know what I have done with the branch and git but it looks this PR is going to update 13 files where in theory I have modified only the SLOs and the main index.

@bmorelli25 , would you help me to figure out what's going on?

@eedugon
Copy link
Contributor Author

eedugon commented Nov 6, 2024

I have created a new PR #4486 as I don't understand this mess :(
cc: @dedemorton , feel free to review the new one. I will keep this one to try to understand what has happened and I'll cancel it afterwards.

@eedugon eedugon closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify needs-dev-review needs-writer-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Request]: Document SLO Transforms and troubleshooting steps
6 participants