Skip to content

Commit

Permalink
Adding SLOs troubleshooting document (attempt 2) (#4486)
Browse files Browse the repository at this point in the history
* slo troubleshooting doc added
  • Loading branch information
eedugon authored Nov 28, 2024
1 parent 083cf27 commit 64b89f9
Show file tree
Hide file tree
Showing 5 changed files with 271 additions and 57 deletions.
2 changes: 2 additions & 0 deletions docs/en/observability/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,8 @@ include::slo-privileges.asciidoc[leveloffset=+3]

include::slo-create.asciidoc[leveloffset=+3]

include::slo-troubleshoot.asciidoc[leveloffset=+3]

//Data Set Quality
include::logs-monitor-datasets.asciidoc[leveloffset=+1]

Expand Down
6 changes: 6 additions & 0 deletions docs/en/observability/slo-create.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@ From here, complete the following steps:
. <<set-slo>>.
. <<slo-describe>>.

[NOTE]
====
For SLOs to function, the cluster must include one or more nodes with both `ingest` and `transform` {ref}/modules-node.html#node-roles[roles]. The roles can exist on the same node or be distributed across separate nodes.
On ESS deployments (Elastic Cloud), this is handled by the hot nodes, which serve as both `ingest` and `transform` nodes.
====

[discrete]
[[define-sli]]
= Define your SLI
Expand Down
60 changes: 4 additions & 56 deletions docs/en/observability/slo-overview.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
// tag::slo-license[]
[IMPORTANT]
====
To create and manage SLOs, you need an {subscriptions}[appropriate license] and <<slo-privileges,SLO access>> must be configured.
To create and manage SLOs, you need an {subscriptions}[appropriate license], an {es} cluster with both `transform` and `ingest` {ref}/modules-node.html#node-roles[node roles] present, and <<slo-privileges,SLO access>> must be configured.
====
// end::slo-license[]

Expand All @@ -25,6 +25,8 @@ SLO:: The target you set for your SLI. It specifies th
Error budget:: The amount of time that your SLI can not meet the SLO target before it violates your SLO.
Burn rate:: The rate at which your service consumes your error budget.

In addition to these key concepts related to SLO functionality, see <<slo-understanding-slos>> for more information on how SLOs work and their relationship with other system components, such as {ref}/transforms.html[{es} Transforms].

[discrete]
[[slo-in-elastic]]
== SLO overview
Expand Down Expand Up @@ -90,61 +92,7 @@ Starting in version 8.12.0, SLOs are generally available (GA).
If you're upgrading from a beta version of SLOs (available in 8.11.0 and earlier),
you must migrate your SLO definitions to a new format.

[%collapsible]
.Migrate your SLO definitions
====
To migrate your SLO definitions, open the SLO overview.
A banner will display the number of outdated SLOs detected.
For each outdated SLO, click **Reset**. If you no longer need the SLO, select **Delete**.
If you have a large number of SLO definitions, it is possible to automate this process.
To do this, you'll need to use two Elastic APIs:
* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L453-L514[SLO Definitions Find API] (`/api/observability/slos/_definitions`)
* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L368-L410[SLO Reset API] (`/api/observability/slos/${id}/_reset`)
Pass in `includeOutdatedOnly=1` as a query parameter to the Definitions Find API.
This will display your outdated SLO definitions.
Loop through this list, one by one, calling the Reset API on each outdated SLO definition.
The Reset API loads the outdated SLO definition and resets it to the new format required for GA.
Once an SLO is reset, it will start to regenerate SLIs and summary data.
====

[%collapsible]
.Remove legacy summary transforms
====
After migrating to 8.12 or later, you might have some legacy SLO summary transforms running.
You can safely delete the following legacy summary transforms:
[source,sh]
----------------------------------
# Stop all legacy summary transforms
POST _transform/slo-summary-occurrences-30d-rolling/_stop?force=true
POST _transform/slo-summary-occurrences-7d-rolling/_stop?force=true
POST _transform/slo-summary-occurrences-90d-rolling/_stop?force=true
POST _transform/slo-summary-occurrences-monthly-aligned/_stop?force=true
POST _transform/slo-summary-occurrences-weekly-aligned/_stop?force=true
POST _transform/slo-summary-timeslices-30d-rolling/_stop?force=true
POST _transform/slo-summary-timeslices-7d-rolling/_stop?force=true
POST _transform/slo-summary-timeslices-90d-rolling/_stop?force=true
POST _transform/slo-summary-timeslices-monthly-aligned/_stop?force=true
POST _transform/slo-summary-timeslices-weekly-aligned/_stop?force=true
# Delete all legacy summary transforms
DELETE _transform/slo-summary-occurrences-30d-rolling?force=true
DELETE _transform/slo-summary-occurrences-7d-rolling?force=true
DELETE _transform/slo-summary-occurrences-90d-rolling?force=true
DELETE _transform/slo-summary-occurrences-monthly-aligned?force=true
DELETE _transform/slo-summary-occurrences-weekly-aligned?force=true
DELETE _transform/slo-summary-timeslices-30d-rolling?force=true
DELETE _transform/slo-summary-timeslices-7d-rolling?force=true
DELETE _transform/slo-summary-timeslices-90d-rolling?force=true
DELETE _transform/slo-summary-timeslices-monthly-aligned?force=true
DELETE _transform/slo-summary-timeslices-weekly-aligned?force=true
----------------------------------
Do not delete any new summary transforms used by your migrated SLOs.
====
Refer to <<slo-troubleshoot-beta>> for more details on how to proceed.

[discrete]
[[slo-overview-next-steps]]
Expand Down
2 changes: 1 addition & 1 deletion docs/en/observability/slo-privileges.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<titleabbrev>Configure SLO access</titleabbrev>
++++

IMPORTANT: To create and manage SLOs, you need an {subscriptions}[appropriate license].
IMPORTANT: To create and manage SLOs, you need an {subscriptions}[appropriate license] and an {es} cluster with both `transform` and `ingest` {ref}/modules-node.html#node-roles[node roles] present.

You can enable access to SLOs in two different ways:

Expand Down
258 changes: 258 additions & 0 deletions docs/en/observability/slo-troubleshoot.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
[[slo-troubleshoot-slos]]
= Troubleshoot service-level objectives (SLOs)

++++
<titleabbrev>Troubleshoot SLOs</titleabbrev>
++++

include::slo-overview.asciidoc[tag=slo-license]

[WARNING]
====
Do not edit, delete, or tamper with any "internal" assets mentioned in this document, such as the transforms or ingest pipelines created by the SLO application.
Do not attempt to edit the `.slo-observability.*` indices mentioned in this document by overriding index templates or editing the settings/mappings.
The implementation details described here are subject to change.
====

This document provides an overview of common issues encountered when working with service-level objectives (SLOs). It explores the relationships between SLOs and other core functionalities within the stack, such as {ref}/transforms.html[transforms] and {ref}/ingest.html[ingest pipelines], highlighting how these integrations can impact the functionality of SLOs.

* <<slo-understanding-slos>>
* <<slo-common-problems>>
* <<slo-troubleshoot-actions>>
* <<slo-troubleshoot-beta>>

[discrete]
[[slo-understanding-slos]]
== Understanding SLO internals

[TIP]
====
If you’re already familiar with how SLOs work and their relationship with other system components, such as transforms and ingest pipelines, you can jump directly to <<slo-common-problems>>.
====

An SLO is represented by several system resources:

* *SLO Definition*: Stored as a Kibana Saved Object.
* *Transforms*: For each SLO, {kib} creates two transforms:
** *Rolling-up transform*: `slo-{slo.id}-{slo.revision}`, rolls up the data into a smaller set of documents. The source indices of this transform are defined by the SLO. The target index will be `.slo-observability.sli-v{slo.internal-version}-{monthly date}`.
** *Rolling-up ingest pipeline*: `slo-observability.sli.pipeline-{slo.id}-{slo.revision}`, used by the rolling-up transform.
** *Summarizing transform*: `slo-summary-{slo.id}-{slo.revision}`, updates the latest values, such as the observed SLI or remaining error budget, for efficient searching and filtering of SLOs. The source of this transform is `.slo-observability.sli-v{slo.internal-version}*`. The target index is `.slo-observability.summary-v{slo.internal-version}`.
** *Summarizing ingest pipeline*: `slo-observability.summary.pipeline-{slo.id}-{slo.revision}`, used by the summarizing transform.

* *Additional resources*: {kib} also installs and manages shared resources to the SLOs, including index templates, indices, and ingest pipelines, among others.

When an **SLO update** changes any of the `SLI parameters`, the `SLO objective`, or the `time window`, a revision bump (`{slo.revision}`) and a full reinstallation of the associated assets (transforms and ingest pipelines) occur. In addition, the revision bump deletes any previously aggregated data for that SLO. Updates to fields like `name`, `description`, or `tags` do not trigger a revision bump or asset reinstallation.

Ensuring that transforms are functioning correctly and that the cluster is healthy is crucial for maintaining accurate and reliable SLOs.

[discrete]
[[slo-common-problems]]
== Common problems

It's common for SLO problems to arise when there are underlying problems in the cluster, such as unavailable shards or failed transforms. Because SLOs rely on transforms to aggregate and process data, any failure or misconfiguration in these components can lead to inaccurate or incomplete SLO calculations. Additionally, unavailable shards can affect the data retrieval process, further complicating the reliability of SLO metrics.

[discrete]
[[slo-no-transform-ingest-node]]
=== No transform or ingest nodes

Because SLOs depend on both {ref}/ingest.html[ingest pipelines] and {ref}/transforms.html[transforms] to process the data, it's essential to ensure that the cluster has nodes with the appropriate {ref}/modules-node.html#node-roles[roles].

Ensure the cluster includes one or more nodes with both `ingest` and `transform` roles to support the data processing and transformations required for SLOs to function properly. The roles can exist on the same node or be distributed across separate nodes.

[discrete]
[[slo-transform-unhealthy]]
=== Unhealthy or missing transforms

When working with SLOs, it is crucial to ensure that the associated transforms function correctly. Transforms are responsible for generating the data needed for SLOs, and two transforms are created for each SLO. If you notice that your SLOs are not displaying the expected data, it's time to check the health of these associated transforms.

{kib} shows the following message when any of the associated transforms is in an unexpected state:

* `"The following transform is an unhealthy state"`, followed by a list of transforms.

For detailed guidance on diagnosing and resolving transform-related issues, refer to the {ref}/transform-troubleshooting.html[troubleshooting transforms] documentation .

It's also recommended that you perform the following transform checks:

* Ensure the transforms needed for the SLOs haven't been deleted or stopped.
+
If a transform has been deleted, the easiest way to recreate it is using the <<slo-troubleshoot-reset>> action, forcing the recreation of the transforms.
If a transform was stopped, try to start it, and then check the `health tab` of the transform.

* <<slo-troubleshoot-inspect>> to analyze the SLO definition and all associated resources.
+
Use the direct links offered by the **Inspect UI** and check that all referenced resources exist, as that's not verified by the inspect functionality.
+
Use the `query composite` content to verify if the queries performed by the transforms are valid and return the expected data.

* Check the source data and queries of the SLO.
+
The most common cause of legitimate transform failures is issues with the source data, such as timestamp parsing errors or incorrect query structures. The following is an example of an unparsable timestamp causing a transform to fail:
+
[source,bash]
----
"reason": """Failed to index documents into destination index due to permanent error:
[org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [500] failures and at least 1 irrecoverable
[unable to parse date [1702842480000]]. Other failures:
[IngestProcessorException] message [org.elasticsearch.ingest.IngestProcessorException:
java.lang.IllegalArgumentException: unable to parse date [1702842480000]]; java.lang.IllegalArgumentException: unable to parse date [1702842480000]]""",
"issue": "Transform task state is [failed]"
----

* As a last resort, consider <<slo-troubleshoot-reset, resetting the SLO>>.

[discrete]
[[slo-missing-pipeline]]
=== Missing Ingest Pipelines

If any of the needed ingest pipelines are missing, try the <<slo-troubleshoot-reset>> action.

[discrete]
[[slo-missing-template]]
=== Stack-related problems

As mentioned, maintaining a healthy cluster is crucial for SLOs to function correctly. The following examples show issues *unrelated to SLOs* that can still disrupt their proper operation. While troubleshooting these issues is outside the scope of this document, they are included for illustrative purposes.

* Problems accessing the source data, causing the transform to fail:
+
[source,bash]
----
Failed to execute phase [can_match], start; org.elasticsearch.action.search.SearchPhaseExecutionException:
Search rejected due to missing shards [[index_name_1][1], [index_name_2][1], [index_name_3][1]].
----

* Remote cluster not available, if for example an SLO is fetching data from a remote cluster called `remote-metrics`:
+
[source,sh]
----
Validation Failed: 1: no such remote cluster: [remote-metrics]
----

* {ref}/circuit-breaker-errors.html[Circuit breaker exceptions] due to nodes being under memory pressure.

[discrete]
[[slo-troubleshoot-actions]]
== SLO troubleshooting actions

[discrete]
[[slo-troubleshoot-inspect]]
=== Inspect SLO assets

To be able to inspect SLOs you have to activate the corresponding feature in {kib}:

. Open **Advanced Settings**, by finding **Stack Management** in the main menu or using the {kibana-ref}/introduction.html#kibana-navigation-search[global search field].
. Enable `observability:enableInspectEsQueries` setting.

Afterwards visit the *SLO edit page* and click *SLO Inspect*.

The *SLO Inspect* option provides a detailed report of an SLO, including:

* SLO configuration
* Rollup transform configuration
* Summary transform configuration
* Rollup ingest pipeline
* Summary ingest pipeline
* Temporary document
* Rollup transform query composite
* Summary transform query composite

These resources are very useful for tasks such as trying out the queries performed by the transforms and checking the IDs of all associated resources. The view also includes direct links to transforms and ingest pipelines sections in {kib}.

[discrete]
[[slo-troubleshoot-reset]]
=== Reset SLO

Resetting an SLO forces the deletion of all SLI data, summary data, and transforms, and then reinstalls and processes the data. Essentially, it recreates the SLO as if it had been deleted and re-created by the user.

[NOTE]
====
While resetting an SLO can help resolve certain issues, it may not always address the root cause of errors. Most errors related to transforms typically arise from improperly structured source data, such as unparsable timestamps, which prevent the transform from progressing. Additionally, incorrect formatted SLO queries, and consequently transform queries, can also lead to failures.
Before resetting the SLO, verify that the source data and queries are correctly formatted and validated. Resetting should only be used as a last resort when all other troubleshooting steps have been exhausted.
====

Follow these steps to reset an SLO:

. Find **SLOs** in the main menu or use the {kibana-ref}/introduction.html#kibana-navigation-search[global search field].
. Click on the SLO to reset.
. Select *Actions* → *Reset*.

Alternatively you can use {kib} API for the reset action:

[source,console]
----
POST kbn:/api/observability/slos/{sloId}∫/_reset
----

Where `sloId` can be obtained from the <<slo-troubleshoot-inspect>> action.

[discrete]
[[slo-api-calls]]
=== Using API calls to retrieve SLO details

Refer to https://www.elastic.co/docs/api/doc/kibana/v8/operation/operation-findslosop[SLO API calls] as an alternative to <<slo-troubleshoot-inspect,using SLO Inspect>>.

[discrete]
[[slo-troubleshoot-beta]]
== Upgrade from beta to GA

Starting in version 8.12.0, SLOs are generally available (GA).
If you're upgrading from a beta version of SLOs (available in 8.11.0 and earlier),
you must migrate your SLO definitions to a new format. Otherwise SLOs won't show up.

[%collapsible]
.Migrate your SLO definitions
====
To migrate your SLO definitions, open the SLO overview.
A banner will display the number of outdated SLOs detected.
For each outdated SLO, click **Reset**. If you no longer need the SLO, select **Delete**.
If you have a large number of SLO definitions, it is possible to automate this process.
To do this, you'll need to use two Elastic APIs:
* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L453-L514[SLO Definitions Find API] (`/api/observability/slos/_definitions`)
* https://www.elastic.co/docs/api/doc/kibana/v8/operation/operation-resetsloop[SLO Reset API]
Pass in `includeOutdatedOnly=1` as a query parameter to the Definitions Find API.
This will display your outdated SLO definitions.
Loop through this list, one by one, calling the Reset API on each outdated SLO definition.
The Reset API loads the outdated SLO definition and resets it to the new format required for GA.
Once an SLO is reset, it will start to regenerate SLIs and summary data.
====

[%collapsible]
.Remove legacy summary transforms
====
After migrating to 8.12 or later, you might have some legacy SLO summary transforms running.
You can safely delete the following legacy summary transforms:
[source,sh]
----------------------------------
# Stop all legacy summary transforms
POST _transform/slo-summary-occurrences-30d-rolling/_stop?force=true
POST _transform/slo-summary-occurrences-7d-rolling/_stop?force=true
POST _transform/slo-summary-occurrences-90d-rolling/_stop?force=true
POST _transform/slo-summary-occurrences-monthly-aligned/_stop?force=true
POST _transform/slo-summary-occurrences-weekly-aligned/_stop?force=true
POST _transform/slo-summary-timeslices-30d-rolling/_stop?force=true
POST _transform/slo-summary-timeslices-7d-rolling/_stop?force=true
POST _transform/slo-summary-timeslices-90d-rolling/_stop?force=true
POST _transform/slo-summary-timeslices-monthly-aligned/_stop?force=true
POST _transform/slo-summary-timeslices-weekly-aligned/_stop?force=true
# Delete all legacy summary transforms
DELETE _transform/slo-summary-occurrences-30d-rolling?force=true
DELETE _transform/slo-summary-occurrences-7d-rolling?force=true
DELETE _transform/slo-summary-occurrences-90d-rolling?force=true
DELETE _transform/slo-summary-occurrences-monthly-aligned?force=true
DELETE _transform/slo-summary-occurrences-weekly-aligned?force=true
DELETE _transform/slo-summary-timeslices-30d-rolling?force=true
DELETE _transform/slo-summary-timeslices-7d-rolling?force=true
DELETE _transform/slo-summary-timeslices-90d-rolling?force=true
DELETE _transform/slo-summary-timeslices-monthly-aligned?force=true
DELETE _transform/slo-summary-timeslices-weekly-aligned?force=true
----------------------------------
Do not delete any new summary transforms used by your migrated SLOs.
====

0 comments on commit 64b89f9

Please sign in to comment.