diff --git a/docs/en/observability/apm/known-issues.asciidoc b/docs/en/observability/apm/known-issues.asciidoc index 72f8990b39..adc97e3c4b 100644 --- a/docs/en/observability/apm/known-issues.asciidoc +++ b/docs/en/observability/apm/known-issues.asciidoc @@ -24,10 +24,17 @@ _Versions: XX.XX.XX, YY.YY.YY, ZZ.ZZ.ZZ_ [discrete] == Upgrading to v8.15.x may cause ingestion to fail +<<<<<<< HEAD +_Elastic Stack versions: 8.15.0_ + + +// The conditions in which this issue occurs +The issue only occurs when _upgrading_ the {stack} from <= 8.12.2 directly to any 8.15.x version. +======= _Elastic Stack versions: 8.15.0+_ // The conditions in which this issue occurs The issue only occurs when _upgrading_ the {stack} from 8.12.2 or lower directly to any 8.15.x version. +>>>>>>> origin/main The issue does _not_ occur when creating a _new_ cluster using any 8.15.x version, or when upgrading from 8.12.2 to 8.13.x or 8.14.x and then to 8.15.x. @@ -42,7 +49,11 @@ related to https://github.com/elastic/elasticsearch/issues/112781[lazy rollover If the deployment is running 8.15.0, upgrade the deployment to 8.15.1 or above. A manual rollover of all APM data streams is required to pick up the new index templates and remove the faulty ingest pipeline version check. Perform the following requests to Elasticsearch (they are assuming the `default` namespace is used, adjust if necessary): +<<<<<<< HEAD ++ +======= +>>>>>>> origin/main [source,txt] ---- POST /traces-apm-default/_rollover diff --git a/docs/en/observability/index.asciidoc b/docs/en/observability/index.asciidoc index 532198bf85..71d69cdd00 100644 --- a/docs/en/observability/index.asciidoc +++ b/docs/en/observability/index.asciidoc @@ -236,6 +236,8 @@ include::slo-privileges.asciidoc[leveloffset=+3] include::slo-create.asciidoc[leveloffset=+3] +include::slo-troubleshoot.asciidoc[leveloffset=+3] + //Data Set Quality include::logs-monitor-datasets.asciidoc[leveloffset=+1] diff --git a/docs/en/observability/slo-create.asciidoc b/docs/en/observability/slo-create.asciidoc index e1da49a152..c6b45cf70a 100644 --- a/docs/en/observability/slo-create.asciidoc +++ b/docs/en/observability/slo-create.asciidoc @@ -18,6 +18,12 @@ From here, complete the following steps: . <>. . <>. +[NOTE] +==== +For SLOs to function, the cluster must include one or more nodes with both `ingest` and `transform` {ref}/modules-node.html#node-roles[roles] (they can co-exist or be distributed across separate nodes). +On ESS deployments (Elastic Cloud), this is handled by the hot nodes, which serve as both `ingest` and `transform` nodes. +==== + [discrete] [[define-sli]] = Define your SLI diff --git a/docs/en/observability/slo-overview.asciidoc b/docs/en/observability/slo-overview.asciidoc index a9a9832e9f..54a8b2f4b3 100644 --- a/docs/en/observability/slo-overview.asciidoc +++ b/docs/en/observability/slo-overview.asciidoc @@ -8,7 +8,7 @@ // tag::slo-license[] [IMPORTANT] ==== -To create and manage SLOs, you need an {subscriptions}[appropriate license] and <> must be configured. +To create and manage SLOs, you need an {subscriptions}[appropriate license], an {es} cluster with both `transform` and `ingest` {ref}/modules-node.html#node-roles[node roles] present, and <> must be configured. ==== // end::slo-license[] @@ -29,6 +29,8 @@ SLO:: The target you set for your SLI. It specifies th Error budget:: The amount of time that your SLI can not meet the SLO target before it violates your SLO. Burn rate:: The rate at which your service consumes your error budget. +In addition to these key concepts related to SLO functionality, see <> for more information on how SLOs work and their relationship with other system components, such as {ref}/transforms.html[{es} Transforms]. + [discrete] [[slo-in-elastic]] == SLO overview @@ -94,61 +96,7 @@ Starting in version 8.12.0, SLOs are generally available (GA). If you're upgrading from a beta version of SLOs (available in 8.11.0 and earlier), you must migrate your SLO definitions to a new format. -[%collapsible] -.Migrate your SLO definitions -==== -To migrate your SLO definitions, open the SLO overview. -A banner will display the number of outdated SLOs detected. -For each outdated SLO, click **Reset**. If you no longer need the SLO, select **Delete**. - -If you have a large number of SLO definitions, it is possible to automate this process. -To do this, you'll need to use two Elastic APIs: - -* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L453-L514[SLO Definitions Find API] (`/api/observability/slos/_definitions`) -* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L368-L410[SLO Reset API] (`/api/observability/slos/${id}/_reset`) - -Pass in `includeOutdatedOnly=1` as a query parameter to the Definitions Find API. -This will display your outdated SLO definitions. -Loop through this list, one by one, calling the Reset API on each outdated SLO definition. -The Reset API loads the outdated SLO definition and resets it to the new format required for GA. -Once an SLO is reset, it will start to regenerate SLIs and summary data. -==== - -[%collapsible] -.Remove legacy summary transforms -==== -After migrating to 8.12 or later, you might have some legacy SLO summary transforms running. -You can safely delete the following legacy summary transforms: - -[source,sh] ----------------------------------- -# Stop all legacy summary transforms -POST _transform/slo-summary-occurrences-30d-rolling/_stop?force=true -POST _transform/slo-summary-occurrences-7d-rolling/_stop?force=true -POST _transform/slo-summary-occurrences-90d-rolling/_stop?force=true -POST _transform/slo-summary-occurrences-monthly-aligned/_stop?force=true -POST _transform/slo-summary-occurrences-weekly-aligned/_stop?force=true -POST _transform/slo-summary-timeslices-30d-rolling/_stop?force=true -POST _transform/slo-summary-timeslices-7d-rolling/_stop?force=true -POST _transform/slo-summary-timeslices-90d-rolling/_stop?force=true -POST _transform/slo-summary-timeslices-monthly-aligned/_stop?force=true -POST _transform/slo-summary-timeslices-weekly-aligned/_stop?force=true - -# Delete all legacy summary transforms -DELETE _transform/slo-summary-occurrences-30d-rolling?force=true -DELETE _transform/slo-summary-occurrences-7d-rolling?force=true -DELETE _transform/slo-summary-occurrences-90d-rolling?force=true -DELETE _transform/slo-summary-occurrences-monthly-aligned?force=true -DELETE _transform/slo-summary-occurrences-weekly-aligned?force=true -DELETE _transform/slo-summary-timeslices-30d-rolling?force=true -DELETE _transform/slo-summary-timeslices-7d-rolling?force=true -DELETE _transform/slo-summary-timeslices-90d-rolling?force=true -DELETE _transform/slo-summary-timeslices-monthly-aligned?force=true -DELETE _transform/slo-summary-timeslices-weekly-aligned?force=true ----------------------------------- - -Do not delete any new summary transforms used by your migrated SLOs. -==== +Refer to <> for more details on how to proceed. [discrete] [[slo-overview-next-steps]] diff --git a/docs/en/observability/slo-privileges.asciidoc b/docs/en/observability/slo-privileges.asciidoc index 86b6c6df7b..2068bd4521 100644 --- a/docs/en/observability/slo-privileges.asciidoc +++ b/docs/en/observability/slo-privileges.asciidoc @@ -5,7 +5,7 @@ Configure SLO access ++++ -IMPORTANT: To create and manage SLOs, you need an {subscriptions}[appropriate license]. +IMPORTANT: To create and manage SLOs, you need an {subscriptions}[appropriate license] and an {es} cluster with both `transform` and `ingest` {ref}/modules-node.html#node-roles[node roles] present. You can enable access to SLOs in two different ways: diff --git a/docs/en/observability/slo-troubleshoot.asciidoc b/docs/en/observability/slo-troubleshoot.asciidoc new file mode 100644 index 0000000000..ad2584e253 --- /dev/null +++ b/docs/en/observability/slo-troubleshoot.asciidoc @@ -0,0 +1,404 @@ +[[slo-troubleshoot-slos]] += Troubleshoot service-level objectives (SLOs) + +++++ +Troubleshoot SLOs +++++ + +include::slo-overview.asciidoc[tag=slo-license] + +This document provides an overview of common issues encountered when working with service-level objectives (SLOs). It explores the relationships between SLOs and other core functionalities within the stack, such as {ref}/transforms.html[transforms] and {ref}/ingest.html[ingest pipelines], highlighting how these integrations can impact the functionality of SLOs. + +* <> +* <> +* <> + +[discrete] +[[slo-understanding-slos]] +== Understanding SLOs + +[TIP] +==== +If you’re already familiar with how SLOs work and their relationship with other system components, such as transforms and ingest pipelines, you can jump directly to <>. +==== + +An SLO is represented by several system resources: + +* *Definition*: Stored as a Kibana Saved Object. +* *Transforms*: For each SLO, {kib} creates two transforms: + * *Rolling-up transform*: Rolls up the data into a smaller set of documents. + * *Summarising transform*: Updates the latest values, such as the observed SLI or remaining error budget, for efficient searching and filtering of SLOs. +* *Additional resources*: {kib} also installs and manages shared resources to the SLOs, including index templates, indices, and ingest pipelines, among others. + +The rollup documents are stored in an index named `.slo-observability.sli-v3` (index split per month through an ingest pipeline) while summary documents are stored in `.slo-observability.summary-v3`. + +Each time an SLO is updated, a new transform is created using the latest definition. The transform ID is generated by combining the SLO id and the SLO revision, following the format: `slo-{slo.id}-{slo.revision}`. + +Ensuring that transforms are functioning correctly and that the cluster is healthy is crucial for maintaining accurate and reliable SLOs. + +(TBD: explain also the main pipelines associated with SLOs and their objectives?) +(TBD: anything to add about index templates or other indices being used?) + +[discrete] +[[slo-common-problems]] +== Common problems + +One of the common issues with SLOs arises when there are underlying problems in the cluster, such as unavailable shards or failed transforms. Since SLOs rely on transforms to aggregate and process data, any failure or misconfiguration in these components can lead to inaccurate or incomplete SLO calculations. Additionally, unavailable shards can affect the data retrieval process, further complicating the reliability of SLO metrics. + +[discrete] +[[slo-no-transform-ingest-node]] +=== No transform or ingest nodes + +Since SLOs depend on both {ref}/ingest.html[ingest pipelines] and {ref}/transforms.html[transforms] to process the data, it's essential to ensure that the cluster has nodes with the appropriate {ref}/modules-node.html#node-roles[roles]. + +Ensure the cluster includes one or more nodes with both `ingest` and `transform` roles (they can co-exist or be distributed across separate nodes), to support the data processing and transformations required for SLOs to function properly. + +[discrete] +[[slo-transform-unhealthy]] +=== Unhealthy transforms + +When working with SLOs, ensuring that the associated transforms function correctly is crucial. Transforms are responsible for generating the data needed for SLOs, and typically, two transforms are created for each SLO. If you notice that your SLOs are not displaying the expected data, it's time to check the health of these associated transforms. + +{kib} shows the following message when any of the associated transforms is in an unexpected state: + +* `"The following transform is an unhealthy state"`, followed by a list of transforms, as shown in the picture: +(TBD: add screenshot of how the unhealthy transform report looks like, the warning introduced in 8.15) + +In the case of reported **unhealthy transforms**, refer to the {ref}/transform-troubleshooting.html[troubleshooting transforms] documentation for detailed guidance on diagnosing and resolving transform-related issues. + +Additionally, it is recommended to perform the following **transform checks**: + +* Ensure the needed transforms for the SLOs haven't been deleted or stopped. ++ +If a transform has been deleted the easiest way to recreate it is using the <> action, forcing the recreation of the transforms. +If a transform was stopped just try to start it. + +* <> to analyze the SLO definition and all associated resources. ++ +Use the direct links offered by the **inspect UI** and check that all referenced resources exist, as that's not verified by the inspect functionality. ++ +Use the `query composite` content to verify if the queries performed by the transforms are valid and return the expected data. + +* Check the source data and queries of the SLO ++ +The most common cause of legitimate transform failures is issues with the source data, such as timestamp parsing errors or incorrect query structures. + +* As a last resort, consider <>. + +[discrete] +[[slo-missing-pipeline]] +=== Missing Ingest Pipelines + +(decide what to do here) + +[discrete] +[[slo-missing-template]] +=== Missing Templates + +(decide what to do here) + +[discrete] +[[slo-missing-indices]] +=== Missing Indices or Shards + +(decide what to do here. I'm sharing error examples I have collected to see if it makes sense to offer some background and context for issues that are not really related with SLOs logic but with other parts of the stack). + +Other examples: +> Failed to execute phase [can_match], start; org.elasticsearch.action.search.SearchPhaseExecutionException: Search rejected due to missing shards [[.ds-metrics-apm.internal-default-2024.06.08-000030][1], [.ds-metrics-apm.service_transaction.1m-default-2024.06.07-000023][1], [.ds-metrics-apm.transaction.1m-default-2024.06.07-000024][1]]. Consider using `allow_partial_search_results` setting to bypass this error. + +another (unavailable remote cluster (CCS)) +> Validation Failed: 1: no such remote cluster: [metrics];2: no such remote cluster: [metrics]; + +> Some Transform failures can be totally unrelated to SLO/O11y but to platform (example: circuit breaker exceptions due to low memory on ES side). + +[source,bash] +---- + "reason": """Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [500] failures and at least 1 irrecoverable [unable to parse date [1702842480000]]. Other failures: +[IngestProcessorException] message [org.elasticsearch.ingest.IngestProcessorException: java.lang.IllegalArgumentException: unable to parse date [1702842480000]]; java.lang.IllegalArgumentException: unable to parse date [1702842480000]]""", + + "issue": "Transform task state is [failed]", + "details": """Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [500] failures and at least 1 irrecoverable [unable to parse date [1702842480000]]. Other failures: +[IngestProcessorException] message [org.elasticsearch.ingest.IngestProcessorException: java.lang.IllegalArgumentException: unable to parse date [1702842480000]]; java.lang.IllegalArgumentException: unable to parse date [1702842480000]]""", + "count": 1 +---- + +[discrete] +[[slo-troubleshoot-actions]] +== SLO troubleshooting actions + +[discrete] +[[slo-troubleshoot-inspect]] +=== Inspect SLO assets + +To be able to inspect SLOs, activate the following advanced setting in {kib}: + +. Open {kib}'s *Stack Management* -> *Advanced Settings* +. Enable `observability:enableInspectEsQueries` + +Afterwards visit the *SLO edit page* and click on *SLO Inspect* at the bottom of the page. + +The *SLO Inspect* option provides a detailed report of an SLO, including: + +* SLO configuration +* Rollup transform configuration +* Summary transform configuration +* Rollup ingest pipeline +* Summary ingest pipeline +* Temporary document +* Rollup transform query composite +* Summary transform query composite + +These resources are very useful to for example try out the queries performed by the transforms and check the `ids` of all associated resources. The view also includes direct links to transforms and ingest pipelines sections of {kib}. + +(TBD: should we add a screenshot here?) + +[discrete] +[[slo-troubleshoot-reset]] +=== Reset SLO + +Resetting an SLO forces the deletion of all SLI data, summary data, and transforms, and then reinstalls and processes the data. Essentially, it recreates the SLO as if it had been deleted and re-created by the user. + +[NOTE] +==== +While resetting an SLO can help resolve certain issues, it may not always address the root cause of errors. Most errors related to transforms typically arise from improperly structured source data, such as unparseable timestamps, which prevent the transform from progressing. Additionally, incorrect formatted SLO queries, and consequently transform queries, can also lead to failures. + +Therefore, before resetting the SLO, verify that the source data and queries are correctly formatted and validated. Resetting should only be used as a last resort when all other troubleshooting steps have been exhausted. +==== + +Follow these steps to reset an SLO: +. Open *Observability* -> *SLOs* +. Click on the SLO to reset. +. Select *Actions* -> *Reset* + +Alternatively you can use {kib} API for the reset action: + +[source,console] +---- +POST kbn:/api/observability/slos/{sloId}∫/_reset +---- + +Where `sloId` can be obtained from the <> action. + +[discrete] +[[slo-troubleshoot-beta]] +== Upgrade from beta to GA + +Starting in version 8.12.0, SLOs are generally available (GA). +If you're upgrading from a beta version of SLOs (available in 8.11.0 and earlier), +you must migrate your SLO definitions to a new format. Otherwise SLOs won't show up. + +[%collapsible] +.Migrate your SLO definitions +==== +To migrate your SLO definitions, open the SLO overview. +A banner will display the number of outdated SLOs detected. +For each outdated SLO, click **Reset**. If you no longer need the SLO, select **Delete**. + +If you have a large number of SLO definitions, it is possible to automate this process. +To do this, you'll need to use two Elastic APIs: + +* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L453-L514[SLO Definitions Find API] (`/api/observability/slos/_definitions`) +* https://github.com/elastic/kibana/blob/9cb830fe9a021cda1d091effbe3e0cd300220969/x-pack/plugins/observability/docs/openapi/slo/bundled.yaml#L368-L410[SLO Reset API] (`/api/observability/slos/${id}/_reset`) + +Pass in `includeOutdatedOnly=1` as a query parameter to the Definitions Find API. +This will display your outdated SLO definitions. +Loop through this list, one by one, calling the Reset API on each outdated SLO definition. +The Reset API loads the outdated SLO definition and resets it to the new format required for GA. +Once an SLO is reset, it will start to regenerate SLIs and summary data. +==== + +[%collapsible] +.Remove legacy summary transforms +==== +After migrating to 8.12 or later, you might have some legacy SLO summary transforms running. +You can safely delete the following legacy summary transforms: + +[source,sh] +---------------------------------- +# Stop all legacy summary transforms +POST _transform/slo-summary-occurrences-30d-rolling/_stop?force=true +POST _transform/slo-summary-occurrences-7d-rolling/_stop?force=true +POST _transform/slo-summary-occurrences-90d-rolling/_stop?force=true +POST _transform/slo-summary-occurrences-monthly-aligned/_stop?force=true +POST _transform/slo-summary-occurrences-weekly-aligned/_stop?force=true +POST _transform/slo-summary-timeslices-30d-rolling/_stop?force=true +POST _transform/slo-summary-timeslices-7d-rolling/_stop?force=true +POST _transform/slo-summary-timeslices-90d-rolling/_stop?force=true +POST _transform/slo-summary-timeslices-monthly-aligned/_stop?force=true +POST _transform/slo-summary-timeslices-weekly-aligned/_stop?force=true + +# Delete all legacy summary transforms +DELETE _transform/slo-summary-occurrences-30d-rolling?force=true +DELETE _transform/slo-summary-occurrences-7d-rolling?force=true +DELETE _transform/slo-summary-occurrences-90d-rolling?force=true +DELETE _transform/slo-summary-occurrences-monthly-aligned?force=true +DELETE _transform/slo-summary-occurrences-weekly-aligned?force=true +DELETE _transform/slo-summary-timeslices-30d-rolling?force=true +DELETE _transform/slo-summary-timeslices-7d-rolling?force=true +DELETE _transform/slo-summary-timeslices-90d-rolling?force=true +DELETE _transform/slo-summary-timeslices-monthly-aligned?force=true +DELETE _transform/slo-summary-timeslices-weekly-aligned?force=true +---------------------------------- + +Do not delete any new summary transforms used by your migrated SLOs. +==== + +[discrete] +[[slo-api-calls]] +== Using API calls to retrieve SLO details + +TBD: determine if we need this section or not. I think it's NOT needed, as SLO Inspect offers all details already. + +The following {kib} API calls are useful to retrieve different level of details of the SLOs and surrounding components. + +[discrete] +[[slo-api-find]] +=== Find SLO definitions + +You can achieve this in multiple ways: + +* From Saved Objects + +The following query returns the stored SLO definitions. SLO, and therefore this API, is space aware. + +[source,console] +---------------------------------- +GET kbn:/s/{space}/api/saved_objects/_find?type=slo +---------------------------------- + +* Through _definitions API + +The following internal API returns the SLO definitions. It is space aware. + +[source,console] +---------------------------------- +GET kbn:/s/{space}/api/observability/slos/_definitions +---------------------------------- + +* Through slos API + +The following public API returns the total number of SLOs, including the group by instances. It is space aware. + +[source,console] +---------------------------------- +GET kbn:/s/{space}/api/observability/slos +---------------------------------- + +* Through UI + +Users can also get the total number of SLOs through the SLO UI. In the SLO Overview page we display the total number of SLOs. + +* Via Raw Kibana index + +[source,console] +---------------------------------- +GET .kibana*/_search +{ + "size": 10, # adjust this + "query": { + "term": { + "type": { + "value": "slo" + } + } + } +} +---------------------------------- + + +[discrete] +[[slo-api-find-specific]] +=== Find definition for a specific SLO + +The following internal API returns the SLO definition for a specific SLO, filtered by the name of the SLO: + +[source,console] +---------------------------------- +GET kbn:/api/observability/slos/_definitions?search=Some SLO +---------------------------------- + + + +[discrete] +[[slo-api-find-rollup]] +=== Find rollup SLO transforms + +Each SLO creates a rollup transform, and everytime you update the SLO a new transform is created with the latest definition. + +The transform id is built with the slo id and the slo revision as `slo-{slo.id}-{slo.revision}`. + +Fetch a specific transform for a given SLO using this call: + +[source,console] +---------------------------------- +GET _transform/slo-{id}-{revision} +---------------------------------- + +You can also fetch all transforms using: + +GET _transform/slo-* + +[discrete] +[[slo-api-rollup-documents]] +=== Search the rollup documents for an SLO + +It can be useful to fetch the latest rollup document for a given slo id and optionally an instance id, in case investigating why an SLO shows as no data for too long. + +[source,console] +---------------------------------- +POST .slo-observability.sli-v3*/_search +{ + "sort": [ + { + "event.ingested": { + "order": "desc" + } + } + ], + "query": { + "bool": { + "filter": [ + { + "term": { + "slo.id": "id" + } + }, + { + "term": { + "slo.instanceId": "instanceId" + } + } + ] + } + } +} +---------------------------------- + +[discrete] +[[slo-api-summary-documents]] +=== Search the summary documents for an SLO + +It can be useful to fetch the latest summary document for a given slo id and optionally an instance id: + +[source,console] +---------------------------------- +POST .slo-observability.summary-v3*/_search +{ + "query": { + "bool": { + "filter": [ + { + "term": { + "slo.id": "id" + } + }, + { + "term": { + "slo.instanceId": "instanceId" + } + } + ] + } + } +} +---------------------------------- \ No newline at end of file diff --git a/docs/en/serverless/alerting/create-custom-threshold-alert-rule.mdx b/docs/en/serverless/alerting/create-custom-threshold-alert-rule.mdx new file mode 100644 index 0000000000..fdd3d70a2b --- /dev/null +++ b/docs/en/serverless/alerting/create-custom-threshold-alert-rule.mdx @@ -0,0 +1,235 @@ +--- +slug: /serverless/observability/create-custom-threshold-alert-rule +title: Create a custom threshold rule +description: Get alerts when an Observability data type reach a given value. +tags: [ 'serverless', 'observability', 'how-to', 'alerting' ] +--- + +

+ +import Connectors from './alerting-connectors.mdx' + +import Roles from '../partials/roles.mdx' + + + +Create a custom threshold rule to trigger an alert when an ((observability)) data type reaches or exceeds a given value. + +1. To access this page, from your project go to **Alerts**. +1. Click **Manage Rules** -> **Create rule**. +1. Under **Select rule type**, select **Custom threshold**. + +![Rule details (custom threshold)](../images/custom-threshold-rule.png) + +
+ +## Define rule data + +Specify the following settings to define the data the rule applies to: + +* **Select a data view:** Click the data view field to search for and select a data view that points to the indices or data streams that you're creating a rule for. You can also create a _new_ data view by clicking **Create a data view**. Refer to [Create a data view](((kibana-ref))/data-views.html) for more on creating data views. +* **Define query filter (optional):** Use a query filter to narrow down the data that the rule applies to. For example, set a query filter to a specific host name using the query filter `host.name:host-1` to only apply the rule to that host. + +
+ +## Set rule conditions + +Set the conditions for the rule to detect using aggregations, an equation, and a threshold. + +
+ +### Set aggregations + +Aggregations summarize your data to make it easier to analyze. +Set any of the following aggregation types to gather data to create your rule: +`Average`, `Max`, `Min`, `Cardinality`, `Count`, `Sum,` `Percentile`, or `Rate`. +For more information about these options, refer to . + +For example, to gather the total number of log documents with a log level of `warn`: + +1. Set the **Aggregation** to **Count**, and set the **KQL Filter** to `log.level: "warn"`. +1. Set the threshold to `IS ABOVE 100` to trigger an alert when the number of log documents with a log level of `warn` reaches 100. + +
+ +### Set the equation and threshold + +Set an equation using your aggregations. Based on the results of your equation, set a threshold to define when to trigger an alert. The equations use basic math or boolean logic. Refer to the following examples for possible use cases. + +
+ +### Basic math equation + +Add, subtract, multiply, or divide your aggregations to define conditions for alerting. + +**Example:** +Set an equation and threshold to trigger an alert when a metric is above a threshold. For this example, we'll use average CPU usage—the percentage of CPU time spent in states other than `idle` or `IOWait` normalized by the number of CPU cores—and trigger an alert when CPU usage is above a specific percentage. To do this, set the following aggregations, equation, and threshold: + +1. Set the following aggregations: + * **Aggregation A:** Average `system.cpu.user.pct` + * **Aggregation B:** Average `system.cpu.system.pct` + * **Aggregation C:** Max `system.cpu.cores`. +1. Set the equation to `(A + B) / C * 100` +1. Set the threshold to `IS ABOVE 95` to alert when CPU usage is above 95%. + +
+ +### Boolean logic + +Use conditional operators and comparison operators with you aggregations to define conditions for alerting. + +**Example:** +Set an equation and threshold to trigger an alert when the number of stateful pods differs from the number of desired pods. For this example, we'll use `kubernetes.statefulset.ready` and `kubernetes.statefulset.desired`, and trigger an alert when their values differ. To do this, set the following aggregations, equation, and threshold: + +1. Set the following aggregations: + * **Aggregation A:** Sum `kubernetes.statefulset.ready` + * **Aggregation B:** Sum `kubernetes.statefulset.desired` +1. Set the equation to `A == B ? 1 : 0`. If A and B are equal, the result is `1`. If they're not equal, the result is `0`. +1. Set the threshold to `IS BELOW 1` to trigger an alert when the result is `0` and the field values do not match. + +
+ +## Preview chart + +The preview chart provides a visualization of how many entries match your configuration. +The shaded area shows the threshold you've set. + +
+ +## Group alerts by (optional) + +Set one or more **group alerts by** fields for custom threshold rules to perform a composite aggregation against the selected fields. +When any of these groups match the selected rule conditions, an alert is triggered _per group_. + +When you select multiple groupings, the group name is separated by commas. + +For example, if you group alerts by the `host.name` and `host.architecture` fields, and there are two hosts (`Host A` and `Host B`) and two architectures (`Architecture A` and `Architecture B`), the composite aggregation forms multiple groups. + +If the `Host A, Architecture A` group matches the rule conditions, but the `Host B, Architecture B` group doesn't, one alert is triggered for `Host A, Architecture A`. + +If you select one field—for example, `host.name`—and `Host A` matches the conditions but `Host B` doesn't, one alert is triggered for `Host A`. +If both groups match the conditions, alerts are triggered for both groups. + +## Trigger "no data" alerts (optional) + +Optionally configure the rule to trigger an alert when: + +* there is no data, or +* a group that was previously detected stops reporting data. + +To do this, select **Alert me if there's no data**. + +The behavior of the alert depends on whether any **group alerts by** fields are specified: + +* **No "group alerts by" fields**: (Default) A "no data" alert is triggered if the condition fails to report data over the expected time period, or the rule fails to query ((es)). This alert means that something is wrong and there is not enough data to evaluate the related threshold. + +* **Has "group alerts by" fields**: If a previously detected group stops reporting data, a "no data" alert is triggered for the missing group. + + For example, consider a scenario where `host.name` is the **group alerts by** field for CPU usage above 80%. The first time the rule runs, two hosts report data: `host-1` and `host-2`. The second time the rule runs, `host-1` does not report any data, so a "no data" alert is triggered for `host-1`. When the rule runs again, if `host-1` starts reporting data again, there are a couple possible scenarios: + + * If `host-1` reports data for CPU usage and it is above the threshold of 80%, no new alert is triggered. + Instead the existing alert changes from "no data" to a triggered alert that breaches the threshold. + Keep in mind that no notifications are sent in this case because there is still an ongoing issue. + * If `host-1` reports CPU usage below the threshold of 80%, the alert status is changed to recovered. + + + If a host (for example, `host-1`) is decommissioned, you probably no longer want to see "no data" alerts about it. + To mark an alert as untracked: + Go to the Alerts table, click the icon to expand the "More actions" menu, and click *Mark as untracked*. + + +## Add actions + +You can extend your rules with actions that interact with third-party systems, write to logs or indices, or send user notifications. You can add an action to a rule at any time. You can create rules without adding actions, and you can also define multiple actions for a single rule. + +To add actions to rules, you must first create a connector for that service (for example, an email or external incident management system), which you can then use for different rules, each with their own action frequency. + + +Connectors provide a central place to store connection information for services and integrations with third party systems. +The following connectors are available when defining actions for alerting rules: + + + +For more information on creating connectors, refer to Connectors. + + + + +After you select a connector, you must set the action frequency. +You can choose to create a summary of alerts on each check interval or on a custom interval. +Alternatively, you can set the action frequency such that you choose how often the action runs (for example, +at each check interval, only when the alert status changes, or at a custom action interval). +In this case, you must also select the specific threshold condition that affects when actions run: `Alert`, `No Data`, or `Recovered`. + +![Configure when a rule is triggered](../images/custom-threshold-run-when.png) + +You can also further refine the conditions under which actions run by specifying that actions only run when they match a KQL query or when an alert occurs within a specific time frame: + +- **If alert matches query**: Enter a KQL query that defines field-value pairs or query conditions that must be met for notifications to send. The query only searches alert documents in the indices specified for the rule. +- **If alert is generated during timeframe**: Set timeframe details. Notifications are only sent if alerts are generated within the timeframe you define. + +![Configure a conditional alert](../images/logs-threshold-conditional-alert.png) + + + + +Use the default notification message or customize it. +You can add more context to the message by clicking the Add variable icon and selecting from a list of available variables. + +![Action variables list](../images/action-variables-popup.png) + +The following variables are specific to this rule type. +You can also specify [variables common to all rules](((kibana-ref))/rule-action-variables.html). + + + `context.alertDetailsUrl` + + Link to the alert troubleshooting view for further context and details. This will be an empty string if the `server.publicBaseUrl` is not configured. + + `context.cloud` + + The cloud object defined by ECS if available in the source. + + `context.container` + + The container object defined by ECS if available in the source. + + `context.group` + + The object containing groups that are reporting data. + + `context.host` + + The host object defined by ECS if available in the source. + + `context.labels` + + List of labels associated with the entity where this alert triggered. + + `context.orchestrator` + + The orchestrator object defined by ECS if available in the source. + + `context.reason` + + A concise description of the reason for the alert. + + `context.tags` + + List of tags associated with the entity where this alert triggered. + + `context.timestamp` + + A timestamp of when the alert was detected. + + `context.value` + + List of the condition values. + + `context.viewInAppUrl` + + Link to the alert source. + + + + diff --git a/docs/en/serverless/alerting/synthetic-monitor-status-alert.mdx b/docs/en/serverless/alerting/synthetic-monitor-status-alert.mdx new file mode 100644 index 0000000000..38dc6de731 --- /dev/null +++ b/docs/en/serverless/alerting/synthetic-monitor-status-alert.mdx @@ -0,0 +1,120 @@ +--- +slug: /serverless/observability/monitor-status-alert +title: Create a synthetic monitor status rule +description: Get alerts based on the status of synthetic monitors. +tags: [ 'serverless', 'observability', 'how-to', 'alerting' ] +--- + +import Connectors from './alerting-connectors.mdx' + +Within the Synthetics UI, create a **Monitor Status** rule to receive notifications +based on errors and outages. + +1. To access this page, go to **Synthetics** → **Overview**. +1. At the top of the page, click **Alerts and rules** → **Monitor status rule** → **Create status rule**. + +## Filters + +The **Filter by** section controls the scope of the rule. +The rule will only check monitors that match the filters defined in this section. +In this example, the rule will only alert on `browser` monitors located in `Asia/Pacific - Japan`. + +![Filter by section of the Synthetics monitor status rule](../images/synthetic-monitor-filters.png) + +## Conditions + +Conditions for each rule will be applied to all monitors that match the filters in the [**Filter by** section](#filters). +You can choose the number of times the monitor has to be down relative to either a number of checks run +or a time range in which checks were run, and the minimum number of locations the monitor must be down in. + + + Retests are included in the number of checks. + + +The **Rule schedule** defines how often to evaluate the condition. Note that checks are queued, and they run as close +to the defined value as capacity allows. For example, if a check is scheduled to run every 2 minutes, but the check +takes longer than 2 minutes to run, a check will not run until the previous check has finished. + +You can also set **Advanced options** such as the number of consecutive runs that must meet the rule conditions before +an alert occurs. + +In this example, the conditions will be met any time a `browser` monitor is down `3` of the last `5` times +the monitor ran across any locations that match the filter. These conditions will be evaluated every minute, +and you will only receive an alert when the conditions are met three times consecutively. + +![Filters and conditions defining a Synthetics monitor status rule](../images/synthetic-monitor-conditions.png) + +## Action types + +Extend your rules by connecting them to actions that use the following supported built-in integrations. + + + +After you select a connector, you must set the action frequency. +You can choose to create a summary of alerts on each check interval or on a custom interval. +For example, send email notifications that summarize the new, ongoing, and recovered alerts each hour: + +![](../images/synthetic-monitor-action-types-summary.png) + +Alternatively, you can set the action frequency such that you choose how often the action runs +(for example, at each check interval, only when the alert status changes, or at a custom action interval). +In this case, you must also select the specific threshold condition that affects when actions run: +the _Synthetics monitor status_ changes or when it is _Recovered_ (went from down to up). + +![](../images/synthetic-monitor-action-types-each-alert.png) + +You can also further refine the conditions under which actions run by specifying that actions only run +when they match a KQL query or when an alert occurs within a specific time frame: + +* **If alert matches query**: Enter a KQL query that defines field-value pairs or query conditions that must + be met for notifications to send. The query only searches alert documents in the indices specified for the rule. +* **If alert is generated during timeframe**: Set timeframe details. Notifications are only sent if alerts are + generated within the timeframe you define. + +![](../images/synthetic-monitor-action-types-more-options.png) + +### Action variables + +Use the default notification message or customize it. +You can add more context to the message by clicking the icon above the message text box +and selecting from a list of available variables. + +![](../images/synthetic-monitor-action-variables.png) + +The following variables are specific to this rule type. +You an also specify [variables common to all rules](((kibana-ref))/rule-action-variables.html). + + + `context.checkedAt` + Timestamp of the monitor run. + `context.hostName` + Hostname of the location from which the check is performed. + `context.lastErrorMessage` + Monitor last error message. + `context.locationId` + Location id from which the check is performed. + `context.locationName` + Location name from which the check is performed. + `context.locationNames` + Location names from which the checks are performed. + `context.message` + A generated message summarizing the status of monitors currently down. + `context.monitorId` + ID of the monitor. + `context.monitorName` + Name of the monitor. + `context.monitorTags` + Tags associated with the monitor. + `context.monitorType` + Type (for example, HTTP/TCP) of the monitor. + `context.monitorUrl` + URL of the monitor. + `context.reason` + A concise description of the reason for the alert. + `context.recoveryReason` + A concise description of the reason for the recovery. + `context.status` + Monitor status (for example, "down"). + `context.viewInAppUrl` + Open alert details and context in Synthetics app. + \ No newline at end of file diff --git a/docs/en/serverless/infra-monitoring/container-metrics.mdx b/docs/en/serverless/infra-monitoring/container-metrics.mdx new file mode 100644 index 0000000000..ad6bedc4ae --- /dev/null +++ b/docs/en/serverless/infra-monitoring/container-metrics.mdx @@ -0,0 +1,186 @@ +--- +slug: /serverless/observability/container-metrics +title: Container metrics +description: Learn about key container metrics used for container monitoring. +tags: [ 'serverless', 'observability', 'reference' ] +--- + +

+ +
+ +Learn about key container metrics displayed in the Infrastructure UI: + +* Docker +* Kubernetes + + +
+ +## Docker container metrics + +These are the key metrics displayed for Docker containers. + +
+ +### CPU usage metrics + + + + **CPU Usage (%)** + + Average CPU for the container. + + **Field Calculation:** `average(docker.cpu.total.pct)` + + + + +
+ +### Memory metrics + + + + **Memory Usage (%)** + + Average memory usage for the container. + + **Field Calculation:** `average(docker.memory.usage.pct)` + + + + +
+ +### Network metrics + + + + **Inbound Traffic (RX)** + + Derivative of the maximum of `docker.network.in.bytes` scaled to a 1 second rate. + + **Field Calculation:** `average(docker.network.inbound.bytes) * 8 / (max(metricset.period, kql='docker.network.inbound.bytes: *') / 1000)` + + + + **Outbound Traffic (TX)** + + Derivative of the maximum of `docker.network.out.bytes` scaled to a 1 second rate. + + **Field Calculation:** `average(docker.network.outbound.bytes) * 8 / (max(metricset.period, kql='docker.network.outbound.bytes: *') / 1000)` + + + + +### Disk metrics + + + + **Disk Read IOPS** + + Average count of read operations from the device per second. + + **Field Calculation:** `counter_rate(max(docker.diskio.read.ops), kql='docker.diskio.read.ops: *')` + + + + **Disk Write IOPS** + + Average count of write operations from the device per second. + + **Field Calculation:** `counter_rate(max(docker.diskio.write.ops), kql='docker.diskio.write.ops: *')` + + + + +
+ +## Kubernetes container metrics + +These are the key metrics displayed for Kubernetes (containerd) containers. + +
+ +### CPU usage metrics + + + + **CPU Usage (%)** + + Average CPU for the container. + + **Field Calculation:** `average(kubernetes.container.cpu.usage.limit.pct)` + + + + +
+ +### Memory metrics + + + + **Memory Usage (%)** + + Average memory usage for the container. + + **Field Calculation:** `average(kubernetes.container.memory.usage.limit.pct)` + + + \ No newline at end of file diff --git a/docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc b/docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc index 8a06aa4334..3041324c76 100644 --- a/docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc +++ b/docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc @@ -6,12 +6,23 @@ preview:[] +<<<<<<< HEAD:docs/en/serverless/infra-monitoring/view-infrastructure-metrics.mdx +import ContainerDetails from '../transclusion/container-details.mdx' + +
+ +======= +>>>>>>> origin/main:docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc The **Infrastructure Inventory** page provides a metrics-driven view of your entire infrastructure grouped by the resources you are monitoring. All monitored resources emitting a core set of infrastructure metrics are displayed to give you a quick view of the overall health of your infrastructure. +<<<<<<< HEAD:docs/en/serverless/infra-monitoring/view-infrastructure-metrics.mdx +To access the **Infrastructure Inventory** page, in your ((observability)) project, +======= To access the **Infrastructure Inventory** page, in your {observability} project, +>>>>>>> origin/main:docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc go to **Infrastructure inventory**. [role="screenshot"] @@ -59,6 +70,11 @@ To examine the metrics for a specific time, use the time filter to select the da [[analyze-hosts-inventory]] == View host metrics +<<<<<<< HEAD:docs/en/serverless/infra-monitoring/view-infrastructure-metrics.mdx +## View host metrics + +======= +>>>>>>> origin/main:docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc By default the **Infrastructure Inventory** page displays a waffle map that shows the hosts you are monitoring and the current CPU usage for each host. Alternatively, you can click the **Table view** icon image:images/table-view-icon.png[Table view icon] diff --git a/docs/en/serverless/serverless-observability.docnav.json b/docs/en/serverless/serverless-observability.docnav.json new file mode 100644 index 0000000000..2172810dce --- /dev/null +++ b/docs/en/serverless/serverless-observability.docnav.json @@ -0,0 +1,662 @@ +{ + "mission": "Elastic Observability", + "id": "serverless-observability", + "landingPageSlug": "/serverless/observability/what-is-observability-serverless", + "icon": "logoObservability", + "description": "Description to be written", + "items": [ + { + "slug": "/serverless/observability/serverless-observability-overview", + "classic-sources": [ "enObservabilityObservabilityIntroduction" ], + "classic-skip": true + }, + { + "slug": "/serverless/observability/quickstarts/overview", + "items": [ + { + "slug": "/serverless/observability/quickstarts/monitor-hosts-with-elastic-agent" + }, + { + "slug": "/serverless/observability/quickstarts/k8s-logs-metrics" + } + ] + }, + { + "slug": "/serverless/observability/observability-billing" + }, + { + "label": "Create an Observability project", + "slug": "/serverless/observability/create-an-observability-project" + }, + { + "slug": "/serverless/observability/log-monitoring", + "classic-sources": ["enObservabilityLogsObservabilityOverview"], + "items": [ + { + "slug": "/serverless/observability/get-started-with-logs" + }, + { + "slug": "/serverless/observability/stream-log-files", + "classic-sources": ["enObservabilityLogsStream"] + }, + { + "slug": "/serverless/observability/correlate-application-logs", + "classic-sources": [ "enObservabilityApplicationLogs" ], + "items": [ + { + "slug": "/serverless/observability/plaintext-application-logs", + "classic-sources": [ + "enObservabilityPlaintextLogs" + ] + }, + { + "slug": "/serverless/observability/ecs-application-logs", + "classic-sources": [ + "enObservabilityEcsLoggingLogs" + ] + }, + { + "slug": "/serverless/observability/send-application-logs", + "classic-sources": [ + "enObservabilityApmAgentLogSending" + ] + } + ] + }, + { + "slug": "/serverless/observability/parse-log-data", + "classic-sources": ["enObservabilityLogsParse"] + }, + { + "slug": "/serverless/observability/filter-and-aggregate-logs", + "classic-sources": ["enObservabilityLogsFilterAndAggregate"] + }, + { + "slug": "/serverless/observability/discover-and-explore-logs", + "classic-sources": ["enObservabilityMonitorLogs"], + "classic-skip": true + }, + { + "slug": "/serverless/observability/add-logs-service-name", + "classic-sources": ["enObservabilityAddLogsServiceName"], + "classic-skip": true + }, + { + "slug": "/serverless/observability/run-log-pattern-analysis", + "classic-sources": ["enKibanaRunPatternAnalysisDiscover"] + }, + { + "slug": "/serverless/observability/troubleshoot-logs", + "classic-sources": ["enObservabilityLogsTroubleshooting"] + } + ] + }, + { + "label": "Inventory", + "slug": "/serverless/observability/inventory" + }, + { + "slug": "/serverless/observability/apm", + "classic-sources": [ "enApmGuideApmOverview" ], + "items": [ + { + "slug": "/serverless/observability/apm-get-started", + "classic-sources": [ "enObservabilityIngestTraces" ], + "classic-skip": true + }, + { + "slug": "/serverless/observability/apm-send-data-to-elastic", + "classic-sources": [], + "items": [ + { + "slug": "/serverless/observability/apm-agents-elastic-apm-agents" + }, + { + "slug": "/serverless/observability/apm-agents-opentelemetry", + "classic-sources": [ "enApmGuideOpenTelemetry" ], + "items": [ + { + "slug": "/serverless/observability/apm-agents-opentelemetry-opentelemetry-native-support", + "classic-sources": [ + "enApmGuideOpenTelemetryDirect" + ] + }, + { + "slug": "/serverless/observability/apm-agents-opentelemetry-collect-metrics", + "classic-sources": [ + "enApmGuideOpenTelemetryCollectMetrics" + ] + }, + { + "slug": "/serverless/observability/apm-agents-opentelemetry-limitations", + "classic-sources": [ + "enApmGuideOpenTelemetryKnownLimitations" + ] + }, + { + "slug": "/serverless/observability/apm-agents-opentelemetry-resource-attributes", + "classic-sources": [ + "enApmGuideOpenTelemetryResourceAttributes" + ] + } + ] + }, + { + "slug": "/serverless/observability/apm-agents-aws-lambda-functions", + "classic-sources": [ + "enApmGuideMonitoringAwsLambda", + "enApmLambdaAwsLambdaArch" + ] + } + ] + }, + { + "slug": "/serverless/observability/apm-view-and-analyze-traces", + "classic-sources": [ + "enKibanaXpackApm" + ], + "items": [ + { + "slug": "/serverless/observability/apm-find-transaction-latency-and-failure-correlations", + "classic-sources": [ + "enKibanaCorrelations" + ] + }, + { + "slug": "/serverless/observability/apm-integrate-with-machine-learning", + "classic-sources": [ + "enKibanaMachineLearningIntegration" + ] + }, + { + "slug": "/serverless/observability/apm-create-custom-links", + "classic-sources": [ + "enKibanaCustomLinks" + ] + }, + { + "slug": "/serverless/observability/apm-track-deployments-with-annotations", + "classic-sources": [ + "enKibanaTransactionsAnnotations" + ] + }, + { + "slug": "/serverless/observability/apm-query-your-data", + "classic-sources": [ + "enKibanaAdvancedQueries" + ] + }, + { + "slug": "/serverless/observability/apm-filter-your-data", + "classic-sources": [ + "enKibanaFilters" + ] + }, + { + "slug": "/serverless/observability/apm-observe-lambda-functions", + "classic-sources": [ + "enKibanaApmLambda" + ] + }, + { + "slug": "/serverless/observability/apm-ui-overview", + "classic-sources": [ + "enKibanaApmGettingStarted" + ], + "items": [ + { + "slug": "/serverless/observability/apm-services", + "classic-sources": [ + "enKibanaServices" + ] + }, + { + "slug": "/serverless/observability/apm-traces", + "classic-sources": [ + "enKibanaTraces" + ] + }, + { + "slug": "/serverless/observability/apm-dependencies", + "classic-sources": [ + "enKibanaDependencies" + ] + }, + { + "slug": "/serverless/observability/apm-service-map", + "classic-sources": [ + "enKibanaServiceMaps" + ] + }, + { + "slug": "/serverless/observability/apm-service-overview", + "classic-sources": [ + "enKibanaServiceOverview" + ] + }, + { + "slug": "/serverless/observability/apm-transactions", + "classic-sources": [ + "enKibanaTransactions" + ] + }, + { + "slug": "/serverless/observability/apm-trace-sample-timeline", + "classic-sources": [ + "enKibanaSpans" + ] + }, + { + "slug": "/serverless/observability/apm-errors", + "classic-sources": [ + "enKibanaErrors" + ] + }, + { + "slug": "/serverless/observability/apm-metrics", + "classic-sources": [ + "enKibanaMetrics" + ] + }, + { + "slug": "/serverless/observability/apm-infrastructure", + "classic-sources": [ + "enKibanaInfrastructure" + ] + }, { + "slug": "/serverless/observability/apm-logs", + "classic-sources": [ + "enKibanaLogs" + ] + } + ] + } + ] + }, + { + "slug": "/serverless/observability/apm-data-types", + "classic-sources": [ "" ] + }, + { + "slug": "/serverless/observability/apm-distributed-tracing", + "classic-sources": [ + "enApmGuideApmDistributedTracing" + ] + }, + { + "slug": "/serverless/observability/apm-reduce-your-data-usage", + "classic-sources": [ "" ], + "items": [ + { + "slug": "/serverless/observability/apm-transaction-sampling", + "classic-sources": [ + "enApmGuideSampling", + "enApmGuideConfigureHeadBasedSampling" + ] + }, + { + "slug": "/serverless/observability/apm-compress-spans", + "classic-sources": [ + "enApmGuideSpanCompression" + ] + }, + { + "slug": "/serverless/observability/apm-stacktrace-collection" + } + ] + }, + { + "slug": "/serverless/observability/apm-keep-data-secure", + "classic-sources": [ "enApmGuideSecureAgentCommunication" ] + }, + { + "slug": "/serverless/observability/apm-troubleshooting", + "classic-sources": [ + "enApmGuideTroubleshootApm", + "enApmGuideCommonProblems", + "enApmGuideServerEsDown", + "enApmGuideCommonResponseCodes", + "enApmGuideProcessingAndPerformance" + ] + }, + { + "slug": "/serverless/observability/apm-reference", + "classic-sources": [], + "items": [ + { + "slug": "/serverless/observability/apm-kibana-settings" + }, + { + "slug": "/serverless/observability/apm-server-api", + "classic-sources": [ + "enApmGuideApi", + "enApmGuideApiEvents", + "enApmGuideApiMetadata", + "enApmGuideApiTransaction", + "enApmGuideApiSpan", + "enApmGuideApiError", + "enApmGuideApiMetricset", + "enApmGuideApiConfig", + "enApmGuideApiInfo", + "enApmGuideApiOtlp" + ] + } + ] + } + ] + }, + { + "slug": "/serverless/observability/infrastructure-monitoring", + "classic-sources": ["enObservabilityAnalyzeMetrics"], + "items": [ + { + "slug": "/serverless/observability/get-started-with-metrics" + }, + { + "slug": "/serverless/observability/view-infrastructure-metrics", + "classic-sources": ["enObservabilityViewInfrastructureMetrics"] + }, + { + "slug": "/serverless/observability/analyze-hosts", + "classic-sources": ["enObservabilityAnalyzeHosts"] + }, + { + "slug": "/serverless/observability/detect-metric-anomalies", + "classic-sources": ["enObservabilityInspectMetricAnomalies"] + }, + { + "slug": "/serverless/observability/configure-intra-settings", + "classic-sources": ["enObservabilityConfigureSettings"] + }, + { + "slug": "/serverless/observability/troubleshooting-infrastructure-monitoring", + "items": [ + { + "slug": "/serverless/observability/handle-no-results-found-message" + } + ] + }, + { + "slug": "/serverless/observability/metrics-reference", + "classic-sources": ["enObservabilityMetricsReference"], + "items": [ + { + "slug": "/serverless/observability/host-metrics", + "classic-sources": ["enObservabilityHostMetrics"] + }, + { + "slug": "/serverless/observability/container-metrics", + "classic-sources": ["enObservabilityDockerContainerMetrics"] + }, + { + "slug": "/serverless/observability/kubernetes-pod-metrics", + "classic-sources": ["enObservabilityKubernetesPodMetrics"] + }, + { + "slug": "/serverless/observability/aws-metrics", + "classic-sources": ["enObservabilityAwsMetrics"] + } + ] + }, + { + "slug": "/serverless/observability/infrastructure-monitoring-required-fields", + "classic-sources": ["enObservabilityMetricsAppFields"] + } + ] + }, + { + "label": "Synthetic monitoring", + "slug": "/serverless/observability/monitor-synthetics", + "classic-sources": ["enObservabilityMonitorUptimeSynthetics"], + "items": [ + { + "label": "Get started", + "slug": "/serverless/observability/synthetics-get-started", + "classic-sources": ["enObservabilitySyntheticsGetStarted"], + "items": [ + { + "label": "Use a Synthetics project", + "slug": "/serverless/observability/synthetics-get-started-project", + "classic-sources": ["enObservabilitySyntheticsGetStartedProject"] + }, + { + "label": "Use the Synthetics UI", + "slug": "/serverless/observability/synthetics-get-started-ui", + "classic-sources": ["enObservabilitySyntheticsGetStartedUi"] + } + ] + }, + { + "label": "Scripting browser monitors", + "slug": "/serverless/observability/synthetics-journeys", + "classic-sources": ["enObservabilitySyntheticsJourneys"], + "items": [ + { + "label": "Write a synthetic test", + "slug": "/serverless/observability/synthetics-create-test", + "classic-sources": ["enObservabilitySyntheticsCreateTest"] + }, + { + "label": "Configure individual monitors", + "slug": "/serverless/observability/synthetics-monitor-use", + "classic-sources": ["enObservabilitySyntheticsMonitorUse"] + }, + { + "label": "Use the Synthetics Recorder", + "slug": "/serverless/observability/synthetics-recorder", + "classic-sources": ["enObservabilitySyntheticsRecorder"] + } + ] + }, + { + "label": "Configure lightweight monitors", + "slug": "/serverless/observability/synthetics-lightweight", + "classic-sources": ["enObservabilitySyntheticsLightweight"] + }, + { + "label": "Manage monitors", + "slug": "/serverless/observability/synthetics-manage-monitors", + "classic-sources": ["enObservabilitySyntheticsManageMonitors"] + }, + { + "label": "Work with params and secrets", + "slug": "/serverless/observability/synthetics-params-secrets", + "classic-sources": ["enObservabilitySyntheticsParamsSecrets"] + }, + { + "label": "Analyze monitor data", + "slug": "/serverless/observability/synthetics-analyze", + "classic-sources": ["enObservabilitySyntheticsAnalyze"] + }, + { + "label": "Monitor resources on private networks", + "slug": "/serverless/observability/synthetics-private-location", + "classic-sources": ["enObservabilitySyntheticsPrivateLocation"] + }, + { + "label": "Use the CLI", + "slug": "/serverless/observability/synthetics-command-reference", + "classic-sources": ["enObservabilitySyntheticsCommandReference"] + }, + { + "label": "Configure a Synthetics project", + "slug": "/serverless/observability/synthetics-configuration", + "classic-sources": ["enObservabilitySyntheticsConfiguration"] + }, + { + "label": "Configure Synthetics settings", + "slug": "/serverless/observability/synthetics-settings", + "classic-sources": ["enObservabilitySyntheticsSettings"] + }, + { + "label": "Grant users access to secured resources", + "slug": "/serverless/observability/synthetics-feature-roles", + "classic-sources": ["enObservabilitySyntheticsFeatureRoles"] + }, + { + "label": "Manage data retention", + "slug": "/serverless/observability/synthetics-manage-retention", + "classic-sources": ["enObservabilitySyntheticsManageRetention"] + }, + { + "label": "Scale and architect a deployment", + "slug": "/serverless/observability/synthetics-scale-and-architect", + "classic-sources": ["enObservabilitySyntheticsScaleAndArchitect"] + }, + { + "label": "Synthetics Encryption and Security", + "slug": "/serverless/observability/synthetics-security-encryption", + "classic-sources": ["enObservabilitySyntheticsSecurityEncryption"] + }, + { + "label": "Troubleshooting", + "slug": "/serverless/observability/synthetics-troubleshooting", + "classic-sources": ["enObservabilitySyntheticsTroubleshooting"] + } + ] + }, + { + "slug": "/serverless/observability/dashboards" + }, + { + "slug": "/serverless/observability/alerting", + "classic-sources": ["enObservabilityCreateAlerts"], + "items": [ + { + "slug": "/serverless/observability/create-manage-rules", + "classic-sources": ["enKibanaCreateAndManageRules"], + "items": [ + { + "label": "Anomaly detection", + "slug": "/serverless/observability/aiops-generate-anomaly-alerts" + }, + { + "label": "APM anomaly", + "slug": "/serverless/observability/create-anomaly-alert-rule" + }, + { + "label": "Custom threshold", + "slug": "/serverless/observability/create-custom-threshold-alert-rule" + }, + { + "label": "Elasticsearch query", + "slug": "/serverless/observability/create-elasticsearch-query-rule", + "classic-sources": ["enKibanaRuleTypeEsQuery"] + }, + { + "label": "Error count threshold", + "slug": "/serverless/observability/create-error-count-threshold-alert-rule" + }, + { + "label": "Failed transaction rate threshold", + "slug": "/serverless/observability/create-failed-transaction-rate-threshold-alert-rule" + }, + { + "label": "Inventory", + "slug": "/serverless/observability/create-inventory-threshold-alert-rule", + "classic-sources": ["enObservabilityInfrastructureThresholdAlert"] + }, + { + "label": "Latency threshold", + "slug": "/serverless/observability/create-latency-threshold-alert-rule" + }, + { + "label": "SLO burn rate", + "slug": "/serverless/observability/create-slo-burn-rate-alert-rule", + "classic-sources": [ "enObservabilitySloBurnRateAlert" ] + }, + { + "label": "Synthetic monitor status", + "slug": "/serverless/observability/monitor-status-alert" + } + ] + }, + { + "slug": "/serverless/observability/aggregationOptions", + "items": [ + { + "slug": "/serverless/observability/rateAggregation" + } + ] + }, + { + "slug": "/serverless/observability/view-alerts", + "classic-sources": ["enObservabilityViewObservabilityAlerts"], + "items": [ + { + "slug": "/serverless/observability/triage-slo-burn-rate-breaches", + "label": "SLO burn rate breaches" + }, + { + "slug": "/serverless/observability/triage-threshold-breaches", + "label": "Threshold breaches" + } + ] + } + ] + }, + { + "slug": "/serverless/observability/slos", + "classic-sources": [ "enObservabilitySlo" ], + "items": [ + { + "slug": "/serverless/observability/create-an-slo", + "classic-sources": [ "enObservabilitySloCreate" ] + } + ] + }, + { + "slug": "/serverless/observability/cases", + "classic-sources": [ "enObservabilityCreateCases" ], + "items": [ + { + "slug": "/serverless/observability/create-a-new-case", + "classic-sources": [ "enObservabilityManageCases" ] + }, + { + "slug": "/serverless/observability/case-settings" + } + ] + }, + { + "slug": "/serverless/observability/aiops", + "items": [ + { + "slug": "/serverless/observability/aiops-detect-anomalies", + "classic-sources": [ "enMachineLearningMlAdFindingAnomalies" ], + "classic-skip": true, + "items": [ + { + "slug": "/serverless/observability/aiops-tune-anomaly-detection-job" + }, + { + "slug": "/serverless/observability/aiops-forecast-anomalies" + } + ] + }, + { + "slug": "/serverless/observability/aiops-analyze-spikes", + "classic-sources": [ "enKibanaXpackMlAiops" ] + }, + { + "slug": "/serverless/observability/aiops-detect-change-points" + } + ] + }, + { + "slug": "/serverless/observability/monitor-datasets", + "classic-sources": ["enObservabilityMonitorDatasets"], + "classic-skip": true + }, + { + "slug": "/serverless/observability/ai-assistant", + "classic-sources": [ "enObservabilityObsAiAssistant" ] + }, + { + "slug": "/serverless/observability/elastic-entity-model" + }, + { + "slug": "/serverless/observability/observability-technical-preview-limitations" + } + ] +} \ No newline at end of file diff --git a/docs/en/serverless/transclusion/container-details.mdx b/docs/en/serverless/transclusion/container-details.mdx new file mode 100644 index 0000000000..d483893181 --- /dev/null +++ b/docs/en/serverless/transclusion/container-details.mdx @@ -0,0 +1,56 @@ +{/* This is collapsed by default */} + + + +The **Overview** tab displays key metrics about the selected container, such as CPU, memory, network, and disk usage. +The metrics shown may vary depending on the type of container you're monitoring. + +Change the time range to view metrics over a specific period of time. + +Expand each section to view more detail related to the selected container, such as metadata, +active alerts, and metrics. + +Hover over a specific time period on a chart to compare the various metrics at that given time. + +Click **Show all** to drill down into related data. + +![Container overview](../images/overview-overlay-containers.png) + + + + + +The **Metadata** tab lists all the meta information relating to the container: + +* Host information +* Cloud information +* Agent information + +All of this information can help when investigating events—for example, filtering by operating system or architecture. + +![Container metadata](../images/metadata-overlay-containers.png) + + + + + +The **Metrics** tab shows container metrics organized by type. + +![Metrics](../images/metrics-overlay-containers.png) + + + + + +The **Logs** tab displays logs relating to the container that you have selected. By default, the logs tab displays the following columns. + +| | | +|---|---| +| **Timestamp** | The timestamp of the log entry from the `timestamp` field. | +| **Message** | The message extracted from the document. The content of this field depends on the type of log message. If no special log message type is detected, the [Elastic Common Schema (ECS)](((ecs-ref))/ecs-base.html) base field, `message`, is used. | + +To view the logs in the ((logs-app)) for a detailed analysis, click **Open in Logs**. + +![Container logs](../images/logs-overlay-containers.png) + + diff --git a/docs/en/serverless/transclusion/host-details.mdx b/docs/en/serverless/transclusion/host-details.mdx new file mode 100644 index 0000000000..12defd7960 --- /dev/null +++ b/docs/en/serverless/transclusion/host-details.mdx @@ -0,0 +1,141 @@ +{/* This is collapsed by default */} + + + +The **Overview** tab displays key metrics about the selected host, such as CPU usage, +normalized load, memory usage, and max disk usage. + +Change the time range to view metrics over a specific period of time. + +Expand each section to view more detail related to the selected host, such as metadata, +active alerts, services detected on the host, and metrics. + +Hover over a specific time period on a chart to compare the various metrics at that given time. + +Click **Show all** to drill down into related data. + +![Host overview](../images/overview-overlay.png) + + + + + +The **Metadata** tab lists all the meta information relating to the host, +including host, cloud, and agent information. + +This information can help when investigating events—for example, +when filtering by operating system or architecture. + +![Host metadata](../images/metadata-overlay.png) + + + + + +The **Metrics** tab shows host metrics organized by type and is more complete than the view available in the *Overview* tab. + +![Metrics](../images/metrics-overlay.png) + + + + + +The **Processes** tab lists the total number of processes (`system.process.summary.total`) running on the host, +along with the total number of processes in these various states: + +* Running (`system.process.summary.running`) +* Sleeping (`system.process.summary.sleeping`) +* Stopped (`system.process.summary.stopped`) +* Idle (`system.process.summary.idle`) +* Dead (`system.process.summary.dead`) +* Zombie (`system.process.summary.zombie`) +* Unknown (`system.process.summary.unknown`) + +The processes listed in the **Top processes** table are based on an aggregation of the top CPU and the top memory consuming processes. +The number of top processes is controlled by `process.include_top_n.by_cpu` and `process.include_top_n.by_memory`. + +| | | +|---|---| +| **Command** | Full command line that started the process, including the absolute path to the executable, and all the arguments (`system.process.cmdline`). | +| **PID** | Process id (`process.pid`). | +| **User** | User name (`user.name`). | +| **CPU** | The percentage of CPU time spent by the process since the last event (`system.process.cpu.total.pct`). | +| **Time** | The time the process started (`system.process.cpu.start_time`). | +| **Memory** | The percentage of memory (`system.process.memory.rss.pct`) the process occupied in main memory (RAM). | +| **State** | The current state of the process and the total number of processes (`system.process.state`). Expected values are: `running`, `sleeping`, `dead`, `stopped`, `idle`, `zombie`, and `unknown`. | + +![Host processes](../images/processes-overlay.png) + + + + + +The **Logs** tab displays logs relating to the host that you have selected. By default, the logs tab displays the following columns. + +| | | +|---|---| +| **Timestamp** | The timestamp of the log entry from the `timestamp` field. | +| **Message** | The message extracted from the document. The content of this field depends on the type of log message. If no special log message type is detected, the [Elastic Common Schema (ECS)](((ecs-ref))/ecs-base.html) base field, `message`, is used. | + +To view the logs in the ((logs-app)) for a detailed analysis, click **Open in Logs**. + +![Host logs](../images/logs-overlay.png) + + + + + +The **Anomalies** tab displays a list of each single metric ((anomaly-detect)) job for the specific host. By default, anomaly +jobs are sorted by time, showing the most recent jobs first. + +Along with the name of each anomaly job, detected anomalies with a severity score equal to 50 or higher are listed. These +scores represent a severity of "warning" or higher in the selected time period. The **summary** value represents the increase between +the actual value and the expected ("typical") value of the host metric in the anomaly record result. + +To drill down and analyze the metric anomaly, select **Actions** → **Open in Anomaly Explorer**. +You can also select **Actions** → **Show in Inventory** to view the host Inventory page, filtered by the specific metric. + +![Anomalies](../images/anomalies-overlay.png) + + + + + + +One of the following roles is required to use Osquery. + +* **Admin:** Has full access to project configuration, including the ability to install, manage, and run Osquery queries through ((agent)). This role supports both ad hoc (live) queries and scheduled queries against monitored hosts. Admins can view and analyze the results directly in ((es)). +* **Editor:** Has limited access. Editors can run pre-configured queries, but may have restricted permissions for setting up and scheduling new queries, especially queries that require broader access or permissions adjustments. +* **Viewer**: Has read-only access to data, including viewing Osquery results if configured by a user with higher permissions. Viewers cannot initiate or schedule Osquery queries themselves. + +To learn more about roles, refer to . + + + +You must have an active [((agent))](((fleet-guide))/elastic-agent-installation.html) with an assigned agent policy +that includes the [Osquery Manager](((integrations-docs))/osquery_manager.html) integration. + + +The **Osquery** tab allows you to build SQL statements to query your host data. +You can create and run live or saved queries against +the ((agent)). Osquery results are stored in ((es)) +so that you can use the ((stack)) to search, analyze, and +visualize your host metrics. To create saved queries and add scheduled query groups, +refer to [Osquery](((kibana-ref))/osquery.html). + +To view more information about the query, click the **Status** tab. A query status can result in +`success`, `error` (along with an error message), or `pending` (if the ((agent)) is offline). + +Other options include: + +* View in Discover to search, filter, and view information about the structure of host metric fields. To learn more, refer to [Discover](((kibana-ref))/discover.html). +* View in Lens to create visualizations based on your host metric fields. To learn more, refer to [Lens](((kibana-ref))/lens.html). +* View the results in full screen mode. +* Add, remove, reorder, and resize columns. +* Sort field names in ascending or descending order. + +![Osquery](../images/osquery-overlay.png) + + + +