From 3f83dd060b2d2cf70fb0231ee256e5c956726885 Mon Sep 17 00:00:00 2001 From: Stef Nestor <26751266+stefnestor@users.noreply.github.com> Date: Fri, 3 May 2024 10:53:12 -0600 Subject: [PATCH] (Doc+) Capture Elasticsearch diagnostic --- .../troubleshooting/diagnostic.asciidoc | 115 ++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 docs/reference/troubleshooting/diagnostic.asciidoc diff --git a/docs/reference/troubleshooting/diagnostic.asciidoc b/docs/reference/troubleshooting/diagnostic.asciidoc new file mode 100644 index 0000000000000..17a499c88c7a9 --- /dev/null +++ b/docs/reference/troubleshooting/diagnostic.asciidoc @@ -0,0 +1,115 @@ +[[diagnostic]] +=== Diagnostic +++++ +Capturing Diagnostic +++++ +:keywords: Elasticsearch diagnostic, diagnostics + +An https://github.com/elastic/support-diagnostics[{es} diagnostic] allows +you to capture a point-in-time snapshot of cluster statistics and most settings. +It works against all {es} versions and requires JRE/JDK ≥v1.8. It is +useful when escalting to https://support.elastic.co[Elastic Support] or +https://discuss.elastic.co[Elastic Discuss] to minimize turnaround time. +It's point-in-time view is also useful when troubleshooting, see +https://www.elastic.co/blog/why-does-elastic-support-keep-asking-for-diagnostic-files[this +for examples]. + +[TIP] +==== +The {es} diagnostic is included as a sub-library within Elastic's platforms: + +* {ece} which you can pull under {ece} > Deployment > Operations > +Prepare Bundle > {es}. +* {eck}'s https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-take-eck-dump.html[diagnostic] +pulls this by default. +==== + +[discrete] +[[diagnostic-capture]] +==== Capture + +To capture an {es} diagnostic: + +. Download latest `diagnostics-X.X.X-dist.zip` (_not_ the "source code") file +from https://github.com/elastic/support-diagnostics/releases/latest[its +latest releases]. We will reference the unzipped execution file below as +`./diagnostics.sh` below which is for Unix-based systems though Windows will +replace this for `.\diagnostics.bat`. + +. There's https://github.com/elastic/support-diagnostics#diagnostic-types[three +available `type`'s'] to capture your {es} diagnostic. + +** `local` (default, **recommended**): polls the <>, +gathers Operating System info, and captures cluster and GC logs. +Alternatively, you can use `remote` which will establish an ssh session +to the applicable target server to pull the same info. + +** `api` polls the <> but all other data must be +collected manually. + +. Verify network and user permissions are sufficient to connect to your {es} +cluster by checking its <>. For example, +for `host:localhost`, `port:9200`, and `username:elastic` this would curl as: ++ +[source,sh] +--- +curl -X GET -k -u elastic -p https://localhost:9200/_cluster/health +--- + +. You're expecting an HTTP 200 `OK` response that reports the cluster's +`status`. If you can't successfully curl your {es} host, please +pause and review the resulting error as the diagnostic will potentially +not have the expected results. Outlining common errors and their next steps: + +** HTTP 401 `UNAUTHENTICATED`: the error will usually tell you either +that your `username:password` pair is invalid or that your `.security` +index is unavailable and you'll need to setup a temporary +<> user with `role:superuser` to authenticate. + +** HTTP 403 `UNAUTHORIZED`: your attempted `username` is recognized but +has insufficient permissions to run the diagnostic. Either use a different +username or elevate this user's privileges. + +** HTTP 429 `TOO_MANY_REQUESTS` (for example `circuit_breaking_exception`): +your username authenticated and authorized but the cluster is under +sufficiently high strain that it's not responding to API calls. These +responses are usually hit and miss, so potentially indicate that you can +proceed with running the diagnostic (which will pull what it can). + +** HTTP 504 `BAD_GATEWAY`: your network is experiencing issues reaching +the cluster (for example because of proxy or firewall). You might +change where you attempt from, confirm your port, or attempt targeting +the host's IP instead of its URL domain. + +** HTTP 503 `SERVICE_UNAVAILABLE` (for example `master_not_discovered_exception`): +your cluster does not currently have an elected master node (which is +required for it to be API-responsive). This may be temporary while master +node rotates. Otherwise, do not run Step#5 but pivot towards investigating +and first resolve <> +before proceeding. + +. Once you have a working curl request, use those same parameters to fill-in +the https://github.com/elastic/support-diagnostics#standard-options[diagnostic +parameters]. From our example, most common results will appear: ++ +[source,sh] +--- +sudo ./diagnostics.sh --type local --host localhost --port 9200 -u elastic -p --bypassDiagVerify --ssl --noVerify +--- + +. Once this script has completed, verify no errors emitted in the +`diagnostic.log`. Common errors to resolve: + +** `Error: Could not find or load main class com.elastic.support.diagnostics.DiagnosticApp` +indicates that you accidentally downloaded the "source code" file +instead of the diagnostic in Step#1 above. + +** `Could not retrieve the {es} version due to a system or network error - unable to continue.` +indicates an issue for the diagnostic to curl the cluster. You should +expect either Step#3 failed or there's a parameter disconnect between +Step#3 and Step#5 above. + +** `security_exception` with `is unauthorized for user` suggests +insufficient admin permissions to run the diagnostic tool and another +user should be used or current user granted `role:superuser` privileges +to run diagnostic.