Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.ml indices that are closed prevent Kibana monitoring from displaying. #91893

Closed
travisestill opened this issue Nov 22, 2022 · 8 comments · Fixed by #91917
Closed

.ml indices that are closed prevent Kibana monitoring from displaying. #91893

travisestill opened this issue Nov 22, 2022 · 8 comments · Fixed by #91917
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@travisestill
Copy link

travisestill commented Nov 22, 2022

Elasticsearch Version

8.3.3, 8.4.3

Installed Plugins

No response

Java Version

bundled

OS Version

any

Problem Description

.ml indices that are closed prevent Kibana monitoring from displaying. Using the GET _ml/anomaly_detectors/_stats endpoint:

{
  "error": {
    "root_cause": [
      {
        "type": "cluster_block_exception",
        "reason": "index [.ml-anomalies-shared] blocked by: [FORBIDDEN/4/index closed];"
      }
    ],
    "type": "cluster_block_exception",
    "reason": "index [.ml-anomalies-shared] blocked by: [FORBIDDEN/4/index closed];"
  },
  "status": 403
}

Steps to Reproduce

(Reproduced)

  1. Started with the closed .ml indices. (In ESS) Run a plan to change Logging and Monitoring to send to itself (self-monitoring), not a dedicated monitoring cluster.
  2. Plan change completed but the cluster's Stack Monitoring page would not render any data, it shows the message that you need to enable it in the console.
  3. Opened the .ml* indices and the Stack Monitoring data is displayed.

Logs (if relevant)

No response

@travisestill travisestill added >bug needs:triage Requires assignment of a team area label labels Nov 22, 2022
@dliappis dliappis removed the needs:triage Requires assignment of a team area label label Nov 22, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor

droberts195 commented Nov 23, 2022

I think this is a bug in Kibana monitoring, not the ML stats endpoint.

In this case the user has chosen to incapacitate ML by closing an internal index. It's reasonable that ML APIs should return errors under these circumstances. The only alternative would be to silently return incorrect/incomplete data, but that entirely goes against the Elasticsearch philosophy of making it clear when things aren't working.

Plan change completed but the cluster's Stack Monitoring page would not render any data, it shows the message that you need to enable it in the console.

For me that is the bug. Stack Monitoring must be assembling an overall response by combining the output of many APIs. But if one of those responses is an error it tells you you need to enable monitoring, which is wrong. Instead it should tell you that some portion of the stats are not available due to an error, and show you what is available.

If Stack Monitoring is not handling errors appropriately then the only alternative is that we change all the APIs it calls to never return an error. That would lead to Stack Monitoring silently missing out information in the event of some error, and would also affect other uses of those same APIs.

With get anomaly detector stats the current error message for a closed index makes it totally clear how to fix the problem. If we made it silently return an empty response in this case then we'd get bugs opened along the lines of, "Get anomaly detector stats is returning an empty response when I have lots of jobs and I cannot figure out why". And then someone would have to spend hours trawling through a support diag to work out that the root cause was a closed index.

@droberts195
Copy link
Contributor

@elastic/stack-monitoring which repo should this issue be transferred to for further investigation? There must be some point on the journey the data takes from the underlying Elasticsearch APIs through metricbeat and into the stack monitoring UI where the absence of this one part of the data causes the stack monitoring UI to display nothing and say stack monitoring is not enabled when it is.

@droberts195 droberts195 transferred this issue from elastic/elasticsearch Nov 24, 2022
@botelastic botelastic bot removed the needs_team label Nov 24, 2022
@droberts195
Copy link
Contributor

droberts195 commented Nov 24, 2022

I chatted with @klacabane about this on Slack.

Metricbeat is calling the ML stats API from: https://github.com/elastic/beats/blob/a106ad28c7c8f76d7bdfbb43ef88b077d6ef2327/metricbeat/module/elasticsearch/ml_job/ml_job.go#L33

An example response to GET kbn:api/monitoring/v1/_health with the ML results index closed is:

{
  "monitoredClusters": {
    "clusters": {
      "mHsD7oYvSi29j2WBton6Zg": {
        "cluster": {
          "mHsD7oYvSi29j2WBton6Zg": {
            "index_summary": {
              "metricbeat-8": {
                "index": ".ds-.monitoring-es-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:26:19.634Z"
              }
            },
            "index_recovery": {
              "metricbeat-8": {
                "index": ".ds-.monitoring-es-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:26:19.625Z"
              }
            }
          }
        },
        "elasticsearch": {
          "eLaxgJRfRG67vd3f8kg_XQ": {
            "shard": {
              "internal-monitoring": {
                "index": ".monitoring-es-7-2022.11.24",
                "lastSeen": "2022-11-24T11:08:48.831Z"
              },
              "metricbeat-8": {
                "index": ".ds-.monitoring-es-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:20:09.623Z"
              }
            }
          },
          "nacg489iQ32rtFNrLGR5_w": {
            "shard": {
              "internal-monitoring": {
                "index": ".monitoring-es-7-2022.11.24",
                "lastSeen": "2022-11-24T11:08:48.831Z"
              },
              "metricbeat-8": {
                "index": ".ds-.monitoring-es-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:20:09.623Z"
              }
            }
          },
          "54cyoviyS-aWAc_bJ2GwPg": {
            "node_stats": {
              "metricbeat-8": {
                "index": ".ds-.monitoring-es-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:26:18.273Z"
              }
            }
          },
          "b-HuEJz2QAeAqPQpDgp56g": {
            "node_stats": {
              "metricbeat-8": {
                "index": ".ds-.monitoring-es-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:26:24.182Z"
              }
            }
          }
        },
        "kibana": {
          "df4d79bc-2e4a-4327-a39e-2e66046d9b45": {
            "stats": {
              "metricbeat-8": {
                "index": ".ds-.monitoring-kibana-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:26:24.944Z"
              }
            }
          }
        },
        "beats": {
          "apm-server|6de09e55-3a5d-4f4b-a91c-b4984bc508c7": {
            "stats": {
              "metricbeat-8": {
                "index": ".ds-.monitoring-beats-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:26:22.297Z"
              }
            },
            "state": {
              "metricbeat-8": {
                "index": ".ds-.monitoring-beats-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:26:22.294Z"
              }
            }
          }
        },
        "enterpriseSearch": {
          "d58c6d90-d609-45c8-98de-a4ba9a56c243": {
            "stats": {
              "metricbeat-8": {
                "index": ".ds-.monitoring-ent-search-8-mb-2022.11.24-000001",
                "lastSeen": "2022-11-24T11:26:23.656Z"
              }
            }
          }
        }
      }
    },
    "execution": {
      "timedOut": false,
      "errors": []
    }
  },
  "metricbeatErrors": {
    "products": {
      "elasticsearch": {
        "cluster_stats": [
          {
            "message": "failed to get stack usage from Elasticsearch: HTTP error 403 in : 403 Forbidden",
            "lastSeen": "2022-11-24T11:26:19.625Z"
          }
        ],
        "index": [
          {
            "message": "HTTP error 400 in : 400 Bad Request",
            "lastSeen": "2022-11-24T11:26:19.626Z"
          }
        ],
        "ml_job": [
          {
            "message": "HTTP error 403 in : 403 Forbidden",
            "lastSeen": "2022-11-24T11:26:19.622Z"
          }
        ]
      }
    },
    "execution": {
      "timedOut": false,
      "errors": []
    }
  },
  "settings": {
    "ccs": true,
    "logsIndex": "filebeat-*",
    "metricbeatIndex": "metricbeat-*",
    "hasRemoteClusterConfigured": false
  }
}

I think the general principle for stack monitoring should be that if any one API call it makes returns an error then the monitoring page should still display what it can.

@klacabane
Copy link
Contributor

We have an open issue to make the Stack monitoring more tolerant to missing metricsets elastic/kibana#130577

As for the failure to collect metricsets:

  • cluster_stats is the entry point of Stack Monitoring and it cannot function without these. Apparently the failure to collect these is caused by a call to _xpack/usage API which returns the ml issue[1]. The open question here is whether this API should be impacted by indices closed or if this is an expected behavior that metricbeat should handle gracefully
{"error":{"root_cause":[{"type":"cluster_block_exception","reason":"index [.ml-anomalies-shared] blocked by: [FORBIDDEN/4/index closed];"}],"type":"cluster_block_exception","reason":"index [.ml-anomalies-shared] blocked by: [FORBIDDEN/4/index closed];"},"status":403}

@droberts195
Copy link
Contributor

Apparently the failure to collect these is caused by a call to _xpack/usage API which returns the ml issue

I don't think ML not being able to provide stats should stop the whole usage endpoint failing, so I'll change it so that it just omits the ML information if it cannot be obtained.

Since you've already got an issue for being more tolerant to missing metricsets in general I'll transfer this issue back to ML.

@droberts195 droberts195 transferred this issue from elastic/beats Nov 24, 2022
@droberts195 droberts195 added :ml Machine learning and removed Team:Infra Monitoring UI labels Nov 24, 2022
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 24, 2022
@droberts195
Copy link
Contributor

This is back from the beats repo, with the requirement that MachineLearningUsageTransportAction should not propagate exceptions that make X-Pack usage return an error. We should just miss out the ML section if it cannot be populated.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 24, 2022
It is possible to meddle with internal ML state such that calls
to the ML stats APIs return errors. It is justifiable for these
single purpose APIs to return errors when the internal state of
ML is corrupted. However, it is undesirable for these low level
problems to completely prevent the overall usage API from returning,
because then callers cannot find out usage information from any
part of the system.

This change makes errors in the ML stats APIs non-fatal to the
overall response of the usage API. When an ML stats APIs returns
an error, the corresponding section of the ML usage information
will be blank.

Fixes elastic#91893
@droberts195
Copy link
Contributor

I have confirmed that #91917 fixes this.

Screenshot 2022-11-24 at 17 24 27

Now if you close the ML results index the stack monitoring page still displays, albeit with an incorrect value for the number of ML jobs. I think that's the best that can be expected in the circumstances. Internal features cannot be expected to operate perfectly if their internal state has been changed in unexpected ways. But at least now meddling with the ML internals doesn't completely break stack monitoring.

droberts195 added a commit that referenced this issue Nov 24, 2022
It is possible to meddle with internal ML state such that calls
to the ML stats APIs return errors. It is justifiable for these
single purpose APIs to return errors when the internal state of
ML is corrupted. However, it is undesirable for these low level
problems to completely prevent the overall usage API from returning,
because then callers cannot find out usage information from any
part of the system.

This change makes errors in the ML stats APIs non-fatal to the
overall response of the usage API. When an ML stats APIs returns
an error, the corresponding section of the ML usage information
will be blank.

Fixes #91893
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants