Investigate timeout issue and use of time range in stack monitoring queries #189728

jennypavlova · 2024-08-01T15:43:30Z

Related to https://github.com/elastic/sdh-elasticsearch/issues/8151

There is a reported issue of timeouts while using Stack Monitoring After some investigation we saw that in Stack Monitoring we have queries without a date range filter:

getClustersState

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state"
{
    "query": {
      "bool": {
        "filter": [
          {
            "term": {
              "type": "cluster_state"
            }
          },
          {
            "terms": {
              "cluster_uuid": [
                CLUSTER_UID
              ]
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "cluster_uuid"
    },
    "sort": {
      "timestamp": {
        "order": "desc",
        "unmapped_type": "long"
      }
    }
}

getShardStats

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.shard-*/_search?ignore_unavailable=true
{
  "size": 0,
  "sort": {
    "timestamp": {
      "order": "desc",
      "unmapped_type": "long"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "data_stream.dataset": "elasticsearch.stack_monitoring.shard"
                }
              },
              {
                "term": {
                  "metricset.name": "shard"
                }
              },
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "term": {
            "cluster_uuid": [CLUSTER_UUID]
          }
        },
        {
          "term": {
            "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID]
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "shard.node": [SHARD_NODE]
                }
              },
              {
                "term": {
                  "elasticsearch.node.id":[NODE_ID]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "shard.index",
        "size": 10000
      },
      "aggs": {
        "states": {
          "terms": {
            "field": "shard.state",
            "size": 10
          },
          "aggs": {
            "primary": {
              "terms": {
                "field": "shard.primary",
                "size": 2
              }
            }
          }
        }
      }
    },
    "nodes": {
      "terms": {
        "field": "shard.node",
        "size": 10000
      },
      "aggs": {
        "index_count": {
          "cardinality": {
            "field": "shard.index"
          }
        },
        "node_names": {
          "terms": {
            "field": "source_node.name",
            "size": 10
          }
        },
        "node_ids": {
          "terms": {
            "field": "source_node.uuid",
            "size": 1
          }
        }
      }
    }
  }
}

loading indices page results in timeout
loading machine learning page results in timeout

Both run this query

getUnassignedShardData

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.shard-*/_search?ignore_unavailable=true
{
  "size": 0,
  "sort": {
    "timestamp": {
      "order": "desc",
      "unmapped_type": "long"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "data_stream.dataset": "elasticsearch.stack_monitoring.shard"
                }
              },
              {
                "term": {
                  "metricset.name": "shard"
                }
              },
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "term": {
            "cluster_uuid": [CLUSTER_UUID]
          }
        },
        {
          "term": {
            "elasticsearch.cluster.stats.state.state_uuid":  [CLUSTER_STATE_UUID]
          }
        }
      ]
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "shard.index",
        "size": 10000
      },
      "aggs": {
        "state": {
          "filter": {
            "terms": {
              "shard.state": [
                "UNASSIGNED",
                "INITIALIZING"
              ]
            }
          },
          "aggs": {
            "primary": {
              "terms": {
                "field": "shard.primary",
                "size": 2
              }
            }
          }
        }
      }
    }
  }
}

This idea here is to investigate how to improve the queries and possibly include a time range while maintaining the same functionality.

The text was updated successfully, but these errors were encountered:

consulthys · 2024-08-20T13:07:22Z

Regarding the `getClustersState` query

That query always returns an empty result set, as there are no documents in the monitoring indices (.monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*) with "type": "cluster_state". This is confirmed with the queries run by the customer and shared by @louisong in his comment for both custom and super user (see Query 1 in those shared files and also provided below). Even if they return nothing, it would still be interesting to know the took time of those two queries.

Query 1

Query 1
===============================================================================================
GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state"
{
    "query": {
      "bool": {
        "filter": [
          {
            "term": {
              "type": "cluster_state"
            }
          },
          {
            "terms": {
              "cluster_uuid": [
                "GkDDZY7mT42RyVatNmEnbA"
              ]
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "cluster_uuid"
    },
    "sort": {
      "timestamp": {
        "order": "desc",
        "unmapped_type": "long"
      }
    }
}

Response
===============================================================================================
{} - no output

=> I think this query can be ruled out and we should probably not focus on it.

Regarding the `getShardStats` and `getUnassignedShardData` queries

As stated by @klacabane here, the shard documents are "siloed" by cluster state. In order to have a true picture of the assigned/unassigned shards, the queries need to query all shards for a given cluster state and that's what is done with the "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID] contraint in both queries. That's semantically equivalent to providing the time range [last_cluster_state_change_time, now]

I assume that the latest cluster state_uuid is used in the two shard queries. Even if the shard documents that are retrieved are most certainly "recent" (i.e. from the latest cluster state), the lack of a time range constraint might indeed prevent to skip frozen shards.

A word of caution here that adding a time range in these queries might alter the current behavior since not all clusters change their state at the same pace. That being said, it could probably help to add a reasonable time range (e.g. last 10 days) in order to increase the odds of leveraging the pre-filtering phase and skip frozen shards. However, if the cluster state hasn't changed during that period, the result would be empty. Maybe, in a second iteration, this time range should even be configurable to cater for this possibility.

And since we're looking at improving performance, it could also make sense to remove the sort clause since it doesn't make sense in an aggregation query with size: 0.

Before attempting anything here, I'm going to ask support to have the customer re-run those shard queries with an additional 10-days-ish time range in order to see if that helps at all.

consulthys · 2024-09-01T07:46:44Z

Circling back to this after having discussed with support, we could see that adding a time range to those queries showed that they run much faster and they do not hit the frozen tier.

Knowing that they execute fast when they do not hit the frozen tier, we might not even have to add a time frame to these two queries, since it should be impossible for the shard metric set documents that are part of the latest cluster state to be in any other tier than the hot tier. As a result, we could maybe leverage the _tier metadata field and only query the hot tier, which is what I'm going to ask support to try.

consulthys · 2024-09-03T07:11:33Z

Adding a constraint on the _tier metadata field proves to be even more optimal than adding a time range (which would be difficult to set given that ILMs can be configured very differently from user to user).

A quick summary of how much the query took time decreased when using a time range and a tier affinity can be found below.

Query	User	Using _tier	Using time range	No constraint
getShardStats	User with DLS	1819 ms	2066 ms	12009 ms
getShardStats	Superuser (no DLS)	1561 ms	5881 ms	38850 ms
getUnassignedShardData	User with DLS	1445 ms	1423 ms	9992 ms
getUnassignedShardData	Superuser (no DLS)	1529 ms	2603 ms	7725 ms

If we would like to pursue this, we see two options:

only query "_tier": "data_hot"
similarly to what's being done for APM, simply exclude "_tier": "data_frozen"

consulthys · 2024-09-25T06:48:12Z

I eventually figured out that we cannot leverage the _tier field, because getShardStats and getUnassignedShardData go hand in hand with another query called getClustersStats that always runs just before them and which takes into account a time range. The time range selected by the user defines which cluster state/stats to consider, and hence, also which shard documents to look at.

So if the user decides to go back in time (e.g. 3 weeks) to monitor the behavior of an index or a node at that time, we need to retrieve the cluster state/stats/shards from that time, hence using _tier here would break that behavior.

What I'm trying next is to enhance the getShardStats and getUnassignedShardData queries with the same time range that is being used in the getClustersStats query. Adding the time range would have the benefit of limiting the blast radius of the queries and not query the frozen tier for recent data and as we've seen in the previous update, the performance are still much better than querying the whole universe. When going back in time, however, the frozen tier would be hit anyway, but that is also expected.

consulthys · 2024-10-03T12:08:29Z

My last experiment described above yielded mixed results, depending on whether the cluster is monitored via internal monitoring or via Metricbeat (or Elastic agent).

Internal monitoring

When the cluster is monitored via internal monitoring (i.e. data stored in the .monitoring-es-7-* indices), what happens is that the shards documents are reindexed every 10 seconds for all indices (even though the cluster state hasn't changed, but that's another discussion). So, since the ID of those documents is constant for a given cluster state, each document gets deleted and reindexed with the latest timestamp on every new sampling period. Hence, a few seconds after you click Refresh, a whole new batch of shards documents overrides the previous one with a timestamp that is outside of the selected time window.

We could argue that we could only use the start time and leave the time range open ended, but that would only work when selecting "Last XX minutes/hours/days". Selecting any closed time range in the past (e.g. specific day or week) wouldn't work the same way.

Metricbeat or Elastic agent monitoring

When the cluster is monitored via either Metricbeat or the Elastic Agent (i.e. data stored in the .monitoring-es-mb-8-* or metrics.* indices), adding the time range constraint would work, because documents are not being rewritten due to the "put if absent" semantics with concrete ID introduced in ES 8. This semantics ensures that the shards documents do not "time-travel", even though it induced another issue, but that is unrelated to this one.

So what's next...

To summarize, the way internal monitoring currently works disqualifies the idea of using a time range. Given that internal monitoring days are counted (still being debated), I don't think we should/can ask them to introduce "put if absent" semantics.

I'm on the look out for further ideas... Stay tuned...

jennypavlova added bug Fixes for quality problems that affect the customer experience sdh-linked Team:Monitoring Stack Monitoring team labels Aug 1, 2024

jennypavlova changed the title ~~[Infra] Investigate timeout issue and use of time range in stack monitoring queries~~ Investigate timeout issue and use of time range in stack monitoring queries Aug 1, 2024

consulthys mentioned this issue Dec 13, 2024

[Metricbeat] Improve the elasticsearch module when used for Stack Monitoring elastic/beats#39058

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate timeout issue and use of time range in stack monitoring queries #189728

Investigate timeout issue and use of time range in stack monitoring queries #189728

jennypavlova commented Aug 1, 2024

consulthys commented Aug 20, 2024 •

edited

Loading

consulthys commented Sep 1, 2024

consulthys commented Sep 3, 2024 •

edited

Loading

consulthys commented Sep 25, 2024 •

edited

Loading

consulthys commented Oct 3, 2024 •

edited

Loading

Investigate timeout issue and use of time range in stack monitoring queries #189728

Investigate timeout issue and use of time range in stack monitoring queries #189728

Comments

jennypavlova commented Aug 1, 2024

consulthys commented Aug 20, 2024 • edited Loading

Regarding the getClustersState query

Regarding the getShardStats and getUnassignedShardData queries

consulthys commented Sep 1, 2024

consulthys commented Sep 3, 2024 • edited Loading

consulthys commented Sep 25, 2024 • edited Loading

consulthys commented Oct 3, 2024 • edited Loading

Internal monitoring

Metricbeat or Elastic agent monitoring

So what's next...

consulthys commented Aug 20, 2024 •

edited

Loading

Regarding the `getClustersState` query

Regarding the `getShardStats` and `getUnassignedShardData` queries

consulthys commented Sep 3, 2024 •

edited

Loading

consulthys commented Sep 25, 2024 •

edited

Loading

consulthys commented Oct 3, 2024 •

edited

Loading