Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate timeout issue and use of time range in stack monitoring queries #189728

Open
jennypavlova opened this issue Aug 1, 2024 · 5 comments
Labels
bug Fixes for quality problems that affect the customer experience sdh-linked Team:Monitoring Stack Monitoring team

Comments

@jennypavlova
Copy link
Member

Related to https://github.com/elastic/sdh-elasticsearch/issues/8151

There is a reported issue of timeouts while using Stack Monitoring After some investigation we saw that in Stack Monitoring we have queries without a date range filter:

getClustersState

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state"
{
    "query": {
      "bool": {
        "filter": [
          {
            "term": {
              "type": "cluster_state"
            }
          },
          {
            "terms": {
              "cluster_uuid": [
                CLUSTER_UID
              ]
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "cluster_uuid"
    },
    "sort": {
      "timestamp": {
        "order": "desc",
        "unmapped_type": "long"
      }
    }
}

getShardStats

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.shard-*/_search?ignore_unavailable=true
{
  "size": 0,
  "sort": {
    "timestamp": {
      "order": "desc",
      "unmapped_type": "long"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "data_stream.dataset": "elasticsearch.stack_monitoring.shard"
                }
              },
              {
                "term": {
                  "metricset.name": "shard"
                }
              },
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "term": {
            "cluster_uuid": [CLUSTER_UUID]
          }
        },
        {
          "term": {
            "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID]
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "shard.node": [SHARD_NODE]
                }
              },
              {
                "term": {
                  "elasticsearch.node.id":[NODE_ID]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "shard.index",
        "size": 10000
      },
      "aggs": {
        "states": {
          "terms": {
            "field": "shard.state",
            "size": 10
          },
          "aggs": {
            "primary": {
              "terms": {
                "field": "shard.primary",
                "size": 2
              }
            }
          }
        }
      }
    },
    "nodes": {
      "terms": {
        "field": "shard.node",
        "size": 10000
      },
      "aggs": {
        "index_count": {
          "cardinality": {
            "field": "shard.index"
          }
        },
        "node_names": {
          "terms": {
            "field": "source_node.name",
            "size": 10
          }
        },
        "node_ids": {
          "terms": {
            "field": "source_node.uuid",
            "size": 1
          }
        }
      }
    }
  }
}

loading indices page results in timeout
loading machine learning page results in timeout

Both run this query

getUnassignedShardData

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.shard-*/_search?ignore_unavailable=true
{
  "size": 0,
  "sort": {
    "timestamp": {
      "order": "desc",
      "unmapped_type": "long"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "data_stream.dataset": "elasticsearch.stack_monitoring.shard"
                }
              },
              {
                "term": {
                  "metricset.name": "shard"
                }
              },
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "term": {
            "cluster_uuid": [CLUSTER_UUID]
          }
        },
        {
          "term": {
            "elasticsearch.cluster.stats.state.state_uuid":  [CLUSTER_STATE_UUID]
          }
        }
      ]
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "shard.index",
        "size": 10000
      },
      "aggs": {
        "state": {
          "filter": {
            "terms": {
              "shard.state": [
                "UNASSIGNED",
                "INITIALIZING"
              ]
            }
          },
          "aggs": {
            "primary": {
              "terms": {
                "field": "shard.primary",
                "size": 2
              }
            }
          }
        }
      }
    }
  }
}

This idea here is to investigate how to improve the queries and possibly include a time range while maintaining the same functionality.

@jennypavlova jennypavlova added bug Fixes for quality problems that affect the customer experience sdh-linked Team:Monitoring Stack Monitoring team labels Aug 1, 2024
@jennypavlova jennypavlova changed the title [Infra] Investigate timeout issue and use of time range in stack monitoring queries Investigate timeout issue and use of time range in stack monitoring queries Aug 1, 2024
@consulthys
Copy link
Contributor

consulthys commented Aug 20, 2024

Regarding the getClustersState query

That query always returns an empty result set, as there are no documents in the monitoring indices (.monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*) with "type": "cluster_state". This is confirmed with the queries run by the customer and shared by @louisong in his comment for both custom and super user (see Query 1 in those shared files and also provided below). Even if they return nothing, it would still be interesting to know the took time of those two queries.

Query 1

Query 1
===============================================================================================
GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state"
{
    "query": {
      "bool": {
        "filter": [
          {
            "term": {
              "type": "cluster_state"
            }
          },
          {
            "terms": {
              "cluster_uuid": [
                "GkDDZY7mT42RyVatNmEnbA"
              ]
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "cluster_uuid"
    },
    "sort": {
      "timestamp": {
        "order": "desc",
        "unmapped_type": "long"
      }
    }
}

Response
===============================================================================================
{} - no output

=> I think this query can be ruled out and we should probably not focus on it.

Regarding the getShardStats and getUnassignedShardData queries

As stated by @klacabane here, the shard documents are "siloed" by cluster state. In order to have a true picture of the assigned/unassigned shards, the queries need to query all shards for a given cluster state and that's what is done with the "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID] contraint in both queries. That's semantically equivalent to providing the time range [last_cluster_state_change_time, now]

I assume that the latest cluster state_uuid is used in the two shard queries. Even if the shard documents that are retrieved are most certainly "recent" (i.e. from the latest cluster state), the lack of a time range constraint might indeed prevent to skip frozen shards.

A word of caution here that adding a time range in these queries might alter the current behavior since not all clusters change their state at the same pace. That being said, it could probably help to add a reasonable time range (e.g. last 10 days) in order to increase the odds of leveraging the pre-filtering phase and skip frozen shards. However, if the cluster state hasn't changed during that period, the result would be empty. Maybe, in a second iteration, this time range should even be configurable to cater for this possibility.

And since we're looking at improving performance, it could also make sense to remove the sort clause since it doesn't make sense in an aggregation query with size: 0.

Before attempting anything here, I'm going to ask support to have the customer re-run those shard queries with an additional 10-days-ish time range in order to see if that helps at all.

@consulthys
Copy link
Contributor

Circling back to this after having discussed with support, we could see that adding a time range to those queries showed that they run much faster and they do not hit the frozen tier.

Knowing that they execute fast when they do not hit the frozen tier, we might not even have to add a time frame to these two queries, since it should be impossible for the shard metric set documents that are part of the latest cluster state to be in any other tier than the hot tier. As a result, we could maybe leverage the _tier metadata field and only query the hot tier, which is what I'm going to ask support to try.

@consulthys
Copy link
Contributor

consulthys commented Sep 3, 2024

Adding a constraint on the _tier metadata field proves to be even more optimal than adding a time range (which would be difficult to set given that ILMs can be configured very differently from user to user).

A quick summary of how much the query took time decreased when using a time range and a tier affinity can be found below.

 Query User  Using _tier Using time range No constraint
getShardStats User with DLS 1819 ms 2066 ms 12009 ms
getShardStats Superuser (no DLS) 1561 ms 5881 ms 38850 ms
getUnassignedShardData User with DLS 1445 ms 1423 ms 9992 ms
getUnassignedShardData Superuser (no DLS) 1529 ms 2603 ms 7725 ms

If we would like to pursue this, we see two options:

  1. only query "_tier": "data_hot"
  2. similarly to what's being done for APM, simply exclude "_tier": "data_frozen"

@consulthys
Copy link
Contributor

consulthys commented Sep 25, 2024

I eventually figured out that we cannot leverage the _tier field, because getShardStats and getUnassignedShardData go hand in hand with another query called getClustersStats that always runs just before them and which takes into account a time range. The time range selected by the user defines which cluster state/stats to consider, and hence, also which shard documents to look at.

So if the user decides to go back in time (e.g. 3 weeks) to monitor the behavior of an index or a node at that time, we need to retrieve the cluster state/stats/shards from that time, hence using _tier here would break that behavior.

What I'm trying next is to enhance the getShardStats and getUnassignedShardData queries with the same time range that is being used in the getClustersStats query. Adding the time range would have the benefit of limiting the blast radius of the queries and not query the frozen tier for recent data and as we've seen in the previous update, the performance are still much better than querying the whole universe. When going back in time, however, the frozen tier would be hit anyway, but that is also expected.

@consulthys
Copy link
Contributor

consulthys commented Oct 3, 2024

My last experiment described above yielded mixed results, depending on whether the cluster is monitored via internal monitoring or via Metricbeat (or Elastic agent).

Internal monitoring

When the cluster is monitored via internal monitoring (i.e. data stored in the .monitoring-es-7-* indices), what happens is that the shards documents are reindexed every 10 seconds for all indices (even though the cluster state hasn't changed, but that's another discussion). So, since the ID of those documents is constant for a given cluster state, each document gets deleted and reindexed with the latest timestamp on every new sampling period. Hence, a few seconds after you click Refresh, a whole new batch of shards documents overrides the previous one with a timestamp that is outside of the selected time window.

We could argue that we could only use the start time and leave the time range open ended, but that would only work when selecting "Last XX minutes/hours/days". Selecting any closed time range in the past (e.g. specific day or week) wouldn't work the same way.

Metricbeat or Elastic agent monitoring

When the cluster is monitored via either Metricbeat or the Elastic Agent (i.e. data stored in the .monitoring-es-mb-8-* or metrics.* indices), adding the time range constraint would work, because documents are not being rewritten due to the "put if absent" semantics with concrete ID introduced in ES 8. This semantics ensures that the shards documents do not "time-travel", even though it induced another issue, but that is unrelated to this one.

So what's next...

To summarize, the way internal monitoring currently works disqualifies the idea of using a time range. Given that internal monitoring days are counted (still being debated), I don't think we should/can ask them to introduce "put if absent" semantics.

I'm on the look out for further ideas... Stay tuned...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience sdh-linked Team:Monitoring Stack Monitoring team
Projects
None yet
Development

No branches or pull requests

2 participants