-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should _cat API calls cause index refresh? #11225
Comments
Honestly the shard idle optimization really seems like a lot of trouble. It's a nice optimization for the system to automatically optimize for bulk load scenarios that happen when no searches are happening, but I'm curious how common that actually is. The idle behavior is somewhat antithetical other availability tenets like predictable performance (i.e. your system may work well because shards go idle, but then something changes to start sending sporadic search traffic and now ingestion starts failing because you were unknowingly dependent on the shards being idle). |
@andrross - I understand and agree about predictable performance. Problem is, it was sometimes predictable bad performance. In the bad old days, before this optimization, we saw 25-50% increase in throughput through adjusting refresh_interval from 1s up to a minute. Now that's more like 10%, best case. In other words, while you could set refresh_interval low, it would hurt you. And, relying on load-only metrics is a problem. Actually, AFAIK, the OSB workloads don't mix queries and indexing - all of the indexing is up front. We need a test workload that runs mixed query/indexing for exactly this reason. Mixed workloads are the hardest to scale for, since you have competing concerns that have different load characteristics (e.g., fewer shards are better for query, more shards are better for indexing). So, letting shards idle is a good optimization, especially for logs workloads that can actually go long periods without queries. I think we need an expanded definition of "query" that includes (some? all?) _cat APIs. And the query should wake shards and replicas to mitigate against inconsistent results. |
@Jon-AtAWS Regarding OSB workload supporting indexing and search in parallel, you can use |
See also #9707 |
Thanks @Jon-AtAWS, the historical perspective is super helpful. Regarding the specific request here, treating |
Thanks @rishabh6788 - didn't know that! I'll go play with it. @andrross - agreed, that's a tradeoff of choosing to poll admin APIs. Having said that, it's usual (? I don't have statistics, but I suspect...) to poll at 30s-1m intervals, so the impact should be pretty low to non-existent. And, I would argue that if you're polling these APIs you actually want accurate results. We can choose which APIs should wake shards and try to minimize as well. Apart from _cat APIs, we should consider _stats, _nodes/stats, <index>/_stats, etc. for waking shards if they don't already. I'm pretty sure cluster health should not wake up shards, but you could convince me... |
Given the default idle time of 30s, if you're polling at 30s then you would prevent shards from ever going idle. However, the point stands that if you're polling these APIs then you probably do want accurate results! Agree that cluster health need not wake shards, but any API that reports information dependent on indexed data probably should. |
Is your feature request related to a problem? Please describe.
_cat/indices' doc count lags reality when shards are in idle state due to lack of queries for the shard.
Describe the solution you'd like
Calls to the _cat APIs should trigger shard refresh. At least, for calls like _cat/indices that expose shard statistics and metrics.
Describe alternatives you've considered
A call to the _count or other query API partially fixes the problem. However, it won't trigger refresh on shards not queried (e.g., replicas or primaries that are not queries). Subsequent _cat/indices calls can hit non-refreshed shards.
Since the _cat API is administrative, it won’t cause too much perf degradation in normal operating mode.
The text was updated successfully, but these errors were encountered: