Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should _cat API calls cause index refresh? #11225

Open
Jon-AtAWS opened this issue Nov 15, 2023 · 7 comments
Open

Should _cat API calls cause index refresh? #11225

Jon-AtAWS opened this issue Nov 15, 2023 · 7 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Search Search query, autocomplete ...etc

Comments

@Jon-AtAWS
Copy link
Member

Is your feature request related to a problem? Please describe.
_cat/indices' doc count lags reality when shards are in idle state due to lack of queries for the shard.

Describe the solution you'd like
Calls to the _cat APIs should trigger shard refresh. At least, for calls like _cat/indices that expose shard statistics and metrics.

Describe alternatives you've considered
A call to the _count or other query API partially fixes the problem. However, it won't trigger refresh on shards not queried (e.g., replicas or primaries that are not queries). Subsequent _cat/indices calls can hit non-refreshed shards.

Since the _cat API is administrative, it won’t cause too much perf degradation in normal operating mode.

@Jon-AtAWS Jon-AtAWS added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 15, 2023
@andrross
Copy link
Member

Honestly the shard idle optimization really seems like a lot of trouble. It's a nice optimization for the system to automatically optimize for bulk load scenarios that happen when no searches are happening, but I'm curious how common that actually is. The idle behavior is somewhat antithetical other availability tenets like predictable performance (i.e. your system may work well because shards go idle, but then something changes to start sending sporadic search traffic and now ingestion starts failing because you were unknowingly dependent on the shards being idle).

@Jon-AtAWS
Copy link
Member Author

@andrross - I understand and agree about predictable performance. Problem is, it was sometimes predictable bad performance. In the bad old days, before this optimization, we saw 25-50% increase in throughput through adjusting refresh_interval from 1s up to a minute. Now that's more like 10%, best case. In other words, while you could set refresh_interval low, it would hurt you.

And, relying on load-only metrics is a problem. Actually, AFAIK, the OSB workloads don't mix queries and indexing - all of the indexing is up front. We need a test workload that runs mixed query/indexing for exactly this reason. Mixed workloads are the hardest to scale for, since you have competing concerns that have different load characteristics (e.g., fewer shards are better for query, more shards are better for indexing).

So, letting shards idle is a good optimization, especially for logs workloads that can actually go long periods without queries. I think we need an expanded definition of "query" that includes (some? all?) _cat APIs. And the query should wake shards and replicas to mitigate against inconsistent results.

@rishabh6788
Copy link
Contributor

@Jon-AtAWS Regarding OSB workload supporting indexing and search in parallel, you can use pmc workload. We recently added indexing-querying test procedure that does this. Use --test-procedure="indexing-querying" OSB parameter while using PMC workload.

@msfroh
Copy link
Collaborator

msfroh commented Nov 16, 2023

Honestly the shard idle optimization really seems like a lot of trouble.

See also #9707

@msfroh msfroh added Search Search query, autocomplete ...etc Indexing Indexing, Bulk Indexing and anything related to indexing labels Nov 16, 2023
@msfroh msfroh removed the untriaged label Nov 16, 2023
@andrross
Copy link
Member

Thanks @Jon-AtAWS, the historical perspective is super helpful.

Regarding the specific request here, treating _cat APIs like queries with regard to waking shards up makes a lot of sense. Returning very stale data is not a good experience. The biggest risk I see is that any deployments have external automated monitoring systems (e.g. managed services) polling the _cat APIs could effectively disable the shard idle optimization with this change.

@Jon-AtAWS
Copy link
Member Author

Jon-AtAWS commented Nov 16, 2023

Thanks @rishabh6788 - didn't know that! I'll go play with it.

@andrross - agreed, that's a tradeoff of choosing to poll admin APIs. Having said that, it's usual (? I don't have statistics, but I suspect...) to poll at 30s-1m intervals, so the impact should be pretty low to non-existent. And, I would argue that if you're polling these APIs you actually want accurate results. We can choose which APIs should wake shards and try to minimize as well.

Apart from _cat APIs, we should consider _stats, _nodes/stats, <index>/_stats, etc. for waking shards if they don't already.

I'm pretty sure cluster health should not wake up shards, but you could convince me...

@andrross
Copy link
Member

to poll at 30s-1m intervals, so the impact should be pretty low to non-existent

Given the default idle time of 30s, if you're polling at 30s then you would prevent shards from ever going idle. However, the point stands that if you're polling these APIs then you probably do want accurate results!

Agree that cluster health need not wake shards, but any API that reports information dependent on indexed data probably should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Search Search query, autocomplete ...etc
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

4 participants