[BUG] Make AsyncShardFetch cache in GatewayAllocator bounded with eviction policy #10316

sam-herman · 2023-10-02T22:17:33Z

Describe the bug
Currently AsyncShardFetch requests triggered during reroute in the GatewayAllocator have a cache that is not bounded and eventually can cause the ClusterManager to run out of heap memory when clusters with multiple nodes and shards are starting up after a ClusterManager restart.

To Reproduce
When provisioning large clusters with multiple nodes and shards you can take a heap dump and observe that the GatewayAllocator retain most of the heap.

Expected behavior

Cache will have limits and will start evict older entries when reaching the limits
Limits should be configurable with default heuristic that calculates desired cache size and eviction policy based on heap size.

Bukhtawar · 2023-10-03T07:48:05Z

Tagging the meta issue #8098 for better visibility.

gtahhan · 2023-10-05T16:03:15Z

I can work on this

amkhar · 2023-10-16T04:44:39Z

@samuel-oci

Thanks for putting up this thought, this is interesting as it'll help in overall resiliency of this particular flow. Currently we're also trying to reduce the size of cache itself by going deep one level down and removing unnecessary data present in cache like number of DiscoveryNode objects as @Bukhtawar pointed out about meta issue for improving this area.

Quick question

Cache will have limits and will start evict older entries when reaching the limits

We're still clearing the cache after shard is started. And it makes sense to keep ShardEntry in cache to avoid re-fetching the metadata from data nodes. Evicting an entry for which shard initialize is not called yet will require us to do async fetch again for that shard. Want to understand your thoughts on this.

sam-herman · 2023-10-20T17:42:32Z

Hey @amkhar that's correct understanding of the proposal. You are correct to point out that without sufficient memory for cache there will be repeated asyncFetch requests to retrieve the same entry. This for sure is a tradeoff proposed in the event where the cache overfill the heap.
I believe @gtahhan had done some experimentation and found that eviction penalty of refetching is manageable. Perhaps he can provide more details.

amkhar · 2024-01-11T09:51:57Z

Hi @gtahhan are you still working on this ?

peternied · 2024-03-06T16:51:22Z

[Triage - attendees 1 2 3 4 5]
@samuel-oci Thanks for opening this issue.

rwali-aws · 2024-05-09T09:44:52Z

@gtahhan Please let us know in case you are still working on this.

gtahhan · 2024-05-09T15:50:08Z

Hi @rwali-aws @amkhar sorry I didnt get the chance to work on it as of now... if its urgent, please you can un assign it from myself... Thank you and sorry again

sam-herman added bug Something isn't working untriaged labels Oct 2, 2023

gbbafna assigned gbbafna and gtahhan and unassigned gbbafna Oct 6, 2023

anasalkouz added the Other label Dec 14, 2023

ankitkala added Cluster Manager and removed Other labels Dec 17, 2023

peternied removed the untriaged label Mar 6, 2024

github-project-automation bot added this to Cluster Manager Project Board Mar 6, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Mar 6, 2024

rwali-aws moved this from 🆕 New to Later (6 months plus) in Cluster Manager Project Board May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Make AsyncShardFetch cache in GatewayAllocator bounded with eviction policy #10316

[BUG] Make AsyncShardFetch cache in GatewayAllocator bounded with eviction policy #10316

sam-herman commented Oct 2, 2023

Bukhtawar commented Oct 3, 2023

gtahhan commented Oct 5, 2023

amkhar commented Oct 16, 2023

sam-herman commented Oct 20, 2023

amkhar commented Jan 11, 2024

peternied commented Mar 6, 2024

rwali-aws commented May 9, 2024

gtahhan commented May 9, 2024

[BUG] Make AsyncShardFetch cache in GatewayAllocator bounded with eviction policy #10316

[BUG] Make AsyncShardFetch cache in GatewayAllocator bounded with eviction policy #10316

Comments

sam-herman commented Oct 2, 2023

Bukhtawar commented Oct 3, 2023

gtahhan commented Oct 5, 2023

amkhar commented Oct 16, 2023

sam-herman commented Oct 20, 2023

amkhar commented Jan 11, 2024

peternied commented Mar 6, 2024

rwali-aws commented May 9, 2024

gtahhan commented May 9, 2024