-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Make AsyncShardFetch cache in GatewayAllocator bounded with eviction policy #10316
Comments
Tagging the meta issue #8098 for better visibility. |
I can work on this |
@samuel-oci Thanks for putting up this thought, this is interesting as it'll help in overall resiliency of this particular flow. Currently we're also trying to reduce the size of cache itself by going deep one level down and removing unnecessary data present in cache like number of DiscoveryNode objects as @Bukhtawar pointed out about meta issue for improving this area. Quick question
We're still clearing the cache after shard is started. And it makes sense to keep ShardEntry in cache to avoid re-fetching the metadata from data nodes. Evicting an entry for which shard initialize is not called yet will require us to do async fetch again for that shard. Want to understand your thoughts on this. |
Hey @amkhar that's correct understanding of the proposal. You are correct to point out that without sufficient memory for cache there will be repeated asyncFetch requests to retrieve the same entry. This for sure is a tradeoff proposed in the event where the cache overfill the heap. |
Hi @gtahhan are you still working on this ? |
@gtahhan Please let us know in case you are still working on this. |
Hi @rwali-aws @amkhar sorry I didnt get the chance to work on it as of now... if its urgent, please you can un assign it from myself... Thank you and sorry again |
Describe the bug
Currently AsyncShardFetch requests triggered during reroute in the GatewayAllocator have a cache that is not bounded and eventually can cause the ClusterManager to run out of heap memory when clusters with multiple nodes and shards are starting up after a ClusterManager restart.
To Reproduce
When provisioning large clusters with multiple nodes and shards you can take a heap dump and observe that the GatewayAllocator retain most of the heap.
Expected behavior
The text was updated successfully, but these errors were encountered: