Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Make AsyncShardFetch cache in GatewayAllocator bounded with eviction policy #10316

Open
sam-herman opened this issue Oct 2, 2023 · 8 comments
Assignees
Labels
bug Something isn't working Cluster Manager

Comments

@sam-herman
Copy link
Contributor

Describe the bug
Currently AsyncShardFetch requests triggered during reroute in the GatewayAllocator have a cache that is not bounded and eventually can cause the ClusterManager to run out of heap memory when clusters with multiple nodes and shards are starting up after a ClusterManager restart.

To Reproduce
When provisioning large clusters with multiple nodes and shards you can take a heap dump and observe that the GatewayAllocator retain most of the heap.

github issue for OpenSearch regarding cache

Expected behavior

  1. Cache will have limits and will start evict older entries when reaching the limits
  2. Limits should be configurable with default heuristic that calculates desired cache size and eviction policy based on heap size.
@sam-herman sam-herman added bug Something isn't working untriaged labels Oct 2, 2023
@Bukhtawar
Copy link
Collaborator

Tagging the meta issue #8098 for better visibility.

@gtahhan
Copy link

gtahhan commented Oct 5, 2023

I can work on this

@gbbafna gbbafna assigned gbbafna and gtahhan and unassigned gbbafna Oct 6, 2023
@amkhar
Copy link
Contributor

amkhar commented Oct 16, 2023

@samuel-oci

Thanks for putting up this thought, this is interesting as it'll help in overall resiliency of this particular flow. Currently we're also trying to reduce the size of cache itself by going deep one level down and removing unnecessary data present in cache like number of DiscoveryNode objects as @Bukhtawar pointed out about meta issue for improving this area.

Quick question

Cache will have limits and will start evict older entries when reaching the limits

We're still clearing the cache after shard is started. And it makes sense to keep ShardEntry in cache to avoid re-fetching the metadata from data nodes. Evicting an entry for which shard initialize is not called yet will require us to do async fetch again for that shard. Want to understand your thoughts on this.

@sam-herman
Copy link
Contributor Author

Hey @amkhar that's correct understanding of the proposal. You are correct to point out that without sufficient memory for cache there will be repeated asyncFetch requests to retrieve the same entry. This for sure is a tradeoff proposed in the event where the cache overfill the heap.
I believe @gtahhan had done some experimentation and found that eviction penalty of refetching is manageable. Perhaps he can provide more details.

@amkhar
Copy link
Contributor

amkhar commented Jan 11, 2024

Hi @gtahhan are you still working on this ?

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5]
@samuel-oci Thanks for opening this issue.

@rwali-aws
Copy link

@gtahhan Please let us know in case you are still working on this.

@rwali-aws rwali-aws moved this from 🆕 New to Later (6 months plus) in Cluster Manager Project Board May 9, 2024
@gtahhan
Copy link

gtahhan commented May 9, 2024

Hi @rwali-aws @amkhar sorry I didnt get the chance to work on it as of now... if its urgent, please you can un assign it from myself... Thank you and sorry again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

9 participants