-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Support vertical scaling for snapshot repository data cache limit #16298
Comments
Hello, I’d like to work on this issue! I’ll do my best to resolve it as quickly as possible and submit a PR. Thank you for creating this as a good first issue. |
I just submitted a PR for the feature addition related to this issue! Thank you for leaving such an interesting issue. 😄 |
Thanks for raising the PR. I will take a look at this soon. |
Do you have any data on how big the size tends to scale? |
I would like to share my own experience in case it might be of some use. The I referred to the
According to the official documentation snapshots tend to be saved frequently over short intervals. Additionally, from examining the testReadAndWriteSnapshotsThroughIndexFile method in BlobStoreRepositoryTests, I found that RepositoryData.EMPTY required 59 bytes of caching. |
Thanks for the detail. I guess what I'm looking for is what sort of max value ("sensible upper bound") we might consider. I've no argument with increasing it over 500k, but the entire heap shouldn't be dedicated to this cache either. With a 64GB heap, 500K is 0.76% of the heap. Is 1% enough? 2%? Even memory-hungry features are often limited to 10%. Where in this range is sensible? (Edit: I see I'm off by a factor of 1024. I did this pre-coffee.) |
Thank you for the detailed insights! I agree with your perspective. I also think that setting too high of a cache ratio, like 100%, is likely a misconfiguration, even if a cacheable rate is provided to the user. I looked into how OpenSearch handles custom cache ratios. For instance, "indices.requests.cache.size" and "indices.fielddata.cache.size" related to indexes don’t impose a strict limit. Given that Assuming 12 indexes without replicas, approximately
I made these estimates based on available information, but I would be grateful if you could review them and share your thoughts. (Note: I think there may have been a typo! I believe you mentioned 500MB for 0.76% of 64GB heap memory 😄) |
Would it be worth considering using a soft reference for this cached metadata? We could mitigate the risk of holding on to a large amount of memory if it were able to be GC'd if the system was under memory pressure. I have never used soft references in practice, and I know there are definitely some downsides (e.g. non-deterministic behavior), but I think the behavior we want here is "hold on to this object if we've got heap memory to spare because there is a good chance we'll need it again" which is the problem soft references are intended to solve. |
Well, I typed what I was thinking but you're right, I was off by a factor of 1024. Oops. :)
I like this direction, particularly if we allow a "larger" limit. |
Thanks @inpink for the calculation. To summarise, the repository data size increases with each snapshot. The incremental size per snapshot is a function of indexes, number of primary shards, index metadata changes. We should be able to come up with a deterministic function accordingly for the size. This size can grow tremendously when for log analytics use case with frequent snapshots. We may want to have reasonably higher limit (may be like 3-5%) where snapshotting is mission critical use case.
I, also, like this direction as long as we are not having an inconsistent behaviour with any of the snapshot operations. |
Is your feature request related to a problem? Please describe
As of today, the repository data is not cached if the compressed (using default - DEFLATE) size of repository data exceeds 500 KB. This limit has stayed from a long time and has not been changed with increasing heap size. If there are numerous snapshots in a repo that leads to size being more than 500KB, then repository data needs to be downloaded multiple times during clone, restore, finalise snapshots & status/GET snapshot status amongst many other use cases. This leads to elevated latency for these operations. No matter we do vertical scaling or horizontal scaling, the limit stays as is.
Describe the solution you'd like
To mitigate the issue mentioned above, I propose that we have cache size which is x% of heap size. This will allow solutions like vertical scaling to prevent hit to remote store each time for fetch repository data even though it has not been updated.
Related component
Storage:Snapshots
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: