[Feature Request] Support vertical scaling for snapshot repository data cache limit #16298

ashking94 · 2024-10-12T12:28:04Z

Is your feature request related to a problem? Please describe

As of today, the repository data is not cached if the compressed (using default - DEFLATE) size of repository data exceeds 500 KB. This limit has stayed from a long time and has not been changed with increasing heap size. If there are numerous snapshots in a repo that leads to size being more than 500KB, then repository data needs to be downloaded multiple times during clone, restore, finalise snapshots & status/GET snapshot status amongst many other use cases. This leads to elevated latency for these operations. No matter we do vertical scaling or horizontal scaling, the limit stays as is.

Describe the solution you'd like

To mitigate the issue mentioned above, I propose that we have cache size which is x% of heap size. This will allow solutions like vertical scaling to prevent hit to remote store each time for fetch repository data even though it has not been updated.

Related component

Storage:Snapshots

Describe alternatives you've considered

No response

Additional context

No response

inpink · 2024-10-24T17:10:50Z

Hello, I’d like to work on this issue! I’ll do my best to resolve it as quickly as possible and submit a PR. Thank you for creating this as a good first issue.

inpink · 2024-10-27T14:51:31Z

I just submitted a PR for the feature addition related to this issue! Thank you for leaving such an interesting issue. 😄

ashking94 · 2024-10-28T04:51:40Z

I just submitted a PR for the feature addition related to this issue! Thank you for leaving such an interesting issue. 😄

Thanks for raising the PR. I will take a look at this soon.

dbwiddis · 2024-10-28T05:03:21Z

If there are numerous snapshots in a repo that leads to size being more than 500KB

Do you have any data on how big the size tends to scale?

inpink · 2024-10-29T15:40:20Z

I would like to share my own experience in case it might be of some use.

The repository data mentioned here stores metadata of snapshots (such as snapshotId, snapshotState, shardGenerations, indexMetaBlobs, and newIdentifiers).
In BlobStoreRepository, repositoryData is cached, and there is an issue to increase the cache limit.

I referred to the addRandomSnapshotsToRepoData method in BlobStoreRepositoryTests.
Referring to the code below, each time a snapshot containing a index was saved, the compressed size of cached data increased by approximately 300 bytes.

The size of repositoryData varied depending on metadata length and the number of indices.

SnapshotId snapshotId = new SnapshotId(randomAlphaOfLength(8), UUIDs.randomBase64UUID());
final ShardGenerations.Builder builder = ShardGenerations.builder();
builder.put(new IndexId(randomAlphaOfLength(1), UUIDs.randomBase64UUID()), 0, "1");

final ShardGenerations shardGenerations = builder.build();
final Map<IndexId, String> indexLookup = shardGenerations.indices()
                .stream()
                .collect(Collectors.toMap(Function.identity(), ind -> randomAlphaOfLength(256)));
repoData = repoData.addSnapshot(
                snapshotId,
                randomFrom(SnapshotState.SUCCESS, SnapshotState.PARTIAL, SnapshotState.FAILED),
                Version.CURRENT,
                shardGenerations,
                indexLookup,
                indexLookup.values().stream().collect(Collectors.toMap(Function.identity(), ignored -> UUIDs.randomBase64UUID(random())))
            );

According to the official documentation snapshots tend to be saved frequently over short intervals.
If a snapshot contains multiple indices and many snapshots are saved, the size of repositoryData that needs to be cached could grow significantly.

Additionally, from examining the testReadAndWriteSnapshotsThroughIndexFile method in BlobStoreRepositoryTests, I found that RepositoryData.EMPTY required 59 bytes of caching.

dbwiddis · 2024-10-29T16:12:02Z

Thanks for the detail. I guess what I'm looking for is what sort of max value ("sensible upper bound") we might consider. I've no argument with increasing it over 500k, but the entire heap shouldn't be dedicated to this cache either.

With a 64GB heap, 500K is 0.76% of the heap. Is 1% enough? 2%? Even memory-hungry features are often limited to 10%. Where in this range is sensible? (Edit: I see I'm off by a factor of 1024. I did this pre-coffee.)

inpink · 2024-10-29T17:43:57Z

Thank you for the detailed insights! I agree with your perspective. I also think that setting too high of a cache ratio, like 100%, is likely a misconfiguration, even if a cacheable rate is provided to the user.

I looked into how OpenSearch handles custom cache ratios. For instance, "indices.requests.cache.size" and "indices.fielddata.cache.size" related to indexes don’t impose a strict limit.
Meanwhile, the KNN plugin’s "knn.model.cache.size.limit" restricts the maximum to 25%.

Given that BlobStoreRepository caches only metadata (not actual data), I believe a smaller cache ratio is suitable.
While I lack real-world data, I made estimates based on official documents.
According to AWS OpenSearch guidelines, the maximum recommended shard count for a node is proportional to the JVM heap memory, targeting around 25 shards per GiB.

Assuming 12 indexes without replicas, approximately 12*300=3,600 bytes are needed per 1 GiB.
Also AWS states that hourly snapshots are taken and kept for 14 days, totaling 336 snapshots (AWS Snapshot Documentation).

3,600 bytes * 336 snapshots = 1209600bytes = 1.15MiB = 1GiB's 0.11%
In this case, a 0.11% cache allocation would require 1.15MiB of memory.
For larger metadata or more snapshots, a maximum cache threshold of 5–10% is advisable.

I made these estimates based on available information, but I would be grateful if you could review them and share your thoughts.

(Note: I think there may have been a typo! I believe you mentioned 500MB for 0.76% of 64GB heap memory 😄)

andrross · 2024-10-29T20:17:56Z

Would it be worth considering using a soft reference for this cached metadata? We could mitigate the risk of holding on to a large amount of memory if it were able to be GC'd if the system was under memory pressure. I have never used soft references in practice, and I know there are definitely some downsides (e.g. non-deterministic behavior), but I think the behavior we want here is "hold on to this object if we've got heap memory to spare because there is a good chance we'll need it again" which is the problem soft references are intended to solve.

dbwiddis · 2024-10-30T04:43:56Z

(Note: I think there may have been a typo! I believe you mentioned 500MB for 0.76% of 64GB heap memory 😄)

Well, I typed what I was thinking but you're right, I was off by a factor of 1024. Oops. :)

Would it be worth considering using a soft reference for this cached metadata?

I like this direction, particularly if we allow a "larger" limit.

ashking94 · 2024-10-30T07:40:34Z

If there are numerous snapshots in a repo that leads to size being more than 500KB

Do you have any data on how big the size tends to scale?

Thanks @inpink for the calculation.

To summarise, the repository data size increases with each snapshot. The incremental size per snapshot is a function of indexes, number of primary shards, index metadata changes. We should be able to come up with a deterministic function accordingly for the size. This size can grow tremendously when for log analytics use case with frequent snapshots. We may want to have reasonably higher limit (may be like 3-5%) where snapshotting is mission critical use case.

Would it be worth considering using a soft reference for this cached metadata?

I, also, like this direction as long as we are not having an inconsistent behaviour with any of the snapshot operations.

ashking94 added enhancement Enhancement or improvement to existing feature or request untriaged skip-changelog and removed untriaged labels Oct 12, 2024

github-actions bot added Storage:Snapshots untriaged labels Oct 12, 2024

github-project-automation bot added this to Storage Project Board Oct 12, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Oct 12, 2024

ashking94 changed the title ~~[Feature Request] Support dynamic limit for snapshot repository data cache limit~~ [Feature Request] Support vertical scaling for snapshot repository data cache limit Oct 12, 2024

gbbafna added good first issue Good for newcomers and removed untriaged labels Oct 17, 2024

dbwiddis assigned inpink Oct 24, 2024

inpink mentioned this issue Oct 27, 2024

Add vertical scaling and SoftReference for snapshot repository data cache #16489

Merged

1 task

inpink mentioned this issue Oct 29, 2024

[DOC] Add documentation on caching for the Snapshot’s RepositoryData opensearch-project/documentation-website#8629

Open

4 tasks

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.19.0 Issues and PRs related to version 2.19.0 labels Nov 8, 2024

reta closed this as completed in #16489 Nov 12, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support vertical scaling for snapshot repository data cache limit #16298

[Feature Request] Support vertical scaling for snapshot repository data cache limit #16298

ashking94 commented Oct 12, 2024

inpink commented Oct 24, 2024

inpink commented Oct 27, 2024

ashking94 commented Oct 28, 2024

dbwiddis commented Oct 28, 2024

inpink commented Oct 29, 2024

dbwiddis commented Oct 29, 2024 •

edited

Loading

inpink commented Oct 29, 2024 •

edited

Loading

andrross commented Oct 29, 2024

dbwiddis commented Oct 30, 2024

ashking94 commented Oct 30, 2024

[Feature Request] Support vertical scaling for snapshot repository data cache limit #16298

[Feature Request] Support vertical scaling for snapshot repository data cache limit #16298

Comments

ashking94 commented Oct 12, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

inpink commented Oct 24, 2024

inpink commented Oct 27, 2024

ashking94 commented Oct 28, 2024

dbwiddis commented Oct 28, 2024

inpink commented Oct 29, 2024

dbwiddis commented Oct 29, 2024 • edited Loading

inpink commented Oct 29, 2024 • edited Loading

andrross commented Oct 29, 2024

dbwiddis commented Oct 30, 2024

ashking94 commented Oct 30, 2024

dbwiddis commented Oct 29, 2024 •

edited

Loading

inpink commented Oct 29, 2024 •

edited

Loading