Add vertical scaling and SoftReference for snapshot repository data cache #16489

inpink · 2024-10-27T06:43:03Z

Description

Background

Currently, snapshot repository data is not cached if its compressed size exceeds 500KB.
This static limit causes repeated downloads in operations like clone, restore, and status checks, increasing latency.
The limit has not been adjusted for larger heap sizes, impacting repositories with numerous snapshots.
It doesn’t adjust with vertical or horizontal scaling. This restricts caching efficiency, even with enhanced system resources.

Changes

The new setting, opensearch.snapshot.cache.threshold, allows users to adjust the cache limit as a percentage of heap size. To maintain consistency in how cache sizes are set across OpenSearch, I referenced existing configurations like indices.requests.cache.size and indices.fielddata.cache.size.

snapshot.repository_data.cache.threshold (String): Defines the maximum threshold of the snapshot repository data cache. This value can be specified either as an absolute value (e.g., 2GB) or as a percentage of the node’s heap memory (e.g., 3%). It is a static setting, meaning it must be defined in the opensearch.yml file.
Based on the discussions, I set the user-selectable cache size range within 500kb ... min(max(500kb, 1% of heap memory), Integer.MAX - 8)), with a default of min(max(500kb, 1% of heap memory), Integer.MAX - 8)).
For implementing the maximum limit, I referred to the knn plugin and its knn.model.cache.size.limit setting.
(It would be helpful if the Setting class could directly support maximum memory size limits in the future.)
I applied a SoftReference to latestKnownRepositoryData, which is already managed with an AtomicReference for thread safety. The approach of combining both references was inspired by Gradle’s implementation.
The previous warning threshold for cached data was set at 5MB, but this has now been changed to issue a warning at min(configured limit * 10, Integer.MAX - 8).
I believed that a static setting, rather than dynamic, was more suitable here to avoid further amplifying the unpredictability of SoftReference.

(I plan to contribute documentation for this feature to the OpenSearch project. I have already opened an issue for documentation contribution.)

Testing

Setting Tests

I conducted tests in MemorySizeSettingsTests to verify that the snapshot.repository_data.cache.threshold option allocates memory as intended. I also confirmed that exceptions are triggered if the specified percentage exceeds the maximum limit or falls below the minimum limit.

E2E Test with Docker

I assembled the code I modified and ran it in Docker to conduct E2E testing to ensure everything functions properly overall.
(If any additional data is needed for these E2E tests, I am happy to provide it at any time.)

Cache Threshold	Image
Default (1%)
In Range(1024KiB)
Over Limit(10%)
Below Minimum(499KiB)

(Click the image to view it in full size.)

Cache Testing

I wrote test code to confirm that the SoftReference works as expected. However, since latestKnownRepositoryData is private, I temporarily changed its access level to protected in my local environment, and confirmed that the tests passed.

testCacheRepositoryData: Verifies that caching works as expected in repository.latestKnownRepositoryData.
testSoftReferenceRepositoryDataCacheCleared: Confirms that the cache is cleared when there is memory pressure and no objects are referencing latestKnownRepositoryData.

    public void testCachingRepositoryData() throws Exception {
        final BlobStoreRepository repository = setupRepo();
        final long pendingGeneration = repository.metadata.pendingGeneration();

        // Check the initial state of repository data and cache status
        assertThat(OpenSearchBlobStoreRepositoryIntegTestCase.getRepositoryData(repository)
            .getSnapshotIds().size(), equalTo(0));
        final RepositoryData emptyData = RepositoryData.EMPTY;
        writeIndexGen(repository, emptyData, emptyData.getGenId());
        RepositoryData repoData = OpenSearchBlobStoreRepositoryIntegTestCase.getRepositoryData(
            repository);

        // Verify caching status
        SoftReference<Tuple<Long, BytesReference>> cachedRef = repository.latestKnownRepositoryData.get();
        assertNotNull(cachedRef);
        Tuple<Long, BytesReference> cachedData = cachedRef.get();
        assertNotNull(cachedData);

        // Verify data consistency
        assertEquals(repoData, emptyData);
        assertEquals(repoData.getIndices().size(), 0);
        assertEquals(repoData.getSnapshotIds().size(), 0);
        assertEquals(pendingGeneration + 1L, repoData.getGenId());
    }

    public void testSoftReferenceRepositoryDataCacheCleared() throws Exception {
        final BlobStoreRepository repository = setupRepo();

        // Check the initial state of repository data and cache status
        assertThat(OpenSearchBlobStoreRepositoryIntegTestCase.getRepositoryData(repository)
            .getSnapshotIds().size(), equalTo(0));
        final RepositoryData emptyData = RepositoryData.EMPTY;
        writeIndexGen(repository, emptyData, emptyData.getGenId());

        // Induce memory pressure to test SoftReference eviction
        try {
            final ArrayList<Object[]> allocations = new ArrayList<>();
            while (true) {
                int size = (int) Math.min(Runtime.getRuntime().freeMemory() / 2, Integer.MAX_VALUE);
                allocations.add(new Object[size]);
            }
        } catch (OutOfMemoryError e) {
            // Memory shortage situation occurred
            System.out.println("Memory pressure caused OutOfMemoryError as expected");
        }

        System.gc();

        // Check if cache has been evicted
        SoftReference<Tuple<Long, BytesReference>> cachedRef = repository.latestKnownRepositoryData.get();
        assertNotNull(cachedRef);
        Tuple<Long, BytesReference> cachedData = cachedRef.get();
        assertNull(cachedData);
    }

Closing Thoughts

I’m very pleased to have contributed to OpenSearchㅡit was an exciting feature to add! Participating in discussions with excellent maintainers to provide a robust caching feature for users has been an incredibly valuable experience for me. If there’s anything that needs adjustment, I’ll make the changes as quickly as possible.

Related Issues

Resolves #16298

Check List

Functionality includes testing.
~~[ ] API changes companion pull request created, if applicable.~~
~~[ ]Public documentation issue/PR created, if applicable.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-10-27T07:11:38Z

❌ Gradle check result for e746f68: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

reta · 2024-11-08T18:23:49Z

Despite keeping everything up to date, the CI continues to fail. I’ll give it another try in a few hours! :)


> Task :server:compileTestJava
/var/jenkins/workspace/gradle-check/search/server/src/test/java/org/opensearch/repositories/blobstore/BlobStoreRepositoryTests.java:661: warning: [static] static method should be qualified by type name, BlobStoreRepository, instead of by an expression
        long maxThreshold = repository.calculateMaxSnapshotRepositoryDataCacheThreshold();
                                      ^
error: warnings found and -Werror specified

github-actions · 2024-11-08T18:35:53Z

❌ Gradle check result for a607693: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

reta · 2024-11-08T19:20:12Z

Thanks @inpink , @andrross LGTY?

inpink · 2024-11-08T19:22:08Z

Task :server:compileTestJava
/var/jenkins/workspace/gradle-check/search/server/src/test/java/org/opensearch/repositories/blobstore/BlobStoreRepositoryTests.java:661: warning: [static] static method should be qualified by type name, BlobStoreRepository, instead of by an expression
long maxThreshold = repository.calculateMaxSnapshotRepositoryDataCacheThreshold();
^
error: warnings found and -Werror specified

Oh, wow, thank you so much. I hadn’t caught this error when running with JUnit. In my future contributions, I’ll be sure to verify tests with Gradle before pushing!

github-actions · 2024-11-08T20:21:04Z

✅ Gradle check result for 46b46be: SUCCESS

…data cache - Applies `SoftReference` to cached repository data for efficient memory management under heap pressure. - Enables cache size configuration in `opensearch.yml`, adjustable within a range of 500KB to 1% of heap memory. - Sets the default cache size to `Math.max(ByteSizeUnit.KB.toBytes(500), CACHE_MAX_THRESHOLD / 2)` so it’s generally proportional to heap size. In cases where 1% of the heap is less than 1000KB, indicating a low-memory environment, the default reverts to 500KB as before. - Since `BytesReference` internally uses `byte[]`, the compressed array size is capped at `Integer.MAX_VALUE - 8` to ensure compatibility with JDK limitations on array sizes. Therefore, the maximum cache size cannot exceed this limit. Signed-off-by: inpink <[email protected]>

github-actions · 2024-11-12T04:07:09Z

✅ Gradle check result for eeb0be0: SUCCESS

opensearch-trigger-bot · 2024-11-12T12:52:24Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16489-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 53d41d3fac28c9f72d5883467d5a6211ad09feac
# Push it to GitHub
git push --set-upstream origin backport/backport-16489-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16489-to-2.x.

…data cache (opensearch-project#16489) - Applies `SoftReference` to cached repository data for efficient memory management under heap pressure. - Enables cache size configuration in `opensearch.yml`, adjustable within a range of 500KB to 1% of heap memory. - Sets the default cache size to `Math.max(ByteSizeUnit.KB.toBytes(500), CACHE_MAX_THRESHOLD / 2)` so it’s generally proportional to heap size. In cases where 1% of the heap is less than 1000KB, indicating a low-memory environment, the default reverts to 500KB as before. - Since `BytesReference` internally uses `byte[]`, the compressed array size is capped at `Integer.MAX_VALUE - 8` to ensure compatibility with JDK limitations on array sizes. Therefore, the maximum cache size cannot exceed this limit. Signed-off-by: inpink <[email protected]> (cherry picked from commit 53d41d3)

…data cache (opensearch-project#16489) - Applies `SoftReference` to cached repository data for efficient memory management under heap pressure. - Enables cache size configuration in `opensearch.yml`, adjustable within a range of 500KB to 1% of heap memory. - Sets the default cache size to `Math.max(ByteSizeUnit.KB.toBytes(500), CACHE_MAX_THRESHOLD / 2)` so it’s generally proportional to heap size. In cases where 1% of the heap is less than 1000KB, indicating a low-memory environment, the default reverts to 500KB as before. - Since `BytesReference` internally uses `byte[]`, the compressed array size is capped at `Integer.MAX_VALUE - 8` to ensure compatibility with JDK limitations on array sizes. Therefore, the maximum cache size cannot exceed this limit. Signed-off-by: inpink <[email protected]> (cherry picked from commit 53d41d3) Signed-off-by: Andriy Redko <[email protected]>

injae-kim · 2024-11-12T14:14:31Z

Nice work @inpink !! 👍👍👍

inpink · 2024-11-12T16:32:00Z

Thank you so much, @injae-kim! 🎉
I really appreciate the encouragement and advice! 🙏 I’ll keep striving to grow and improve! 😊

…data cache (#16489) (#16624) - Applies `SoftReference` to cached repository data for efficient memory management under heap pressure. - Enables cache size configuration in `opensearch.yml`, adjustable within a range of 500KB to 1% of heap memory. - Sets the default cache size to `Math.max(ByteSizeUnit.KB.toBytes(500), CACHE_MAX_THRESHOLD / 2)` so it’s generally proportional to heap size. In cases where 1% of the heap is less than 1000KB, indicating a low-memory environment, the default reverts to 500KB as before. - Since `BytesReference` internally uses `byte[]`, the compressed array size is capped at `Integer.MAX_VALUE - 8` to ensure compatibility with JDK limitations on array sizes. Therefore, the maximum cache size cannot exceed this limit. (cherry picked from commit 53d41d3) Signed-off-by: inpink <[email protected]> Signed-off-by: Andriy Redko <[email protected]> Co-authored-by: inpink <[email protected]>

github-actions bot added enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers skip-changelog Storage:Snapshots labels Oct 27, 2024

inpink force-pushed the 16298-solve branch 2 times, most recently from 9061419 to e746f68 Compare October 27, 2024 06:54

inpink force-pushed the 16298-solve branch from e746f68 to 1fb0d58 Compare October 27, 2024 07:17

inpink force-pushed the 16298-solve branch 3 times, most recently from 2a3cbc5 to 46b46be Compare November 8, 2024 19:18

reta approved these changes Nov 8, 2024

View reviewed changes

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.19.0 Issues and PRs related to version 2.19.0 labels Nov 8, 2024

andrross force-pushed the 16298-solve branch from 46b46be to eeb0be0 Compare November 12, 2024 01:19

andrross approved these changes Nov 12, 2024

View reviewed changes

reta merged commit 53d41d3 into opensearch-project:main Nov 12, 2024
37 of 38 checks passed

reta added backport 2.x Backport to 2.x branch and removed skip-changelog labels Nov 12, 2024

opensearch-trigger-bot bot added the backport-failed label Nov 12, 2024

reta mentioned this pull request Nov 12, 2024

[Backport] [2.x] feat: add vertical scaling and SoftReference for snapshot repository data cache (#16489) #16624

Merged

This was referenced Nov 12, 2024

[AUTOCUT] Gradle Check Flaky Test Report for SpecificClusterManagerNodesIT #15944

Open

[AUTOCUT] Gradle Check Flaky Test Report for SearchWeightedRoutingIT #15843

Open

[AUTOCUT] Gradle Check Flaky Test Report for SearchWithRandomExceptionsIT #15828

Closed

github-actions bot added the skip-changelog label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vertical scaling and SoftReference for snapshot repository data cache #16489

Add vertical scaling and SoftReference for snapshot repository data cache #16489

inpink commented Oct 27, 2024 •

edited

Loading

github-actions bot commented Oct 27, 2024

reta commented Nov 8, 2024

github-actions bot commented Nov 8, 2024

reta commented Nov 8, 2024

inpink commented Nov 8, 2024

github-actions bot commented Nov 8, 2024

github-actions bot commented Nov 12, 2024

opensearch-trigger-bot bot commented Nov 12, 2024

injae-kim commented Nov 12, 2024

inpink commented Nov 12, 2024

Add vertical scaling and SoftReference for snapshot repository data cache #16489

Add vertical scaling and SoftReference for snapshot repository data cache #16489

Conversation

inpink commented Oct 27, 2024 • edited Loading

Description

Background

Changes

Testing

Closing Thoughts

Related Issues

Check List

github-actions bot commented Oct 27, 2024

reta commented Nov 8, 2024

github-actions bot commented Nov 8, 2024

reta commented Nov 8, 2024

inpink commented Nov 8, 2024

github-actions bot commented Nov 8, 2024

github-actions bot commented Nov 12, 2024

opensearch-trigger-bot bot commented Nov 12, 2024

injae-kim commented Nov 12, 2024

inpink commented Nov 12, 2024

inpink commented Oct 27, 2024 •

edited

Loading