[Tiered Caching] [META] Performance benchmark plan #11464

kiranprakash154 · 2023-12-05T01:32:50Z

This issue captures the different performance benchmarks which we plan to do as part of evaluating Tiered Caching which is tracked here.
We will gather feedback from the community to see if we are missing anything that needs to be considered.

Tiered Caching is described here

Goals of Benchmarking

Ensure No Regression:
- Verify no regressions for on-heap cache lookup when disk cache is enabled or disabled.
Latency Analysis:
- Assess worst-case and best-case latency (cost of going to disk) with tiered caching, including on-heap and disk cache latency.
- Understand when going to disk makes sense vs not.
- How is latency affected in the below scenarios
  - With Tiered caching enabled and disabled how does increasing size of cache value(10K/100KB/1MB) affect latency
  - With Tiered caching enabled and disabled how does increasing number of keys of a specific size affects latency. Example: 1/1k/1M keys of cache value size say 10KB/100KB/1MB
Resource Utilization:
- Measure CPU, memory, and disk I/O usage under all scenarios.
Caching Efficiency:
- Evaluate cache hit/miss ratios for both on-heap and disk-based caches.

Scenarios

1. Readonly vs Changing data

Mimic a log analytics behavior by having index rollover policy and perform searches on

An Index with no write data
An Index with active writes
- This index should have refresh interval
- Run multiple iterations with varying refresh intervals.

2. Concurrent Queries trying to cache into disk based caching

Stress test the tiered cache with continuous search traffic by forcing to cache the requests. This will help us understand if there is a limit we have to be aware of or the sweet spot after which tiered caching might not make too much sense.

This is limited by number of threads in search threadpool. So we should consider different instances like (xl/2xl//8xl/16xl) which will increase number of concurrent queries running on the node. Then with increasing concurrency how the latency profile looks like

3. Long Running Performance Test

Run a workload for long duration.

4. Tune parameters of Ehcache

We use Ehcache as the disk cache and it has many parameters exposed to tune it for specific use cases, we can experiment with them to see how it impacts the performance overall.
like the below parameters.

    // Ehcache disk write minimum threads for its pool
    public final Setting<Integer> DISK_WRITE_MINIMUM_THREADS;

    // Ehcache disk write maximum threads for its pool
    public final Setting<Integer> DISK_WRITE_MAXIMUM_THREADS;

    // Not be to confused with number of disk segments, this is different. Defines
    // distinct write queues created for disk store where a group of segments share a write queue. This is
    // implemented with ehcache using a partitioned thread pool exectutor By default all segments share a single write
    // queue ie write concurrency is 1. Check OffHeapDiskStoreConfiguration and DiskWriteThreadPool.
    public final Setting<Integer> DISK_WRITE_CONCURRENCY;

    // Defines how many segments the disk cache is separated into. Higher number achieves greater concurrency but
    // will hold that many file pointers.
    public final Setting<Integer> DISK_SEGMENTS;

5. Invalidate while adding to disk based cache (High Disk I/O)

Have high throughput of search queries being forced to be cached and parallely have frequent invalidations happening due to a refresh and bulk writes in the mix

6. Varying clean up intervals

Mimic a use case with varying refresh intervals, currently on heap caches are being cleaned up every min.
What is the behavior with too often vs not.

7. Varying Disk Threshold

Mimic a use case with varying disk threshold, we have thought of keeping 50% as a default threshold. So the cache cleaner only cleans from the disk cache if the keys to cleanup account for 50% of keys (by count not space) in disk.

Test Dimensions

Workload
Concurrent segment search was benchmarked with Http_logs & nyc_taxis. We need to generate unique queries to create many cache entries. Will use the existing search query and introduce randomness to them.
Search Clients
OSB lets you control the number of clients through a parameter.
Shard size
Following our recommendation of shards ranging between 20 - 50 GB
Various Instance types
EBS vs SSD
EBS Based instance types - r5.large, r5.2xlarge, r5.8xlarge
SSD Based instance types - i3.large, i3.2xlarge, i3.8xlarge
Number of cores - Less to Many
Graviton vs Intel

kiranprakash154 · 2023-12-05T01:33:55Z

@reta @anasalkouz @andrross would like to get your feedback !

anasalkouz · 2023-12-06T00:57:50Z

Thanks @kiranprakash154 for the performance benchmark plan. Some of the things that you need to consider for disk based cache.

Make sure disk based cache will not overwhelm cluster and won't compete with indexed data.
Make sure to cover other cases/features that will use disk storage significantly, like searchable snapshots and force merges. How to assign enough disk space for each of them? May be try to benchmark the caching feature while you are doing force merges or using searchable snapshots.
can replication method impact the eviction? Just make sure it's works with SegRep.
Is caching per node? then how to route same queries to same node for better cache hit.

kiranprakash154 · 2023-12-20T23:30:13Z

Thanks for your comments @anasalkouz

Make sure disk based cache will not overwhelm cluster and won't compete with indexed data.

Yes, we are limiting the size of disk cache after which it starts evicting. So in terms of space we should not have any issues unless the customer sets the config very high.
From scenario 5 - Invalidate while adding to disk based cache (High Disk I/O), this will help us gather more data around how the ehcache behaves in high throughput environment.

Is caching per node? then how to route same queries to same node for better cache hit.

We don't have to handle this separately, the way we have it working now is, when we have the in-mem cache(that is already at the node level) filling up instead of just evicting from the in heap, we evict and add to disk tier.
So the route it took to this node should not change.

May be try to benchmark the caching feature while you are doing force merges or using searchable snapshots.

I think the scenario 5 should cover the behavior during high disk io, regardless of what is causing it, but I get your point, I will also make sure this is not regressing the segrep feature.

kiranprakash154 added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 5, 2023

kiranprakash154 self-assigned this Dec 5, 2023

kiranprakash154 added this to Performance Roadmap Dec 5, 2023

github-project-automation bot moved this to Todo in Performance Roadmap Dec 5, 2023

kiranprakash154 removed the untriaged label Dec 5, 2023

github-actions bot added the untriaged label Dec 5, 2023

kiranprakash154 added benchmarking Issues related to benchmarking or performance. Performance This is for any performance related enhancements or bugs and removed untriaged labels Dec 5, 2023

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Dec 5, 2023

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Dec 5, 2023

kiranprakash154 added Search:Performance Search Search query, autocomplete ...etc labels Dec 5, 2023

github-project-automation bot added this to Search Project Board Dec 5, 2023

github-project-automation bot moved this to 🆕 New in Search Project Board Dec 5, 2023

andrross added the Roadmap:Cost/Performance/Scale Project-wide roadmap label label May 31, 2024

Pallavi-AWS added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tiered Caching] [META] Performance benchmark plan #11464

[Tiered Caching] [META] Performance benchmark plan #11464

kiranprakash154 commented Dec 5, 2023

kiranprakash154 commented Dec 5, 2023 •

edited

Loading

anasalkouz commented Dec 6, 2023

kiranprakash154 commented Dec 20, 2023 •

edited

Loading

[Tiered Caching] [META] Performance benchmark plan #11464

[Tiered Caching] [META] Performance benchmark plan #11464

Comments

kiranprakash154 commented Dec 5, 2023

Goals of Benchmarking

Scenarios

1. Readonly vs Changing data

2. Concurrent Queries trying to cache into disk based caching

3. Long Running Performance Test

4. Tune parameters of Ehcache

5. Invalidate while adding to disk based cache (High Disk I/O)

6. Varying clean up intervals

7. Varying Disk Threshold

Test Dimensions

kiranprakash154 commented Dec 5, 2023 • edited Loading

anasalkouz commented Dec 6, 2023

kiranprakash154 commented Dec 20, 2023 • edited Loading

kiranprakash154 commented Dec 5, 2023 •

edited

Loading

kiranprakash154 commented Dec 20, 2023 •

edited

Loading