-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] [META] Variance Analysis for Performance Runs #398
Comments
Last week, I've setup two EC2 instances, one that triggers a test with traditional ingestion and subsequent search and another that runs snapshot restoration and subsequent search. Both tests target clusters of the same configuration (OpenSearch 2.9 single node clusters) and ingest their metrics and results into the same metrics store. Based on the last 5 days that the setup has run, runs that use snapshot restoration are showing less variance compared to runs doing traditional ingestion.
Follow ups
|
@IanHoang -- I was talking with @rishabh6788 a few weeks ago and speculating that segment topology (i.e. how big are the different segments in each shard) could have some impact. If we have snapshots that perform well and other snapshots that perform poorly, could we restore those, dump the output of In particular, I think we can learn form the "good" segment topologies and tune merge settings to make good performance the default. (See opensearch-project/OpenSearch#11163 for details.) |
Based on @msfroh feedback, I implemented snapshots to be taken after each traditional run. Got the following results across 7 days. To narrow our focus, grabbed the snapshots that performed poorly and well for a single operation. In this case, grabbed snapshots 12/9/2023 9AM CDT and 12/13/2023 9AM CDT (as the first time stamp performed poorly and the second timestamp performed well for operation 12/9/2023 9AM CDT Segments (Worse Performance)
12/13/2023 9AM CDT Segments (Better Performance)
|
Profiled queries with desc-sort-timestamp operation in worst performing snapshot to see where time is mostly being spent. In summary, desc-sort-timestamp reaches 35 shards, and all shards spend around the same time in seconds except for shards in logs-241998. logs-241998 is the largest index in http-logs, approximately 10x bigger than other indices in the cluster. As one can see, the shards in logs-241998 not only differ in time from shards in other indices but also from within the same index. When looking at Next steps:
|
Held back by 1.2.0 release and other tasks.
|
The segments are just files on disk on the data nodes. If you can get onto the data nodes, you can see the contents of each shard, like:
That's a very small shard with 10 segments, where most segments are about 17 kilobytes (but segments _5, _6, and _7 are a little smaller). |
Thanks @msfroh! This is helpful. I currently have a test running with logs-241998 only but will ssh into the data node and see what shows up. Will also check what shows up for the original snapshots we were inspecting. |
Segment Distribution for Shards in Logs-241998 Index
|
Not seeing
@msfroh Based on the files above for each shard, are there any that stand out to you? |
For numeric range queries (which tend to dominate http-logs), the In this particular case, all the shards are merged to a single segment. In general, we would expect two indices constructed with the same data and merged to a single segment across all shards to behave similarly. (There may be slight variations based on the document ordering within shards, but I wouldn't expect more than low single digit percent change.) How does it compare to another run with different performance characteristics? If we have two runs, each with the same data, and each merged to a single segment (per shard) and performance is wildly different, then we'll need to figure out an explanation for that. |
Rerun of same configuration
|
Segment Tuning Experiment 1
ResultsSegments from Disk
Results |
Number of segments in cluster after experiment 1:
After running experiment 1, I ran a modified version of
Plugging these two timestamps into an epoch-converter, we get the following dates:
|
Segment Tuning Experiment 2
ResultsNumber of Segments: 29 Segments from _cat API
Segments from Disk
Segment Timestamps
|
Converted times
|
Segment Tuning Experiment 3
Created the index first and updated the ResultsNumber of Segments: 22 Segments from _cat API
Segments from Disk
Segment Timestamps
|
Converted time stamps for experiment 3
|
Segment Tuning Experiment 4
Created the index first and updated the ResultsNumber of Segments: 12 Segments from _cat API
Segments from Disk
Segment Timestamps
|
Converted Segment Timestamps
|
@IanHoang -- Can you please update the units for the floor segment size to MB in each of your comments above? A floor segment size of 500GB would merge really aggressively. (I hope it was 500MB.) Also, this last run makes me very curious to see how the Ideally, we will be able to set parameters so that we end up with segments that have less overlap (so each segment covers approximately half a day's worth of docs). |
@msfroh I have updated the comments above to show the floor segments in MB. Regarding the latest run, it was a typo and was meant to show MB. I captured the following in my worklog when I curled the settings before I started the run
Also, could you elaborate by what you mean by "segment skipping"? |
Segment Tuning Experiment 5
Created the index first and updated the ResultsNumber of Segments: 11 Segments from _cat API
Segments from Disk
Segment Timestamps
|
Converted Timestamps for Segment Tuning Experiment 5
|
Tiered Merge Policy Segment Tuning Experiment 1
ResultsTask Latency Number of Segments: 1 Segments from _cat API
Segment Timestamps
|
Log Byte Size Merge Policy Segment Tuning Experiment 1
ResultsTask Latency Number of Segments: 24 Segments from _cat API
Segment Timestamps
|
@IanHoang -- Do you have multiple clients running? Does the test run shuffle documents? For With each segment containing docs from Jul 13 or 14, it makes it sound like later documents are being added throughout the run. |
Had an offline discussion with @msfroh, @rishabhmaurya, and @gkamat. This will be the plan going forward:
Testing Scenarios
Will be conducting these experiments first and paste the results in a google spreadsheet here afterwards. This will make analysis easier and prevent pollution of this issue. |
Segment Tuning Analysis ResultsPurposeThese tests were conducted to determine if variance seen in query service times in Test SetupEight OpenSearch 2.11 clusters were provisioned. An EC2 instance was provisioned for each of these eight clusters. Each EC2 instance contains the necessary scripts to configure the clusters based on four categories -- Max Segment Size (GB), Floor Segment Size (MB), Merge Policy, and Segments Per Tier -- and trigger a series of While RSD provides us a percentage that represents the variance for each scenario, we must also plot the raw data by query to better visualize the variance. These results are shown in the next section. ResultsAverage Scenario Results and Relative Standard Deviation (RSD)
Raw Runs Plotted by QueryNotes: ConclusionsThe conducted experiments demonstrate that modifications to four categories -- Max Segment Size (GB), Floor Segment Size (MB), Merge Policy, and Segments Per Tier -- that influence segment distribution and sizes certainly impact query service time and variance. Based on the results observed above, the results do not highlight a superior scenario. It can be inferred that modifying these four categories can potentially improve the performance for specific sort queries when compared with the default configurations (Scenario 2). However, despite this performance improvement for some sort queries, variance still persists. For folks who are interested in recreating the testing apparatus and retrieving their own results, they are welcome to follow the steps and scripts provided in this public repository. The README has all the required steps. For future experiments, it would be interesting to see segment tuning analysis be done with other workloads like the new |
Thanks @IanHoang -- If we squint, it looks like scenario 5 is "sometimes" better than scenario 2, but it's not consistent enough that I would recommend that we change the default. We do still have opensearch-project/OpenSearch#7160 as an open issue to explore lowering the max segment size to better-balance segment sizes, but that's focused on improving performance of concurrent segment search (which I don't think was covered by these experiments -- and concurrent segment search has interesting side-effects on inter-segment dynamic pruning). |
Synopsis and Motivation
OpenSearch Benchmark is currently running nightly runs with various configurations. In a few cases, some queries have been detected to have more variance than others. For example, a nightly run on a multi node cluster with the
http_logs
workload has shown variance forsort
queries, as depicted below.Variances in test runs prevent the community from accurately identifying regressions. The goal of this RFC is to find optimal OpenSearch configurations and rule out any causes of variances, which will in turn improve the effort of obtaining consistent and reproducible numbers.
Questions
We should aim to answer the following:
Strategy
Instead of using the data from nightly runs (or at benchmarks.opensearch.org), we should replicate the nightly runs setup but on a smaller scale. Our setup will include the following characteristics:
Workload:
http_logs
OpenSearch versions:
2.9
(or any other stable release should be sufficient)Test Clusters configurations (subject to change throughout the process):
The tests will be run on a nightly basis and will replace ingestion phase with restoring from snapshot. This should have a couple benefits. First, it should reduce the overall testing time. Second, it could potentially reduce the query variance since it's suspected that the ingestion phase can be inconsistent and influence query performance. Metrics and test results will be channeled into a external metrics datastore, which will be used to build visualizations for each query. These visualizations will give us a better idea of which queries have variance and will spawn deeper investigations.
Next Steps
Preliminary Testing
Further Testing
--telemetry=node-stats
to gather CPU and JVM metrics. A profiler should also be enabled on each cluster so that we can understand where most of the time is spent during each search operation in each cluster.Some Additional Areas to Check
Current Experiments
How Can You Help?
The text was updated successfully, but these errors were encountered: