-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Understand and mitigate the source of Index latency increase in 2.0-RC #2054
Comments
@CEHENKLE @dblock @kotwanikunal Whats the current status of this issue? Did we find the root cause of the performance issue? |
CC'ing @andrross as he was actively looking into it. |
I am actively working on it. The status is that the latency and throughput difference between 1.3 and 2.0 has disappeared with runs against recent builds. However, it appears that the cause is that 1.3 has gotten slower. I'm still working on this, but my theory is that a change in Java versions and/or the default garbage collector is introducing the differences here. |
@andrross The G1GC fix was backported into 1.3 and should be part of 1.3.2, opensearch-project/OpenSearch#2971 |
@dblock Yes, we see the impact of G1GC clearly in the "old" GC metrics: However, the indexing latency and throughput metrics also get roughly 20% worse at the same time (for this nyc_taxis workload). What I've found so far is that 1.3 with the concurrent mark sweep collector was about 10% better in indexing latency and throughput than 2.0 for the nyc_taxis workload, which motivated the creation of this issue. However, when both 1.3 and 2.0 are using the G1GC collector, then 2.0 performs better across the board. Obviously the real world performance of 1.3 with concurrent mark sweep was quite bad so our performance tests seem not to be measuring the right thing. The data right now is showing us that 2.0 is as good or better than 1.3.2, but we still don't have an answer for why we didn't catch the performance issue in 1.3 that our users saw, and this issue adds further nuance that the nyc_taxi workload actually performs slightly better with the CMS garbage collector. |
I have run the stack overflow dataset workload on my test host (c6i.8xlarge). The stack overflow dataset is the recommended workload for testing indexing performance. Baseline=1.3.2(cms) Contender=2.0.0
Baseline=1.3.2(g1gc) Contender=2.0.0
Summary Metrics
The upshot here is that 2.0.0 offers the best indexing throughput performance and p50 latency. At higher percentiles, then 1.3.2 does better, particularly using the old CMS garbage collector. |
@CEHENKLE I assume we can close this issue as we found the root cause of the latency? Can you please confirm? |
Closing this issue as we finalized the root cause of the latency. |
Describe the bug
While running performance tests for 2.0-RC1, we saw an indexing latency appear to increase by ~13%. This issue is to investigate the source of the increase so we can propose ways to mitigate it.
#1624 (comment)
To reproduce
#1624 (comment)
Expected behavior
No response
Screenshots
If applicable, add screenshots to help explain your problem.
Host / Environment
No response
Additional context
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: