-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness flaky #3603
Comments
Seed for repro
|
@imRishN Can you have a look at this test? Do we need 15 nodes for this test or could we scale it back to something like 12? three nodes per AZ? This is pretty demanding for an integration test and I'm wondering if that's what's causing these intermittent timeout failures? |
I'll take a look to this test and try to fix it |
@imRishN lets take a look and get back |
Another failure here: #6333 (comment) |
Another failure : #8758 (comment) |
Another failure here - #10777 (comment) |
Investigated this issue. OpenSearch in a way follows greedy approach while allocating shards and doesn't compute the optimal allocation for all the shards that needs to be allocated. This approach based on certain filters and rules tries to control nodes where shards are assigned. The unassigned shards causing test failure is due to the same above reason where a node where the shard was supposed to be assigned created a conflict with the awareness allocation decider. Hence, it is stuck in a state, waiting for space to allocate the unassigned shard because it cannot assign it to the only node with space. This also seemed more likely to happen in this particular test case because it is creating a 15 nodes cluster and over 120 shards which increases the probability of landing up in such a case. A smaller cluster with lesser shards would be less likely to land up in such a case. This also seem to be a known issue after scrolling open issues in ElasticSearch/OpenSearch Also added same in #7401 |
Closing the issue, test has been muted and will be fixed and enabled back after the fix in allocator |
This issue isn't fixed - the test is disabled. If we were to delete the test I'd be happy to close this issue; however, I suspect that we want to fix the underlying issue. If there is another issue tracking the scroll issue please link it here. |
@peternied, the merged PR which is muting the test links it to actual underlying issue. This is the merged PR muting the test - #11767. And the PR links the test to actual issue - #5908. Feel free to close the issue if this suffices, else we can keep this open |
Describe the bug
New flaky test after #3563 got merged:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
On CI, the test is flaky, see please https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6061.log for example.,
Plugins
Standard.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: