-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart intermittent failure #5145
Comments
@samarthg1705, could you pick this up for OSCI? |
Hi! Sure I'll just do that. |
Ran the tests multiple times over IDE and over terminal by using following command
and it passed every time. I am using the main branch with this commit id as HEAD
|
@Gaurav614 There are two failures recently:
Can you check and confirm their failure cause? Not able to repro is not a reason to close unless we're sure why it had failed earlier. |
Looked into it. Issue:From the logs in the previous mentioned failures, the test seems to be failing on this specific code path where we poll for 10 seconds until all the PersistentTasks are removed from the cluster state post completion. The logs from build failure 30063 show a progressive reduction of the Persistent tasks in the Cluster State indicating the failure could be due to insufficient time to complete the execution.
Reproducing Failure:To simulate the same, I ran the failed seed close to 200 times with extra logging to see how much time it takes to complete execution for the code path mentioned above:
The time spent on the code path varied from 4138 msecs to 138 msecs depending on the number of persistent task created in that specific run. However, there were no failures. One observation in this exercise was that the very FIRST run in the Gradle test run would take the highest time (4138 msecs) for execution of the targeted code path and subsequent runs would be drastically less, even for the same number of persistent task created (which is 10 for this seed). This could most probably be due to Thread resources being reused in the test suite. Taking it forward - I simulated the same test but by repeating the test using a script instead of passing the
Also added some amount of CPU stress in between (from iteration 4 to 19) using stress to get the CPU utilisation % close to 80%
This resulted in generation of edge cases close to the failure scenario in the flaky tests. Also saw 1 failure.
ResolutionMost likely, this flakiness could be a result of ephemeral CPU Spikes at the time of testing or something similar. |
Describe the bug
See #5144.
https://build.ci.opensearch.org/job/gradle-check/6554/console
The text was updated successfully, but these errors were encountered: