Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart intermittent failure #5145

Closed
dblock opened this issue Nov 8, 2022 · 5 comments · Fixed by #13350
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run

Comments

@dblock
Copy link
Member

dblock commented Nov 8, 2022

Describe the bug

See #5144.

2> REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart" -Dtests.seed=5FE27A861515F3EB -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=be-BY -Dtests.timezone=Europe/Guernsey -Druntime.java=17
  2> java.lang.AssertionError: 
    Expected: an empty collection
         but: <[{"id":"rsvnV4QBA7bGcm02BYEn","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":12,"assignment":{"executor_node":"SeUlKLdiT-W73R0c5Mn0tg","explanation":""},"allocation_id_on_last_status_update":0}]>
        at __randomizedtesting.SeedInfo.seed([5FE27A861515F3EB:ECD6B97E039BB44]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:964)
        at org.junit.Assert.assertThat(Assert.java:930)
        at org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.lambda$testFullClusterRestart$2(PersistentTasksExecutorFullRestartIT.java:125)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1049)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1022)
        at org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart(PersistentTasksExecutorFullRestartIT.java:123)

https://build.ci.opensearch.org/job/gradle-check/6554/console

@sejli
Copy link
Member

sejli commented Oct 17, 2023

@samarthg1705, could you pick this up for OSCI?

@samarthg1705
Copy link

@samarthg1705, could you pick this up for OSCI?

Hi! Sure I'll just do that.

@Gaurav614
Copy link
Contributor

Ran the tests multiple times over IDE and over terminal by using following command

./gradlew :server:internalClusterTest --tests "org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart" -Dtests.iters=100

and it passed every time.

I am using the main branch with this commit id as HEAD

commit 5bb6caec906f9e89d330332ebb74789571409eb1 (HEAD -> main, origin/main, origin/HEAD)
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Nov 23 14:34:32 2023 -0500

@r1walz
Copy link
Contributor

r1walz commented Nov 28, 2023

@Gaurav614 There are two failures recently:

Can you check and confirm their failure cause? Not able to repro is not a reason to close unless we're sure why it had failed earlier.

@Pranshu-S
Copy link
Contributor

Looked into it.

Issue:

From the logs in the previous mentioned failures, the test seems to be failing on this specific code path where we poll for 10 seconds until all the PersistentTasks are removed from the cluster state post completion. The logs from build failure 30063 show a progressive reduction of the Persistent tasks in the Cluster State indicating the failure could be due to insufficient time to complete the execution.

Note: Logs pasted below are in the order of Latest to Oldest.

java.lang.AssertionError: 
Expected: an empty collection
     but: <[{"id":"WpM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":12,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"XJM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":13,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}]>
.
.
.
Expected: an empty collection
     but: <[{"id":"VpM81YsBjHiuQBQ8uO96","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":11,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"WpM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":12,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"XJM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":13,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"WJM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":14,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"V5M81YsBjHiuQBQ8uO96","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":15,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"W5M81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":16,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"VZM81YsBjHiuQBQ8uO93","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":17,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"XZM81YsBjHiuQBQ8uO98","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":18,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"XpM81YsBjHiuQBQ8uO98","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":19,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"WZM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":20,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}]>

Reproducing Failure:

To simulate the same, I ran the failed seed close to 200 times with extra logging to see how much time it takes to complete execution for the code path mentioned above:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart" -Dtests.seed=786576D401A40E5B -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar -Dtests.timezone=Atlantic/Stanley -Druntime.java=21 -Dtests.iters=500 -Dtests.output=true > temp.text

The time spent on the code path varied from 4138 msecs to 138 msecs depending on the number of persistent task created in that specific run. However, there were no failures.

One observation in this exercise was that the very FIRST run in the Gradle test run would take the highest time (4138 msecs) for execution of the targeted code path and subsequent runs would be drastically less, even for the same number of persistent task created (which is 10 for this seed). This could most probably be due to Thread resources being reused in the test suite.

Taking it forward - I simulated the same test but by repeating the test using a script instead of passing the -Dtests.iters command -

for num in {0..200}; do ./gradlew ':server:internalClusterTest' --tests "org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart" --rerun -Dtests.seed=786576D401A40E5B -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar -Dtests.timezone=Atlantic/Stanley -Druntime.java=21 -Dtests.output=true -Dtests.iters=2 > ./temp/temp$num.text; sleep 10; done

Also added some amount of CPU stress in between (from iteration 4 to 19) using stress to get the CPU utilisation % close to 80%

stress --cpu 4 --timeout 400

This resulted in generation of edge cases close to the failure scenario in the flaky tests. Also saw 1 failure.

temp0.text:    [2024-04-22T12:06:49,553][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4133
temp0.text:    [2024-04-22T12:06:52,270][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 266
temp1.text:    [2024-04-22T12:07:32,412][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4139
temp1.text:    [2024-04-22T12:07:35,801][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 277
temp10.text:    [2024-04-22T12:14:14,691][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4128
temp10.text:    [2024-04-22T12:14:16,811][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 275
temp11.text:    [2024-04-22T12:15:00,181][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 10061
temp11.text:    [2024-04-22T12:15:02,990][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 532
temp12.text:    [2024-04-22T12:15:42,053][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4126
temp12.text:    [2024-04-22T12:15:46,471][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 532
temp13.text:    [2024-04-22T12:16:30,502][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 10048
temp13.text:    [2024-04-22T12:16:32,594][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 268
temp14.text:    [2024-04-22T12:17:12,045][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4138
temp14.text:    [2024-04-22T12:17:13,970][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 271
temp15.text:    [2024-04-22T12:17:51,980][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4131
temp15.text:    [2024-04-22T12:17:53,996][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 277
temp16.text:    [2024-04-22T12:18:31,062][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4117
temp16.text:    [2024-04-22T12:18:34,793][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 267
temp17.text:    [2024-04-22T12:19:14,459][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4149
temp17.text:    [2024-04-22T12:19:16,532][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 268
temp18.text:    [2024-04-22T12:20:00,210][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4127
temp18.text:    [2024-04-22T12:20:02,151][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 272
temp19.text:    [2024-04-22T12:20:45,399][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 10071
temp19.text:    [2024-04-22T12:20:50,548][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 277
temp2.text:    [2024-04-22T12:08:16,419][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4130
temp2.text:    [2024-04-22T12:08:19,649][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 275
temp20.text:            at org.opensearch.cluster.service.ClusterApplierService.addTimeoutListener(ClusterApplierService.java:304) [main/:?]
temp20.text:    [2024-04-22T12:21:50,609][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4141
temp20.text:    [2024-04-22T12:21:54,405][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 531
temp21.text:    [2024-04-22T12:22:31,112][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4115
temp21.text:    [2024-04-22T12:22:32,988][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 275
temp3.text:    [2024-04-22T12:08:59,433][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4127
temp3.text:    [2024-04-22T12:09:03,178][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 269
temp4.text:    [2024-04-22T12:09:43,949][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4126
temp4.text:    [2024-04-22T12:09:45,937][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 284
temp5.text:    [2024-04-22T12:10:29,236][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4118
temp5.text:    [2024-04-22T12:10:31,256][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 272
temp6.text:    [2024-04-22T12:11:15,608][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 10103
temp6.text:    [2024-04-22T12:11:17,539][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 270
temp7.text:    [2024-04-22T12:12:09,563][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4130
temp7.text:    [2024-04-22T12:12:11,513][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 275
temp8.text:    [2024-04-22T12:12:51,270][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4125
temp8.text:    [2024-04-22T12:12:53,201][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 270
temp9.text:    [2024-04-22T12:13:31,632][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4130
temp9.text:    [2024-04-22T12:13:36,061][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 277

Resolution

Most likely, this flakiness could be a result of ephemeral CPU Spikes at the time of testing or something similar.
Increasing the timeout to 20 seconds should resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.