[BUG] org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart intermittent failure #5145

dblock · 2022-11-08T16:04:24Z

Describe the bug

2> REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart" -Dtests.seed=5FE27A861515F3EB -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=be-BY -Dtests.timezone=Europe/Guernsey -Druntime.java=17
  2> java.lang.AssertionError: 
    Expected: an empty collection
         but: <[{"id":"rsvnV4QBA7bGcm02BYEn","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":12,"assignment":{"executor_node":"SeUlKLdiT-W73R0c5Mn0tg","explanation":""},"allocation_id_on_last_status_update":0}]>
        at __randomizedtesting.SeedInfo.seed([5FE27A861515F3EB:ECD6B97E039BB44]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:964)
        at org.junit.Assert.assertThat(Assert.java:930)
        at org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.lambda$testFullClusterRestart$2(PersistentTasksExecutorFullRestartIT.java:125)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1049)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1022)
        at org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart(PersistentTasksExecutorFullRestartIT.java:123)

https://build.ci.opensearch.org/job/gradle-check/6554/console

The text was updated successfully, but these errors were encountered:

sejli · 2023-10-17T20:41:39Z

@samarthg1705, could you pick this up for OSCI?

samarthg1705 · 2023-10-19T23:19:29Z

@samarthg1705, could you pick this up for OSCI?

Hi! Sure I'll just do that.

Gaurav614 · 2023-11-28T08:10:59Z

Ran the tests multiple times over IDE and over terminal by using following command

./gradlew :server:internalClusterTest --tests "org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart" -Dtests.iters=100

and it passed every time.

I am using the main branch with this commit id as HEAD

commit 5bb6caec906f9e89d330332ebb74789571409eb1 (HEAD -> main, origin/main, origin/HEAD)
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Nov 23 14:34:32 2023 -0500

r1walz · 2023-11-28T10:45:27Z

@Gaurav614 There are two failures recently:

Can you check and confirm their failure cause? Not able to repro is not a reason to close unless we're sure why it had failed earlier.

Pranshu-S · 2024-04-23T12:24:31Z

Looked into it.

Issue:

From the logs in the previous mentioned failures, the test seems to be failing on this specific code path where we poll for 10 seconds until all the PersistentTasks are removed from the cluster state post completion. The logs from build failure 30063 show a progressive reduction of the Persistent tasks in the Cluster State indicating the failure could be due to insufficient time to complete the execution.

Note: Logs pasted below are in the order of Latest to Oldest.

java.lang.AssertionError: 
Expected: an empty collection
     but: <[{"id":"WpM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":12,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"XJM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":13,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}]>
.
.
.
Expected: an empty collection
     but: <[{"id":"VpM81YsBjHiuQBQ8uO96","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":11,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"WpM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":12,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"XJM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":13,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"WJM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":14,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"V5M81YsBjHiuQBQ8uO96","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":15,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"W5M81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":16,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"VZM81YsBjHiuQBQ8uO93","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":17,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"XZM81YsBjHiuQBQ8uO98","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":18,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"XpM81YsBjHiuQBQ8uO98","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":19,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}, {"id":"WZM81YsBjHiuQBQ8uO97","task":{"cluster:admin/persistent/test":{"params":{"param":"Blah"}}},"allocation_id":20,"assignment":{"executor_node":"giai6tlYRXymI4x-98qBcw","explanation":""},"allocation_id_on_last_status_update":0}]>

Reproducing Failure:

To simulate the same, I ran the failed seed close to 200 times with extra logging to see how much time it takes to complete execution for the code path mentioned above:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart" -Dtests.seed=786576D401A40E5B -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar -Dtests.timezone=Atlantic/Stanley -Druntime.java=21 -Dtests.iters=500 -Dtests.output=true > temp.text

The time spent on the code path varied from 4138 msecs to 138 msecs depending on the number of persistent task created in that specific run. However, there were no failures.

One observation in this exercise was that the very FIRST run in the Gradle test run would take the highest time (4138 msecs) for execution of the targeted code path and subsequent runs would be drastically less, even for the same number of persistent task created (which is 10 for this seed). This could most probably be due to Thread resources being reused in the test suite.

Taking it forward - I simulated the same test but by repeating the test using a script instead of passing the -Dtests.iters command -

for num in {0..200}; do ./gradlew ':server:internalClusterTest' --tests "org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart" --rerun -Dtests.seed=786576D401A40E5B -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar -Dtests.timezone=Atlantic/Stanley -Druntime.java=21 -Dtests.output=true -Dtests.iters=2 > ./temp/temp$num.text; sleep 10; done

Also added some amount of CPU stress in between (from iteration 4 to 19) using stress to get the CPU utilisation % close to 80%

stress --cpu 4 --timeout 400

This resulted in generation of edge cases close to the failure scenario in the flaky tests. Also saw 1 failure.

temp0.text:    [2024-04-22T12:06:49,553][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4133
temp0.text:    [2024-04-22T12:06:52,270][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 266
temp1.text:    [2024-04-22T12:07:32,412][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4139
temp1.text:    [2024-04-22T12:07:35,801][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 277
temp10.text:    [2024-04-22T12:14:14,691][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4128
temp10.text:    [2024-04-22T12:14:16,811][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 275
temp11.text:    [2024-04-22T12:15:00,181][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 10061
temp11.text:    [2024-04-22T12:15:02,990][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 532
temp12.text:    [2024-04-22T12:15:42,053][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4126
temp12.text:    [2024-04-22T12:15:46,471][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 532
temp13.text:    [2024-04-22T12:16:30,502][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 10048
temp13.text:    [2024-04-22T12:16:32,594][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 268
temp14.text:    [2024-04-22T12:17:12,045][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4138
temp14.text:    [2024-04-22T12:17:13,970][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 271
temp15.text:    [2024-04-22T12:17:51,980][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4131
temp15.text:    [2024-04-22T12:17:53,996][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 277
temp16.text:    [2024-04-22T12:18:31,062][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4117
temp16.text:    [2024-04-22T12:18:34,793][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 267
temp17.text:    [2024-04-22T12:19:14,459][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4149
temp17.text:    [2024-04-22T12:19:16,532][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 268
temp18.text:    [2024-04-22T12:20:00,210][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4127
temp18.text:    [2024-04-22T12:20:02,151][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 272
temp19.text:    [2024-04-22T12:20:45,399][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 10071
temp19.text:    [2024-04-22T12:20:50,548][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 277
temp2.text:    [2024-04-22T12:08:16,419][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4130
temp2.text:    [2024-04-22T12:08:19,649][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 275
temp20.text:            at org.opensearch.cluster.service.ClusterApplierService.addTimeoutListener(ClusterApplierService.java:304) [main/:?]
temp20.text:    [2024-04-22T12:21:50,609][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4141
temp20.text:    [2024-04-22T12:21:54,405][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 531
temp21.text:    [2024-04-22T12:22:31,112][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4115
temp21.text:    [2024-04-22T12:22:32,988][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 275
temp3.text:    [2024-04-22T12:08:59,433][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4127
temp3.text:    [2024-04-22T12:09:03,178][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 269
temp4.text:    [2024-04-22T12:09:43,949][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4126
temp4.text:    [2024-04-22T12:09:45,937][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 284
temp5.text:    [2024-04-22T12:10:29,236][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4118
temp5.text:    [2024-04-22T12:10:31,256][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 272
temp6.text:    [2024-04-22T12:11:15,608][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 10103
temp6.text:    [2024-04-22T12:11:17,539][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 270
temp7.text:    [2024-04-22T12:12:09,563][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4130
temp7.text:    [2024-04-22T12:12:11,513][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 275
temp8.text:    [2024-04-22T12:12:51,270][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4125
temp8.text:    [2024-04-22T12:12:53,201][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 270
temp9.text:    [2024-04-22T12:13:31,632][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 4130
temp9.text:    [2024-04-22T12:13:36,061][INFO ][o.o.p.PersistentTasksExecutorFullRestartIT] [testFullClusterRestart] Time spent: 277

Resolution

Most likely, this flakiness could be a result of ephemeral CPU Spikes at the time of testing or something similar.
Increasing the timeout to 20 seconds should resolve this.

dblock added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels Nov 8, 2022

dblock mentioned this issue Nov 8, 2022

Only run backport workflow it a backport label was added. #5144

Merged

andrross added the distributed framework label Nov 8, 2022

anasalkouz removed the untriaged label Nov 9, 2022

Poojita-Raj mentioned this issue Nov 15, 2022

[Meta] Fix random test failures #1715

Closed

37 tasks

anasalkouz added Migration:Backlog and removed Migration:Backlog labels Mar 17, 2023

gaobinlong mentioned this issue Oct 9, 2023

Fix dissect ingest processor parsing empty brackets failed #9255

Merged

6 tasks

RamakrishnaChilaka assigned owaiskazi19 Oct 24, 2023

minalsha assigned gauravruhela and unassigned owaiskazi19 Nov 15, 2023

vikasvb90 assigned amkhar and unassigned gauravruhela Jan 24, 2024

andrross added the Cluster Manager label Feb 21, 2024

github-project-automation bot added this to Cluster Manager Project Board Feb 21, 2024

andrross removed the distributed framework label Feb 21, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Feb 21, 2024

rwali-aws unassigned amkhar Apr 16, 2024

rwali-aws moved this from 🆕 New to Now(This Quarter) in Cluster Manager Project Board Apr 22, 2024

Pranshu-S mentioned this issue Apr 23, 2024

Fix Flaky Test PersistentTasksExecutorFullRestartIT.testFullClusterRe… #13350

Merged

8 tasks

dblock closed this as completed in #13350 Apr 25, 2024

github-project-automation bot moved this from Now(This Quarter) to ✅ Done in Cluster Manager Project Board Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart intermittent failure #5145

[BUG] org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart intermittent failure #5145

dblock commented Nov 8, 2022

sejli commented Oct 17, 2023

samarthg1705 commented Oct 19, 2023

Gaurav614 commented Nov 28, 2023

r1walz commented Nov 28, 2023

Pranshu-S commented Apr 23, 2024

[BUG] org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart intermittent failure #5145

[BUG] org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart intermittent failure #5145

Comments

dblock commented Nov 8, 2022

sejli commented Oct 17, 2023

samarthg1705 commented Oct 19, 2023

Gaurav614 commented Nov 28, 2023

r1walz commented Nov 28, 2023

Pranshu-S commented Apr 23, 2024

Issue:

Reproducing Failure:

Resolution