Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1711][TEST] Fix flaky test caused by master/worker setup issue #2906

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

turboFei
Copy link
Member

@turboFei turboFei commented Nov 12, 2024

What changes were proposed in this pull request?

  1. retry on BindException when starting master/worker http server
  2. record the used ports and pre-check whether the selected port is used or bounded before binding

Why are the changes needed?

To fix flaky test.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

GA.

@turboFei turboFei marked this pull request as draft November 12, 2024 05:22
@turboFei turboFei changed the title [CELEBORN-1711][TEST] Setup celeborn master with retry on port BindException [WIP][CELEBORN-1711][TEST] Setup celeborn master with retry on port BindException Nov 12, 2024
@turboFei turboFei changed the title [WIP][CELEBORN-1711][TEST] Setup celeborn master with retry on port BindException [WIP][CELEBORN-1711][TEST] Setup celeborn master/worker http server with retry on port BindException Nov 12, 2024
@turboFei turboFei changed the title [WIP][CELEBORN-1711][TEST] Setup celeborn master/worker http server with retry on port BindException [CELEBORN-1711][TEST] Setup celeborn master/worker http server with retry on port BindException Nov 12, 2024
@turboFei turboFei marked this pull request as ready for review November 12, 2024 07:51
@turboFei
Copy link
Member Author

[info] org.apache.celeborn.tests.spark.PushDataTimeoutTest *** ABORTED ***
[info]   java.lang.AssertionError: assertion failed
[info]   at scala.Predef$.assert(Predef.scala:208)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.$anonfun$setUpWorkers$7(MiniClusterFeature.scala:234)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.$anonfun$setUpWorkers$7$adapted(MiniClusterFeature.scala:234)
[info]   at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
[info]   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
[info]   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
[info]   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
[info]   at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpWorkers(MiniClusterFeature.scala:234)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpWorkers$(MiniClusterFeature.scala:181)
[info]   at org.apache.celeborn.tests.spark.PushDataTimeoutTest.setUpWorkers(PushDataTimeoutTest.scala:34)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpMiniCluster(MiniClusterFeature.scala:254)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpMiniCluster$(MiniClusterFeature.scala:250)
[info]   at org.apache.celeborn.tests.spark.PushDataTimeoutTest.setUpMiniCluster(PushDataTimeoutTest.scala:34)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setupMiniClusterWithRandomPorts(MiniClusterFeature.scala:77)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setupMiniClusterWithRandomPorts$(MiniClusterFeature.scala:49)
[info]   at org.apache.celeborn.tests.spark.PushDataTimeoutTest.setupMiniClusterWithRandomPorts(PushDataTimeoutTest.scala:34)
[info]   at org.apache.celeborn.tests.spark.PushDataTimeoutTest.beforeAll(PushDataTimeoutTest.scala:47)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.celeborn.tests.spark.PushDataTimeoutTest.run(PushDataTimeoutTest.scala:34)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:750)

@turboFei
Copy link
Member Author

[info] org.apache.celeborn.tests.spark.SkewJoinSuite *** ABORTED ***
[info]   java.lang.AssertionError: assertion failed
[info]   at scala.Predef$.assert(Predef.scala:208)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.$anonfun$setUpWorkers$7(MiniClusterFeature.scala:234)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.$anonfun$setUpWorkers$7$adapted(MiniClusterFeature.scala:234)
[info]   at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
[info]   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
[info]   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
[info]   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
[info]   at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpWorkers(MiniClusterFeature.scala:234)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpWorkers$(MiniClusterFeature.scala:181)
[info]   at org.apache.celeborn.tests.spark.SkewJoinSuite.setUpWorkers(SkewJoinSuite.scala:32)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpMiniCluster(MiniClusterFeature.scala:254)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpMiniCluster$(MiniClusterFeature.scala:250)
[info]   at org.apache.celeborn.tests.spark.SkewJoinSuite.setUpMiniCluster(SkewJoinSuite.scala:32)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setupMiniClusterWithRandomPorts(MiniClusterFeature.scala:77)
[info]   at org.apache.celeborn.service.deploy.MiniClusterFeature.setupMiniClusterWithRandomPorts$(MiniClusterFeature.scala:49)
[info]   at org.apache.celeborn.tests.spark.SkewJoinSuite.setupMiniClusterWithRandomPorts(SkewJoinSuite.scala:32)
[info]   at org.apache.celeborn.tests.spark.SparkTestBase.beforeAll(SparkTestBase.scala:47)
[info]   at org.apache.celeborn.tests.spark.SparkTestBase.beforeAll$(SparkTestBase.scala:45)
[info]   at org.apache.celeborn.tests.spark.SkewJoinSuite.beforeAll(SkewJoinSuite.scala:32)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.celeborn.tests.spark.SkewJoinSuite.run(SkewJoinSuite.scala:32)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)

@turboFei turboFei marked this pull request as draft November 12, 2024 09:14
@turboFei turboFei marked this pull request as ready for review November 12, 2024 17:39
@turboFei
Copy link
Member Author

24/11/12 20:24:33,875 ERROR [ScalaTest-main-running-DiscoverySuite] ReusedExchangeSuite: cannot start all workers after 60000 ms
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208)
	at org.apache.celeborn.service.deploy.MiniClusterFeature.$anonfun$setUpWorkers$7(MiniClusterFeature.scala:246)
	at org.apache.celeborn.service.deploy.MiniClusterFeature.$anonfun$setUpWorkers$7$adapted(MiniClusterFeature.scala:246)
	at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
	at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
	at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
	at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
	at org.apache.celeborn.service.deploy.MiniClusterFeature.setUpWorkers(MiniClusterFeature.scala:246)

@turboFei turboFei force-pushed the retry_master_suite branch 3 times, most recently from 7088ead to 03b0282 Compare November 13, 2024 01:46
@turboFei turboFei changed the title [CELEBORN-1711][TEST] Setup celeborn master/worker http server with retry on port BindException [CELEBORN-1711][TEST] Fix flaky test because of master/worker setup Nov 13, 2024
@turboFei turboFei changed the title [CELEBORN-1711][TEST] Fix flaky test because of master/worker setup [CELEBORN-1711][TEST] Fix flaky test because of master/worker setup issue Nov 13, 2024
@turboFei turboFei changed the title [CELEBORN-1711][TEST] Fix flaky test because of master/worker setup issue [CELEBORN-1711][TEST] Fix flaky test caused by master/worker setup issue Nov 13, 2024
@turboFei
Copy link
Member Author

turboFei commented Nov 13, 2024

cc @FMX @RexXiong @SteNicholas It is to reduce the flaky test failures.

@turboFei
Copy link
Member Author

gentle ping @FMX @RexXiong @SteNicholas

@turboFei
Copy link
Member Author

turboFei commented Nov 21, 2024

Seems the port conflicts issue happen with high frequency recently.

Hope this patch can help reduce it.

@turboFei
Copy link
Member Author

turboFei commented Nov 22, 2024

interesting,so many failures recently

image

@turboFei turboFei force-pushed the retry_master_suite branch 2 times, most recently from feb3ac5 to c462899 Compare November 22, 2024 04:35
…ception

reduce sleep time

not needed

remove unused

used ports

nit

sleep and then check timeout and config the timeout

retest

please

do not sleep

check portBounded

more random worker ports

retest

please

synchronized

shutdown

increase timeout
Copy link

codecov bot commented Nov 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.52%. Comparing base (aa62549) to head (c92199e).
Report is 29 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2906      +/-   ##
==========================================
+ Coverage   32.07%   32.52%   +0.46%     
==========================================
  Files         331      334       +3     
  Lines       19749    19940     +191     
  Branches     1778     1799      +21     
==========================================
+ Hits         6332     6484     +152     
- Misses      13071    13096      +25     
- Partials      346      360      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants