-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node #16759
Comments
Which version you are using? @1105560808 |
"Using dev branch code" |
The |
The waiting strategy will be deprecated, please use stop strategy. |
"I discovered that this issue still exists in the latest 3.2.2-release version. During Master failover (even when the original Master hasn't actually exited), multiple Masters may simultaneously execute the same workflow." |
We will remove |
When using a stop strategy, the following issue may occur: After the master node distributes a task to the worker, the worker submits the task to its thread pool for execution and attempts to reply to the master. However, if the master has stopped when the worker tries to reply, the worker will attempt to reconnect to the current master several times before throwing an exception. At this point, the worker may already be executing the task, but since the original master has failed, triggering failover, another master takes over. This raises the question: could this situation lead to duplicate task execution |
Search before asking
What happened
"Due to network issues, Master lost connection with ZooKeeper, triggering the failover mechanism. However, the original Master was still running with tasks in execution and next nodes waiting in memory. Meanwhile, other Master nodes detected the issue and regenerated the task DAG. When the previous node completed, both Masters simultaneously executed the next node, causing multiple Worker nodes to process the same task. This may lead to subsequent task state inconsistency issues."
What you expected to happen
After Master loses connection with ZooKeeper due to network issues, concurrent execution of the same task should not occur
How to reproduce
Steps:
Anything else
Proposed Solution:
Before submitting next node task, Master should:
Version
3.2.x
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: