Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node #16759

Open
3 tasks done
1105560808 opened this issue Nov 1, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@1105560808
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

"Due to network issues, Master lost connection with ZooKeeper, triggering the failover mechanism. However, the original Master was still running with tasks in execution and next nodes waiting in memory. Meanwhile, other Master nodes detected the issue and regenerated the task DAG. When the previous node completed, both Masters simultaneously executed the next node, causing multiple Worker nodes to process the same task. This may lead to subsequent task state inconsistency issues."

What you expected to happen

After Master loses connection with ZooKeeper due to network issues, concurrent execution of the same task should not occur

How to reproduce

Steps:

  1. Identify a workflow with long-running node
  2. During node execution:
    • Disconnect Master from ZooKeeper
    • Use pause strategy (not stop)
    • Trigger Master failover
  3. Wait for current node completion
  4. Verify:
    • Check for duplicate execution of subsequent nodes
    • Monitor task state consistency

Anything else

Proposed Solution:
Before submitting next node task, Master should:

  1. Verify host in processInstance
  2. Compare with current Master's host
  3. Exit if mismatch detected

Version

3.2.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@1105560808 1105560808 added bug Something isn't working Waiting for reply Waiting for reply labels Nov 1, 2024
@SbloodyS
Copy link
Member

SbloodyS commented Nov 1, 2024

Which version you are using? @1105560808

@SbloodyS SbloodyS added need more information and removed Waiting for reply Waiting for reply labels Nov 1, 2024
@1105560808
Copy link
Author

Which version you are using? @1105560808

您使用的是哪个版本?电话:0556 - 8888888

"Using dev branch code"

@SbloodyS
Copy link
Member

SbloodyS commented Nov 4, 2024

The dev is an unreleased branch. It may have many unstable issues.

@ruanwenjun
Copy link
Member

The waiting strategy will be deprecated, please use stop strategy.
Since right now there is difficult to make sure once a server receive SUSPENDED and RECONNECTED event, the other server has received the delete event of the node or not. https://curator.apache.org/docs/tech-note-14

@1105560808
Copy link
Author

1105560808 commented Nov 4, 2024

dev是一个未发布的分支。它可能存在许多不稳定的问题。

"I discovered that this issue still exists in the latest 3.2.2-release version. During Master failover (even when the original Master hasn't actually exited), multiple Masters may simultaneously execute the same workflow."

@SbloodyS
Copy link
Member

SbloodyS commented Nov 5, 2024

We will remove waiting strategy in the next version.

@1105560808
Copy link
Author

We will remove waiting strategy in the next version.

When using a stop strategy, the following issue may occur: After the master node distributes a task to the worker, the worker submits the task to its thread pool for execution and attempts to reply to the master. However, if the master has stopped when the worker tries to reply, the worker will attempt to reconnect to the current master several times before throwing an exception. At this point, the worker may already be executing the task, but since the original master has failed, triggering failover, another master takes over. This raises the question: could this situation lead to duplicate task execution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants