[Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node #16759

1105560808 · 2024-11-01T06:39:06Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

"Due to network issues, Master lost connection with ZooKeeper, triggering the failover mechanism. However, the original Master was still running with tasks in execution and next nodes waiting in memory. Meanwhile, other Master nodes detected the issue and regenerated the task DAG. When the previous node completed, both Masters simultaneously executed the next node, causing multiple Worker nodes to process the same task. This may lead to subsequent task state inconsistency issues."

What you expected to happen

After Master loses connection with ZooKeeper due to network issues, concurrent execution of the same task should not occur

How to reproduce

Steps:

Identify a workflow with long-running node
During node execution:
- Disconnect Master from ZooKeeper
- Use pause strategy (not stop)
- Trigger Master failover
Wait for current node completion
Verify:
- Check for duplicate execution of subsequent nodes
- Monitor task state consistency

Anything else

Proposed Solution:
Before submitting next node task, Master should:

Verify host in processInstance
Compare with current Master's host
Exit if mismatch detected

Version

3.2.x

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

SbloodyS · 2024-11-01T08:59:46Z

Which version you are using? @1105560808

1105560808 · 2024-11-04T01:37:47Z

Which version you are using? @1105560808

您使用的是哪个版本？电话：0556 - 8888888

"Using dev branch code"

SbloodyS · 2024-11-04T01:41:14Z

The dev is an unreleased branch. It may have many unstable issues.

ruanwenjun · 2024-11-04T02:29:37Z

The waiting strategy will be deprecated, please use stop strategy.
Since right now there is difficult to make sure once a server receive SUSPENDED and RECONNECTED event, the other server has received the delete event of the node or not. https://curator.apache.org/docs/tech-note-14

1105560808 · 2024-11-04T03:54:37Z

这dev是一个未发布的分支。它可能存在许多不稳定的问题。

"I discovered that this issue still exists in the latest 3.2.2-release version. During Master failover (even when the original Master hasn't actually exited), multiple Masters may simultaneously execute the same workflow."

SbloodyS · 2024-11-05T09:06:34Z

We will remove waiting strategy in the next version.

1105560808 · 2024-11-05T12:04:51Z

We will remove waiting strategy in the next version.

When using a stop strategy, the following issue may occur: After the master node distributes a task to the worker, the worker submits the task to its thread pool for execution and attempts to reply to the master. However, if the master has stopped when the worker tries to reply, the worker will attempt to reconnect to the current master several times before throwing an exception. At this point, the worker may already be executing the task, but since the original master has failed, triggering failover, another master takes over. This raises the question: could this situation lead to duplicate task execution

1105560808 added bug Something isn't working Waiting for reply Waiting for reply labels Nov 1, 2024

SbloodyS added need more information and removed Waiting for reply Waiting for reply labels Nov 1, 2024

SbloodyS removed the need more information label Nov 4, 2024

SbloodyS assigned ruanwenjun Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node #16759

[Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node #16759

1105560808 commented Nov 1, 2024

SbloodyS commented Nov 1, 2024

1105560808 commented Nov 4, 2024

SbloodyS commented Nov 4, 2024

ruanwenjun commented Nov 4, 2024

1105560808 commented Nov 4, 2024 •

edited by SbloodyS

Loading

SbloodyS commented Nov 5, 2024

1105560808 commented Nov 5, 2024

[Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node #16759

[Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node #16759

Comments

1105560808 commented Nov 1, 2024

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

SbloodyS commented Nov 1, 2024

1105560808 commented Nov 4, 2024

SbloodyS commented Nov 4, 2024

ruanwenjun commented Nov 4, 2024

1105560808 commented Nov 4, 2024 • edited by SbloodyS Loading

SbloodyS commented Nov 5, 2024

1105560808 commented Nov 5, 2024

1105560808 commented Nov 4, 2024 •

edited by SbloodyS

Loading