Honor all non-completion commands #569

dandavison · 2024-07-01T16:44:24Z

Fixes #528

With this change we honor all non-completion commands emitted by workflow coroutines, even if they come after a completion command (i.e. complete/CAN/cancel/fail). A consequence is that when an update completion is returned in the same WFT response as a workflow completion, the client will always get the update response; previously that was only the case if the update handler returned prior to any completion command being emitted by another coroutine.

The solution involves devolving responsibilites for this logic to core: see explanatory code comments in temporalio/sdk-core#776.

Evidence that this is correct

$ pytest -k "completion" tests/worker/test_workflow.py

# These two fail on main
tests/worker/test_workflow.py::test_update_completion_is_honored_when_after_workflow_return_1 PASSED
tests/worker/test_workflow.py::test_update_completion_is_honored_when_after_workflow_return_2 PASSED

# These two check that we respect ordering of completion commands. They pass on main as well as with the new changes.
tests/worker/test_workflow.py::test_first_of_two_signal_completion_commands_is_honored PASSED                                                                                                                
tests/worker/test_workflow.py::test_workflow_return_is_honored_when_it_precedes_signal_completion_command PASSED

cretz

LGTM, but want to wait until temporalio/sdk-core#776 settles and the submodule can be updated to master.

cretz · 2024-07-15T22:26:54Z

temporalio/worker/_workflow_instance.py

-        # If there are successful commands, we must remove all
-        # non-query-responses after terminal workflow commands. We must do this
-        # in place to avoid the copy-on-write that occurs when you reassign.
-        seen_completion = False
-        i = 0
-        while i < len(self._current_completion.successful.commands):
-            command = self._current_completion.successful.commands[i]
-            if not seen_completion:
-                seen_completion = (
-                    command.HasField("complete_workflow_execution")
-                    or command.HasField("continue_as_new_workflow_execution")
-                    or command.HasField("fail_workflow_execution")
-                    or command.HasField("cancel_workflow_execution")
-                )
-            elif not command.HasField("respond_to_query"):
-                del self._current_completion.successful.commands[i]
-                continue
-            i += 1


Any concerns that removing this is backwards incompatible w/ already completed workflows, or can you confirm that in no-flag-replaying situations the core behavior was always the same (sans query stuff)? One thing you can do is make a workflow that has post-complete command, run it in older SDK, grab JSON history, and run replayer in tests here with new code.

I'm not quite following this bit of the question:

or can you confirm that in no-flag-replaying situations the core behavior was always the same

Here's how I am thinking of it:

sdk-python v1.0 was released in Jan 2023, and dropped post-terminal commands from the beginning, until this change.

Therefore, prior to this change, all Python WFTs had their post-terminal commands dropped.

Incidentally, Core started also dropping post-terminal commands since March 2023: Drop all post-terminal commands & sort activation jobs sdk-core#502

The new SDK code drops post-terminal commands when replaying without the flag set, and there is test coverage for this: https://github.com/temporalio/sdk-core/blob/master/core/src/core_tests/workflow_tasks.rs#L2558-L2577. Therefore we do not expect NDEs: the command sequence applied to core state machines when replaying without the flag will be the same as it was prior to this change.

I'm not quite following this bit of the question:

Think about a user with an old workflow (i.e. sans flag). If you remove the old Python behavior that runs sans flag, it now relies on the old Core behavior sans flag. If that old behavior doesn't match Python's old behavior, they will get a non-determinism error. So we need to confirm that old Core code does the same thing as old Python code before removing old Python code. Did they drop post-terminal commands the same way? If so, we're all good here.

The new SDK code drops post-terminal commands when replaying without the flag set, and there is test coverage for this

IMO you should grab a workflow history JSON or two from a workflow that had post-terminal commands from a Python SDK before this change, then run it through a replayer in the test on this version. There's a couple of other JSON files in the test suite that you can see how their tests are doing this. Also, I assume the test in this PR is testing that now commands after workflow complete are properly included?

Think about a user with an old workflow (i.e. sans flag). If you remove the old Python behavior that runs sans flag, it now relies on the old Core behavior sans flag. If that old behavior doesn't match Python's old behavior, they will get a non-determinism error. So we need to confirm that old Core code does the same thing as old Python code before removing old Python code. Did they drop post-terminal commands the same way? If so, we're all good here.

Personally I would substitute s/old Core/new Core/ throughout this paragraph, since we're never going to be running old Core code: rather it's new Core code which, when replaying without the flag, is intended to behave as old Core did (i.e. truncating at first terminal command). This is tested in two different ways in the Core test suite, but I agree that SDK-specific tests replaying old workflows with post-terminal commands would be good too.

👍 Makes sense, yeah whatever the terms are that mean "Workflows with post-complete commands on previous Python SDK versions work the exact same with this PR"

Added the replay backward compatibility test. This should be ready to go.

Sushisource · 2024-08-05T18:12:29Z

tests/worker/test_replayer.py

+    The UpdateCompletionAfterWorkflowReturn workflow above features an update handler that returns
+    after the main workflow coroutine has exited. It will (if an update is sent in the first WFT)
+    generate a raw command sequence (before sending to core) of


Yeah, great comment, makes sense 👍

dandavison force-pushed the sdk-528-let-coroutines-complete-before-setting-completion branch from 61e8a3c to ffa761c Compare July 15, 2024 11:24

dandavison changed the title ~~Let workflow coroutines settle before setting completion~~ Honor all non-completion commands Jul 15, 2024

dandavison force-pushed the sdk-528-let-coroutines-complete-before-setting-completion branch 2 times, most recently from d553b0b to 9266c22 Compare July 15, 2024 13:03

dandavison mentioned this pull request Jul 15, 2024

Honor all non-terminal commands temporalio/sdk-core#776

Merged

dandavison marked this pull request as ready for review July 15, 2024 22:18

dandavison requested a review from a team as a code owner July 15, 2024 22:18

cretz reviewed Jul 15, 2024

View reviewed changes

dandavison force-pushed the sdk-528-let-coroutines-complete-before-setting-completion branch 6 times, most recently from bc53e49 to 337cae1 Compare July 23, 2024 20:17

dandavison added 5 commits August 2, 2024 16:58

Honor commands generated after the first completion command

bd7930c

Add type annotation needed by mypy

64ece8e

Update core

42d66db

Add test that timer can be started after workflow completion

ab55ee6

Skip update tests under Java server

8c5e9c3

dandavison force-pushed the sdk-528-let-coroutines-complete-before-setting-completion branch 4 times, most recently from 72961fa to 7a847d5 Compare August 3, 2024 01:30

Test replay backwards compatibility

7a847d5

cretz approved these changes Aug 5, 2024

View reviewed changes

Sushisource approved these changes Aug 5, 2024

View reviewed changes

dandavison merged commit 50914c4 into main Aug 5, 2024
12 checks passed

dandavison deleted the sdk-528-let-coroutines-complete-before-setting-completion branch August 5, 2024 18:19

dandavison added a commit to dandavison/temporalio-samples-python that referenced this pull request Aug 5, 2024

Replay test wf for temporalio/sdk-python#569

41f918f

dandavison added a commit to temporalio/samples-python that referenced this pull request Aug 12, 2024

Replay test wf for temporalio/sdk-python#569

e7775e6

dandavison added a commit to dandavison/temporalio-samples-python that referenced this pull request Aug 27, 2024

Replay test wf for temporalio/sdk-python#569

5da3ca9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor all non-completion commands #569

Honor all non-completion commands #569

dandavison commented Jul 1, 2024 •

edited

Loading

cretz left a comment

cretz Jul 15, 2024 •

edited

Loading

dandavison Jul 23, 2024

cretz Jul 23, 2024 •

edited

Loading

dandavison Jul 23, 2024

cretz Jul 23, 2024

dandavison Aug 3, 2024

Sushisource Aug 5, 2024

Honor all non-completion commands #569

Honor all non-completion commands #569

Conversation

dandavison commented Jul 1, 2024 • edited Loading

Evidence that this is correct

cretz left a comment

Choose a reason for hiding this comment

cretz Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

dandavison Jul 23, 2024

Choose a reason for hiding this comment

cretz Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

dandavison Jul 23, 2024

Choose a reason for hiding this comment

cretz Jul 23, 2024

Choose a reason for hiding this comment

dandavison Aug 3, 2024

Choose a reason for hiding this comment

Sushisource Aug 5, 2024

Choose a reason for hiding this comment

dandavison commented Jul 1, 2024 •

edited

Loading

cretz Jul 15, 2024 •

edited

Loading

cretz Jul 23, 2024 •

edited

Loading