Allow proper stack trace on eviction deadlock #530

cretz · 2024-05-20T13:24:43Z

What was changed

Wrapped tasks in list when gathering stack trace so that anything affecting task set during doesn't affect trace. Also added another deadlock test for this scenario, shrunk deadlock timeout in worker to 1 during test, fixed minor sandbox issue, and made minor change to error message on eviction failure.

Checklist

Closes [Bug] Can crash SDK with non-halting workflow #527

Fixes temporalio#527

dandavison · 2024-05-20T13:48:16Z

tests/worker/test_workflow.py

+
+    # Send cancel but don't wait
+    worker_task.cancel()
+


I'm finding that this test passes even with the bug fix reverted.

(I know that some people disagree, but personally I think that it's a good idea to structure PRs like this as an initial commit with a failing test, followed by a commit that fixes the failure.)

sdk-python(eviction-deadlock-stack અ ) .venv/bin/pytest tests/worker/test_workflow.py::test_workflow_eviction_deadlock 09:46:21 [3/5530] ================================================================================================ test session starts ================================================================================================= platform darwin -- Python 3.10.13, pytest-7.4.3, pluggy-1.0.0 rootdir: /Users/dan/src/temporalio/sdk-python configfile: pyproject.toml plugins: timeout-2.2.0, asyncio-0.21.1 timeout: 600.0s timeout method: signal timeout func_only: True asyncio: mode=auto collected 1 item tests/worker/test_workflow.py::test_workflow_eviction_deadlock --------------------------------------------------------------------------------------------------- live log call ---------------------------------------------------------------------------------------------------- 09:46:20 [ ERROR] Failed handling activation on workflow with run ID 70ba6973-f045-46e2-b5c0-78a944e41904 (_workflow.py:271) Traceback (most recent call last): File "/Users/dan/.pyenv/versions/3.10.13/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow.py", line 249, in _handle_activation completion = await asyncio.wait_for( File "/Users/dan/.pyenv/versions/3.10.13/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow.py", line 253, in _handle_activation raise RuntimeError( RuntimeError: [TMPRL1101] Potential deadlock detected, workflow didn't yield within 1 second(s) 09:46:20 [ ERROR] Failed running eviction job, not evicting. Since eviction could not be processed, this worker cannot complete and the slot will remain forever used. (_workflow.py:264) Traceback (most recent call last): File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow.py", line 249, in _handle_activation completion = await asyncio.wait_for( File "/Users/dan/.pyenv/versions/3.10.13/lib/python3.10/asyncio/tasks.py", line 445, in wait_for return fut.result() File "/Users/dan/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/workflow_sandbox/_runner.py", line 157, in activate self._run_code( File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/workflow_sandbox/_runner.py", line 172, in _run_code exec(code, self.globals_and_locals, self.globals_and_locals) File "<string>", line 2, in <module> File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/workflow_sandbox/_in_sandbox.py", line 81, in activate return self.instance.activate(act) File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow_instance.py", line 361, in activate + f"Stack traces below:\n\n{self._stack_trace()}" File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow_instance.py", line 1779, in _stack_trace for task in self._tasks: RuntimeError: Set changed size during iteration 09:46:21 [ INFO] Worker cancelled, shutting down (_worker.py:479) 09:46:21 [ INFO] Beginning worker shutdown, will wait 0:00:00 before cancelling activities (_worker.py:485) 09:46:21 [ WARNING] Shutting down workflow worker, but 1 workflow(s) could not be evicted previously, so the shutdown will hang (_workflow.py:172) PASSED

Yes, there's no easy way without extracting logs or adding other testing-only helpers to assert which reason exists for eviction failure. The test in this case was more for observed behavior as I struggled to justify adding injection points for this test. But if we think it's important, I can add hooks in the runtime code for this one test.

Sushisource · 2024-05-20T16:29:56Z

temporalio/worker/_workflow.py

+                logger.exception(
+                    "Failed running eviction job, not evicting. "
+                    + "Since eviction could not be processed, this worker cannot complete and the slot will remain forever used."
+                )


I'm wondering if it'd make more sense to just panic and crash at this point

This issue is not really about changing this behavior, but we can have a general discussion on that if we want though I don't think it should hold up this issue.

To do this, we might need some Core support. Today Core won't shutdown a worker if there is an outstanding eviction, and we don't want to return an eviction if it can't properly tear down because we don't want to accept more work for this workflow. What will Core do if we fail the eviction instead of hang it?

Also, a deadlocked workflow fatal'ing the worker is a bit rough on users. Unfortunately deadlocked many times means failed-eviction-in-cache-forever from a Python POV. Technically we could move can't-evict workflows off to some other dict, but that's just shifting the memory burden, it doesn't help the end user. This whole situation should be rare, hence why I was hesitant to build in some continually-retry-eviction type of logic in case the user's non-deterministic stuff started working (rare/unlikely).

Now that I think about it, I think other SDKs like Go and Java have this issue too. If they can't tear down their coroutines/threads and it hangs, there's nothing they can do, but I don't think they fatal the worker.

If you fail the eviction I believe it just gets re-issued. We could make a special case for this but I'd rather not.

And yeah, fair point about the other SDKs and how this is likely when you do deadlock. I'm fine with just eating up the slot.

Allow proper stack trace on eviction deadlock

9b48e9c

Fixes temporalio#527

cretz requested a review from a team as a code owner May 20, 2024 13:24

dandavison reviewed May 20, 2024

View reviewed changes

Sushisource reviewed May 20, 2024

View reviewed changes

Merge branch 'main' into eviction-deadlock-stack

7427f63

Sushisource approved these changes May 28, 2024

View reviewed changes

Merge branch 'main' into eviction-deadlock-stack

68f31e5

cretz merged commit afadc15 into temporalio:main May 28, 2024
12 checks passed

cretz deleted the eviction-deadlock-stack branch May 28, 2024 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow proper stack trace on eviction deadlock #530

Allow proper stack trace on eviction deadlock #530

cretz commented May 20, 2024

dandavison May 20, 2024

cretz May 20, 2024

Sushisource May 20, 2024

cretz May 20, 2024

Sushisource May 20, 2024

Allow proper stack trace on eviction deadlock #530

Allow proper stack trace on eviction deadlock #530

Conversation

cretz commented May 20, 2024

What was changed

Checklist

dandavison May 20, 2024

Choose a reason for hiding this comment

cretz May 20, 2024

Choose a reason for hiding this comment

Sushisource May 20, 2024

Choose a reason for hiding this comment

cretz May 20, 2024

Choose a reason for hiding this comment

Sushisource May 20, 2024

Choose a reason for hiding this comment