-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow proper stack trace on eviction deadlock #530
Conversation
|
||
# Send cancel but don't wait | ||
worker_task.cancel() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding that this test passes even with the bug fix reverted.
(I know that some people disagree, but personally I think that it's a good idea to structure PRs like this as an initial commit with a failing test, followed by a commit that fixes the failure.)
sdk-python(eviction-deadlock-stack અ ) .venv/bin/pytest tests/worker/test_workflow.py::test_workflow_eviction_deadlock 09:46:21 [3/5530]
================================================================================================ test session starts =================================================================================================
platform darwin -- Python 3.10.13, pytest-7.4.3, pluggy-1.0.0
rootdir: /Users/dan/src/temporalio/sdk-python
configfile: pyproject.toml
plugins: timeout-2.2.0, asyncio-0.21.1
timeout: 600.0s
timeout method: signal
timeout func_only: True
asyncio: mode=auto
collected 1 item
tests/worker/test_workflow.py::test_workflow_eviction_deadlock
--------------------------------------------------------------------------------------------------- live log call ----------------------------------------------------------------------------------------------------
09:46:20 [ ERROR] Failed handling activation on workflow with run ID 70ba6973-f045-46e2-b5c0-78a944e41904 (_workflow.py:271)
Traceback (most recent call last):
File "/Users/dan/.pyenv/versions/3.10.13/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow.py", line 249, in _handle_activation
completion = await asyncio.wait_for(
File "/Users/dan/.pyenv/versions/3.10.13/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow.py", line 253, in _handle_activation
raise RuntimeError(
RuntimeError: [TMPRL1101] Potential deadlock detected, workflow didn't yield within 1 second(s)
09:46:20 [ ERROR] Failed running eviction job, not evicting. Since eviction could not be processed, this worker cannot complete and the slot will remain forever used. (_workflow.py:264)
Traceback (most recent call last):
File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow.py", line 249, in _handle_activation
completion = await asyncio.wait_for(
File "/Users/dan/.pyenv/versions/3.10.13/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/Users/dan/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/workflow_sandbox/_runner.py", line 157, in activate
self._run_code(
File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/workflow_sandbox/_runner.py", line 172, in _run_code
exec(code, self.globals_and_locals, self.globals_and_locals)
File "<string>", line 2, in <module>
File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/workflow_sandbox/_in_sandbox.py", line 81, in activate
return self.instance.activate(act)
File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow_instance.py", line 361, in activate
+ f"Stack traces below:\n\n{self._stack_trace()}"
File "/Users/dan/src/temporalio/sdk-python/temporalio/worker/_workflow_instance.py", line 1779, in _stack_trace
for task in self._tasks:
RuntimeError: Set changed size during iteration
09:46:21 [ INFO] Worker cancelled, shutting down (_worker.py:479)
09:46:21 [ INFO] Beginning worker shutdown, will wait 0:00:00 before cancelling activities (_worker.py:485)
09:46:21 [ WARNING] Shutting down workflow worker, but 1 workflow(s) could not be evicted previously, so the shutdown will hang (_workflow.py:172)
PASSED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's no easy way without extracting logs or adding other testing-only helpers to assert which reason exists for eviction failure. The test in this case was more for observed behavior as I struggled to justify adding injection points for this test. But if we think it's important, I can add hooks in the runtime code for this one test.
logger.exception( | ||
"Failed running eviction job, not evicting. " | ||
+ "Since eviction could not be processed, this worker cannot complete and the slot will remain forever used." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it'd make more sense to just panic and crash at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is not really about changing this behavior, but we can have a general discussion on that if we want though I don't think it should hold up this issue.
To do this, we might need some Core support. Today Core won't shutdown a worker if there is an outstanding eviction, and we don't want to return an eviction if it can't properly tear down because we don't want to accept more work for this workflow. What will Core do if we fail the eviction instead of hang it?
Also, a deadlocked workflow fatal'ing the worker is a bit rough on users. Unfortunately deadlocked many times means failed-eviction-in-cache-forever from a Python POV. Technically we could move can't-evict workflows off to some other dict, but that's just shifting the memory burden, it doesn't help the end user. This whole situation should be rare, hence why I was hesitant to build in some continually-retry-eviction type of logic in case the user's non-deterministic stuff started working (rare/unlikely).
Now that I think about it, I think other SDKs like Go and Java have this issue too. If they can't tear down their coroutines/threads and it hangs, there's nothing they can do, but I don't think they fatal the worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you fail the eviction I believe it just gets re-issued. We could make a special case for this but I'd rather not.
And yeah, fair point about the other SDKs and how this is likely when you do deadlock. I'm fine with just eating up the slot.
What was changed
Wrapped tasks in list when gathering stack trace so that anything affecting task set during doesn't affect trace. Also added another deadlock test for this scenario, shrunk deadlock timeout in worker to 1 during test, fixed minor sandbox issue, and made minor change to error message on eviction failure.
Checklist