-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safe Eviction #499
Safe Eviction #499
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely makes sense to me, just want to confirm something.
try: | ||
# Wait for signal count to reach 2 | ||
await asyncio.sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this guy here for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, I originally thought I added it just to have more things to try to trip up eviction, but I am seeing some oddities if I remove it. I am investigating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After testing, I am seeing some strange server/core behavior here. If you remove this and the other 0.01
sleep in this test, it hangs. If you enable logging of the protos (i.e. setting LOG_PROTOS = True
in worker/_workflow.py
), it hangs but only for 10s. There is some timing issue causing hang somewhere (and even with logging, the 10s hang is confusing too).
If you have the time, can you help debug this from a core POV? To replicate, clone and run things for local env (https://github.com/temporalio/sdk-python?tab=readme-ov-file#local-sdk-development-environment) and then run poe test -s --log-cli-level=DEBUG -k test_cache_eviction_tear_down
. Then remove the two 0.01 sleeps and run again and see if it hangs for you. Similarly, enable that logging and see if it just has the 10s hiccup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I cannot replicate against a modern server, only against an old CLI server.
Overall, yes, this asyncio.sleep(0.01)
is just there for trying to add extra things to trip up the task collector, but can be removed and work without issue.
@@ -155,6 +160,13 @@ async def run(self) -> None: | |||
if self._throw_after_activation: | |||
raise self._throw_after_activation | |||
|
|||
def notify_shutdown(self) -> None: | |||
if self._could_not_evict_count: | |||
logger.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this warrants 'error' level. Sounds like something the user should most definitely know about.
I know in our case the task will just get SIGKILL'ED
but this is just a red flag in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but this could also be while handling an exception which causes shutdown. But I think I agree this maybe should throw.
@@ -234,6 +247,17 @@ async def _handle_activation( | |||
f"[TMPRL1101] Potential deadlock detected, workflow didn't yield within {self._deadlock_timeout_seconds} second(s)" | |||
) | |||
except Exception as err: | |||
# We cannot fail a cache eviction, we must just log and not complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like an exception or syntax error in finally
of the workflow code cause this?
I.e. this is just a workflow task activation failure that happens to happen during eviction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like an exception or syntax error in finally of the workflow code cause this?
No, workflow code exceptions would not bubble out here, we swallow them in the workflow instance
I.e. this is just a workflow task activation failure that happens to happen during eviction?
Correct, and I actually have a hard time replicating besides deadlock
What was changed
See #494. Today, if you evict a workflow that has incomplete coroutines, we simply delete it from the workflow map. This means Python will garbage collect these coroutines at some point in the future. Garbage collection of coroutines involves throwing a
GeneratorExit
within them, which can cause the coroutine to wake up on any thread to handle thisGeneratorExit
. Therefore,finally
may execute in another thread which may be running another workflow at the time. This is very bad and despite all attempts, we cannot reasonably intercept Python's coroutine garbage collection or theGeneratorExit
behavior here.So we have refactored the eviction process to cancel all outstanding tasks and to ignore or raise during any side effects attempted (e.g. commands). This is similar to other SDKs that have to tear down coroutines. However there are cases where a user may have done something invalid and the cancel may not complete the coroutine. This will log an error, hang the eviction, and will forever use up that task slot. It will also prevent worker shutdown. We can discuss ways to improve this if needed.
This supersedes #325/#341 which we previously thought was good enough to handle
GeneratorExit
.What changed:
disable_safe_workflow_eviction
which, if set toTrue
, will perform the old behavior of letting GC collect coroutinesChecklist