Safe Eviction #499

cretz · 2024-04-02T13:22:19Z

What was changed

See #494. Today, if you evict a workflow that has incomplete coroutines, we simply delete it from the workflow map. This means Python will garbage collect these coroutines at some point in the future. Garbage collection of coroutines involves throwing a GeneratorExit within them, which can cause the coroutine to wake up on any thread to handle this GeneratorExit. Therefore, finally may execute in another thread which may be running another workflow at the time. This is very bad and despite all attempts, we cannot reasonably intercept Python's coroutine garbage collection or the GeneratorExit behavior here.

So we have refactored the eviction process to cancel all outstanding tasks and to ignore or raise during any side effects attempted (e.g. commands). This is similar to other SDKs that have to tear down coroutines. However there are cases where a user may have done something invalid and the cancel may not complete the coroutine. This will log an error, hang the eviction, and will forever use up that task slot. It will also prevent worker shutdown. We can discuss ways to improve this if needed.

This supersedes #325/#341 which we previously thought was good enough to handle GeneratorExit.

What changed:

Refactored eviction to assume that eviction now comes in its own activation (Evictions in their own activations sdk-core#712)
Refactored eviction to send the eviction job to the workflow instance and not remove from cache until it is completed
Added code in workflow instance on eviction job to cancel all outstanding tasks and wait on their completion
Added disable_safe_workflow_eviction which, if set to True, will perform the old behavior of letting GC collect coroutines

Checklist

Closes [Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

…t-prevention

Sushisource

Largely makes sense to me, just want to confirm something.

Sushisource · 2024-04-02T17:11:40Z

tests/worker/test_workflow.py

        try:
            # Wait for signal count to reach 2
+            await asyncio.sleep(0.01)


What's this guy here for?

Hrmm, I originally thought I added it just to have more things to try to trip up eviction, but I am seeing some oddities if I remove it. I am investigating.

After testing, I am seeing some strange server/core behavior here. If you remove this and the other 0.01 sleep in this test, it hangs. If you enable logging of the protos (i.e. setting LOG_PROTOS = True in worker/_workflow.py), it hangs but only for 10s. There is some timing issue causing hang somewhere (and even with logging, the 10s hang is confusing too).

If you have the time, can you help debug this from a core POV? To replicate, clone and run things for local env (https://github.com/temporalio/sdk-python?tab=readme-ov-file#local-sdk-development-environment) and then run poe test -s --log-cli-level=DEBUG -k test_cache_eviction_tear_down. Then remove the two 0.01 sleeps and run again and see if it hangs for you. Similarly, enable that logging and see if it just has the 10s hiccup.

Ok, I cannot replicate against a modern server, only against an old CLI server.

Overall, yes, this asyncio.sleep(0.01) is just there for trying to add extra things to trip up the task collector, but can be removed and work without issue.

tests/worker/test_workflow.py

ytaben · 2024-04-03T17:36:26Z

temporalio/worker/_workflow.py

@@ -155,6 +160,13 @@ async def run(self) -> None:
        if self._throw_after_activation:
            raise self._throw_after_activation

+    def notify_shutdown(self) -> None:
+        if self._could_not_evict_count:
+            logger.warn(


I wonder if this warrants 'error' level. Sounds like something the user should most definitely know about.

I know in our case the task will just get SIGKILL'ED but this is just a red flag in general

True, but this could also be while handling an exception which causes shutdown. But I think I agree this maybe should throw.

ytaben · 2024-04-03T17:38:10Z

temporalio/worker/_workflow.py

@@ -234,6 +247,17 @@ async def _handle_activation(
                        f"[TMPRL1101] Potential deadlock detected, workflow didn't yield within {self._deadlock_timeout_seconds} second(s)"
                    )
        except Exception as err:
+            # We cannot fail a cache eviction, we must just log and not complete


Would something like an exception or syntax error in finally of the workflow code cause this?
I.e. this is just a workflow task activation failure that happens to happen during eviction?

Would something like an exception or syntax error in finally of the workflow code cause this?

No, workflow code exceptions would not bubble out here, we swallow them in the workflow instance

I.e. this is just a workflow task activation failure that happens to happen during eviction?

Correct, and I actually have a hard time replicating besides deadlock

cretz added 4 commits April 1, 2024 10:19

Work on new eviction approach

7d8a45e

Merge remote-tracking branch 'remotes/origin/main' into generator-exi…

40d1d49

…t-prevention

Work on safe eviction

6414b3e

Add shielded task to tear down test

b711dec

cretz marked this pull request as ready for review April 2, 2024 13:27

cretz requested a review from a team as a code owner April 2, 2024 13:27

Sushisource reviewed Apr 2, 2024

View reviewed changes

Add background task to test

1601e4d

ytaben reviewed Apr 3, 2024

View reviewed changes

Sushisource approved these changes Apr 4, 2024

View reviewed changes

Merge branch 'main' into generator-exit-prevention

38c940b

cretz merged commit 466da16 into temporalio:main Apr 5, 2024
11 checks passed

cretz deleted the generator-exit-prevention branch April 5, 2024 14:00

cretz mentioned this pull request Apr 19, 2024

[Bug] GeneratorExit possibly causing issues on context detach in OTel finally #441

Open

cretz mentioned this pull request May 13, 2024

During eviction, set is_replaying and raise special exception #524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safe Eviction #499

Safe Eviction #499

cretz commented Apr 2, 2024

Sushisource left a comment

Sushisource Apr 2, 2024

cretz Apr 2, 2024

cretz Apr 3, 2024 •

edited

Loading

cretz Apr 4, 2024

ytaben Apr 3, 2024

cretz Apr 3, 2024

ytaben Apr 3, 2024

cretz Apr 3, 2024 •

edited

Loading

Safe Eviction #499

Safe Eviction #499

Conversation

cretz commented Apr 2, 2024

What was changed

Checklist

Sushisource left a comment

Choose a reason for hiding this comment

Sushisource Apr 2, 2024

Choose a reason for hiding this comment

cretz Apr 2, 2024

Choose a reason for hiding this comment

cretz Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

cretz Apr 4, 2024

Choose a reason for hiding this comment

ytaben Apr 3, 2024

Choose a reason for hiding this comment

cretz Apr 3, 2024

Choose a reason for hiding this comment

ytaben Apr 3, 2024

Choose a reason for hiding this comment

cretz Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

cretz Apr 3, 2024 •

edited

Loading

cretz Apr 3, 2024 •

edited

Loading