Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe Eviction #499

Merged
merged 6 commits into from
Apr 5, 2024
Merged

Safe Eviction #499

merged 6 commits into from
Apr 5, 2024

Conversation

cretz
Copy link
Member

@cretz cretz commented Apr 2, 2024

What was changed

See #494. Today, if you evict a workflow that has incomplete coroutines, we simply delete it from the workflow map. This means Python will garbage collect these coroutines at some point in the future. Garbage collection of coroutines involves throwing a GeneratorExit within them, which can cause the coroutine to wake up on any thread to handle this GeneratorExit. Therefore, finally may execute in another thread which may be running another workflow at the time. This is very bad and despite all attempts, we cannot reasonably intercept Python's coroutine garbage collection or the GeneratorExit behavior here.

So we have refactored the eviction process to cancel all outstanding tasks and to ignore or raise during any side effects attempted (e.g. commands). This is similar to other SDKs that have to tear down coroutines. However there are cases where a user may have done something invalid and the cancel may not complete the coroutine. This will log an error, hang the eviction, and will forever use up that task slot. It will also prevent worker shutdown. We can discuss ways to improve this if needed.

This supersedes #325/#341 which we previously thought was good enough to handle GeneratorExit.

What changed:

  • Refactored eviction to assume that eviction now comes in its own activation (Evictions in their own activations sdk-core#712)
  • Refactored eviction to send the eviction job to the workflow instance and not remove from cache until it is completed
  • Added code in workflow instance on eviction job to cancel all outstanding tasks and wait on their completion
  • Added disable_safe_workflow_eviction which, if set to True, will perform the old behavior of letting GC collect coroutines

Checklist

  1. Closes [Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

@cretz cretz marked this pull request as ready for review April 2, 2024 13:27
@cretz cretz requested a review from a team as a code owner April 2, 2024 13:27
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely makes sense to me, just want to confirm something.

try:
# Wait for signal count to reach 2
await asyncio.sleep(0.01)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this guy here for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I originally thought I added it just to have more things to try to trip up eviction, but I am seeing some oddities if I remove it. I am investigating.

Copy link
Member Author

@cretz cretz Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing, I am seeing some strange server/core behavior here. If you remove this and the other 0.01 sleep in this test, it hangs. If you enable logging of the protos (i.e. setting LOG_PROTOS = True in worker/_workflow.py), it hangs but only for 10s. There is some timing issue causing hang somewhere (and even with logging, the 10s hang is confusing too).

If you have the time, can you help debug this from a core POV? To replicate, clone and run things for local env (https://github.com/temporalio/sdk-python?tab=readme-ov-file#local-sdk-development-environment) and then run poe test -s --log-cli-level=DEBUG -k test_cache_eviction_tear_down. Then remove the two 0.01 sleeps and run again and see if it hangs for you. Similarly, enable that logging and see if it just has the 10s hiccup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I cannot replicate against a modern server, only against an old CLI server.

Overall, yes, this asyncio.sleep(0.01) is just there for trying to add extra things to trip up the task collector, but can be removed and work without issue.

tests/worker/test_workflow.py Show resolved Hide resolved
@@ -155,6 +160,13 @@ async def run(self) -> None:
if self._throw_after_activation:
raise self._throw_after_activation

def notify_shutdown(self) -> None:
if self._could_not_evict_count:
logger.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this warrants 'error' level. Sounds like something the user should most definitely know about.

I know in our case the task will just get SIGKILL'ED but this is just a red flag in general

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but this could also be while handling an exception which causes shutdown. But I think I agree this maybe should throw.

@@ -234,6 +247,17 @@ async def _handle_activation(
f"[TMPRL1101] Potential deadlock detected, workflow didn't yield within {self._deadlock_timeout_seconds} second(s)"
)
except Exception as err:
# We cannot fail a cache eviction, we must just log and not complete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like an exception or syntax error in finally of the workflow code cause this?
I.e. this is just a workflow task activation failure that happens to happen during eviction?

Copy link
Member Author

@cretz cretz Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like an exception or syntax error in finally of the workflow code cause this?

No, workflow code exceptions would not bubble out here, we swallow them in the workflow instance

I.e. this is just a workflow task activation failure that happens to happen during eviction?

Correct, and I actually have a hard time replicating besides deadlock

@cretz cretz merged commit 466da16 into temporalio:main Apr 5, 2024
11 checks passed
@cretz cretz deleted the generator-exit-prevention branch April 5, 2024 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Commands sent during finally of a cache eviction may cross workflow contexts
3 participants