Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChildWorkflowFuture.Get hangs forever (and leaks a goroutine) if parent workflow times out #1407

Open
petergardfjall opened this issue Dec 3, 2024 · 5 comments

Comments

@petergardfjall
Copy link

Describe the bug
If a workflow times out while calling Get on a child workflow future (as illustrated below), the Get call never returns and a goroutine is leaked.

func MyWorkflow(ctx workflow.Context) error {
    // .. snip
    future := workflow.ExecuteChildWorkflow(childCtx, childWorkflow)
    future.Get(childCtx, nil) // <- hangs forever if MyWorkflow exceeds START_TO_CLOSE timeout.
    // .. snip
}

Notably, all other involved "entities" behave the way (I think) they shuld:

  • The caller of the parent workflow (MyWorkflow above) correctly sees a TimeoutType: START_TO_CLOSE.
  • The child workflow (childWorkflow above) correctly sees a CanceledError.
  • Any activities started by the child workflow correctly see a context canceled.

Steps to reproduce the behavior:
Find a sample application attached (sample.tar.gz) that can be used to reproduce and illustrate the issue. Assuming that a Cadence cluster is serving a cadence-frontend on localhost:7833 the following steps can be used to reproduce the issue:

  1. Start a workflow worker:
    go run ./worker/
  2. Execute the parent workflow with a 60s start-to-close timeout:
    go run ./client/

After one minute the client will fail (as expected):

{"level":"info","time":"2024-12-03T09:26:42+01:00","message":"starting workflow ..."}
{"level":"info","time":"2024-12-03T09:26:42+01:00","message":"awaiting workflow 1733214402 ..."}
{"level":"error","error":"TimeoutType: START_TO_CLOSE","time":"2024-12-03T09:27:42+01:00","message":"parent workflow execution failed: TimeoutType: START_TO_CLOSE (*internal.TimeoutError)\n&internal.TimeoutError{timeoutType:0, details:internal.ErrorDetailsValues(nil)}"}

Looking at the worker output, we will see that the child workflow failed (as expected):

{"level":"error","error":"CanceledError","time":"2024-12-03T09:27:42+01:00","message":"ChildWorkflow.Run: doActivity failed: CanceledError (*internal.CanceledError)\n&internal.CanceledError{details:internal.ErrorDetailsValues(nil)}"}

Similarly, we will see that the activity timed out (as expected):

{"level":"error","error":"context canceled","time":"2024-12-03T09:27:56+01:00","message":"ChildWorkflow.doActivity: context done: context canceled (*errors.errorString)"}

But, and here is the big BUT, looking at the debug output from pprof on http://localhost:6060/debug/pprof/goroutine?debug=1 we will see that the parent workflow still hangs on the Get call. Something like:


1 @ 0x47672e 0x40b8dc 0x40b492 0xbf87df 0xbf780f 0xbf7787 0xbfc3f5 0xedae65 0x4d26c6 0x4d17d9 0xc03cb9 0xbeea52 0xbf6ecc 0xbf9119 0x47e781
# 0xbf87de  go.uber.org/cadence/internal.(*coroutineState).initialYield+0x5e      /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:798
# 0xbf780e  go.uber.org/cadence/internal.(*coroutineState).yield+0x22e        /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:808
# 0xbf7786  go.uber.org/cadence/internal.(*channelImpl).Receive+0x1a6       /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:623
# 0xbfc3f4  go.uber.org/cadence/internal.(*decodeFutureImpl).Get+0x54       /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:1301
# 0xedae64  main.(*ParentWorkflow).Run+0x144              /home/peterg/dev/cadence-bug/worker/parent_workflow.go:51

That call never returns, which I think violates the intended semantics (in the documentation) and results in a resource leak.

Expected behavior
I would expect the Get call made by the (timed out) parent workflow to eventually return with TimeoutError.

@Groxx
Copy link
Member

Groxx commented Dec 5, 2024

Excellent find, thank you! We've been attempting to figure out a(?) leak for a while, but all the code it has been hitting internally have been incredibly complicated and nearly impossible to reproduce.

I'll see what I can find here ASAP. Child workflow state handling is one of the more complicated and buggier sections of the core logic, so it seems plausible and this is by far the best lead I've seen. Hopefully not too many distractions before I get to it :)

@Groxx
Copy link
Member

Groxx commented Dec 7, 2024

Ah. This one is basically known, and kinda sorta "intended". What's happening is:

  • 0s: parent starts with an execution timeout of 1m
  • 0s: child starts with an execution timeout of 5m
  • 0s: child starts activity for ~1m30s
  • 1m: parent times out, which tells the child to cancel
    • this is a server-side only event, and it doesn't trigger any "work" (because it's a hard-terminate due to timeout, not a cooperative "stop soon" like cancellation).
  • 1m: child told to cancel:
    • child tells activity to cancel
    • child's activity's future.Get returns immediately, because it isn't waiting for the activity's cancellation
    • child's workflow function returns
  • 1m: child workflow ends ^ because the function returned
  • ~1m30s: child's activity's context is canceled due to the deadline:
    • the select{...} unblocks
    • the activity's function returns
    • this fails to record because the workflow that started it is already completed and won't be modified any more, which is why you see errors like "WorkerType":"LocallyDispatchedActivityWorker","error":"EntityNotExistsError{Message: Workflow execution already completed., CurrentCluster: , ActiveCluster: }"} at this poin

The activity and child workflow are ending "normally", i.e. their functions are returning and they are returning values, so their call stack / program counter / etc have gone away naturally.

The parent workflow however is timing out and ending server-side, which stops everything...
... including new decision tasks. Because all history changes end immediately.

So nothing ever actually communicates back to the worker that it's "done". So it never finishes.

This workflow is still cached "normally" though, and when the "sticky workflow cache" runs out of space it'll evict in LRU order.
When this workflow is evicted, that internal.(*coroutineState).initialYield call will unblock, because it'll push this func into that channel:

func (s *coroutineState) exit() {
if !s.closed {
s.unblock <- func(status string, stackDepth int) bool {
runtime.Goexit()
return true
}
}
}

which executes it here:
f := <-s.unblock
keepBlocked = f(status, stackDepth+1)

and that'll tear down all of the workflow's goroutines (in this case only one) due to calling runtime.Goexit().

You can simulate all this much more easily by setting a sticky cache size of like 2 (iirc it kinda sorta behaves like "N-1", so I just avoid 1), and using a single workflow like this:

func work(ctx workflow.Context) error {
  workflow.GetSignalChannel(ctx, "ignored").Receive(ctx, nil) // and never send a signal
  return nil
}

and letting it time out. It only needs to be a couple seconds. Regardless of how many you start and with what timing, you shouldn't see the stuck-goroutines exceed the cache size.

You can also / probably usually want to simply avoid this, by making sure your workflows end normally before their hard cutoff time. In this case that'd mean keeping your child timeout shorter than your remaining parent time. This is also necessary for having the parent record the child's end state (if that's useful, e.g. if you want to do something with it or report it to parent-workflow observers).


So both good and bad news, and both unavoidable details and things we can / should obviously improve:

Good news: this doesn't actually block forever. Just until your worker has handled tasks from (by default) 10,000 other workflows.

Bad news: yes that's often quite a while later. More than long enough for it to be confusing and/or cause problems, e.g. if each one holds a lot of memory.

Unavoidable: even if we did schedule a decision task to "clean up", it wouldn't be guaranteed to clean up, because the worker may have lost its claim to the cache. This can happen if the worker is too slow to pick up the "next" task -> the server gives up on the sticky cache attempt and sends to any random worker -> now there's an "old" goroutine on the old worker that nothing remembers, and a "new" one somewhere else. We can't really hold "try to send cleanup tasks to every worker ever, forever" so at some point it has to give up and stuff can be left dangling.

Could improve:

  1. We could obviously try to dispatch a cleanup-task, and that'd work great for most short-lived things like this, and maybe like 99% of all workflows in aggregate (widely varying by user patterns tho).
    • I'm not sure if we have any way to actually distribute this task though, since the workflow has ended, and that basically always means "stop everything"... so this might not be as easy as it sounds.
  2. We've talked internally about building some kind of periodic "is this workflow still alive / am I an old worker?" check, and it'd be able to take care of this dangling cache eventually. This would probably have to be on the order of minutes, but that's much better than hours/days/etc later.
    • I kinda hope we do this eventually, because it's the only real way to clean up cases where workers lose their sticky claim (as workflow timeouts could be months away, and 3 below would be ineffective).
  3. We could have a client-side timeout that matches the server-side timeout, plus/minus some fuzz-factor to avoid clock skew.
    • I suspect we'll do this at some point, but for safety it'll probably ignore like an hour of skew (likely configurable), so small samples like this will still show it.

Sad news: unfortunately this wasn't the leak I was hoping for. We appear to have a leak somewhere else that is truly leaking, e.g. can lead to > sticky cache size top-level goroutines hanging around. AFAICT this isn't related tho.

@petergardfjall
Copy link
Author

petergardfjall commented Dec 9, 2024

Thanks for the thorough explanation!

The conclusion is a bit of a letdown though I must say.

We have some quite long-running parent workflows (they are normally limited to around a week) that we would like to guarantee cleanup after. It would be ideal if that could happen in the workflow code itself. I was hoping to catch a TimedoutError and then perform the cleanup in a detached workflow context.

I somehow feel like there is a lack of symmetry in the API semantics. "Everything" (caller, child workflow, activity) behaves as one would intuitively expect, except for the parent workflow which just hangs ...
Or, as you point out, technically it blocks (for a very long time, for some 10000 workflows to pass through the worker which would take a couple of months probably in our use case).
In practice that's a hang for our use-case at least, and generally speaking waiting for 10000 goroutines to build up before cleanup can take place sounds like a good way of ensuring that cleanup will never take place (the likelihood of the worker being restarted, for example following a rolling update, is significant, and I suppose then all hope is lost of a cleanup action?).

So I'm wondering what we can do to work around this issue.

From your reasoning it appears that the problem is that the parent workflow times out only on the cadence server, and the worker never hears again from the server. So can we somehow force server communication to wake up the timed out workflow? What about a workflow.Timer and a Select?

  future := workflow.ExecuteChildWorkflow(childCtx, wf.childWorkflow.Run)

  var wfErr error
  sel := workflow.NewSelector(wfCtx)
  sel.AddFuture(future, func(future workflow.Future) {
    // Workflow ended normally ...
  })

  timeout := time.Duration(workflow.GetInfo(wfCtx).ExecutionStartToCloseTimeoutSeconds)*time.Second - 1*time.Second
  sel.AddFuture(workflow.NewTimer(wfCtx, timeout), func(f workflow.Future) {
    // Workflow timed out
    // ... handle cleanup with a detached context ...
  })
  sel.Select(wfCtx)

Or is that still not guarenteed to work? Could Cadence being under pressure or the worker being late to pick up the timer result still lead to everything timing out only at the server?

Is there some prior art in this area? Perhaps even an idiomatic way of handling cleanup for a timed out long-running workflow? (ideally without resorting to some janitor goroutine that needs to query cadence for workflows)

@tfcace
Copy link

tfcace commented Dec 9, 2024

Will attempt to add my 2 cents after reading this (quite fascinating) issue.

Would it make sense to borrow the "Heartbeat" semantics to child-Workflows as well, in the context of the worker cache?

I haven't looked in the code, but... can the worker associate a cache entry with a heartbeat for child workflows?
The only difference would be that the heartbeat check is done against the Cadence server?
If there's no workflow that's deemed "running" on the server (for example, it timed out), evict a cache entry?

Sorry in advance if this was a brain fart.

@Groxx
Copy link
Member

Groxx commented Dec 9, 2024

I was hoping to catch a TimedoutError and then perform the cleanup in a detached workflow context.

You can, you just can't do anything after your execution timeout - it's your hard cutoff, not a cooperative cancel / canceled context. Otherwise we wouldn't have a hard cutoff and everything could run for infinite time.
The easy option is to take your execution timeout and subtract [some time] and cancel your own context at that time. Then either use your original not-canceled context or create a NewDisconnectedContext if you're stuck deep in some call stack.

I think we might be missing a func workflow.With{Deadline,Timeout}(workflow.Context, ...) workflow.Context (that seems like an obvious API worth adding), but it's not hard to build:

ctx, cancel := workflow.WithCancel(ctx)
workflow.Go(func(ctx) {
  workflow.Sleep(ctx, timeout)
  cancel()
  // ignoring ^ sleep's err return is fine, it'll only err
  // if the ctx is canceled, and this is just canceling the ctx.
  // either way it's canceled at the right time or does nothing.
})
// this is essentially the same cost as a WithDeadline would be, it's totally fine

Basically just get your workflowCtx.Deadline() (every top-level workflow context will have a deadline... and hopefully disconnected too since it's a hard deadline, but I'm not sure OTOH) and sleep for deadline.Sub(workflow.Now()).Sub(time.Hour) to ensure you have an hour to do cleanup. Or a few days if you want to be resistant to short outages. You'll still be able to find ones that actually timed out, because they'll be "real" timeouts (Timed Out state) instead of normal success/fail exits.

When doing this though, I will broadly caution/remind that cancellation is cooperative. Missing a ctx.Done() channel read in selectors is kinda common:

// normal go
func thing(ctx) {
  select {
  case <-doSomething:
  case <-etc:
  }
  // no `case <-ctx.Done()` so this does not unblock when canceled,
  // and might never unblock if other channels don't receive anything
}

// cadence
func thing(ctx) {
  // cancel ctx after 3h
  workflow.NewSelector(ctx).
    AddReceive(workflow.GetSignalChannel("something"), ...).
    AddReceive(workflow.GetSignalChannel("etc"), ...).
    Select(ctx)
  // same thing, nothing about ^ that says to unblock when canceled,
  // so it does not unblock, and probably locks up until ExecutionTimeout
}

This a bit of an awkward corner with Go because we need ctx to be able to unblock Select (and similar things) when appropriate, but we do not implicitly respect cancellation here to be more like plain select statements. So it might be tickling those "ctx arg means cancelable" feelings and be easy to miss even though it isn't true. More details: https://pkg.go.dev/go.uber.org/cadence/workflow#hdr-Canceled_workflow_Context_behavior


(the likelihood of the worker being restarted, for example following a rolling update, is significant, and I suppose then all hope is lost of a cleanup action?)

As long as it's in the workflow, no, it'll resume just like any other workflow code :) Hence why you can't use the hard cutoff / ExecutionTimeout for it. It'd be unreliable.


So can we somehow force server communication to wake up the timed out workflow?

Not with ExecutionTimeout, but internally yes. Like that timer + select.

Or is that still not guarenteed to work? Could Cadence being under pressure or the worker being late to pick up the timer result still lead to everything timing out only at the server?

We can't guarantee there isn't some many-hours backlog or multi-day service outage (yours or Cadence), or some kind of deadlock in your code, so in that sense no. It's why ExecutionTimeout exists: it stops work no matter what. If you need time for cleanup, you need to do it within that time / expand the ExecutionTimeout long enough to ensure you have the time you need to make that stuff happen.


As far as the implied "this is confusing, can it be improved" question... yes probably :)

Both "soft timeout" and a "hard timeout" are useful and fairly common across languages (Go context deadlines and Java thread interrupts, vs a timer for total process death) and it makes sense to streamline it. In Go that would probably just cancel your context at [a time], because that's the normal "stop work" signal for Go.
And yea, in retrospect it is a bit of a surprise that we don't have that already.

Client-side-only that's not too hard to build, e.g. a workflow interceptor could ensure you have an hour to clean up on everything with the ^ sample above, and that'll work today (with versioning). I think it might make more sense to coordinate for a server-side CancelWorkflow timer though, and it can just have a custom cause/reason/details 🤔 I'm not sure how much work that'd take though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants