Long-running workflow with a promise is being unexpectedly suspended after 60 seconds #35

pstemporowski · 2024-12-01T07:57:35Z

I am using a promise in a long-running workflow to wait until it gets resolved or cancelled. However, the workflow in rust is consistently being cancelled exactly one minute after the workflow starts. The minimal script example in TypeScript (see the section at the bottom) worked as intended.

Error logs:

restate_sdk::endpoint::futures::handler_state_aware: Error while processing handler Suspended rpc.system="restate" rpc.service=Subscriber rpc.method=run

restate_sdk::hyper: Handler failure: Error(Suspended)

Minimal Reproduction Code

Here is a minimal code example to reproduce the issue
Restate SDK Version: 0.3

Steps to Reproduce

Run the rust code below.
Start the workflow with run.
Wait for longer than 60 seconds without resolving the promise.
Observe the workflow being cancelled with the logged error.

Rust (Not working)

use log::info;
use restate_sdk::errors::HandlerResult;
use restate_sdk::prelude::{
    ContextPromises, Endpoint, HttpServer, SharedWorkflowContext, WorkflowContext,
};

#[restate_sdk::workflow]
pub(crate) trait LongRunningWorkflow {
    async fn run() -> HandlerResult<()>;
    #[shared]
    async fn resolve_promise() -> HandlerResult<()>;
}

pub struct LongrunningWorkflowImpl;

impl LongRunningWorkflow for LongRunningWorkflowImpl {
    async fn run(&self, ctx: WorkflowContext<'_>) -> HandlerResult<()> {
        info!("Started long running workflow");

        let some_promise = ctx.promise::<bool>("somePromise").await?;

        info!("Promise resolved: {}", some_promise);

        Ok(())
    }

    async fn resolve_promise(&self, ctx: SharedWorkflowContext<'_>) -> HandlerResult<()> {
        info!("Resolving promise");
        ctx.resolve_promise::<bool>("somePromise", true);

        Ok(())
    }
}

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt::init();
    let endpoint = Endpoint::builder()
        .bind(LongRunningWorkflowImpl.serve())
        .build();

    HttpServer::new(endpoint)
        .listen_and_serve("0.0.0.0:9080".parse().unwrap())
        .await;
}

Typescript (Working)

import * as restate from "@restatedev/restate-sdk";
import type { WorkflowContext, WorkflowSharedContext } from "@restatedev/restate-sdk";

const longrunningWorkflow = restate.workflow({
  name: "longrunning",
  handlers: {
    run: async (ctx: WorkflowContext) => {
      console.info("Started long running workflow");

      const somePromise = await ctx
        .promise<boolean>("somePromise");

      console.info(`Promise resolved: ${somePromise}`);
    },
    resolvePromise: (ctx: WorkflowSharedContext) => {
      console.info("Resolving promise");
      ctx
        .promise<boolean>("somePromise")
        .resolve(true);
    }
  },
});

export type LongrunningWorkflow = typeof longrunningWorkflow;

restate.endpoint().bind(longrunningWorkflow).listen();

Restate on docker compose

version: '3.8'

services:
  restate_server:
    image: docker.io/restatedev/restate:1.1
    ports:
      - "8080:8080"
      - "9070:9070"
      - "9071:9071"
    extra_hosts:
      - "host.docker.internal:host-gateway"

The text was updated successfully, but these errors were encountered:

slinkydeveloper · 2024-12-02T08:22:53Z

What you see there is not the workflow being cancelled, but being suspended. This is expected when the handler is waiting on some operation to make progress, and restate will resume the workflow back from the point where it left when it can make further progress!
Check out this page https://docs.restate.dev/concepts/durable_execution to learn more.

You can tune that "1 minute" timeout, making it larger, for all the services in the runtime tuning the inactivity timeout (see https://docs.restate.dev/operate/configuration/server/#configuration-file), from 1.2 you'll be able to tune this timeout on a service basis.

pstemporowski · 2024-12-02T09:20:40Z

You're absolutely right. However, how is it possible that with the same server configuration, TypeScript is not getting suspended? Additionally, I have a scenario where a service could potentially run for hours, days or even weeks. How should I handle this? Should I consider disabling the timeout entirely in such cases? When will the 1.2 be out then?

slinkydeveloper · 2024-12-02T11:40:44Z

TypeScript is not getting suspended?

I'm not 100% sure about this TBH, but it looks like the rust sdk is actually behaving correctly here. Could it be that the promise was already resolved? Or has been resolved within the 60 seconds time bound?

Additionally, I have a scenario where a service could potentially run for hours, days or even weeks. How should I handle this?

This is one of the features restate gives you, when waiting for days restate will close the physical request between your service and the restate server, and when the promise you await there will be resolved, restate will "resume" the request again from the point where it left. So you don't really need to do anything about it, Restate takes care of all of this for you! This blog post perhaps gives you a bit more details as well https://restate.dev/blog/we-replaced-400-lines-of-stepfunctions-asl-with-40-lines-of-typescript-by-making-lambdas-suspendable/

slinkydeveloper · 2024-12-02T11:41:14Z

When will the 1.2 be out then?

We don't know yet, probably beginning of next year

pstemporowski · 2024-12-02T17:35:14Z

Yeah that makes sense.

Could it be that the promise was already resolved? Or has been resolved within the 60 seconds time bound?

Nah, i don't think so. It's a similar implementation but only in ts.

I have another use case involving jobs (workflows in Restate) that should never be suspended. These jobs function as subscriptions using an unbounded mpsc channel from Tokio, designed to run indefinitely until explicitly canceled by a user. Is it possible to set suspension limits to months or even years? Do you think that restate could handle this? Now or in 1.2?

slinkydeveloper · 2024-12-03T08:19:24Z

I have another use case involving jobs (workflows in Restate) that should never be suspended. These jobs function as subscriptions using an unbounded mpsc channel from Tokio, designed to run indefinitely until explicitly canceled by a user.

Have you considered using Restate's promises/awakeables for the same purpose? Or, have you considered subscribing to the mpsc channel from Tokio using an ad-hoc "regular tokio task", and then from there call the restate handler? Could you perhaps elaborate more your use case?

Is it possible to set suspension limits to months or even years? Do you think that restate could handle this? Now or in 1.2?

You can technically set any inactivity timeout you want, up to the max duration. But I'm not sure it makes sense to do so, because if the connection crashes and the handler replays, what would be the expected behavior on that subscription?

pstemporowski changed the title ~~Long-running workflow with a promise is being unexpectedly canceled after 60 seconds.~~ Long-running workflow with a promise is being unexpectedly suspended after 60 seconds Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-running workflow with a promise is being unexpectedly suspended after 60 seconds #35

Long-running workflow with a promise is being unexpectedly suspended after 60 seconds #35

pstemporowski commented Dec 1, 2024 •

edited

Loading

slinkydeveloper commented Dec 2, 2024 •

edited

Loading

pstemporowski commented Dec 2, 2024

slinkydeveloper commented Dec 2, 2024

slinkydeveloper commented Dec 2, 2024

pstemporowski commented Dec 2, 2024 •

edited

Loading

slinkydeveloper commented Dec 3, 2024

Long-running workflow with a promise is being unexpectedly suspended after 60 seconds #35

Long-running workflow with a promise is being unexpectedly suspended after 60 seconds #35

Comments

pstemporowski commented Dec 1, 2024 • edited Loading

Error logs:

Minimal Reproduction Code

Steps to Reproduce

Rust (Not working)

Typescript (Working)

Restate on docker compose

slinkydeveloper commented Dec 2, 2024 • edited Loading

pstemporowski commented Dec 2, 2024

slinkydeveloper commented Dec 2, 2024

slinkydeveloper commented Dec 2, 2024

pstemporowski commented Dec 2, 2024 • edited Loading

slinkydeveloper commented Dec 3, 2024

pstemporowski commented Dec 1, 2024 •

edited

Loading

slinkydeveloper commented Dec 2, 2024 •

edited

Loading

pstemporowski commented Dec 2, 2024 •

edited

Loading