Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-running workflow with a promise is being unexpectedly suspended after 60 seconds #35

Open
pstemporowski opened this issue Dec 1, 2024 · 6 comments

Comments

@pstemporowski
Copy link

pstemporowski commented Dec 1, 2024

I am using a promise in a long-running workflow to wait until it gets resolved or cancelled. However, the workflow in rust is consistently being cancelled exactly one minute after the workflow starts. The minimal script example in TypeScript (see the section at the bottom) worked as intended.

Error logs:

restate_sdk::endpoint::futures::handler_state_aware: Error while processing handler Suspended rpc.system="restate" rpc.service=Subscriber rpc.method=run

restate_sdk::hyper: Handler failure: Error(Suspended)

Minimal Reproduction Code

Here is a minimal code example to reproduce the issue
Restate SDK Version: 0.3

Steps to Reproduce

  1. Run the rust code below.
  2. Start the workflow with run.
  3. Wait for longer than 60 seconds without resolving the promise.
  4. Observe the workflow being cancelled with the logged error.

Rust (Not working)

use log::info;
use restate_sdk::errors::HandlerResult;
use restate_sdk::prelude::{
    ContextPromises, Endpoint, HttpServer, SharedWorkflowContext, WorkflowContext,
};

#[restate_sdk::workflow]
pub(crate) trait LongRunningWorkflow {
    async fn run() -> HandlerResult<()>;
    #[shared]
    async fn resolve_promise() -> HandlerResult<()>;
}

pub struct LongrunningWorkflowImpl;

impl LongRunningWorkflow for LongRunningWorkflowImpl {
    async fn run(&self, ctx: WorkflowContext<'_>) -> HandlerResult<()> {
        info!("Started long running workflow");

        let some_promise = ctx.promise::<bool>("somePromise").await?;

        info!("Promise resolved: {}", some_promise);

        Ok(())
    }

    async fn resolve_promise(&self, ctx: SharedWorkflowContext<'_>) -> HandlerResult<()> {
        info!("Resolving promise");
        ctx.resolve_promise::<bool>("somePromise", true);

        Ok(())
    }
}

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt::init();
    let endpoint = Endpoint::builder()
        .bind(LongRunningWorkflowImpl.serve())
        .build();

    HttpServer::new(endpoint)
        .listen_and_serve("0.0.0.0:9080".parse().unwrap())
        .await;
}

Typescript (Working)

import * as restate from "@restatedev/restate-sdk";
import type { WorkflowContext, WorkflowSharedContext } from "@restatedev/restate-sdk";

const longrunningWorkflow = restate.workflow({
  name: "longrunning",
  handlers: {
    run: async (ctx: WorkflowContext) => {
      console.info("Started long running workflow");

      const somePromise = await ctx
        .promise<boolean>("somePromise");

      console.info(`Promise resolved: ${somePromise}`);
    },
    resolvePromise: (ctx: WorkflowSharedContext) => {
      console.info("Resolving promise");
      ctx
        .promise<boolean>("somePromise")
        .resolve(true);
    }
  },
});

export type LongrunningWorkflow = typeof longrunningWorkflow;

restate.endpoint().bind(longrunningWorkflow).listen();

Restate on docker compose

version: '3.8'

services:
  restate_server:
    image: docker.io/restatedev/restate:1.1
    ports:
      - "8080:8080"
      - "9070:9070"
      - "9071:9071"
    extra_hosts:
      - "host.docker.internal:host-gateway"
@slinkydeveloper
Copy link
Collaborator

slinkydeveloper commented Dec 2, 2024

What you see there is not the workflow being cancelled, but being suspended. This is expected when the handler is waiting on some operation to make progress, and restate will resume the workflow back from the point where it left when it can make further progress!
Check out this page https://docs.restate.dev/concepts/durable_execution to learn more.

You can tune that "1 minute" timeout, making it larger, for all the services in the runtime tuning the inactivity timeout (see https://docs.restate.dev/operate/configuration/server/#configuration-file), from 1.2 you'll be able to tune this timeout on a service basis.

@pstemporowski
Copy link
Author

You're absolutely right. However, how is it possible that with the same server configuration, TypeScript is not getting suspended? Additionally, I have a scenario where a service could potentially run for hours, days or even weeks. How should I handle this? Should I consider disabling the timeout entirely in such cases? When will the 1.2 be out then?

@pstemporowski pstemporowski changed the title Long-running workflow with a promise is being unexpectedly canceled after 60 seconds. Long-running workflow with a promise is being unexpectedly suspended after 60 seconds Dec 2, 2024
@slinkydeveloper
Copy link
Collaborator

TypeScript is not getting suspended?

I'm not 100% sure about this TBH, but it looks like the rust sdk is actually behaving correctly here. Could it be that the promise was already resolved? Or has been resolved within the 60 seconds time bound?

Additionally, I have a scenario where a service could potentially run for hours, days or even weeks. How should I handle this?

This is one of the features restate gives you, when waiting for days restate will close the physical request between your service and the restate server, and when the promise you await there will be resolved, restate will "resume" the request again from the point where it left. So you don't really need to do anything about it, Restate takes care of all of this for you! This blog post perhaps gives you a bit more details as well https://restate.dev/blog/we-replaced-400-lines-of-stepfunctions-asl-with-40-lines-of-typescript-by-making-lambdas-suspendable/

@slinkydeveloper
Copy link
Collaborator

When will the 1.2 be out then?

We don't know yet, probably beginning of next year

@pstemporowski
Copy link
Author

pstemporowski commented Dec 2, 2024

Yeah that makes sense.

Could it be that the promise was already resolved? Or has been resolved within the 60 seconds time bound?

Nah, i don't think so. It's a similar implementation but only in ts.

I have another use case involving jobs (workflows in Restate) that should never be suspended. These jobs function as subscriptions using an unbounded mpsc channel from Tokio, designed to run indefinitely until explicitly canceled by a user. Is it possible to set suspension limits to months or even years? Do you think that restate could handle this? Now or in 1.2?

@slinkydeveloper
Copy link
Collaborator

I have another use case involving jobs (workflows in Restate) that should never be suspended. These jobs function as subscriptions using an unbounded mpsc channel from Tokio, designed to run indefinitely until explicitly canceled by a user.

Have you considered using Restate's promises/awakeables for the same purpose? Or, have you considered subscribing to the mpsc channel from Tokio using an ad-hoc "regular tokio task", and then from there call the restate handler? Could you perhaps elaborate more your use case?

Is it possible to set suspension limits to months or even years? Do you think that restate could handle this? Now or in 1.2?

You can technically set any inactivity timeout you want, up to the max duration. But I'm not sure it makes sense to do so, because if the connection crashes and the handler replays, what would be the expected behavior on that subscription?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants