Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First collection of Restate patterns #49

Merged
merged 10 commits into from
Dec 5, 2023

Conversation

StephanEwen
Copy link
Contributor

No description provided.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for creating the patterns @StephanEwen. They are really helpful :-) I've got a question concerning the locking pattern whether there is an unkeyed service missing to make it work correctly?

Comment on lines 48 to 52
acquireBlocking: async (ctx: restate.RpcContext, lockId: string): Promise<string> => {
const awakeable = ctx.awakeable<string>();
ctx.send(lockServiceApi).acquireAsync(lockId, awakeable.id);
return awakeable.promise;
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, this call should not work since an Awakeable is bound to a journal and we are not releasing the lock until the awakeable is fulfilled (assuming that returning a promise in the TS SDK awaits its completion). I think the problem is that acquireAsync is defined on the same service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes. I actually had it like that before, and then thought the two services could be merged after all, but you are right, I need to change it back.

typescript/patterns/src/cross_db_transactions.ts Outdated Show resolved Hide resolved
"type": "commonjs",
"scripts": {
"build": "tsc --noEmitOnError"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patterns don't seem to be integrated into CI which runs verify for all TS examples. To ensure that the patterns not break, it would be great to integrate them. https://github.com/restatedev/examples/blob/main/.github/workflows/test.yml

"devDependencies": {
"typescript": "^5.0.2"
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a README.md listing the different patterns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later, once this is converged

typescript/patterns/src/xa_transactions.ts Outdated Show resolved Hide resolved

prepareTxn(txnId: string): Promise<void> // prepared txn for commit under given ID
commitPreparedTxn(txnId: string): Promise<void> // commits a txn prepared under given ID
rollbackPreparedTxn(txnId: string): Promise<void> // rolls bacl a txn prepared under given ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rollbackPreparedTxn(txnId: string): Promise<void> // rolls bacl a txn prepared under given ID
rollbackPreparedTxn(txnId: string): Promise<void> // rolls back a txn prepared under given ID

typescript/patterns/src/xa_transactions.ts Outdated Show resolved Hide resolved

if (commitDecision.commit) {
// now we need to ensure this one gets committed
await ctx.sideEffect(() => database.commitPreparedTxn(txnId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the commitPreparedTxn call ever fail permanently (so no amount of retries would let it succeed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can, yes. This whole code is very PoC-ish - the basic pattern is in place, but detailed handing of the various possible errors is critical here.

} else {
// we clean up this query. it might be that this query was never prepared, but
// that does not matter here
await ctx.sideEffect(() => database.rollbackPreparedTxn(txnId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the following scenario?

Assume we have a first call that manages to persist its txnId and then before running the weRunTheQuery if block it pauses executing. Now Restate retries the invocation. The retry attempt will reach this point and rolls back the prepared txn which hasn't been created. Now the original invocation continues running and creates a prepared txn which acquires all the locks. If then the second invocation attempt enters the weRunTheQuery block and runs runSql what would happen? Would the runSql execution fail because it cannot acquire the locks or would it wait on acquiring the locks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this seems still possible.

Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XA example is definitely a bit difficult to follow! Maybe that's inherent though.

Comment on lines 67 to 70
const txnId = await ctx.sideEffect(() => {
weRunTheQuery = true;
return Promise.resolve(randomUUID());
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the state of the transaction isn't better off persisted as a state key instead? I worry that we're training people to "leak" state past the Restate API with this approach.

I'd like to offer simple guidance to customers on how to write side effect bodies: the functions must communicate with their containing scope only via read-only parameters and their return values.

In the future we might want to expose a version of sideEffect that returns some extra metadata. I don't know whether it's a great idea to even know whether the side effect committed for the first-time, or it's a replay execution, but that could avoid having to use a "side channel" variable. I imagine the extra metadata would be very useful for other purposes though. E.g. we could return the journal entry's SN, as it's super useful to have a source of guaranteed monotonic SNs in almost any transactional system!

In the mean time, maybe something like this would be more clear?

const transactionStarted = ctx.get("transactionId");
if (!transactionStarted) {
  transactionId = ctx.rand.uuidv4();
  ctx.set("transactionId", transactionId);
}
...

Not 100% sure this is semantically equivalent – working through an updated version myself – but I find it much easier to follow. 😊

Comment on lines 66 to 70
let weRunTheQuery = false;
const txnId = await ctx.sideEffect(() => {
weRunTheQuery = true;
return Promise.resolve(randomUUID());
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pattern also working with a Lambda deployment? From skimming over the typescript SDK it seems that before returning the side effect result we await the storage ack from the runtime. If this is the case, then reaching beyond this point on Lambda would require a replay and therefore not setting weRunTheQuery = true, I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Jack and me found that issue as well. Currently thinking if there is a way to express this that is compatible with "always suspend" executions.

}


export async function runXaDatabaseTransaction<QueryT, ResultT>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the side effects here are making this hard to follow and think through. I don't think we need any of them – just a single journal operation to persist the id - to get at-most-once execution. This is the simplest code I could come up with.

export async function runXaDatabaseTransaction<QueryT, ResultT>(
    ctx: restate.RpcContext,
    database: Database<QueryT, ResultT>,
    query: QueryT): Promise<ResultT> {

    let txnId = ctx.get("transactionId");
    if (txnId === undefined) {
        // First attempt - generate a new ID and save it. (We do it this way to avoid writing a "leaky side effect".)
        txnId = randomUUID();
        ctx.set("transactionId", txnId);
    } else {
        // A previous run failed. Roll back the prepared transaction (if any).
        await database.rollbackPreparedTxn(txnId);
        ctx.clear("transactionId");
        return { status: "ABORTED" } as ResultT;
    }

    try {
        await database.beginTxn();
        const result = await database.runSql(query);
        await database.prepareTxn(txnId);

        await database.commitPreparedTxn(txnId);
        ctx.clear("transactionId");

        return result;
    } catch (e) {
        await database.abortTxn();
        // Note! We deliberately don't do this in a "finally" block - if aborting fails, we want to remember the id!
        ctx.clear("transactionId");
        return { status: "ABORTED" } as ResultT;
    }
}

Do we need anything more complicated than this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example breaks in multiple places:

  • through the durable execution, we always follow the path where the txn ID is initially unset.
  • A failure between await database.prepareTxn(txnId); and await database.commitPreparedTxn(txnId); means the prepared transaction stays. Because it won't ever be cleaned up (point 1), it will block the database forever.
  • Imagine you have a process that makes it to const result = await database.runSql(query); and then stalls. A second (failover execution) makes it all the way though, then the stalled process continues. You now have committed everything twice.

Those are exactly the points I was trying to address in my implementation. Maybe there is a slightly simpler way than I did it, but I think this is inherently complex, because

  • (a) Postgres misses some primitives to make this easier (like fence off previous IDs)
  • (b) we are implementing the txn coordinator, with all recovery logic
  • (c) for that txn coordinator, durable execution is surprisingly not the perfect match, because we want to look at the log while replaying, not skip over actions just like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh, of course ctx.get() won't work on replay – and in Lambda in particular where everything is replayed even on the first run.

A failure between await database.prepareTxn(txnId); and await database.commitPreparedTxn(txnId); means the prepared transaction stays. Because it won't ever be cleaned up (point 1), it will block the database forever.

Won't such a failure trigger automatic rollback, because presumably the session/connection gets terminated?

Imagine you have a process that makes it to const result = await database.runSql(query); and then stalls. A second (failover execution) makes it all the way though, then the stalled process continues. You now have committed everything twice.

This was absolutely not part of my mental failure model, thank you for pointing that out!

Thanks for indulging me – this was very educational!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the hardest problems (and the main class of errors Jepsen tends to find) are related to

  • partial network partitions
  • long stalls in processes that make another process think it timed out and disappeared (but where the process then comes back). Rare corner cases, but saw those problems actually happening during container migrations.

@StephanEwen StephanEwen merged commit ca40202 into restatedev:main Dec 5, 2023
2 checks passed
@StephanEwen
Copy link
Contributor Author

Thanks for all the feedback. I updated this by

  • addressing inline comments
  • fixing distributed lock pattern
  • removing XA transactions and cross DB actions for now
  • Adding Pavel's DynamoDB pattern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants