-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First collection of Restate patterns #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for creating the patterns @StephanEwen. They are really helpful :-) I've got a question concerning the locking pattern whether there is an unkeyed service missing to make it work correctly?
acquireBlocking: async (ctx: restate.RpcContext, lockId: string): Promise<string> => { | ||
const awakeable = ctx.awakeable<string>(); | ||
ctx.send(lockServiceApi).acquireAsync(lockId, awakeable.id); | ||
return awakeable.promise; | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, this call should not work since an Awakeable is bound to a journal and we are not releasing the lock until the awakeable is fulfilled (assuming that returning a promise in the TS SDK awaits its completion). I think the problem is that acquireAsync
is defined on the same service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes. I actually had it like that before, and then thought the two services could be merged after all, but you are right, I need to change it back.
"type": "commonjs", | ||
"scripts": { | ||
"build": "tsc --noEmitOnError" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patterns don't seem to be integrated into CI which runs verify
for all TS examples. To ensure that the patterns not break, it would be great to integrate them. https://github.com/restatedev/examples/blob/main/.github/workflows/test.yml
"devDependencies": { | ||
"typescript": "^5.0.2" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add a README.md
listing the different patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later, once this is converged
|
||
prepareTxn(txnId: string): Promise<void> // prepared txn for commit under given ID | ||
commitPreparedTxn(txnId: string): Promise<void> // commits a txn prepared under given ID | ||
rollbackPreparedTxn(txnId: string): Promise<void> // rolls bacl a txn prepared under given ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rollbackPreparedTxn(txnId: string): Promise<void> // rolls bacl a txn prepared under given ID | |
rollbackPreparedTxn(txnId: string): Promise<void> // rolls back a txn prepared under given ID |
|
||
if (commitDecision.commit) { | ||
// now we need to ensure this one gets committed | ||
await ctx.sideEffect(() => database.commitPreparedTxn(txnId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the commitPreparedTxn
call ever fail permanently (so no amount of retries would let it succeed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can, yes. This whole code is very PoC-ish - the basic pattern is in place, but detailed handing of the various possible errors is critical here.
} else { | ||
// we clean up this query. it might be that this query was never prepared, but | ||
// that does not matter here | ||
await ctx.sideEffect(() => database.rollbackPreparedTxn(txnId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in the following scenario?
Assume we have a first call that manages to persist its txnId
and then before running the weRunTheQuery
if block it pauses executing. Now Restate retries the invocation. The retry attempt will reach this point and rolls back the prepared txn which hasn't been created. Now the original invocation continues running and creates a prepared txn which acquires all the locks. If then the second invocation attempt enters the weRunTheQuery
block and runs runSql
what would happen? Would the runSql
execution fail because it cannot acquire the locks or would it wait on acquiring the locks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this seems still possible.
Co-authored-by: Ahmed Farghal <[email protected]>
Co-authored-by: Ahmed Farghal <[email protected]>
Co-authored-by: Till Rohrmann <[email protected]>
Co-authored-by: Till Rohrmann <[email protected]>
Co-authored-by: Till Rohrmann <[email protected]>
Co-authored-by: Till Rohrmann <[email protected]>
Co-authored-by: Till Rohrmann <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The XA example is definitely a bit difficult to follow! Maybe that's inherent though.
const txnId = await ctx.sideEffect(() => { | ||
weRunTheQuery = true; | ||
return Promise.resolve(randomUUID()); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the state of the transaction isn't better off persisted as a state key instead? I worry that we're training people to "leak" state past the Restate API with this approach.
I'd like to offer simple guidance to customers on how to write side effect bodies: the functions must communicate with their containing scope only via read-only parameters and their return values.
In the future we might want to expose a version of sideEffect
that returns some extra metadata. I don't know whether it's a great idea to even know whether the side effect committed for the first-time, or it's a replay execution, but that could avoid having to use a "side channel" variable. I imagine the extra metadata would be very useful for other purposes though. E.g. we could return the journal entry's SN, as it's super useful to have a source of guaranteed monotonic SNs in almost any transactional system!
In the mean time, maybe something like this would be more clear?
const transactionStarted = ctx.get("transactionId");
if (!transactionStarted) {
transactionId = ctx.rand.uuidv4();
ctx.set("transactionId", transactionId);
}
...
Not 100% sure this is semantically equivalent – working through an updated version myself – but I find it much easier to follow. 😊
let weRunTheQuery = false; | ||
const txnId = await ctx.sideEffect(() => { | ||
weRunTheQuery = true; | ||
return Promise.resolve(randomUUID()); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pattern also working with a Lambda deployment? From skimming over the typescript SDK it seems that before returning the side effect result we await the storage ack from the runtime. If this is the case, then reaching beyond this point on Lambda would require a replay and therefore not setting weRunTheQuery = true
, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Jack and me found that issue as well. Currently thinking if there is a way to express this that is compatible with "always suspend" executions.
} | ||
|
||
|
||
export async function runXaDatabaseTransaction<QueryT, ResultT>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the side effects here are making this hard to follow and think through. I don't think we need any of them – just a single journal operation to persist the id - to get at-most-once execution. This is the simplest code I could come up with.
export async function runXaDatabaseTransaction<QueryT, ResultT>(
ctx: restate.RpcContext,
database: Database<QueryT, ResultT>,
query: QueryT): Promise<ResultT> {
let txnId = ctx.get("transactionId");
if (txnId === undefined) {
// First attempt - generate a new ID and save it. (We do it this way to avoid writing a "leaky side effect".)
txnId = randomUUID();
ctx.set("transactionId", txnId);
} else {
// A previous run failed. Roll back the prepared transaction (if any).
await database.rollbackPreparedTxn(txnId);
ctx.clear("transactionId");
return { status: "ABORTED" } as ResultT;
}
try {
await database.beginTxn();
const result = await database.runSql(query);
await database.prepareTxn(txnId);
await database.commitPreparedTxn(txnId);
ctx.clear("transactionId");
return result;
} catch (e) {
await database.abortTxn();
// Note! We deliberately don't do this in a "finally" block - if aborting fails, we want to remember the id!
ctx.clear("transactionId");
return { status: "ABORTED" } as ResultT;
}
}
Do we need anything more complicated than this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this example breaks in multiple places:
- through the durable execution, we always follow the path where the txn ID is initially unset.
- A failure between
await database.prepareTxn(txnId);
andawait database.commitPreparedTxn(txnId);
means the prepared transaction stays. Because it won't ever be cleaned up (point 1), it will block the database forever. - Imagine you have a process that makes it to
const result = await database.runSql(query);
and then stalls. A second (failover execution) makes it all the way though, then the stalled process continues. You now have committed everything twice.
Those are exactly the points I was trying to address in my implementation. Maybe there is a slightly simpler way than I did it, but I think this is inherently complex, because
- (a) Postgres misses some primitives to make this easier (like fence off previous IDs)
- (b) we are implementing the txn coordinator, with all recovery logic
- (c) for that txn coordinator, durable execution is surprisingly not the perfect match, because we want to look at the log while replaying, not skip over actions just like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, of course ctx.get()
won't work on replay – and in Lambda in particular where everything is replayed even on the first run.
A failure between await database.prepareTxn(txnId); and await database.commitPreparedTxn(txnId); means the prepared transaction stays. Because it won't ever be cleaned up (point 1), it will block the database forever.
Won't such a failure trigger automatic rollback, because presumably the session/connection gets terminated?
Imagine you have a process that makes it to const result = await database.runSql(query); and then stalls. A second (failover execution) makes it all the way though, then the stalled process continues. You now have committed everything twice.
This was absolutely not part of my mental failure model, thank you for pointing that out!
Thanks for indulging me – this was very educational!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the hardest problems (and the main class of errors Jepsen tends to find) are related to
- partial network partitions
- long stalls in processes that make another process think it timed out and disappeared (but where the process then comes back). Rare corner cases, but saw those problems actually happening during container migrations.
4026a63
to
da9cfe9
Compare
Thanks for all the feedback. I updated this by
|
No description provided.