Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced transaction API #10

Open
vigoo opened this issue Apr 11, 2024 · 0 comments
Open

Advanced transaction API #10

vigoo opened this issue Apr 11, 2024 · 0 comments

Comments

@vigoo
Copy link
Contributor

vigoo commented Apr 11, 2024

The initially implemented Rust transaction API works like this:

  • defines the transaction as an atomic region
  • if an operation fails (in user code - returns Result::Err), it performs the rollback steps in reverse order
  • then use set-oplog-index to restart from the beginning of the transaction.

This means that if the worker gets restarted during the transaction for any other reason - which can be a panic, or anything that triggers the worker executor to restart such as hardware failures, deployment, etc - the partially executed transaction's compensation actions will not be executed. There is no way to catch this and execute code - as potentially the executor itself does not exist anymore, and when the worker gets recovered in another executor it just reaches the transaction during replay, and because of the atomic region, decides to rerun it - but there is no existing way exposed by Golem to somehow determine what operations were partially done (although they are in the backlog) from user code (meaning the transaction lib).

We can however take advantage of more Golem host features to solve this, in the following way:

  • First we need to make sure the operations are serializable. The simplest way to do so is that in the transaction function, we need to list all the operations (the types - they are still parametrized by an input value) in advance. Then we call these "registered" operations within the transaction. This way if the operation's input is serializable, the operation itself is also serializable because we can have a simple numeric ID associated with them in the registration step.
  • We also generate a transaction UUID before the atomic region starts. This will only run once and we get the same UUID during replay so we have a persistent unique identifier for our transaction.
  • Before executing an operation, we use Golem's WASI key-value implementation to store the begin (or end, depending on the idempotence mode?) of each operation and its inputs.
  • We can have the same user-land rollback logic as in the simple version, with the extension that it removes these entries from the key-value store as the operations get rolled back.
  • Before actually starting the transaction - but after we already have the transaction id - we add a non-persisted region where we check if there are any leftover operations in the key-value store to be rolled back. If we have, we perform these rollback actions (operation id + input serialized so we can reconstruct the operation and call rollback on them) before restarting the transaction.

This way if the execution dies in the middle of a transaction or even during the rollback, next time the worker gets recovered, these leftover rollback actions will always be performed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant