Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Wait for Crucible agent to do requested work
Several endpoints of the Crucible agent do not perform work synchronously with that endpoint's handler: changes are instead requested and the actual work proceeds asynchronously. Nexus assumes that the work is synchronous though. This commit creates three functions in common_storage that properly deal with an agent that does asynchronous work (`delete_crucible_region`, `delete_crucible_snapshot`, and `delete_crucible_running_snapshot`), and calls those. Part of testing this commit was creating "disk" antagonists in the omicron-stress tool. This uncovered cases where the transactions related to disk creation and deletion failed to complete. One of these cases is in `regions_hard_delete` - this transaction is now retried until it succeeds. The `TransactionError::retry_transaction` function can be used to see if CRDB is telling Nexus to retry the transaction. Another is `decrease_crucible_resource_count_and_soft_delete_volume` - this was broken into two steps, the first of which enumerates the read-only resources to delete (because those don't currently change!). Another fix that's part of this commit, exposed by the disk antagonist: it should only be ok to delete a disk in state `Creating` if you're in the disk create saga. If this is not true, it's possible for a delete of a disk currently in the disk create saga to cause that saga to fail unwinding and remain stuck. This commit also bundles idempotency related fixes for the simulated Crucible agent, as these were exposed with the retries that are performed by the new `delete_crucible_*` functions. What shook out of this is that `Nexus::volume_delete` was removed, as this was not correct to call during the unwind of the disk and snapshot create sagas: it would conflict with what those sagas were doing as part of their unwind. The volume delete saga is still required during disk and snapshot deletion, but there is enough context during the disk and snapshot create sagas to properly unwind a volume, even in the case of read-only parents. `Nexus::volume_delete` ultimately isn't safe, so it was removed: instead, any future sagas should either embed the volume delete saga as a sub saga, or there should be enough context in the outputs said future saga's nodes to properly unwind what was created. Depends on oxidecomputer/crucible#838, which fixes a few non-idempotency bugs in the actual Crucible agent. Fixes oxidecomputer#3698
- Loading branch information