Wait for ZFS snapshot directory + agent fixes #838

jmpesp · 2023-07-19T03:12:03Z

When zfs snapshot dataset@name is called, the ZFS snapshot is not instantaneously mounted to the appropriate .../.zfs/snapshot/ path (as long as 172 ms was measured for a zpool with file backed vdevs). There's two fixes that result from this:

Use zfs list -t snapshot ... to tell if a snapshot exists, not the presence of the mounted snapshot directory
Wait for the ZFS snapshot directory to get mounted before starting a read-only downstairs service that points to it. Wait a maximum of 5 seconds before returning an error.

Also bundled in this commit is a fix for an agent bug: it's possible for a snapshot to have a running read-only downstairs that is in the destroyed state (but not deleted yet), and this will prevent the deletion of said snapshot, as the deletion code simply checked for the presence of a running read-only snapshot and not its state. Fix this by matching on the state and allowing deletion.

Also (!) bundled in this commit is another fix for another agent bug: when getting snapshots for a region, if the region wasn't in state Created then the endpoint would return 500. This is wrong, and we can handle more states than !Created. The only state that should return 500 is Failed.

Fixes #827.

When `zfs snapshot dataset@name` is called, the ZFS snapshot is not instantaneously mounted to the appropriate .../.zfs/snapshot/<name> path (as long as 172 ms was measured for a zpool with file backed vdevs). There's two fixes that result from this: 1. Use `zfs list -t snapshot ...` to tell if a snapshot exists, not the presence of the mounted snapshot directory 2. Wait for the ZFS snapshot directory to get mounted before starting a read-only downstairs service that points to it. Wait a maximum of 5 seconds before returning an error. Also bundled in this commit is a fix for an agent bug: it's possible for a snapshot to have a running read-only downstairs that is in the destroyed state (but not deleted yet), and this will prevent the deletion of said snapshot, as the deletion code simply checked for the presence of a running read-only snapshot and not its state. Fix this by matching on the state and allowing deletion. Also (!) bundled in this commit is another fix for another agent bug: when getting snapshots for a region, if the region wasn't in state Created then the endpoint would return 500. This is wrong, and we can handle more states than !Created. The only state that should return 500 is Failed. Fixes oxidecomputer#827.

Several endpoints of the Crucible agent do not perform work synchronously with that endpoint's handler: changes are instead requested and the actual work proceeds asynchronously. Nexus assumes that the work is synchronous though. This commit creates three functions in common_storage that properly deal with an agent that does asynchronous work (`delete_crucible_region`, `delete_crucible_snapshot`, and `delete_crucible_running_snapshot`), and calls those. Part of testing this commit was creating "disk" antagonists in the omicron-stress tool. This uncovered cases where the transactions related to disk creation and deletion failed to complete. One of these cases is in `regions_hard_delete` - this transaction is now retried until it succeeds. The `TransactionError::retry_transaction` function can be used to see if CRDB is telling Nexus to retry the transaction. Another is `decrease_crucible_resource_count_and_soft_delete_volume` - this was broken into two steps, the first of which enumerates the read-only resources to delete (because those don't currently change!). Another fix that's part of this commit, exposed by the disk antagonist: it should only be ok to delete a disk in state `Creating` if you're in the disk create saga. If this is not true, it's possible for a delete of a disk currently in the disk create saga to cause that saga to fail unwinding and remain stuck. This commit also bundles idempotency related fixes for the simulated Crucible agent, as these were exposed with the retries that are performed by the new `delete_crucible_*` functions. What shook out of this is that `Nexus::volume_delete` was removed, as this was not correct to call during the unwind of the disk and snapshot create sagas: it would conflict with what those sagas were doing as part of their unwind. The volume delete saga is still required during disk and snapshot deletion, but there is enough context during the disk and snapshot create sagas to properly unwind a volume, even in the case of read-only parents. `Nexus::volume_delete` ultimately isn't safe, so it was removed: instead, any future sagas should either embed the volume delete saga as a sub saga, or there should be enough context in the outputs said future saga's nodes to properly unwind what was created. Depends on oxidecomputer/crucible#838, which fixes a few non-idempotency bugs in the actual Crucible agent. Fixes oxidecomputer#3698

leftwo

Some questions for you

leftwo · 2023-07-21T21:18:25Z

agent/src/datafile.rs

+ );
+ }
+
+ State::Destroyed | State::Failed => {


Which state here are we looking at? The state of a downstairs somewhere?

If the state is failed, could it be failed and then come back later and not be failed and then want this snapshot?

This is the stored state for a "running snapshot" (aka read only downstairs booted at the snapshot's mount dir: I like and don't like the "running snapshot" name but here we are).

As far as I can tell, there's no path in the state state machine to transition out of Failed for any of the resources that the agent can interact with.

agent/src/datafile.rs

leftwo · 2023-07-21T21:32:16Z

agent/src/datafile.rs

+ .arg("creation")
+ .arg(snapshot)
+ .output()?;
+


Do we need a if !cmd.status.success() { check here? Or if we fail, does cmd_stdout just end up
with nothing in it?

Error check added in dfb1358

Several endpoints of the Crucible agent do not perform work synchronously with that endpoint's handler: changes are instead requested and the actual work proceeds asynchronously. Nexus assumes that the work is synchronous though. This commit creates three functions in common_storage that properly deal with an agent that does asynchronous work (`delete_crucible_region`, `delete_crucible_snapshot`, and `delete_crucible_running_snapshot`), and calls those. Part of testing this commit was creating "disk" antagonists in the omicron-stress tool. This uncovered cases where the transactions related to disk creation and deletion failed to complete. One of these cases is in `regions_hard_delete` - this transaction is now retried until it succeeds. The `TransactionError::retry_transaction` function can be used to see if CRDB is telling Nexus to retry the transaction. Another is `decrease_crucible_resource_count_and_soft_delete_volume` - this was broken into two steps, the first of which enumerates the read-only resources to delete (because those don't currently change!). Another fix that's part of this commit, exposed by the disk antagonist: it should only be ok to delete a disk in state `Creating` if you're in the disk create saga. If this is not true, it's possible for a delete of a disk currently in the disk create saga to cause that saga to fail unwinding and remain stuck. This commit also bundles idempotency related fixes for the simulated Crucible agent, as these were exposed with the retries that are performed by the new `delete_crucible_*` functions. What shook out of this is that `Nexus::volume_delete` was removed, as this was not correct to call during the unwind of the disk and snapshot create sagas: it would conflict with what those sagas were doing as part of their unwind. The volume delete saga is still required during disk and snapshot deletion, but there is enough context during the disk and snapshot create sagas to properly unwind a volume, even in the case of read-only parents. `Nexus::volume_delete` ultimately isn't safe, so it was removed: instead, any future sagas should either embed the volume delete saga as a sub saga, or there should be enough context in the outputs said future saga's nodes to properly unwind what was created. Depends on oxidecomputer/crucible#838, which fixes a few non-idempotency bugs in the actual Crucible agent. Fixes oxidecomputer#3698

Several endpoints of the Crucible agent do not perform work synchronously with that endpoint's handler: changes are instead requested and the actual work proceeds asynchronously. Nexus assumes that the work is synchronous though. This commit creates three functions in common_storage that properly deal with an agent that does asynchronous work (`delete_crucible_region`, `delete_crucible_snapshot`, and `delete_crucible_running_snapshot`), and calls those. Part of testing this commit was creating "disk" antagonists in the omicron-stress tool. This uncovered cases where the transactions related to disk creation and deletion failed to complete. One of these cases is in `regions_hard_delete` - this transaction is now retried until it succeeds. The `TransactionError::retry_transaction` function can be used to see if CRDB is telling Nexus to retry the transaction. Another is `decrease_crucible_resource_count_and_soft_delete_volume` - this was broken into two steps, the first of which enumerates the read-only resources to delete (because those don't currently change!). Another fix that's part of this commit, exposed by the disk antagonist: it should only be ok to delete a disk in state `Creating` if you're in the disk create saga. If this is not true, it's possible for a delete of a disk currently in the disk create saga to cause that saga to fail unwinding and remain stuck. This commit also bundles idempotency related fixes for the simulated Crucible agent, as these were exposed with the retries that are performed by the new `delete_crucible_*` functions. What shook out of this is that `Nexus::volume_delete` was removed, as this was not correct to call during the unwind of the disk and snapshot create sagas: it would conflict with what those sagas were doing as part of their unwind. The volume delete saga is still required during disk and snapshot deletion, but there is enough context during the disk and snapshot create sagas to properly unwind a volume, even in the case of read-only parents. `Nexus::volume_delete` ultimately isn't safe, so it was removed: instead, any future sagas should either embed the volume delete saga as a sub saga, or there should be enough context in the outputs said future saga's nodes to properly unwind what was created. Depends on oxidecomputer/crucible#838, which fixes a few non-idempotency bugs in the actual Crucible agent. Fixes #3698

jmpesp requested a review from leftwo July 19, 2023 03:12

leftwo reviewed Jul 21, 2023

View reviewed changes

jmpesp added 2 commits July 24, 2023 10:57

actually error log on errors!

dfb1358

don't panic

a0522ef

jmpesp mentioned this pull request Jul 24, 2023

Wait for Crucible agent to do requested work oxidecomputer/omicron#3754

Merged

Merge branch 'main' into agent_wait_for_snapshot

d73203b

jmpesp merged commit 0d90edc into oxidecomputer:main Jul 26, 2023
16 checks passed

jmpesp deleted the agent_wait_for_snapshot branch July 26, 2023 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for ZFS snapshot directory + agent fixes #838

Wait for ZFS snapshot directory + agent fixes #838

jmpesp commented Jul 19, 2023

leftwo left a comment

leftwo Jul 21, 2023

jmpesp Jul 24, 2023

leftwo Jul 21, 2023

jmpesp Jul 24, 2023

Wait for ZFS snapshot directory + agent fixes #838

Wait for ZFS snapshot directory + agent fixes #838

Conversation

jmpesp commented Jul 19, 2023

leftwo left a comment

Choose a reason for hiding this comment

leftwo Jul 21, 2023

Choose a reason for hiding this comment

jmpesp Jul 24, 2023

Choose a reason for hiding this comment

leftwo Jul 21, 2023

Choose a reason for hiding this comment

jmpesp Jul 24, 2023

Choose a reason for hiding this comment