Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for ZFS snapshot directory + agent fixes #838

Merged
merged 4 commits into from
Jul 26, 2023

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Jul 19, 2023

When zfs snapshot dataset@name is called, the ZFS snapshot is not instantaneously mounted to the appropriate .../.zfs/snapshot/ path (as long as 172 ms was measured for a zpool with file backed vdevs). There's two fixes that result from this:

  1. Use zfs list -t snapshot ... to tell if a snapshot exists, not the presence of the mounted snapshot directory

  2. Wait for the ZFS snapshot directory to get mounted before starting a read-only downstairs service that points to it. Wait a maximum of 5 seconds before returning an error.

Also bundled in this commit is a fix for an agent bug: it's possible for a snapshot to have a running read-only downstairs that is in the destroyed state (but not deleted yet), and this will prevent the deletion of said snapshot, as the deletion code simply checked for the presence of a running read-only snapshot and not its state. Fix this by matching on the state and allowing deletion.

Also (!) bundled in this commit is another fix for another agent bug: when getting snapshots for a region, if the region wasn't in state Created then the endpoint would return 500. This is wrong, and we can handle more states than !Created. The only state that should return 500 is Failed.

Fixes #827.

When `zfs snapshot dataset@name` is called, the ZFS snapshot is not
instantaneously mounted to the appropriate .../.zfs/snapshot/<name>
path (as long as 172 ms was measured for a zpool with file backed
vdevs). There's two fixes that result from this:

1. Use `zfs list -t snapshot ...` to tell if a snapshot exists, not the
   presence of the mounted snapshot directory

2. Wait for the ZFS snapshot directory to get mounted before starting a
   read-only downstairs service that points to it. Wait a maximum of 5
   seconds before returning an error.

Also bundled in this commit is a fix for an agent bug: it's possible for
a snapshot to have a running read-only downstairs that is in the
destroyed state (but not deleted yet), and this will prevent the
deletion of said snapshot, as the deletion code simply checked for the
presence of a running read-only snapshot and not its state. Fix this by
matching on the state and allowing deletion.

Also (!) bundled in this commit is another fix for another agent bug:
when getting snapshots for a region, if the region wasn't in state
Created then the endpoint would return 500. This is wrong, and we can
handle more states than !Created. The only state that should return 500
is Failed.

Fixes oxidecomputer#827.
@jmpesp jmpesp requested a review from leftwo July 19, 2023 03:12
jmpesp added a commit to jmpesp/omicron that referenced this pull request Jul 21, 2023
Several endpoints of the Crucible agent do not perform work
synchronously with that endpoint's handler: changes are instead
requested and the actual work proceeds asynchronously. Nexus assumes
that the work is synchronous though.

This commit creates three functions in common_storage that properly deal
with an agent that does asynchronous work (`delete_crucible_region`,
`delete_crucible_snapshot`, and `delete_crucible_running_snapshot`), and
calls those.

Part of testing this commit was creating "disk" antagonists in the
omicron-stress tool. This uncovered cases where the transactions related
to disk creation and deletion failed to complete. One of these cases is
in `regions_hard_delete` - this transaction is now retried until it
succeeds. The `TransactionError::retry_transaction` function can be used
to see if CRDB is telling Nexus to retry the transaction. Another is
`decrease_crucible_resource_count_and_soft_delete_volume` - this was
broken into two steps, the first of which enumerates the read-only
resources to delete (because those don't currently change!).

Another fix that's part of this commit, exposed by the disk antagonist:
it should only be ok to delete a disk in state `Creating` if you're in
the disk create saga. If this is not true, it's possible for a delete of
a disk currently in the disk create saga to cause that saga to fail
unwinding and remain stuck.

This commit also bundles idempotency related fixes for the simulated
Crucible agent, as these were exposed with the retries that are
performed by the new `delete_crucible_*` functions.

What shook out of this is that `Nexus::volume_delete` was removed, as
this was not correct to call during the unwind of the disk and snapshot
create sagas: it would conflict with what those sagas were doing as part
of their unwind. The volume delete saga is still required during disk
and snapshot deletion, but there is enough context during the disk and
snapshot create sagas to properly unwind a volume, even in the case of
read-only parents. `Nexus::volume_delete` ultimately isn't safe, so it
was removed: instead, any future sagas should either embed the volume
delete saga as a sub saga, or there should be enough context in the
outputs said future saga's nodes to properly unwind what was created.

Depends on oxidecomputer/crucible#838, which
fixes a few non-idempotency bugs in the actual Crucible agent.

Fixes oxidecomputer#3698
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions for you

);
}

State::Destroyed | State::Failed => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which state here are we looking at? The state of a downstairs somewhere?

If the state is failed, could it be failed and then come back later and not be failed and then want this snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the stored state for a "running snapshot" (aka read only downstairs booted at the snapshot's mount dir: I like and don't like the "running snapshot" name but here we are).

As far as I can tell, there's no path in the state state machine to transition out of Failed for any of the resources that the agent can interact with.

agent/src/datafile.rs Show resolved Hide resolved
agent/src/datafile.rs Outdated Show resolved Hide resolved
.arg("creation")
.arg(snapshot)
.output()?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a if !cmd.status.success() { check here? Or if we fail, does cmd_stdout just end up
with nothing in it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error check added in dfb1358

jmpesp added a commit to jmpesp/omicron that referenced this pull request Jul 24, 2023
Several endpoints of the Crucible agent do not perform work
synchronously with that endpoint's handler: changes are instead
requested and the actual work proceeds asynchronously. Nexus assumes
that the work is synchronous though.

This commit creates three functions in common_storage that properly deal
with an agent that does asynchronous work (`delete_crucible_region`,
`delete_crucible_snapshot`, and `delete_crucible_running_snapshot`), and
calls those.

Part of testing this commit was creating "disk" antagonists in the
omicron-stress tool. This uncovered cases where the transactions related
to disk creation and deletion failed to complete. One of these cases is
in `regions_hard_delete` - this transaction is now retried until it
succeeds. The `TransactionError::retry_transaction` function can be used
to see if CRDB is telling Nexus to retry the transaction. Another is
`decrease_crucible_resource_count_and_soft_delete_volume` - this was
broken into two steps, the first of which enumerates the read-only
resources to delete (because those don't currently change!).

Another fix that's part of this commit, exposed by the disk antagonist:
it should only be ok to delete a disk in state `Creating` if you're in
the disk create saga. If this is not true, it's possible for a delete of
a disk currently in the disk create saga to cause that saga to fail
unwinding and remain stuck.

This commit also bundles idempotency related fixes for the simulated
Crucible agent, as these were exposed with the retries that are
performed by the new `delete_crucible_*` functions.

What shook out of this is that `Nexus::volume_delete` was removed, as
this was not correct to call during the unwind of the disk and snapshot
create sagas: it would conflict with what those sagas were doing as part
of their unwind. The volume delete saga is still required during disk
and snapshot deletion, but there is enough context during the disk and
snapshot create sagas to properly unwind a volume, even in the case of
read-only parents. `Nexus::volume_delete` ultimately isn't safe, so it
was removed: instead, any future sagas should either embed the volume
delete saga as a sub saga, or there should be enough context in the
outputs said future saga's nodes to properly unwind what was created.

Depends on oxidecomputer/crucible#838, which
fixes a few non-idempotency bugs in the actual Crucible agent.

Fixes oxidecomputer#3698
@jmpesp jmpesp merged commit 0d90edc into oxidecomputer:main Jul 26, 2023
16 checks passed
@jmpesp jmpesp deleted the agent_wait_for_snapshot branch July 26, 2023 15:59
jmpesp added a commit to oxidecomputer/omicron that referenced this pull request Aug 2, 2023
Several endpoints of the Crucible agent do not perform work
synchronously with that endpoint's handler: changes are instead
requested and the actual work proceeds asynchronously. Nexus assumes
that the work is synchronous though.

This commit creates three functions in common_storage that properly deal
with an agent that does asynchronous work (`delete_crucible_region`,
`delete_crucible_snapshot`, and `delete_crucible_running_snapshot`), and
calls those.

Part of testing this commit was creating "disk" antagonists in the
omicron-stress tool. This uncovered cases where the transactions related
to disk creation and deletion failed to complete. One of these cases is
in `regions_hard_delete` - this transaction is now retried until it
succeeds. The `TransactionError::retry_transaction` function can be used
to see if CRDB is telling Nexus to retry the transaction. Another is
`decrease_crucible_resource_count_and_soft_delete_volume` - this was
broken into two steps, the first of which enumerates the read-only
resources to delete (because those don't currently change!).

Another fix that's part of this commit, exposed by the disk antagonist:
it should only be ok to delete a disk in state `Creating` if you're in
the disk create saga. If this is not true, it's possible for a delete of
a disk currently in the disk create saga to cause that saga to fail
unwinding and remain stuck.

This commit also bundles idempotency related fixes for the simulated
Crucible agent, as these were exposed with the retries that are
performed by the new `delete_crucible_*` functions.

What shook out of this is that `Nexus::volume_delete` was removed, as
this was not correct to call during the unwind of the disk and snapshot
create sagas: it would conflict with what those sagas were doing as part
of their unwind. The volume delete saga is still required during disk
and snapshot deletion, but there is enough context during the disk and
snapshot create sagas to properly unwind a volume, even in the case of
read-only parents. `Nexus::volume_delete` ultimately isn't safe, so it
was removed: instead, any future sagas should either embed the volume
delete saga as a sub saga, or there should be enough context in the
outputs said future saga's nodes to properly unwind what was created.

Depends on oxidecomputer/crucible#838, which
fixes a few non-idempotency bugs in the actual Crucible agent.

Fixes #3698
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Snapshots don't instantly mount in the dataset's .zfs/snapshot directory
2 participants