Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not 500 on snapshot delete for deleted region #1162

Merged
merged 2 commits into from
Feb 15, 2024

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Feb 14, 2024

The code path that validates a snapshot delete request checks that a region exists, and returns 404 if it does not:

match rc.context().get(&p.id) {
    Some(_) => (),
    None => {
        return Err(HttpError::for_not_found(
            None,
            format!("region {:?} not found", p.id),
        ));
    }
}

However there's a flaw in this check: if the region existed but was deleted, this will not return a 404, and proceed into delete_snapshot.

delete_snapshot will check if a running snapshot exists, and if it does will bail out: it's an invalid request to delete a snapshot for which there's a read-only downstairs running for.

However, the very next thing the code does is try to make a ZFSDataset object out of the region's data path. If the region was deleted this will lead to zfs list exiting with a non-zero return code, and will bubble up to the user as a 500.

Catch ZFSDataset returning Err, and if the region existed but was deleted, we can assume the snapshot was deleted prior to this (as the agent would not allow otherwise). If the region existed and isn't deleted, and zfs list for that region path failed, then the agent's datafile doesn't match reality.

The code path that validates a snapshot delete request checks that a
region exists, and returns 404 if it does not:

    match rc.context().get(&p.id) {
        Some(_) => (),
        None => {
            return Err(HttpError::for_not_found(
                None,
                format!("region {:?} not found", p.id),
            ));
        }
    }

However there's a flaw in this check: if the region existed but was
deleted, this will not return a 404, and proceed into `delete_snapshot`.

`delete_snapshot` will check if a running snapshot exists, and if it
does will bail out: it's an invalid request to delete a snapshot for
which there's a read-only downstairs running for.

However, the very next thing the code does is try to make a ZFSDataset
object out of the region's data path. If the region was deleted this
will lead to `zfs list` exiting with a non-zero return code, and will
bubble up to the user as a 500.

Catch ZFSDataset returning Err, and if the region existed but was
deleted, we can assume the snapshot was deleted prior to this (as the
agent would not allow otherwise). If the region existed and isn't
deleted, and `zfs list` for that region path failed, then the agent's
datafile doesn't match reality.
@jmpesp jmpesp requested a review from leftwo February 14, 2024 23:08
@@ -95,7 +98,9 @@ impl ZFSDataset {
.output()?;

if !cmd.status.success() {
bail!("zfs list failed!");
let stderr =
String::from_utf8_lossy(&cmd.stderr).trim_end().to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@leftwo leftwo merged commit ace10f4 into oxidecomputer:main Feb 15, 2024
18 checks passed
@jmpesp jmpesp deleted the no_500_on_snapshot_delete branch February 21, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants