Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace a downstairs via a new VCR #863

Merged
merged 10 commits into from
Aug 10, 2023
Merged

Replace a downstairs via a new VCR #863

merged 10 commits into from
Aug 10, 2023

Conversation

leftwo
Copy link
Contributor

@leftwo leftwo commented Aug 8, 2023

This adds support for replacing a crucible downstairs from the Volume layer by submitting a current and
new VolumeConstructionRequest (VCR). If the new VCR matches the current with only a few allowed differences
(gen, a single new target) then the guest layer replace_downstairs is called with the old and new downstairs targets.

To make this code a little more readable, new structure types were created for each of the enums in the VolumeConstructionRequest, and the VolumeConstructionRequest now uses the new types.
These changes can be seen in crucible-client-types/src/lib.rs, and that's probably a good place to start.

A bunch of new tests were created to verify the old/new VCRs only differ in the required ways, and any other
differences are blocked.

The meat of the changes here are the new compare_vcr_for_replacement() call that compares two VCRs and, if it
finds them to match enough, returns the old and new target downstairs. This can then be passed on to the
existing replace_downstairs() call that will tell the upstairs to perform the replacement.

There is a fair chunk of code that is just testing all the possible differences that two VCRs might have, which is sadly
like 85% the same for each test. I'm happy to take suggestions on how to turn that into less code if possible, but
trying to keep a balance such that it is pretty clear as to what the test was doing.

Alan Hanson added 2 commits July 28, 2023 23:24
Made each field of the VolumeConstructionRequest enum a standalone
structure of itself.

Added a volume layer VCR target (downstairs) replacement method. This
will validate an old/new VCR and then call the underlying
replace_downstairs method that exists now.

Created a bunch of new tests to validate VCR differences are noticed.
@leftwo
Copy link
Contributor Author

leftwo commented Aug 9, 2023

Notes to self:
Do we need to update the volumeself? Just the target (and the gen) or the whole thing?
What does it mean to have a new generation number in the Volume, but still be active
with the old one? Okay I think.

@leftwo
Copy link
Contributor Author

leftwo commented Aug 9, 2023

Notes to self: Do we need to update the volumeself? Just the target (and the gen) or the whole thing? What does it mean to have a new generation number in the Volume, but still be active with the old one? Okay I think.

Answer to self: We don't need to update self. The Volume holds a SubVolume, which in itself does not contain
either the downstairs targets, or a generation number. All that info is passed down to the upstairs when the VCR
is translated into a Volume. As we are updating the Upstairs itself here (with the target replace), the upstairs will
have the latest info.

Note that whomever calls this VCR update, must use the new VCR going forward.

@leftwo leftwo requested a review from gjcolombo August 9, 2023 17:01
@leftwo leftwo marked this pull request as ready for review August 9, 2023 17:01
VolumeConstructionRequest::Url { .. } => {
crucible_bail!(
ReplaceRequestInvalid,
"Replacement URL VCR invalid "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

//
// On success, we record the old and new downstairs target and return
// those to the caller.
pub async fn compare_vcr_for_replacement(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level thought: would it make sense to use this during live migration to ensure that two Crucible backends have been configured compatibly? (You'd have the old VCR on the source and the new VCR on the target; the source would send its VCR to the target; the target could then verify that its VCR is plausibly compatible with the one on the source.) We'd have a little work to do in Propolis to enable this but it's nothing complicated.

The reason I ask is that if this could be used this way, it may be nicer to phrase the routine name and error messages throughout in terms of "compatibility" instead of "replaceability."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a more generic check, with a little more polish.

For a migration, the generation on the new side should be > than the old, which we check here. For a migration, the read only parent could be None on the new side, which we check here. I'm assuming we don't want to migrate and replace a downstairs at the same time (at least initailly).

To make this generic, it would have to differentiate between the "okay for migration" VCR and "ok for downstairs replacement" VCR.

I think a way to do that would be to return Ok(None) if the VCRs are the same (given the above constraints) and 'Ok(Some(old_target, new_target))` if the VCRs are the same with the exception of one target being different.

How does that sound?

upstairs/src/volume.rs Show resolved Hide resolved
upstairs/src/volume.rs Show resolved Hide resolved
&self,
original: VolumeConstructionRequest,
replacement: VolumeConstructionRequest,
log: &Logger,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the interface Propolis is going to call? It looks great, except that we don't have a logger handy in the CrucibleBackend, because the backend doesn't hang onto the logger that it passed down to the upstairs in Volume::construct. Can the top-level Volume hold onto that logger in addition to passing it down to the subvolumes so that the caller doesn't have to construct a new one here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this interface was what I was hoping Propolis could use.

I've wanted a few times to add a Logger field to the volume layer, but so far have not done so. Maybe this is the time to do that. Let me see if I can figure out how much work it is to do that. I think since we have updated Propolis to supply a logger at Volume::construct, crucible can attach it there. It's really all the test code I'll have to change :)

Copy link
Contributor Author

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

I've tried to answer them, and I'm going to work on some suggestions you had and I'll push up changes when I have that work done.

//
// On success, we record the old and new downstairs target and return
// those to the caller.
pub async fn compare_vcr_for_replacement(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a more generic check, with a little more polish.

For a migration, the generation on the new side should be > than the old, which we check here. For a migration, the read only parent could be None on the new side, which we check here. I'm assuming we don't want to migrate and replace a downstairs at the same time (at least initailly).

To make this generic, it would have to differentiate between the "okay for migration" VCR and "ok for downstairs replacement" VCR.

I think a way to do that would be to return Ok(None) if the VCRs are the same (given the above constraints) and 'Ok(Some(old_target, new_target))` if the VCRs are the same with the exception of one target being different.

How does that sound?

VolumeConstructionRequest::Url { .. } => {
crucible_bail!(
ReplaceRequestInvalid,
"Replacement URL VCR invalid "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

&self,
original: VolumeConstructionRequest,
replacement: VolumeConstructionRequest,
log: &Logger,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this interface was what I was hoping Propolis could use.

I've wanted a few times to add a Logger field to the volume layer, but so far have not done so. Maybe this is the time to do that. Let me see if I can figure out how much work it is to do that. I think since we have updated Propolis to supply a logger at Volume::construct, crucible can attach it there. It's really all the test code I'll have to change :)

Copy link
Contributor Author

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've addressed all comments, this is ready for a re-review.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, sorry for inciting all those logging changes... One more suggestion about the comparison routines, but otherwise this is looking pretty good.

// On success, we record the old and new downstairs target and return
// those to the caller.
pub async fn compare_vcr_for_replacement(
// We return Ok(None) if requirements 1, 2, 3A, 3B are met. This would
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, since we only need to return a socket address pair if we're replacing, but it feels a little weird to me to have the caller check for the flavor of Ok it got back when it could have specified to the callee which checks to do. That is, I think the downstairs-replacement path is going to end up with something like this:

match compare_vcr_for_update(original, replacement, log)? {
    Ok(Some(_)) => todo!("okay to proceed"),
    Ok(None) => bail!("target replacement check failed"),
}

Might it make sense to have two functions? compare_vcr_for_update checks 1, 2, 3A, and 3B and returns a Result<()>; migration calls that one. compare_vcr_for_replacement calls compare_vcr_for_update and, if that succeeds, does check 3C, and if that succeeds it returns Result<(SocketAddr, SocketAddr)>. Then the downstairs replacement code can just let (old_addr, new_addr) = compare_vcr_for_replacement()?;.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge here is that, and perhaps I need to make it clear in the comment as well, is:

  • For migration, I don't want 3C, the targets must be the same.
  • For replacement, I do want 3C, the targets must be different.

The compare_vcr_for_replacement does not build on top of compare_vcr_for_migration

I think I can do it with a wrapper in front of compare_vcr_for_update, where I just hide the Ok check like you have shown above and have a migration_check and a replacement_check,
migration_check just looks for Ok(None), and replacement_check wants the Ok(Some(_))
Then I'm at least sweeping the Ok check under the rug and the callers don't have to match.

I could also make a "check everything except the sub_volume targets" function (what is common between the replace/migrate cases). If I did that I would have to make another two functions that then did the translation/selection from VolumeConstructionRequest -> VcrVolume -> SubVolume -> Target Vec, and returned the Ok/Err depending on what they wanted, which would then look something like this:

fn compare_vcr_for_migration(..) -> Result<()> {
   compare_vcr_common(..)?
   compare_vcr_subtarget_the_same()
}

and

fn compare_vcr_for_migration(..) -> Result<(SocketAddr, SocketAddr)> {
   compare_vcr_common(..)?
   compare_vcr_subtarget_one_different()
}

Or, maybe I make a give_me_the_subvolume_target_from_this_vcr function, then compare the two targets after I have done the compare_vcr_common, but.. I don't know what ends up being the better solution here. I kind of like just hiding the Ok check.

Let me know if you feel strongly about either of these two solutions (or some third idea). I do think it's worth updating this to be a little better and having two functions to call depending on what the caller wants, but I need a little direction on which path you think might be better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and tried option A (the wrapper around compare_vcr_for_update)
Let me know what you think.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

@leftwo leftwo merged commit b5949e1 into main Aug 10, 2023
16 checks passed
@leftwo leftwo deleted the alan/wip-for-vcr branch August 10, 2023 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants