-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SingleSp::set_power_state
idempotent
#271
base: main
Are you sure you want to change the base?
Conversation
Presently, the `SingleSp::set_power_state` method can fail if the system is already in the requested power state, which is a bit of a bummer. This is because the Hubris sequencer task returns an `IllegalTransition` error when the requested and current power states are the same. This commit changes it to first check the power state, and not ask the SP to do anything else if it's already in the requested state. Closes #270
Alternatively, we could just make the Hubris sequencer handle this case, but @jclulow suggested that the current behavior made sense at the level of the sequencer...I don't have a strong opinion on this. I'd be happy to change whatever. |
I like the property that the lowest level thing will report that you asked it to do something non-sensical, and that it's up to the higher level orchestration software to get both hands around the state of the thing. If the sequencer task also ignores a transition from A2 to A2, then it will be harder to detect unexpected things happening upstack in the future. |
// First, let's make sure the system isn't already *in* the desired power | ||
// state --- the SP will return an error if we try to set the power | ||
// state to the state it's already in. | ||
if power_state == self.power_state().await? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care that there's a TOCTOU issue here (i.e., some other request flips the power state to A2 in between this check and line 665)? It looks like the SP gives back a PowerStateError(u32)
where one of the u32 values presumably maps to SeqError::IllegalTransition
(at least for gimlet - I didn't check sidecar).
A couple options:
- Merge this as is, maybe noting the TOCTOU and that we probably (?) don't care about it.
- Add a more specific error variant that the SP can return for this; then this method could issue
SetPowerState
directly and choose to ignore if we get the "you're already in this state" error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just had an error performing a mupdate in the factory that I think means we do care about this TOCTOU:
I believe what happened here was:
- Sled was in A2
- wicket instructed MGS to change the sled's power state to A0
- MGS sends that request to the SP, then a couple things happen:
- The SP does receive this request, and transitions the sled to A0
- MGS did not receive a response within 2 seconds
- MGS's current config is to repeat requests every 2 seconds until success or timeout, so it resent the request to go to A0 after 2 seconds
- The SP rejected this request because the sled was in A0
This is based on a couple things:
- The sled was in A0 when we checked by hand after this failure
- The error is at exactly 2 seconds, which matches how long MGS waits before resending its request
The change in this PR would not have prevented this error, I think? Because the initial check would have passed (the sled was in A2), but then the "operation not idempotent" failure happened later. I think the way MGS does retries is relying on individual requests/packets being idempotent, so I think we probably want something like my second suggestion above (teach MGS how to tell if the error is "I'm already in this state" and then treat that as success).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the potential TOCTOU issue was an argument in favor of making this change inside the SP instead...but, I think we could alternatively make MGS handle that specific error, instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we have 3 places where we could choose to do that:
- SP sequencer task
- SP control-plane-agent task
- MGS
I don't have any particularly strong preference. A weak preference is that right now, the specific error code control-plane-agent sends to MGS in this case is opaque and not good to match on, so we'd need SP changes anyway. Maybe we could put swallowing this error into control-plane-agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to swallow it in the SP I think we should at least include a bit in the result that one could use to reason about whether the action had any effect or was for some reason a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that, but I'm not sure it makes sense to in the situation outlined above? If the sled is in A2 and MGS requests a transition to A0, we'll usually get back "success and I made the change", but if that response gets lost or MGS times out or whatever and resends the request, we could get back "success and it was already in A0", even though the only reason it was already in A0 was that we actually did receive and handle the first attempt of this same request.
Presently, the
SingleSp::set_power_state
method can fail if the systemis already in the requested power state, which is a bit of a bummer.
This is because the Hubris sequencer task returns an
IllegalTransition
error when the requested and current power states are the same.
This commit changes it to first check the power state, and not ask the
SP to do anything else if it's already in the requested state.
Closes #270