Make `SingleSp::set_power_state` idempotent #271

hawkw · 2024-08-23T22:15:16Z

Presently, the SingleSp::set_power_state method can fail if the system
is already in the requested power state, which is a bit of a bummer.
This is because the Hubris sequencer task returns an IllegalTransition
error when the requested and current power states are the same.

This commit changes it to first check the power state, and not ask the
SP to do anything else if it's already in the requested state.

Closes #270

Presently, the `SingleSp::set_power_state` method can fail if the system is already in the requested power state, which is a bit of a bummer. This is because the Hubris sequencer task returns an `IllegalTransition` error when the requested and current power states are the same. This commit changes it to first check the power state, and not ask the SP to do anything else if it's already in the requested state. Closes #270

hawkw · 2024-08-23T22:18:03Z

Alternatively, we could just make the Hubris sequencer handle this case, but @jclulow suggested that the current behavior made sense at the level of the sequencer...I don't have a strong opinion on this. I'd be happy to change whatever.

jclulow · 2024-08-23T23:07:02Z

I like the property that the lowest level thing will report that you asked it to do something non-sensical, and that it's up to the higher level orchestration software to get both hands around the state of the thing. If the sequencer task also ignores a transition from A2 to A2, then it will be harder to detect unexpected things happening upstack in the future.

jgallagher · 2024-08-26T15:44:47Z

gateway-sp-comms/src/single_sp.rs

+        // First, let's make sure the system isn't already *in* the desired power
+        // state --- the SP will return an error if we try to set the power
+        // state to the state it's already in.
+        if power_state == self.power_state().await? {


Do we care that there's a TOCTOU issue here (i.e., some other request flips the power state to A2 in between this check and line 665)? It looks like the SP gives back a PowerStateError(u32) where one of the u32 values presumably maps to SeqError::IllegalTransition (at least for gimlet - I didn't check sidecar).

A couple options:

Merge this as is, maybe noting the TOCTOU and that we probably (?) don't care about it.

Add a more specific error variant that the SP can return for this; then this method could issue SetPowerState directly and choose to ignore if we get the "you're already in this state" error?

We just had an error performing a mupdate in the factory that I think means we do care about this TOCTOU:

I believe what happened here was:

Sled was in A2

wicket instructed MGS to change the sled's power state to A0

MGS sends that request to the SP, then a couple things happen:

The SP does receive this request, and transitions the sled to A0

MGS did not receive a response within 2 seconds

MGS's current config is to repeat requests every 2 seconds until success or timeout, so it resent the request to go to A0 after 2 seconds

The SP rejected this request because the sled was in A0

This is based on a couple things:

The sled was in A0 when we checked by hand after this failure

The error is at exactly 2 seconds, which matches how long MGS waits before resending its request

The change in this PR would not have prevented this error, I think? Because the initial check would have passed (the sled was in A2), but then the "operation not idempotent" failure happened later. I think the way MGS does retries is relying on individual requests/packets being idempotent, so I think we probably want something like my second suggestion above (teach MGS how to tell if the error is "I'm already in this state" and then treat that as success).

IMO, the potential TOCTOU issue was an argument in favor of making this change inside the SP instead...but, I think we could alternatively make MGS handle that specific error, instead.

I guess we have 3 places where we could choose to do that:

SP sequencer task

SP control-plane-agent task

MGS

I don't have any particularly strong preference. A weak preference is that right now, the specific error code control-plane-agent sends to MGS in this case is opaque and not good to match on, so we'd need SP changes anyway. Maybe we could put swallowing this error into control-plane-agent?

If we're going to swallow it in the SP I think we should at least include a bit in the result that one could use to reason about whether the action had any effect or was for some reason a no-op.

We can do that, but I'm not sure it makes sense to in the situation outlined above? If the sled is in A2 and MGS requests a transition to A0, we'll usually get back "success and I made the change", but if that response gets lost or MGS times out or whatever and resends the request, we could get back "success and it was already in A0", even though the only reason it was already in A0 was that we actually did receive and handle the first attempt of this same request.

hawkw requested review from jgallagher and andrewjstone August 23, 2024 22:18

jgallagher reviewed Aug 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `SingleSp::set_power_state` idempotent #271

Make `SingleSp::set_power_state` idempotent #271

hawkw commented Aug 23, 2024

hawkw commented Aug 23, 2024

jclulow commented Aug 23, 2024

jgallagher Aug 26, 2024

jgallagher Sep 17, 2024

hawkw Sep 17, 2024

jgallagher Sep 17, 2024

jclulow Sep 17, 2024

jgallagher Sep 19, 2024

Make SingleSp::set_power_state idempotent #271

Are you sure you want to change the base?

Make SingleSp::set_power_state idempotent #271

Conversation

hawkw commented Aug 23, 2024

hawkw commented Aug 23, 2024

jclulow commented Aug 23, 2024

jgallagher Aug 26, 2024

Choose a reason for hiding this comment

jgallagher Sep 17, 2024

Choose a reason for hiding this comment

hawkw Sep 17, 2024

Choose a reason for hiding this comment

jgallagher Sep 17, 2024

Choose a reason for hiding this comment

jclulow Sep 17, 2024

Choose a reason for hiding this comment

jgallagher Sep 19, 2024

Choose a reason for hiding this comment

Make `SingleSp::set_power_state` idempotent #271

Make `SingleSp::set_power_state` idempotent #271