Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SingleSp::set_power_state idempotent #271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Aug 23, 2024

Presently, the SingleSp::set_power_state method can fail if the system
is already in the requested power state, which is a bit of a bummer.
This is because the Hubris sequencer task returns an IllegalTransition
error when the requested and current power states are the same.

This commit changes it to first check the power state, and not ask the
SP to do anything else if it's already in the requested state.

Closes #270

Presently, the `SingleSp::set_power_state` method can fail if the system
is already in the requested power state, which is a bit of a bummer.
This is because the Hubris sequencer task returns an `IllegalTransition`
error when the requested and current power states are the same.

This commit changes it to first check the power state, and not ask the
SP to do anything else if it's already in the requested state.

Closes #270
@hawkw
Copy link
Member Author

hawkw commented Aug 23, 2024

Alternatively, we could just make the Hubris sequencer handle this case, but @jclulow suggested that the current behavior made sense at the level of the sequencer...I don't have a strong opinion on this. I'd be happy to change whatever.

@jclulow
Copy link
Contributor

jclulow commented Aug 23, 2024

I like the property that the lowest level thing will report that you asked it to do something non-sensical, and that it's up to the higher level orchestration software to get both hands around the state of the thing. If the sequencer task also ignores a transition from A2 to A2, then it will be harder to detect unexpected things happening upstack in the future.

// First, let's make sure the system isn't already *in* the desired power
// state --- the SP will return an error if we try to set the power
// state to the state it's already in.
if power_state == self.power_state().await? {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care that there's a TOCTOU issue here (i.e., some other request flips the power state to A2 in between this check and line 665)? It looks like the SP gives back a PowerStateError(u32) where one of the u32 values presumably maps to SeqError::IllegalTransition (at least for gimlet - I didn't check sidecar).

A couple options:

  • Merge this as is, maybe noting the TOCTOU and that we probably (?) don't care about it.
  • Add a more specific error variant that the SP can return for this; then this method could issue SetPowerState directly and choose to ignore if we get the "you're already in this state" error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just had an error performing a mupdate in the factory that I think means we do care about this TOCTOU:

Screenshot from 2024-09-17 13-28-27

I believe what happened here was:

  • Sled was in A2
  • wicket instructed MGS to change the sled's power state to A0
  • MGS sends that request to the SP, then a couple things happen:
    • The SP does receive this request, and transitions the sled to A0
    • MGS did not receive a response within 2 seconds
  • MGS's current config is to repeat requests every 2 seconds until success or timeout, so it resent the request to go to A0 after 2 seconds
  • The SP rejected this request because the sled was in A0

This is based on a couple things:

  • The sled was in A0 when we checked by hand after this failure
  • The error is at exactly 2 seconds, which matches how long MGS waits before resending its request

The change in this PR would not have prevented this error, I think? Because the initial check would have passed (the sled was in A2), but then the "operation not idempotent" failure happened later. I think the way MGS does retries is relying on individual requests/packets being idempotent, so I think we probably want something like my second suggestion above (teach MGS how to tell if the error is "I'm already in this state" and then treat that as success).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the potential TOCTOU issue was an argument in favor of making this change inside the SP instead...but, I think we could alternatively make MGS handle that specific error, instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have 3 places where we could choose to do that:

  • SP sequencer task
  • SP control-plane-agent task
  • MGS

I don't have any particularly strong preference. A weak preference is that right now, the specific error code control-plane-agent sends to MGS in this case is opaque and not good to match on, so we'd need SP changes anyway. Maybe we could put swallowing this error into control-plane-agent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to swallow it in the SP I think we should at least include a bit in the result that one could use to reason about whether the action had any effect or was for some reason a no-op.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, but I'm not sure it makes sense to in the situation outlined above? If the sled is in A2 and MGS requests a transition to A0, we'll usually get back "success and I made the change", but if that response gets lost or MGS times out or whatever and resends the request, we could get back "success and it was already in A0", even though the only reason it was already in A0 was that we actually did receive and handle the first attempt of this same request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SingleSp::set_power_state should probably be a bit more idempotent
3 participants