Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SingleSp::set_power_state idempotent #271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions gateway-sp-comms/src/single_sp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -651,6 +651,17 @@ impl SingleSp {

/// Set the current power state.
pub async fn set_power_state(&self, power_state: PowerState) -> Result<()> {
// First, let's make sure the system isn't already *in* the desired power
// state --- the SP will return an error if we try to set the power
// state to the state it's already in.
if power_state == self.power_state().await? {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care that there's a TOCTOU issue here (i.e., some other request flips the power state to A2 in between this check and line 665)? It looks like the SP gives back a PowerStateError(u32) where one of the u32 values presumably maps to SeqError::IllegalTransition (at least for gimlet - I didn't check sidecar).

A couple options:

  • Merge this as is, maybe noting the TOCTOU and that we probably (?) don't care about it.
  • Add a more specific error variant that the SP can return for this; then this method could issue SetPowerState directly and choose to ignore if we get the "you're already in this state" error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just had an error performing a mupdate in the factory that I think means we do care about this TOCTOU:

Screenshot from 2024-09-17 13-28-27

I believe what happened here was:

  • Sled was in A2
  • wicket instructed MGS to change the sled's power state to A0
  • MGS sends that request to the SP, then a couple things happen:
    • The SP does receive this request, and transitions the sled to A0
    • MGS did not receive a response within 2 seconds
  • MGS's current config is to repeat requests every 2 seconds until success or timeout, so it resent the request to go to A0 after 2 seconds
  • The SP rejected this request because the sled was in A0

This is based on a couple things:

  • The sled was in A0 when we checked by hand after this failure
  • The error is at exactly 2 seconds, which matches how long MGS waits before resending its request

The change in this PR would not have prevented this error, I think? Because the initial check would have passed (the sled was in A2), but then the "operation not idempotent" failure happened later. I think the way MGS does retries is relying on individual requests/packets being idempotent, so I think we probably want something like my second suggestion above (teach MGS how to tell if the error is "I'm already in this state" and then treat that as success).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the potential TOCTOU issue was an argument in favor of making this change inside the SP instead...but, I think we could alternatively make MGS handle that specific error, instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have 3 places where we could choose to do that:

  • SP sequencer task
  • SP control-plane-agent task
  • MGS

I don't have any particularly strong preference. A weak preference is that right now, the specific error code control-plane-agent sends to MGS in this case is opaque and not good to match on, so we'd need SP changes anyway. Maybe we could put swallowing this error into control-plane-agent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to swallow it in the SP I think we should at least include a bit in the result that one could use to reason about whether the action had any effect or was for some reason a no-op.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, but I'm not sure it makes sense to in the situation outlined above? If the sled is in A2 and MGS requests a transition to A0, we'll usually get back "success and I made the change", but if that response gets lost or MGS times out or whatever and resends the request, we could get back "success and it was already in A0", even though the only reason it was already in A0 was that we actually did receive and handle the first attempt of this same request.

trace!(
&self.log,
"already in the desired power state, doing nothing";
"power_state" => ?power_state,
);
return Ok(());
}
self.rpc(MgsRequest::SetPowerState(power_state))
.await
.and_then(expect_set_power_state_ack)
Expand Down
Loading