Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SingleSp::set_power_state should probably be a bit more idempotent #270

Open
hawkw opened this issue Aug 23, 2024 · 0 comments · May be fixed by #271
Open

SingleSp::set_power_state should probably be a bit more idempotent #270

hawkw opened this issue Aug 23, 2024 · 0 comments · May be fixed by #271
Assignees

Comments

@hawkw
Copy link
Member

hawkw commented Aug 23, 2024

At present, if one attempts to set a SP to a power state that it's already in, one receives a 503 Service Unavailable. For instance, attempting to update Sled 14 on madrid failed when wicketd attempted to put it in A2, because the system was already in A2:

[         mupdate] [sled 14 00:00:00]   Running  7/19) Setting host power state to A2
[         mupdate] [sled 14 00:00:00]    Failed  7/19) Setting host power state to A2: after 2.38ms: updating power state failed
[         mupdate] [sled 14 00:00:00]             Caused by:
[         mupdate] [sled 14 00:00:00]             - Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "f64e6790-ccfa-43e0-a629-f1ee71d32c5b", "content-length": "233", "date": "Sun, 28 Dec 1986 00:05:40 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 14 }: Error response from SP: power state error (code 1))", request_id: "f64e6790-ccfa-43e0-a629-f1ee71d32c5b" }
[         mupdate] [sled 14 00:00:00]  Terminal Execution failed

This appears to be because the gimlet-seq task in Hubris returns an error for A2 -> A2 transitions. See this match:
https://github.com/oxidecomputer/hubris/blob/2f24dd4419634ca95742de901d18e52ff14b912b/drv/gimlet-seq-server/src/main.rs#L696-L918

gateway-sp-comms' SingleSp::set_power_state method just tries to perform the RPC, bubbling up the error from the SP without checking whether it's already in the desired state:

/// Set the current power state.
pub async fn set_power_state(&self, power_state: PowerState) -> Result<()> {
self.rpc(MgsRequest::SetPowerState(power_state))
.await
.and_then(expect_set_power_state_ack)
}

aaaaaaand finally, over in Omicron, the MGS HTTP API just bubbles up whatever error SingleSp::set_power_state returns:
https://github.com/oxidecomputer/omicron/blob/d96ea7ca8945b8ad78a53fd083850ea39789e5f0/gateway/src/http_entrypoints.rs#L623-L637

We should probably make this method succeed when the host is already in the desired state, instead. Although we could make the sequencer task in Hubris responsible for this, it seems to me that it might make more sense to do it here in MGS?

@hawkw hawkw self-assigned this Aug 23, 2024
hawkw added a commit that referenced this issue Aug 23, 2024
Presently, the `SingleSp::set_power_state` method can fail if the system
is already in the requested power state, which is a bit of a bummer.
This is because the Hubris sequencer task returns an `IllegalTransition`
error when the requested and current power states are the same.

This commit changes it to first check the power state, and not ask the
SP to do anything else if it's already in the requested state.

Closes #270
@hawkw hawkw linked a pull request Aug 23, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant