Skip to content

Commit

Permalink
gimlet-seq: Record why the power state changed (#1953)
Browse files Browse the repository at this point in the history
Currently, when the Gimlet CPU sequencer changes the system's power
state, no information about *why* the power state changed is recorded by
the SP firmware. A system may power off or reboot for a variety of
reasons: it may be requested by the host OS over IPCC, by the control
plane over the management network, or triggered by the thermal task due
to an overheat condition. This makes debugging an unexpected reboot or
power off difficult, as the SP ringbuffers and other diagnostics do not
indicate why an unexpected power state change occurred. See #1950 for a
motivating example.

This commit resolves this as described in #1950 by adding a new field to
the `SetState` variant in the `drv-gimlet-seq-server` ringbuffer, so
that the reason a power state change occurred can be recorded. A new IPC
function, `Sequencer.set_state_with_reason`, is added to the `cpu_seq`
IPC API. This is equivalent to `Sequencer.set_state` but with the
addition of a `StateChangeReason` argument in addition to the desired
power state, and the sequencer task will record the provided reason in
its ringbuffer. This way, we can distinguish between the various reasons
a power state change may have occurred when debugging such issues.

All Hubris-internal callers of `Sequencer.set_state` are updated to
instead use `Sequencer.set_state_with_reason`. In particular,
`host-sp-comms` will record a variety of different `StateChangeReason`s,
allowing us to indicate whether the host requested a normal
power-off/reboot, the host OS panicked or failed to boot, or the host
CPU reset itself. Other callers like `control-plane-agent` and `thermal`
are simpler and just say "it was the control plane" or "overheat",
respectively. For backwards compatibility with existing callers of
`Sequencer.set_state` via `hiffy`, the `set_state` IPC is left as-is,
and will be recorded in the ringbuffer with `StateChangeReason::Other`.
Since all Hubris tasks now use the new API, `Other` *basically* just
means `hiffy`.

The `StateChangeReason` enum also generates counters, so that the total
number of power state changes can be tracked.

Also, while I was here, I've changed the `Trace::SetState` entry in the
`drv-gimlet-seq-server` ringbuf from a tuple-like enum variant to a
struct-like enum variant with named fields. This entry includes two
`PowerState` fields, one recording the previous power state and the
other recording the new power state that has been set. IMHO, using a
tuple-like variant to represent this is a bit unfortunate, as in
Humility, we'll see two values of the same type and it's not immediately
obvious which is the previous state and which is the new state. This
must be determined based on the order of the fields in the ringbuf
entry, which requires referencing the Hubris code to determine.

I felt like it was nicer to just use a struct-like variant with named
fields for this. That way, the semantic meaning of the two `PowerState`s
is actually encoded in the debug info, and Humility can just indicate
which is the previous state and which is the new state when displaying
the ring buffer. I also think it's a bit nicer to name the timestamp
field --- otherwise, it just looks like some arbitrary integer, and you
need to look at the code to determine that it's the timestamp of the
power state change.

Fixes #1950
  • Loading branch information
hawkw authored Dec 19, 2024
1 parent bd35d96 commit 11da9c0
Show file tree
Hide file tree
Showing 8 changed files with 196 additions and 50 deletions.
31 changes: 31 additions & 0 deletions drv/cpu-seq-api/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
use counters::Count;
use derive_idol_err::IdolError;
use userlib::{sys_send, FromPrimitive};
use zerocopy::AsBytes;

// Re-export PowerState for client convenience.
pub use drv_cpu_power_state::PowerState;
Expand All @@ -31,6 +32,36 @@ pub enum SeqError {
ServerRestarted,
}

#[derive(Copy, Clone, Debug, FromPrimitive, Eq, PartialEq, AsBytes, Count)]
#[repr(u8)]
pub enum StateChangeReason {
/// No reason was provided.
///
/// This indicates a legacy caller of `Sequencer.set_state`, rather than
/// `Sequencer.set_state_with_reason`. All Hubris-internal callers should
/// use `set_state_with_reason`, so this variant generally indicates that
/// the `Sequencer.set_state` IPC is being called via Hiffy.
Other = 1,
/// The system has just received power, so the sequencer has booted the
/// host CPU.
InitialPowerOn,
/// A power state change was requested by the control plane.
ControlPlane,
/// The host CPU reset while in A0, so the system has powered off to clear
/// hidden core state.
CpuReset,
/// The host OS failed to boot, so the system has powered off.
HostBootFailure,
/// The host OS panicked.
HostPanic,
/// The host OS requested that the system power off without rebooting.
HostPowerOff,
/// The host OS requested that the system reboot.
HostReboot,
/// The system powered off because a component has overheated.
Overheat,
}

// On Gimlet, we have two banks of up to 8 DIMMs apiece. Export the "two banks"
// bit of knowledge here so it can be used by gimlet-seq-server, spd, and
// packrat, all of which want to know at compile-time how many banks there are.
Expand Down
39 changes: 33 additions & 6 deletions drv/gimlet-seq-server/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ use userlib::{
sys_set_timer, task_slot, units, RecvMessage, TaskId, UnwrapLite,
};

use drv_cpu_seq_api::{PowerState, SeqError};
use drv_cpu_seq_api::{PowerState, SeqError, StateChangeReason};
use drv_hf_api as hf_api;
use drv_i2c_api as i2c;
use drv_ice40_spi_program as ice40;
Expand Down Expand Up @@ -90,7 +90,13 @@ enum Trace {
RailsOn,
UartEnabled,
A0(u16),
SetState(PowerState, PowerState, u64),
SetState {
prev: PowerState,
next: PowerState,
#[count(children)]
why: StateChangeReason,
now: u64,
},
UpdateState(#[count(children)] PowerState),
ClockConfigWrite,
ClockConfigSuccess,
Expand Down Expand Up @@ -482,7 +488,10 @@ impl<S: SpiServer + Clone> ServerImpl<S> {

// Power on, unless suppressed by the `stay-in-a2` feature
if !cfg!(feature = "stay-in-a2") {
_ = server.set_state_internal(PowerState::A0);
_ = server.set_state_internal(
PowerState::A0,
StateChangeReason::InitialPowerOn,
);
}

//
Expand Down Expand Up @@ -666,11 +675,17 @@ impl<S: SpiServer> ServerImpl<S> {
fn set_state_internal(
&mut self,
state: PowerState,
why: StateChangeReason,
) -> Result<(), SeqError> {
let sys = sys_api::Sys::from(SYS.get_task_id());

let now = sys_get_timer().now;
ringbuf_entry!(Trace::SetState(self.state, state, now));
ringbuf_entry!(Trace::SetState {
prev: self.state,
next: state,
why,
now
});

ringbuf_entry_v3p3_sys_a0_vout();

Expand Down Expand Up @@ -813,6 +828,7 @@ impl<S: SpiServer> ServerImpl<S> {

//
// Now wait for the end of Group C.

//
loop {
let mut status = [0u8];
Expand Down Expand Up @@ -1033,7 +1049,18 @@ impl<S: SpiServer> idl::InOrderSequencerImpl for ServerImpl<S> {
_: &RecvMessage,
state: PowerState,
) -> Result<(), RequestError<SeqError>> {
self.set_state_internal(state).map_err(RequestError::from)
self.set_state_internal(state, StateChangeReason::Other)
.map_err(RequestError::from)
}

fn set_state_with_reason(
&mut self,
_: &RecvMessage,
state: PowerState,
reason: StateChangeReason,
) -> Result<(), RequestError<SeqError>> {
self.set_state_internal(state, reason)
.map_err(RequestError::from)
}

fn send_hardware_nmi(
Expand Down Expand Up @@ -1403,7 +1430,7 @@ cfg_if::cfg_if! {
}

mod idl {
use super::SeqError;
use super::{SeqError, StateChangeReason};

include!(concat!(env!("OUT_DIR"), "/server_stub.rs"));
}
Expand Down
44 changes: 30 additions & 14 deletions drv/grapefruit-seq-server/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#![no_std]
#![no_main]

use drv_cpu_seq_api::PowerState;
use drv_cpu_seq_api::{PowerState, StateChangeReason};
use drv_spi_api::{SpiDevice, SpiServer};
use drv_stm32xx_sys_api as sys_api;
use idol_runtime::{NotificationHandler, RequestError};
Expand All @@ -29,6 +29,7 @@ enum Trace {
ContinueBitstreamLoad(usize),
WaitForDone,
Programmed,

#[count(skip)]
None,
}
Expand Down Expand Up @@ -268,6 +269,20 @@ impl<S: SpiServer + Clone> ServerImpl<S> {
fn set_state_impl(&self, state: PowerState) {
self.jefe.set_state(state as u32);
}

fn validate_state_change(
&self,
state: PowerState,
) -> Result<(), drv_cpu_seq_api::SeqError> {
match (self.get_state_impl(), state) {
(PowerState::A2, PowerState::A0)
| (PowerState::A0, PowerState::A2)
| (PowerState::A0PlusHP, PowerState::A2)
| (PowerState::A0Thermtrip, PowerState::A2) => Ok(()),

_ => Err(drv_cpu_seq_api::SeqError::IllegalTransition),
}
}
}

// The `Sequencer` implementation for Grapefruit is copied from
Expand All @@ -286,19 +301,20 @@ impl<S: SpiServer + Clone> idl::InOrderSequencerImpl for ServerImpl<S> {
_: &RecvMessage,
state: PowerState,
) -> Result<(), RequestError<drv_cpu_seq_api::SeqError>> {
match (self.get_state_impl(), state) {
(PowerState::A2, PowerState::A0)
| (PowerState::A0, PowerState::A2)
| (PowerState::A0PlusHP, PowerState::A2)
| (PowerState::A0Thermtrip, PowerState::A2) => {
self.set_state_impl(state);
Ok(())
}
self.validate_state_change(state)?;
self.set_state_impl(state);
Ok(())
}

_ => Err(RequestError::Runtime(
drv_cpu_seq_api::SeqError::IllegalTransition,
)),
}
fn set_state_with_reason(
&mut self,
_: &RecvMessage,
state: PowerState,
_: StateChangeReason,
) -> Result<(), RequestError<drv_cpu_seq_api::SeqError>> {
self.validate_state_change(state)?;
self.set_state_impl(state);
Ok(())
}

fn send_hardware_nmi(
Expand Down Expand Up @@ -327,7 +343,7 @@ impl<S: SpiServer> NotificationHandler for ServerImpl<S> {
}

mod idl {
use drv_cpu_seq_api::SeqError;
use drv_cpu_seq_api::{SeqError, StateChangeReason};
include!(concat!(env!("OUT_DIR"), "/server_stub.rs"));
}

Expand Down
38 changes: 26 additions & 12 deletions drv/mock-gimlet-seq-server/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#![no_std]
#![no_main]

use drv_cpu_seq_api::{PowerState, SeqError};
use drv_cpu_seq_api::{PowerState, SeqError, StateChangeReason};
use idol_runtime::{NotificationHandler, RequestError};
use task_jefe_api::Jefe;
use userlib::{FromPrimitive, RecvMessage, UnwrapLite};
Expand Down Expand Up @@ -44,6 +44,17 @@ impl ServerImpl {
fn set_state_impl(&self, state: PowerState) {
self.jefe.set_state(state as u32);
}

fn validate_state_change(&self, state: PowerState) -> Result<(), SeqError> {
match (self.get_state_impl(), state) {
(PowerState::A2, PowerState::A0)
| (PowerState::A0, PowerState::A2)
| (PowerState::A0PlusHP, PowerState::A2)
| (PowerState::A0Thermtrip, PowerState::A2) => Ok(()),

_ => Err(SeqError::IllegalTransition),
}
}
}

impl idl::InOrderSequencerImpl for ServerImpl {
Expand All @@ -59,17 +70,20 @@ impl idl::InOrderSequencerImpl for ServerImpl {
_: &RecvMessage,
state: PowerState,
) -> Result<(), RequestError<SeqError>> {
match (self.get_state_impl(), state) {
(PowerState::A2, PowerState::A0)
| (PowerState::A0, PowerState::A2)
| (PowerState::A0PlusHP, PowerState::A2)
| (PowerState::A0Thermtrip, PowerState::A2) => {
self.set_state_impl(state);
Ok(())
}
self.validate_state_change(state)?;
self.set_state_impl(state);
Ok(())
}

_ => Err(RequestError::Runtime(SeqError::IllegalTransition)),
}
fn set_state_with_reason(
&mut self,
_: &RecvMessage,
state: PowerState,
_: StateChangeReason,
) -> Result<(), RequestError<SeqError>> {
self.validate_state_change(state)?;
self.set_state_impl(state);
Ok(())
}

fn send_hardware_nmi(
Expand Down Expand Up @@ -99,7 +113,7 @@ impl NotificationHandler for ServerImpl {
}

mod idl {
use super::SeqError;
use super::{SeqError, StateChangeReason};

include!(concat!(env!("OUT_DIR"), "/server_stub.rs"));
}
19 changes: 18 additions & 1 deletion idl/cpu-seq.idol
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,29 @@ Interface(
idempotent: true,
),
"set_state": (
doc: "Set the power state without providing a reason (legacy).",
args: {
"state": (
type: "drv_cpu_power_state::PowerState",
recv: FromPrimitive("u8"),
),
},
reply: Result(
ok: "()",
err: CLike("SeqError"),
),
),
"set_state_with_reason": (
doc: "Set the power state",
args: {
"state": (
type: "drv_cpu_power_state::PowerState",
recv: FromPrimitive("u8"),
)
),
"reason": (
type: "StateChangeReason",
recv: FromPrimitive("u8"),
),
},
reply: Result(
ok: "()",
Expand Down
5 changes: 4 additions & 1 deletion task/control-plane-agent/src/mgs_compute_sled.rs
Original file line number Diff line number Diff line change
Expand Up @@ -728,7 +728,10 @@ impl SpHandler for MgsHandler {
};

self.sequencer
.set_state(power_state)
.set_state_with_reason(
power_state,
drv_cpu_seq_api::StateChangeReason::ControlPlane,
)
.map_err(|e| SpError::PowerStateError(e as u32))
}

Expand Down
Loading

0 comments on commit 11da9c0

Please sign in to comment.