Add kipc::find_faulted_task. #1931

cbiffle · 2024-11-22T18:39:31Z

This resolves a four-year-old TODO in El Jefe asking for a way to process faulted tasks without making so many kipcs. The original supervisor kipc interface was, by definition, designed before we knew what we were doing. Now that we have some miles on the system, some things are more clear:

The supervisor doesn't use the TaskState data to make its decisions.
The TaskState data is pretty expensive to serialize/deserialize, and produces code containing panic sites.
Panic sites in the supervisor are bad, since it is not allowed to panic.

The new find_faulted_task operation can detect all N faulted tasks using N+1 kipcs, instead of one per potentially faulted task, and the request and response messages are trivial to serialize (one four-byte integer each way). This has allowed me to write (out-of-tree) "minisuper," a supervisor in 256 bytes that cannot panic.

In-tree, this has the advantage of knocking 33% off Jefe's flash size and reducing statically-analyzable max stack depth by 20%.

mkeeter · 2024-11-25T15:48:00Z

doc/kipc.adoc

+
+==== Preconditions
+
+The `starting_index` must be a valid index for this system.


Nit: we also allow the last valid index + 1, which is guaranteed to return 0.

mkeeter · 2024-11-25T15:53:21Z

sys/kern/src/kipc.rs

+    for i in index..tasks.len() {
+        if let TaskState::Faulted { .. } = tasks[i].state() {
+            let response_len =
+                serialize_response(&mut tasks[caller], response, &(i as u32))?;
+            tasks[caller]
+                .save_mut()
+                .set_send_response_and_length(0, response_len);
+            return Ok(NextTask::Same);
+        }
+    }
+
+    let response_len =
+        serialize_response(&mut tasks[caller], response, &0_u32)?;
+    tasks[caller]
+        .save_mut()
+        .set_send_response_and_length(0, response_len);
+    Ok(NextTask::Same)


Take it or leave it, but this could be written more concisely as

Suggested change

for i in index..tasks.len() {

if let TaskState::Faulted { .. } = tasks[i].state() {

let response_len =

serialize_response(&mut tasks[caller], response, &(i as u32))?;

tasks[caller]

.save_mut()

.set_send_response_and_length(0, response_len);

return Ok(NextTask::Same);

}

}

let response_len =

serialize_response(&mut tasks[caller], response, &0_u32)?;

tasks[caller]

.save_mut()

.set_send_response_and_length(0, response_len);

Ok(NextTask::Same)

let i = tasks[index..]

.iter()

.position(|task| matches!(task.state(), TaskState::Faulted { .. }))

.map(|i| i + index) // relative -> global task index

.unwrap_or(0);

let response_len =

serialize_response(&mut tasks[caller], response, &(i as u32))?;

tasks[caller]

.save_mut()

.set_send_response_and_length(0, response_len);

return Ok(NextTask::Same);

Into it, thanks.

Rejiggered this a little to use enumerate and find, to ensure that we don't get a silly overflow check added to the i+index expression.

Heh, amusingly, that totally broke it, due to a brain-o on my part. Switched back to your method. :-)

hawkw

Overall, this is lovely, but it looks like the binding in userlib is actually incorrect! It's using the KIPC number for ReadTaskStatus instead of for the new IPC!

Should be easy to fix but I wanted to flag it now before this merges :)

hawkw · 2024-11-25T21:11:42Z

doc/kipc.adoc

+The "task ID or zero" return value is represented as an `Option<NonZeroUsize>`
+in the Rust API, so a typical use of this kipc looks like:
+
+[source,rust]
+----
+let mut next_task = 1; // skip supervisor
+while let Some(fault) = kipc::find_faulted_task(next_task) {
+    let fault = usize::from(fault);
+    // do things with the faulted task
+
+    next_task = fault + 1;
+}
+----


This is lovely :)

hawkw · 2024-11-25T21:16:08Z

sys/userlib/src/kipc.rs

+    let mut response = 0_u32;
+    let (_, _) = sys_send(
+        TaskId::KERNEL,
+        Kipcnum::ReadTaskStatus as u16,


uhh...this looks like the wrong Kipcnum variant?

Suggested change

Kipcnum::ReadTaskStatus as u16,

Kipcnum::FindFaultedTask as u16,

Good catch! I've mostly tested this with exhubris where I used the right number. The perils of a fork!

sys/userlib/src/kipc.rs

hawkw

looks great!

This resolves a four-year-old TODO in El Jefe asking for a way to process faulted tasks without making so many kipcs. The original supervisor kipc interface was, by definition, designed before we knew what we were doing. Now that we have some miles on the system, some things are more clear: 1. The supervisor doesn't use the TaskState data to make its decisions. 2. The TaskState data is pretty expensive to serialize/deserialize, and produces code containing panic sites. 3. Panic sites in the supervisor are bad, since it is not allowed to panic. The new find_faulted_task operation can detect all N faulted tasks using N+1 kipcs, instead of one per potentially faulted task, and the request and response messages are trivial to serialize (one four-byte integer each way). This has allowed me to write (out-of-tree) "minisuper," a supervisor in 256 bytes that cannot panic. In-tree, this has the advantage of knocking 33% off Jefe's flash size and reducing statically-analyzable max stack depth by 20%.

cbiffle requested a review from mkeeter November 22, 2024 18:39

cbiffle force-pushed the cbiffle/find-faulted-task branch 2 times, most recently from 47735cb to d6a6e67 Compare November 22, 2024 23:09

mkeeter reviewed Nov 25, 2024

View reviewed changes

mkeeter approved these changes Nov 25, 2024

View reviewed changes

hawkw self-requested a review November 25, 2024 21:10

hawkw requested changes Nov 25, 2024

View reviewed changes

cbiffle force-pushed the cbiffle/find-faulted-task branch from d6a6e67 to 38771da Compare November 25, 2024 22:16

hawkw approved these changes Nov 25, 2024

View reviewed changes

cbiffle force-pushed the cbiffle/find-faulted-task branch 3 times, most recently from 45ae34f to 2b1ddde Compare November 25, 2024 22:31

cbiffle force-pushed the cbiffle/find-faulted-task branch from 2b1ddde to 123b749 Compare November 25, 2024 22:48

cbiffle enabled auto-merge (rebase) November 25, 2024 22:52

cbiffle merged commit 3408448 into oxidecomputer:master Nov 25, 2024
125 checks passed

cbiffle deleted the cbiffle/find-faulted-task branch November 25, 2024 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kipc::find_faulted_task. #1931

Add kipc::find_faulted_task. #1931

cbiffle commented Nov 22, 2024

mkeeter Nov 25, 2024

mkeeter Nov 25, 2024 •

edited

Loading

cbiffle Nov 25, 2024

cbiffle Nov 25, 2024

cbiffle Nov 25, 2024

hawkw left a comment

hawkw Nov 25, 2024

hawkw Nov 25, 2024

cbiffle Nov 25, 2024

hawkw left a comment


		==== Preconditions

		The `starting_index` must be a valid index for this system.

	Kipcnum::ReadTaskStatus as u16,
	Kipcnum::FindFaultedTask as u16,

Add kipc::find_faulted_task. #1931

Add kipc::find_faulted_task. #1931

Conversation

cbiffle commented Nov 22, 2024

mkeeter Nov 25, 2024

Choose a reason for hiding this comment

mkeeter Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

cbiffle Nov 25, 2024

Choose a reason for hiding this comment

cbiffle Nov 25, 2024

Choose a reason for hiding this comment

cbiffle Nov 25, 2024

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

hawkw Nov 25, 2024

Choose a reason for hiding this comment

hawkw Nov 25, 2024

Choose a reason for hiding this comment

cbiffle Nov 25, 2024

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

mkeeter Nov 25, 2024 •

edited

Loading