Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kipc::find_faulted_task. #1931

Merged
merged 1 commit into from
Nov 25, 2024

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Nov 22, 2024

This resolves a four-year-old TODO in El Jefe asking for a way to process faulted tasks without making so many kipcs. The original supervisor kipc interface was, by definition, designed before we knew what we were doing. Now that we have some miles on the system, some things are more clear:

  1. The supervisor doesn't use the TaskState data to make its decisions.
  2. The TaskState data is pretty expensive to serialize/deserialize, and produces code containing panic sites.
  3. Panic sites in the supervisor are bad, since it is not allowed to panic.

The new find_faulted_task operation can detect all N faulted tasks using N+1 kipcs, instead of one per potentially faulted task, and the request and response messages are trivial to serialize (one four-byte integer each way). This has allowed me to write (out-of-tree) "minisuper," a supervisor in 256 bytes that cannot panic.

In-tree, this has the advantage of knocking 33% off Jefe's flash size and reducing statically-analyzable max stack depth by 20%.

@cbiffle cbiffle requested a review from mkeeter November 22, 2024 18:39
@cbiffle cbiffle force-pushed the cbiffle/find-faulted-task branch 2 times, most recently from 47735cb to d6a6e67 Compare November 22, 2024 23:09
doc/kipc.adoc Outdated

==== Preconditions

The `starting_index` must be a valid index for this system.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we also allow the last valid index + 1, which is guaranteed to return 0.

Comment on lines 494 to 504
for i in index..tasks.len() {
if let TaskState::Faulted { .. } = tasks[i].state() {
let response_len =
serialize_response(&mut tasks[caller], response, &(i as u32))?;
tasks[caller]
.save_mut()
.set_send_response_and_length(0, response_len);
return Ok(NextTask::Same);
}
}

let response_len =
serialize_response(&mut tasks[caller], response, &0_u32)?;
tasks[caller]
.save_mut()
.set_send_response_and_length(0, response_len);
Ok(NextTask::Same)
Copy link
Collaborator

@mkeeter mkeeter Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take it or leave it, but this could be written more concisely as

Suggested change
for i in index..tasks.len() {
if let TaskState::Faulted { .. } = tasks[i].state() {
let response_len =
serialize_response(&mut tasks[caller], response, &(i as u32))?;
tasks[caller]
.save_mut()
.set_send_response_and_length(0, response_len);
return Ok(NextTask::Same);
}
}
let response_len =
serialize_response(&mut tasks[caller], response, &0_u32)?;
tasks[caller]
.save_mut()
.set_send_response_and_length(0, response_len);
Ok(NextTask::Same)
let i = tasks[index..]
.iter()
.position(|task| matches!(task.state(), TaskState::Faulted { .. }))
.map(|i| i + index) // relative -> global task index
.unwrap_or(0);
let response_len =
serialize_response(&mut tasks[caller], response, &(i as u32))?;
tasks[caller]
.save_mut()
.set_send_response_and_length(0, response_len);
return Ok(NextTask::Same);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Into it, thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rejiggered this a little to use enumerate and find, to ensure that we don't get a silly overflow check added to the i+index expression.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, amusingly, that totally broke it, due to a brain-o on my part. Switched back to your method. :-)

@hawkw hawkw self-requested a review November 25, 2024 21:10
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is lovely, but it looks like the binding in userlib is actually incorrect! It's using the KIPC number for ReadTaskStatus instead of for the new IPC!

Should be easy to fix but I wanted to flag it now before this merges :)

Comment on lines +335 to +354
The "task ID or zero" return value is represented as an `Option<NonZeroUsize>`
in the Rust API, so a typical use of this kipc looks like:

[source,rust]
----
let mut next_task = 1; // skip supervisor
while let Some(fault) = kipc::find_faulted_task(next_task) {
let fault = usize::from(fault);
// do things with the faulted task

next_task = fault + 1;
}
----
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is lovely :)

let mut response = 0_u32;
let (_, _) = sys_send(
TaskId::KERNEL,
Kipcnum::ReadTaskStatus as u16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhh...this looks like the wrong Kipcnum variant?

Suggested change
Kipcnum::ReadTaskStatus as u16,
Kipcnum::FindFaultedTask as u16,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've mostly tested this with exhubris where I used the right number. The perils of a fork!

sys/userlib/src/kipc.rs Show resolved Hide resolved
@cbiffle cbiffle force-pushed the cbiffle/find-faulted-task branch from d6a6e67 to 38771da Compare November 25, 2024 22:16
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@cbiffle cbiffle force-pushed the cbiffle/find-faulted-task branch 3 times, most recently from 45ae34f to 2b1ddde Compare November 25, 2024 22:31
This resolves a four-year-old TODO in El Jefe asking for a way to
process faulted tasks without making so many kipcs. The original
supervisor kipc interface was, by definition, designed before we knew
what we were doing. Now that we have some miles on the system, some
things are more clear:

1. The supervisor doesn't use the TaskState data to make its decisions.
2. The TaskState data is pretty expensive to serialize/deserialize, and
   produces code containing panic sites.
3. Panic sites in the supervisor are bad, since it is not allowed to
   panic.

The new find_faulted_task operation can detect all N faulted tasks using
N+1 kipcs, instead of one per potentially faulted task, and the request
and response messages are trivial to serialize (one four-byte integer
each way). This has allowed me to write (out-of-tree) "minisuper," a
supervisor in 256 bytes that cannot panic.

In-tree, this has the advantage of knocking 33% off Jefe's flash size
and reducing statically-analyzable max stack depth by 20%.
@cbiffle cbiffle force-pushed the cbiffle/find-faulted-task branch from 2b1ddde to 123b749 Compare November 25, 2024 22:48
@cbiffle cbiffle enabled auto-merge (rebase) November 25, 2024 22:52
@cbiffle cbiffle merged commit 3408448 into oxidecomputer:master Nov 25, 2024
125 checks passed
@cbiffle cbiffle deleted the cbiffle/find-faulted-task branch November 25, 2024 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants