Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve context switch / MPU loading performance #1960

Merged
merged 2 commits into from
Dec 19, 2024

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Dec 17, 2024

After discovering that MPU loading was a significant fraction of the time we were spending in the kernel, I took a pass over it. The primary changes here are:

  1. The existing functions we use to compute the RASR/RBAR/RLAR and other MPU registers, all of which sound suspiciously like a mumbling tyrannosaur, have been made const fn so they can be evaluated at compile time.
  2. The build system now generates code for the region tables that call these const fns, so that the MPU code is no longer being evaluated at runtime.
  3. In support of this, the RegionDesc internal kernel structure has grown a new field to hold unspecified architecture specific data, which for our current architectures happens to be a pair of words to be loaded into the MPU (RASR/RBAR or RBAR/RLAR depending on tyrannosaur version).
  4. Finally, the actual code for loading the MPU was underperforming on v6m/v7m. I simplified it. Now we turn off the MPU before loading just like we do on v8m, and don't have to do multiple writes with bit manipulation anymore.

@cbiffle
Copy link
Collaborator Author

cbiffle commented Dec 17, 2024

Looks like I get to run around adjusting all the kernel size allocations. I'll push a commit for that here in a bit.

Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments but LGTM

#[cfg(armv8m)]
#[derive(Copy, Clone, Debug, Serialize, Deserialize)]
pub struct RegionDescExt {
rlar_disabled: u32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add add comment explaining this is the RLAR with everything except the enable bit set? It took me a bit to figure out that was what rlar_disabled meant

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had to go convince myself that we need rlar_disabled so I could explain it ... I think we actually don't, but I'll leave it in in the interest of not breaking things. I will add an explanatory comment and a TODO.

Comment on lines +462 to +467
// Turn off the MPU.
//
// Safety: this has no actual memory safety implications, except for
// potentially exposing the kernel to a NULL dereference that succeeds.
unsafe {
mpu.ctrl.write(0);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish ARMv7m gave a sample sequence for the MPU. IIRC the reason we turn off/on on v8m is because that was the sequence specified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed.

For those following along, @labbott did manage to track down this: https://github.com/ARM-software/m-profile-user-guide-examples/tree/main/Memory_model/rtos_context_switch

That is, unfortunately, wrong, and were we to follow that example we would recreate the crash previously fixed by #1193.

@@ -73,6 +73,7 @@
use core::arch::{self, global_asm};
use core::sync::atomic::{AtomicBool, AtomicPtr, AtomicU32, Ordering};

use serde::{Deserialize, Serialize};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little weird to be bringing these into the kernel, mostly because those seem much higher level. I don't have a better idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remind myself where this is actually used nowadays, maybe we can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's required here because the (cross-platform) RegionDesc in kern::descs derives Serialize and Deserialize, and the RegionDescExt struct must also to not break that. I don't actually see any obvious places where we serialize or deserialize RegionDescs --- I wonder if this is a historical artifact. Did we once use these types in the build code when deserializing an application's config file? It looks like we now use a different type in build_kconfig:

/// Description of one memory region.
///
/// A memory region spans a range of physical addresses, and applies access
/// permissions to whatever lies in that range. Despite our use of the term
/// "memory" here, the region may not describe RAM -- ROM and memory-mapped
/// peripherals are also described by memory regions.
///
/// A memory region can be used by multiple tasks. This is mostly used to have
/// tasks share a no-access region (often index 0) in unused region slots, but
/// you could also use it for shared peripheral or RAM access.
#[derive(Copy, Clone, Debug, Serialize, Deserialize)]
pub struct RegionConfig {
/// Address of start of region. The platform likely has alignment
/// requirements for this; it must meet them. (For example, on ARMv7-M, it
/// must be naturally aligned for the size.)
pub base: u32,
/// Size of region, in bytes. The platform likely has alignment requirements
/// for this; it must meet them. (For example, on ARMv7-M, it must be a
/// power of two greater than 16.)
pub size: u32,
/// Flags describing what can be done with this region.
pub attributes: RegionAttributes,
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news! I can remove it, and will push the commit shortly.

// else but this routine. Here we must convert between the two -- and
// quickly, because this is called on every context switch.
//
// The image-generation tools check at build time that region sizes are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth unifying where this check occurs later to remove some code from big scary xtask

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could actually move the checks into this function, now that we expect it to be evaluated only in const context.....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We could actually do the check from the const fn nowadays if we wanted. Followup maybe.

@aapoalas
Copy link
Contributor

Just to bring visibility, I took a look at splitting the RegionDescExt data being added here away from the existing RegionDesc data, and I think I got a pretty nice solution out of it in the end. I opened it as a PR against this branch in Cliff's fork: cbiffle#1 (It seemed like a good idea at the time.)

I'm not convinced it is the best choice, but I'm fairly convinced it's not an asinine choice. It may not be worth bringing into this PR, but maybe it would be worth a follow-up?

Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cbiffle cbiffle enabled auto-merge (rebase) December 18, 2024 23:53
@cbiffle cbiffle force-pushed the cbiffle/mpu-perf branch 2 times, most recently from 1ea8043 to 1dbd28c Compare December 19, 2024 00:08
@cbiffle
Copy link
Collaborator Author

cbiffle commented Dec 19, 2024

Just to bring visibility, I took a look at splitting the RegionDescExt data being added here away from the existing RegionDesc data, and I think I got a pretty nice solution out of it in the end. I opened it as a PR against this branch in Cliff's fork: cbiffle#1 (It seemed like a good idea at the time.)

Neat, I'm happy to take a look at that @aapoalas. It may be a little while, schedules are starting to get weird due to the encroaching holidays. But I can test the cache effects of that pretty easily, I think.

This reduces the cost of the `apply_memory_protection` phase of context
switches by computing the descriptors in a const fn and sneaking them
into the RegionDesc struct via a new field.

This causes MPU load on ARMv7-M (Cortex-M4) to finish in 45% the time it
previously took, which in turn knocks about 266 cycles off IPC-related
syscalls.
This alters the v6/v7 routine to switch off the MPU before updating it,
and switch it back on after, like v8 does. This has allowed me to use
cheaper code to do the update. That, combined with some field ordering
changes to optimize for access order in the generated code, cuts the
cycles spent in this routine by roughly half.
@cbiffle cbiffle merged commit bd35d96 into oxidecomputer:master Dec 19, 2024
125 checks passed
@cbiffle cbiffle deleted the cbiffle/mpu-perf branch December 19, 2024 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants