-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Improved State Machine Codegen #3720
base: master
Are you sure you want to change the base?
Conversation
70ae186
to
72ce3d9
Compare
72ce3d9
to
afac5d4
Compare
In the given example, Concerning the state machines implementation mentioned as a motivation, wouldn't that also be covered as part of the explicit tail call RFC? Both RFC propose to change the language syntax, but the "explicit tail call RFC" seems more generic as it covers any case of tail call optimization. However, it requires the state machine to be in a standalone function, or at the beginning of a function. |
Pre-RFC discussion: Zulip |
You're right, in this case the break is there because I needed some way to highlight that
The section on guaranteed tail calls has some detail on why tail calls are useful (rust should have them!) but not always the best solution:
Also in terms of ergonomics labeled match and guaranteed tail calls make very different tradeoffs, that work well in some cases and less well in others. |
I'll be honest, I really don't like this. Having a match that can implicitly act like a loop is very confusing, although I understand the desire to simplify this kind of construct. I would much prefer if this were some kind of dedicated construct like And just adding onto this: my main issue is that any statement should be immediately obvious whether it loops. So, seeing a large |
I agree that "a match can be a loop now" is the most unintuitive part of this proposal. I think any proposal tackling this problem will not be entirely obvious at first though. This feature is unlikely to show up early on in a beginner's journey (e.g. how often do we really see a labeled block in the wild?). The requirement of a label on both the The RFC specifically uses labeled loop/block syntax and intuition to make the feature as non-invasive as possible. Expanding the language further is something I'd like to avoid if possible, but I'm open to it if the lang team prefers that route. |
See, the thing is that labelled blocks can't loop. The only thing that loops are loops. I get not wanting to "change the language too much" but I think that subtle syntax with massive changes in functionality are a much bigger change than obvious syntax. It requires people to expand their mental model substantially more when reading Rust code, whereas something like Plus, it's very easy to imagine making |
I agree with @clarfonthey that this syntax looks unecessary, including that Yet if folks want this then the systax should be
|
I am not so sure this actually proposes to just add labeled matches. I think it proposes much more: to ascribe a very specific evaluation order to patterns. Otherwise I don't think it works, because without the order being very strict, I don't think the jump-to-block-like control flow works correctly. And currently we don't have a well-specified order: rust-lang/reference#1665 @folkertdev "This makes the language more complex" is "free space" in terms of drawbacks. You may wish to include, rather, that it is possible that adopting this proposal may require us to forfeit possible ways to evaluate patterns that may be more optimal in some cases. Especially ones that are not zlib's use-case. In other words, there is not yet evidence that this cannot negatively impact the optimization of programs that don't use this feature. |
I continue to think that adding labels in more places with special behaviour like this is overall bad. Rust syntax is best when there's a "here's what's coming" warning up-front, which is why it's good that If the goal is
then my suggestion is still that we should have a dedicated state machine construct, not try to force it into a
and thus there's no need to have particularly-terse notation for it. Spitballing: what about some builtin macro like
where the blocks can, like EDIT: Aha, finally found one of the recent zulip threads about this, https://rust-lang.zulipchat.com/#narrow/channel/213817-t-lang/topic/Fallthrough.20in.20Match.20Statements/near/474070351 (That example is an uninteresting use of the feature, but it's the same as the one in the RFC summary.) Then it's clear up-front that this is weird stuff that you have to read differently from normal blocks, there's a obvious wall we can use to say which labels you're allowed to use for this, etc. And there's no I think I'm basically echoing what @clarfonthey said above, so I'll repeat it for emphasis
New syntax is fine. We have identifier space reserved to do it non-breakingly in current editions, then change to something else later if needed. But if it's rare, we don't necessarily even need to do that. So I completely disagree with the RFC's phrasing of
as an advantage. It's much easier to learn something with a EDIT: For clarity, I do think something in this space would be a good thing for Rust to have. I just think that |
text/3720-labeled-match.md
Outdated
|
||
The labeled match proposal combines existing rust features of `match` and labeled blocks/loops. It is just the interaction between these concepts that has to be learned, no new keywords or syntactic constructions are needed. Occurrences of labeled match will be rare, and true beginners are unlikely to encounter them early on. | ||
|
||
Labeled match does not introduce arbitrary control flow (like general `goto`) or surprising implicit control flow (like `switch` fallthrough in C and descendants). The mechanism fits nicely into how rust works today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does introduce irreducible control flow though, right? Which is otherwise impossible in surface rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s possible to semantically understand the construct as a reducible CFG similar to loop { match { … } }
, and not just in the trivial sense that any CFG can technically be made reducible. This would give a reasonable way to apply static analyses that are more naturally expressed in terms of structured control flow (with some loss of precision wrt possible paths through the CFG). If this is not the case, then that’s another drawback in my opinion.
However, it’s true that the sensible MIR lowering for achieving the desired codegen will usually be irreducible from the start. Today you can only(?) write programs that start out as reducible CFG and possibly become irreducible during optimization. Although I don’t know if any current MIR optimizations actually do that (jump threading is the usual suspect) or whether the ball is completely in LLVM’s court for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get the borrowck to understand this then this will really be irreducible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, maybe this isn’t as novel as it seems. With async/await you can also describe what is conceptually irreducible control flow inside a state machine. For example, resuming loop { g().await; }
after suspending at the await
jumps into the middle of the loop body (after calling g().into_future()
, before calling poll()
) so there isn’t a single header block that dominates the loop. Here, too, it’s possible to pretend the state machine is a big loop + match to make it reducible but this loses precision about which state transitions are possible (e.g., you can’t go “backwards” within one iteration of the loop, but in the loop+match CFG it appears possible unless you do the equivalent of jump threading wrt the state variable). If I’m reading the lowered MIR correctly then this already generates irreducible MIR today without optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get the borrowck to understand this then this will really be irreducible
Yes that’s one analysis that would have to choose between precision and dealing with irreducibility. How does it deal with irreducible async/await state machines (see my last comment) today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect they're borrow-checked before the coroutine transform but I could be wrong.
|
||
The idea was first introduced in [this issue](https://github.com/ziglang/zig/issues/8220) which has a fair amount of background on how LLVM is not able to optimize certain cases, reasoning about not having a general `goto` in zig, and why tail calls do not cover all cases. | ||
|
||
[This PR](https://github.com/ziglang/zig/pull/21257) implements the feature, and provides a nice summary of the feature and what guarantees zig makes about code generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that Rust is not going to give any guarantees about code generation. That's a QoI issue, not a semantic one, and not something with stable guarantees for basically anything.
(A spec would describe how this is lowered to MIR, perhaps, but not to machine codegen.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, the RFC text is (or should be, I tried to be) careful to not provide any hard guarantees about the generated code. The property I really want is that continue 'label value
is lowered to a goto
in a predictable way: expert programmers should be able to be fairly confident that that desugaring kicks in when they write their code in a specific way.
Even when the continue is desugared to a goto
, MIR optimalizations and LLVM can change the structure. That is fine, the assumption being that an unconditional jump is actually unlikely to be touched except if it can be removed entirely.
Zig has its own code generation backends, and hence can make hard guarantees about the assembly that they output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zig says:
"This lowering can inflate code size compared to a simple "switch in a loop" lowering, and any Zig implementation is, of course, free to lower this syntax however it wishes provided the language semantics are obeyed. However, the official ZSF compiler implementation will attempt to match the lowering described above, particularly in the ReleaseFast build mode."
That doesn't sound like a hard guarantee at all, it sounds like a QoI promise.
Where are you getting the idea from that Zig makes hard guarantees here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant "can" in the sense of "could": because they literally put the bytes of the final program in the right order, they could guarantee things about the shape of that generated assembly. I'm not sure they would actually want to. Based on the OP of that PR the support is different depending on the backend right now.
text/3720-labeled-match.md
Outdated
The semantics of `'label: match`, `continue 'label value` and `break 'label value`, including any edge cases or error messages that have been missed so far. | ||
|
||
The RFC text provides background on why this feature is needed for improved code generation, but from the language perspective, only the above three elements are required. | ||
The exact details of the HIR to MIR lowering can be figured out through the implemenation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the reasons I'm not a fan of the match
phrasing of this. The lowering of match
to MIR is already the most complicated part of that lowering -- including referencing papers about how to do it avoiding exponential blowup -- so if it's not obvious enough to be written here, I'm inclined to say that it shouldn't happen period. match
is not C++'s switch
, and needs to handle way more things than just integers.
OTOH, lowering a block to MIR is easy, and that lowering already has Goto
terminators to other MIR blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a summary, more detail is provided later on. I think the two open questions are
- what should the lowering of
continue 'label value
be if we can't determine which patternvalue
matches at compile time - how do we handle nested pattern matches
I don't think either of these questions are hard to answer: it's just work. Work that we did not yet do for the PoC and are hesitant to do if it's unlikely to be used at some point.
This has nothing to do with the desugaring of pattern matching itself: we're just evaluating the pattern match at compile time if the necessary information is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ask me if you need concrete details on how we lower matches to MIR. What I've proposed in the discussion is something I am fairly confident is feasible.
text/3720-labeled-match.md
Outdated
|
||
State machines require flexible control flow. However, the unstructured control flow of C is in many ways too flexible: it is hard for programmers to follow and for tools to reason about and give good errors for. Ideally, there is a middle ground between code that is easy to understand (by human and machine), interacts well with other rust features, and is flexible enough to efficiently express state machine logic. | ||
|
||
Today there is no good way to translate C code that uses implicit fallthroughs or similar control flow to rust while preserving both the ergonomics (in particular, the number of levels of indentation) and the performance (due to LLVM using jump tables instead of an unconditional jump, see the next section). If we wanted to translate this C code to Rust: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: phrase this as "a consistent level of indentation" or "a number of level of indentations that's sub-linear in the number of states", rather than just "the number of levels".
"Rust has two levels more indentation than C" isn't good motivation, but "I shouldn't have to ski-jump off the page" is.
(Though even there, it'd be good to show you can't fix that issue by having a macro_rules
macro that does the loop
translation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is such a macro, it was proposed on Zulip.
EDIT: Ah, but that's not a full loop, just fallthrough semantics.
The idea is that we do at compile time what would be done at runtime anyway: compute the next location to jump to. The process is short-circuited, but not changed. What suggests to you that evaluation order would change? Also strictly speaking the specifics of the codegen approach are not part of the RFC itself, but they seemed important to talk about anyway because it's at least half of the justification for the feature (the other half being improved ergonomics for state machines and translating C-style code)
Again, I don't think the evaluation order is changed at all. The only change is that a double jump between blocks
Any code not using labeled match would be unaffected. When you use labeled match, you get the codegen it has. Also I can't think of any cases where an unconditional jump is worse for performance (though there might be cases where other optimizations get missed? Seems unlikely to me but theoretically possible). |
Hm, kinda. If the new state is fully specified, then jumping to the right block is a straightforward optimization on top of the obvious desugaring. If the scrutinee is partially known then optimizing the desugaring may indeed not work reliably: loop match (&0, 0) {
(0, 1..) => continue (foo(), 0),
(_, 0) => break,
_ => unreachable!()
}
// desugars+compiles to:
let mut state = (&0, 0);
loop {
if *state.0 == 0 {
if state.1 >= 1 {
state = (foo(), 0);
continue;
}
}
if state.1 == 0 {
break;
}
unreachable!()
} we can't optimize the But we don't need to compile it to a desugaring: we could say "if the What I don't know is how to specify "the |
We discussed this a bit with @traviscross, and found a sensible line: Then we can lint jumps that aren't of this form, either because the expression is not const-promotable or because it doesn't suffice to know which arm will be taken (e.g. if there's an arm guard). I didn't see it discussed, but if we implement this as a MIR lowering property the borrow-checker would understand these direct jumps. This makes the feature pull more weight imo. |
I don't think that's good enough for applications, I would want |
so then if I were to try to use this to write an interpreter it would have to be something like: macro_rules! make_interpreter {
(
@variants $variants:tt $label:lifetime: loop match ($next_insn:expr) {
$(
$Variant:path => $block:block
)+
}
) => {
$label: loop match $next_insn {
$(
$Variant => {
$block
make_interpreter! {
@variants $variants continue $label $next_insn
}
}
)+
}
};
(
$label:lifetime: loop match ($next_insn:expr) {
$(
$Variant:path => $block:block
)+
}
) => {
make_interpreter! {
@variants($($Variant,)+) $label: loop match ($next_insn) {
$(
$Variant => $block
)+
}
}
};
(
@variants($($Variant:path,)+) continue $label:lifetime $next_insn:expr
) => {
match $next_insn {
$($Variant => continue $label $Variant,)+
}
};
}
#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
#[repr(u8)]
pub enum Opcode {
Add,
Const,
Ret,
JumpIfNotEq,
}
#[repr(C)]
pub struct Insn {
pub opcode: Opcode,
pub args: [u8; 3],
}
pub unsafe fn interpreter(mut pc: *const Insn, regs: &mut [u64; 16]) {
macro_rules! reg_a {
() => {
regs[((*pc).args[0] >> 4) as usize]
};
}
macro_rules! reg_b {
() => {
regs[((*pc).args[0] & 0xF) as usize]
};
}
macro_rules! reg_c {
() => {
regs[((*pc).args[1] >> 4) as usize]
};
}
macro_rules! imm_i16 {
() => {
i16::from_ne_bytes([(*pc).args[1], (*pc).args[2]])
};
}
unsafe {
make_interpreter! {
'a: loop match ((*pc).opcode) {
Opcode::Add => {
reg_a!() = reg_b!().wrapping_add(reg_c!());
pc = pc.add(1);
}
Opcode::Const => {
reg_a!() = imm_i16!() as u64;
pc = pc.add(1);
}
Opcode::Ret => {
return;
}
Opcode::JumpIfNotEq => {
if reg_a!() == reg_b!() {
pc = pc.offset(imm_i16!() as isize);
} else {
pc = pc.add(1);
}
}
}
}
}
} |
Just to clarify since I think it's worth pointing out: I think that a looping match construct in general would be useful, and not just some dedicated state machine syntax. Sure, state machines would be the most common usage, but I can think of others too, like simplifying expressions or traversing trees. I've had plenty of cases where pattern matching is the top level part of a loop and I think it's a useful tool to have. |
Why doesn't an MIR optimization for the pattern suffice. This looks like something which could be guaranteed for any For example:
I think we could even guaranteed that it compiles into a computed goto on supported platforms, via MIR optimizations. So why does this need a new language construct instead? Why can't this be done as a MIR optimization. If you want guarantees, we could also expose an allow by default lint which would fire if this optimization wasn't done on a loop containing a match and nothing else. |
Concern re: expressed optimization improvements: how is a user supposed to know if their code is better in a If there are specific On to the ergonomics part: Another way to write the example today is use std::ops::ControlFlow::{Break, Continue};
fn emulate_labeled_match() -> Option<u8> {
let mut state = 1u8;
'foo: loop {
let next = match state {
1 => Continue(2),
2 => Continue(3),
3 => Break(Some(42)),
_ => unreachable!(),
};
match next {
Continue(c) => { state = c; continue; }
Break(b) => break 'foo b,
}
}
} That achieves the ergonomics that this proposal brings: there is no update to the state needed in every arm, and choosing the control flow is terse. The second match statement could even be packaged as a macro The compiler also already knows about I think that an alternative to this RFC could be adding an easy way to turn an instance of |
Oh yeah. Hm. Then something like "we take into account any tuple/struct/enum constructors and literals syntactically present in the
For a bunch of reasons. One is that we don't have these sort of guarantees in the language today. Also as explained previously, this is harder (and sometimes wrong) to do as an optimization if we want reliable guarantees. And doing as an optimization means we can't get borrow checker benefits which is too bad imo.
That's why I think it stands more on its own as a specific control-flow construct (with borrowck benefits) rather than an optimization. People will use it when it better expresses their intent, just like they use
Having guarantees in your example seems even harder than in the original proposal. How would you phrase the guarantees we want in your view? |
Another point in favor of |
text/3720-labeled-match.md
Outdated
While this is a natural way to express a state machine, it is well-known that when translated to machine code in a straightforward way, this approach is inefficient on modern CPUs: | ||
|
||
- The match is an unpredictable branch, causing many branch misses. Reducing the number of branch misses is crucial for good performance on modern hardware. | ||
- The "loop + match" approach contains control flow paths (so, sequences of branches) that will never be taken in practice. The stack can be smaller if the control flow paths are known more precisely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off: having a prototype implementation and benchmarks to accompany the RFC is great, kudos for that.
Unfortunately what's written here doesn't seem to fit with the linked benchmark results. The reported branch mispredicts (absolute number!) don't change at all in any of the comparisons in the gist. Instead, the most drastic changes in wall clock time seem to correlate mostly with executing fewer instructions and having fewer cache misses. Branch misprediction rate even appears to be lower for the loop+match variant if it executes more instructions (including more branches) but has the same absolute number of mispredicts. The fact that cache misses sometimes differ drastically between loop+match and the labeled-match variants suggests these variants differ in more fundamental ways than what this RFC discusses. I don't understand the code / workload well enough to even speculate why this could be, but it really doesn't line up with the RFC motivation, so something's fishy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text you cite makes a point based on a very simple example. The straightforward code generation would create an unpredictable branch, and its performance would be bad. The text there already mentions the other benefit of unconditional jumps: simpler CFGs for later compiler passes.
In practice, LLVM already does not use the completely naive approach (at least on x86_64, and I believe most modern platforms). Instead, a jump table is used (this is discussed in the LLVM section). Modern branch predictors are able to deal with those quite well.
The benchmark results are from:
- a program that uses tail calls
- a program that uses labeled match
I think it's expected that these have roughly the same number of branch misses, because the majority of jumps in the state machine logic are unconditional in both versions.
Next the detailed benchmarks (linked in the RFC text) https://gist.github.com/folkertdev/977183fb706b7693863bd7f358578292 include a "loop + match" version. Here again we see that branch predictions are roughly the same on my machine, with a barely significant difference on bjorn's machine. I'm honestly not completely sure why that is, my assumption is that modern branch predictors are that good, and that the branches in the benchmark are predictable enough in practice that we only see a minimal increase in branch mispredictions.
So, what explains the performance difference? I believe it is the second point mentioned: a simpler CFG has many downstream benefits: stack slots can be reused sooner, fewer instructions are pushed out of the instruction cache, etc. Diffing the assembly is a mess, the labeled match version is slightly shorter (~40 lines, so roughly that number of instructions). I guess that matters if you can get rid of instructions in tight inner loops, though I can't say for sure that that is where the instructions are saved.
So, with modern branch predictors, with the input I chose for the benchmark (which is a fairly standard input file for benchmarking zlib performance), it looks like branch mispredictions are not a significant factor in explaining the difference in performance in that benchmark. But the difference we see is real, and crucial for beating C.
Hopefully that clears things up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just be be sure, is it accurate to say: all of the variants essentially implement the same state machine and do the same work except for different representations of the state machine transitions plus whatever ripple effects that has for compiler optimizations? At the source level, there are no differences in algorithms or memory access patterns that could explain different numbers of instructions or cache misses?
This doesn't really clear things up, except confirming my suspicion that nobody actually understands where the performance improvement is coming from. That isn't necessarily a problem when one is trying to make a specific program faster: the speedups you can get for representative workloads are what matters in the end, even if they're different from what your mental performance model would have predicted. However, for a language feature intended to be used in countless yet-unwritten programs, I would hope for less hand-wavy understanding of how they'll be impacted.
You're right to point out that presenting the state machine transitions directly in the CFG can have impact throughout the entire optimization and codegen pipeline (not all of them positive, by the way). Comparing the generated assembly code of the hot loops would be a good first step to better understanding. Could you please share the relevant parts along with some profiling data such as perf record
+ perf annotate
? Due to target-cpu=native
and the big differences between CPU models, it may look very different on your machine than on mine.
That said, if the data access patterns are the same then I don't see how this could explain the big difference in cache misses. While instruction cache misses are very important for applications with huge code footprints, we're talking about tight inner loops here, and on my machine the entire text section of each binary is only about half the size of the compressed input file. Similarly, it's usually very hard for less optimal stack frame layouts to cause any measurable difference for cache misses because all the "important" stack slots will virtually always stay in L1 cache. But we don't have to limit ourselves to idle speculation: Linux perf
can show you which exact instructions often cause which kinds of misses, at least on bare metal (e.g., not under WSL on a Windows machine like mine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, here is more detail https://gist.github.com/folkertdev/0926f620e41509e2bda0da2c252921fe
This commit trifectatechfoundation/zlib-rs@85cf9cc really only turns continue 'label
into continue 'label value
s. So now the only difference really is that labeled match in our PoC gets the improved mir lowering.
You can just search for the number of jmp rax
s in the assembly to confirm that we generate fewer (not 0 because in our PoC, the initial pattern match does not become a direct jump, and we call a helper function after which we have to do a jump that is not statically known).
I don't really have a satisfying explanation for the increase in cache misses for a chunk size of 4. The cache misses appear to occur at reasonable locations.
Also, I'd hope that the results in general replicate on your machine, and that the specific architecture/target features don't matter too much.
Let me know if there is anything else you'd like to see
You're right to point out that presenting the state machine transitions directly in the CFG can have impact throughout the entire optimization and codegen pipeline (not all of them positive, by the way).
Are you thinking of anything specific here? I don't see how providing a more accurate CFG could give worse results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the data. With the len-as-match
branch I got notably different results than in the original gist in some metrics. In particular, I see no significant changes to cache_misses
with the labeled match variants in several cases where your results have enormous changes, so I thought it would be better to zoom in one one machine than to wrangle that variation as well. Similarly, I can't reproduce most of the results you report for the split-out-len
branch except the broad strokes of wall time, instructions, and clock cycles improving by similar amounts for chunk size 4. For reference, my CPU is Intel i7-6700K @ 4GHz.
I haven't found the time to really dig into the assembly and perf annotate (which don't seem to match up, by the way). There seem to be very few samples recorded. The only really interesting thing I've gleaned from the assembly you shared is that the stack frame is the exact same size for both (256 bytes), and that both have almost the same number of instruction referencing [rsp + $offset]
. This makes me even more skeptical that any correct explanation for the performance difference involves the stack.
Are you thinking of anything specific here? I don't see how providing a more accurate CFG could give worse results.
In general any tweak to the input program, even if it only gives "more information", can result in worse code generation due to incomplete and greedy heuristics. But there are some factors specific to irreducible control flow. A lot of analyses and transformations like to work in terms of reducible/structured control flow, or are less well-tuned for irreducible CFGs. Some examples:
- LLVM's LoopInfo can only represent natural loops, so any optimization that works with loops won't work as well or at all on irreducible representations of control flow. There is a more general
Cycle
concept in LLVM nowadays, but it's less widely used, and I don't think it's realistically possible for it to replace the "natural loop" concept entirely. - It's not just transformations targeted at loops that rely on LoopInfo, it's also used in many heuristic cost models including (but not limited to) at the machine code level. For example, loop information is used to guide register allocation (spill weights), instruction scheduling, basic block placement, the x86-specific pass that converts between branches and
cmov
s, and many more. Of course irreducible control flow must be handled correctly, but the cost model will usually be less accurate (that doesn't have to result in worse codegen, but it can). - Sometimes these problems are solved by duplicating code ("node splitting"). For example, LLVM's ordinary jump threading pass deliberately limits itself w.r.t. loop+match state machines to avoid introducing irreducible control flow. The newer "DFA jump threading" pass tries to do optimize such code without making the CFG irreducible by duplicating the successor basic block instead of directly jumping to the existing copy. But such duplication can easily cause massive code size increases, so any pass that does this has a hard limit on how much code it can duplicate, so it'll still miss optimizations for larger state machines.
- Occasionally, the CFG must be made reducible for correctness. This is more common with research-y prototypes than with the default pipeline of industry compilers, but it comes up when compiling to WebAssembly. Any irreducible CFG will be made reducible at some point before emitting wasm, either by code duplication or by (re-)introducing explicit state variables and branches on them. I don't want to make any bets whether you'll get better results by carrying structured control flow through the entire optimization pipeline or showing the "middle end" irreducible control flow and letting the backend make it structured somehow.
Note that this RFC does not get borrow checker benefits. (Nor does it claim to, based on my reading of the RFC.) This is actually one of the reasons I'm not a big fan of this particular proposal, although I very desperately want rust to gain some primitive for doing irreducible control flow. To get borrow checker benefits I think you really need a syntactic form that connects the match arm to the jump, and the proposed syntax of |
Yeah, we should probably have a different syntactic expression to get the "direct jump" versus "do the match normally" behaviors, because if it affects borrowck this needs to be visible from reading the code. |
This is not something I considered (so it's not currently part of the RFC), but you get the borrow checker changes for free, because of how the MIR is generated. e.g. this errors: enum State { Foo, Bar, Baz, }
fn main() {
let owned = Box::new(1);
let mut state = State::Foo;
loop {
match state {
State::Foo => {
state = State::Bar;
}
State::Bar => {
// or any function that moves the value
drop(owned); // ERROR use of moved value: `owned`
state = State::Baz;
}
State::Baz => {
break;
}
}
}
} While if written using labeled match (or 'label: match State::Foo {
State::Foo => {
continue 'label State::Bar;
}
State::Bar => {
// or any function that moves the value
drop(owned); // all good now!
continue 'label State::Baz;
}
State::Baz => {
break 'label;
}
} And hence this program would be accepted by the borrow checker: cool! But also whether the optimization fires can determine whether a program passes the borrow checker. |
Other proposals in this space have suggested to use multiple labels, one per block, and I think that makes a lot more sense than marking a match expression and then having to describe an arm in it.
Right, I think this is a non-starter, optimizations cannot affect borrow checker behavior or else they are not optimizations. So either the real implementation will have to go to some extra work specifically to prevent this behavior from leaking, or else the RFC will have to "own it" and define precisely when the optimization which isn't an optimization actually kicks in. |
what if you had to write |
I never like these proposals because labels feel like a last-resort kind of feature in rust, not a first-class one. Plus with a match you get to pass values around without having to define a bunch of mutable variables.
Yes I like this! Or |
but can you express computed goto using That is needed for interpreters because CPUs can predict branches much better if every interpreted instruction kind has a separate computed goto since branch predictors use the address of the branch instruction to differentiate which predictions to give, allowing the branch predictor to learn more complex patterns since it gets more information. |
I tend to agree with this, I think it would look better with function-call syntax rather than using labels. But to be clear I did mean blocks with parameters here (something which would be its own RFC needing appropriate syntax). There are plenty of design proposals for this linked in the previous discussions.
Didn't a certain @programmerjake demonstrate above how to do this using a macro? It seems kind of orthogonal, given that there is no first class support for computed goto in this proposal. |
not really, because the macro I wrote would produce N separate jump tables for the N match arms and take O(N^2) space/time. computed goto can do all of that without any jump tables at all and doesn't need a crazy O(N^2) macro. |
I agree with the sentiments on macro + label solutions. Specifically, I think the connection between control flow and data flow is important. E.g. with the label solutions it is more laborious to store the current state (so that you can later resume there), because labels are not first-class.
This is great! I think I slightly prefer |
I beg to differ. Being able to write Moreover, I believe that the vast majority of When your intention is to write a state machine, a macro called |
updatesI've pushed a number of updates to the RFC text. The main changes are summarized below syntaxThe syntax has now been changed to borrow checker changesThe standard loop match is like a let x: u64;
loop const match true {
true => {
x = 42;
const continue false;
}
false => {
dbg!(x)
}
} Because the compiler can now know that all paths that lead to the restrictions on when a state transition can be
|
I don't think this is a serious problem. We already have |
|
||
The advantages of having more accurate borrow checking, and accepting more valid programs, are compelling to me, but this more limited solution here could absolutely work from just a performance perspective. | ||
|
||
## Syntax Squables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you still put #3720 (comment) / #3720 (comment) as an alternative? Basically using a dedicated macro rather than introducing a new syntax for this feature in the first place, so we can decide later a dedicated syntax and whether it is worth it (like addr_of!(x)
→ &raw x
or matches!(x, p)
↛ x is p
).
finite_state_machine! {
goto dyn count % 4;
0 => {
one();
goto 3;
}
3 => {
one();
goto 2;
}
2 => {
one();
goto 1;
}
1 => {
one();
n -= 1;
if n > 0 {
goto 0;
} else {
break;
}
}
_ => unreachable!(),
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I added it.
I quite like goto dyn
actually, that's good.
I am, at first glance, worried about the ergonomics of such a macro. To get the goto
target validation and lowering to a MIR goto you need serious compiler support. You'd also need custom error messages for this special rust-like dialect that is valid within the macro, and it needs to be parsed. Maybe there is actually good tooling for this, but my experience with the asm!
macros is that the parsing is very manual and kind of tedious.
```rust | ||
loop match read_next_byte() { | ||
0 => continue 1, | ||
_ => continue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this fairly surprising. If I think of loop match ...
as just a shortcut for loop { match ... }
, then it makes sense in isolation. But when it's mixed with the continue value
form, it means the read_next_byte()
part is only executed in some subset of loop iterations, depending on which state transitions are taken. As a reader I have to (1) remember this every time I see a plain continue
, and (2) scroll back up to the start of the loop match
to see the code that runs to determine the next state.
If plain continue
s were disallowed in loop match
, I could instead read loop match read_next_byte()
as: now we're entering a state machine, the initial state is determined by read_next_byte()
, and to understand the state machine I just chase the transitions from one state to the next. This seems conceptually simpler to me, and doesn't fundamentally lose any expressive power, does it?
I think the proposed syntax is confusing because the argument to If I see In other words, I think the syntax should hint at the fact that there is a hidden state variable being introduced. For example: loop match = true {
...
} Indicates that some kind of state initialization is happening. |
1 => continue 2, | ||
2 => continue 3, | ||
3 => break Some(42), | ||
_ => None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop match
syntax is advertised as a loop
and match
can be combined into a loop match
, but the proposed semantics is clearly not the case:
// This will loop forever.
let a = loop { match true {
true => {},
false => unreachable!(),
}};
// This will assign `()` to `b`
let b = loop match true {
true => {}, // a non-diverging value `x` implicitly becomes `break x;`
false => unreachable!(),
};
For clarity it is better to force every block inside the loop match
to be diverging (have type !
), similar to the else
block in a let else
declaration. Yes this makes it more tedious to write but it will clear much confusion to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity it is better to force every block inside the
loop match
to be diverging (have type!
), similar to theelse
block in alet else
declaration.
I'm a big fan of this, but then think that the loop match
fiction once again starts fitting worse, so think it should just make its own syntax that's not trying to look like a match
(#3720 (comment)).
it can always be k#finite_state_machine
or something on existing editions initially. This is not so common that a few more characters is a problem.
Rendered
This RFC adds
loop match
:loop
andmatch
can be fused into aloop match <scrutinee> { /* ... */ }
loop match
can be targeted by acontinue <value>
. Thevalue
is treated as a replacement operand to thematch
expression.The state transitions (going from one branch of the match to another) can be annotated with the
const
keyword, providing more accurate CFG information to the backend. That means:The goal of
loop match
is improved ergonomics and codegen for state machines. Rust, being a systems language, should make it possible to write efficient state machines, and currently falls short. Complex state machines are niche, but foundational to many programs (parsers, interpreters, networking protocols).This RFC follows in part from work on zlib-rs. Today, we simply cannot achieve the same codegen as C implementations. This limitation actively harms the adoption of rust in high-performance areas like compression.
Basic example
A
loop match
is similar to amatch
inside of aloop
, with a mutable variable being updated to move to the next state. For instance, these two functions are semantically equivalent:Interesting example
The real power of
loop match
lies in giving the compiler more accurate information about the control flow of a program. ConsiderReading the code, it is obvious that state moves from states
Foo
toBar
toBaz
: no other path is possible. Specifically, we cannot end up inState::Bar
twice, and hence the generated "use of moved value" error is not a problem in practice. This program is valid, but nonetheless rejected by the rust compiler.With
loop const match
andconst continue
the compiler now understands the control flow:Rendered
Tracking: