-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fd_tpool_wait should have FD_COMPILER_MFENCE() at end of function #3710
Comments
No objection here to having fd_tpool_wait using pre- and post-compiler mfences under the hood. Good catch. |
Now that I am looking at the code once more; fd_tpool_exec However, fd_tpool_private_worker https://github.com/firedancer-io/firedancer/blob/main/src/util/tpool/fd_tpool.cxx#L55 does not have a corresponding compiler fence between reading EXEC as worker state and reading the worker args. Pretty sure that is not right theoretically either. Simple enough to address this fence site as well as part of this issue. Some other codebases (ex. some sites in the linux kernel) for acquire-release patterns explicitly link (through a comment reference) the acquire side to the release side (or matching load acquire/store release). An example is I wonder if it is worth doing so in the firedancer codebase. That would help catch these kinds of unpaired barrier cases. |
Did a quick linting pass through tpool compiler memory fencing shortly after your original note. Re the worker state to EXEC: FD_SPIN_PAUSE expands to __builtin_ia32_pause when FD_HAS_X86 is set. The compiler expands this under the hood into a compiler memory fence and a "pause" (a.k.a. "rep nop") instruction (e.g. https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/x86-Built-in-Functions.html). So the code "works" ... but ... FD_SPIN_PAUSE's documentation in fd_util_base currently doesn't make an explicit guarantee that it is a compiler memory fence. So this is less than obvious. Worse, this could cause an unpleasant surprise when, say, porting to non-x86 if the target's FD_SPIN_PAUSE equivalent didn't act as a compiler fence (in addition to any other more explicit hardware fencing that might be needed for architectures without TSO-style memory ordering as you noted). Currently all the routine CI testing / linting without FD_HAS_X86 uses single threaded processes. So doubt anybody would notice this until trying out tpool-style code on multithreaded processes on non-x86. Then the existing tpool unit test would fail when the compiler mis-optimizes the worker spin loop into an infinite loop. (And maybe this is how you found this ... regardless ... good catch!) Without an explicitly documented compiler memory fence requirement for FD_SPIN_PAUSE, this is a very easy mistake in porting. In fact, it already happened ... when the code is compiled without FD_HAS_X86, FD_SPIN_PAUSE is currently a fence-free no-op (and running test_tpool on 64-cores under stock gcc for the noarch64 target yields the miscompiled worker spin loop). The two choices here are:
These choices aren't exclusive. Strongly leaning toward doing the first on the grounds of clarity. Open to the second on the grounds of defensiveness. Re comments: As per the above, no objection comments like that anywhere it improves clarity. In any case, in the aforementioned linting, added this fence, removed some other superfluous fences (e.g. the leading compiler fence in exec doesn't hurt but the actual requirement is that worker task writes become visible to the worker before the worker's state write), and gave a couple minor APIs stricter semantics as to when they happen from caller's point of view. |
GCC treating & documenting __builtin_ia32_pause as also providing a compiler fence is insane; I was not aware of this and can’t really think of any reason why this would be a sensible choice. I also can’t find any info that clang guarantees something similar from a brief google search. If firedancer used _mm_pause() instead from the intrinsics header directly as many codebases do, there is no documented guarantee of this embedded compiler fence for pause that I can find. Instructions for spin loop delays (pause on x86 or isb on arm) really should not be coupled to compiler fences or memory fences for the purposes of preserving memory ordering (or maybe this is only my own bias). So I would not support adding the compiler fence semantics to FD_SPIN_PAUSE. Nevertheless I think what we need from a fencing pov for the exec -> worker flow is (again only considering x86 for now)
FD_SPIN_PAUSE behaving as a compiler fence results in the equivalent of
which doesn't do what I expect is needed. So option 1 |
As per the discussion in issue #3710, gave fd_tpool API stronger compiler memory fencing guarantees and updated FD_SPIN_PAUSE documentation to be explicit about its current (lack of) fencing guarantees. In the process, this fixes a latent bug caused by missing compiler memory fence on targets without FD_HAS_X86. The missing fence could cause the the compiler to "optimize" the tpool worker task dispatch outer loop into an infinite loop for worker threads in tpools containing 2 or more tiles within the same process on targets without FD_HAS_X86. With this, in the vast majority of circumstances (e.g. not writing low level concurrency tests/benchmarks), normal code (i.e. single-threaded non-conflicting) should "just work" without use of volatile, FD_VOLATILE, FD_COMPILER_MFENCE, FD_ATOMIC, etc when thread parallelized via fd_tpool / FD_FOR_ALL / FD_MAP_REDUCE (and the unit test was updated accordingly).
As per the discussion in issue #3710, gave fd_tpool API stronger compiler memory fencing guarantees and updated FD_SPIN_PAUSE documentation to be explicit about its current (lack of) fencing guarantees. In the process, this fixes a latent bug caused by missing compiler memory fence on targets without FD_HAS_X86. The missing fence could cause the the compiler to "optimize" the tpool worker task dispatch outer loop into an infinite loop for worker threads in tpools containing 2 or more tiles within the same process on targets without FD_HAS_X86. With this, in the vast majority of circumstances (e.g. not writing low level concurrency tests/benchmarks), normal code (i.e. single-threaded non-conflicting) should "just work" without use of volatile, FD_VOLATILE, FD_COMPILER_MFENCE, FD_ATOMIC, etc when thread parallelized via fd_tpool / FD_FOR_ALL / FD_MAP_REDUCE (and the unit test was updated accordingly).
Though I've never been known to be particularly nice about programming language and compiler issues, as far as insanities are concerned, ia32_pause acting as a compiler memory fence isn't the worst thing I've seen (given the main use case for pause is waiting in a nice way for a new value in shared memory, I can see their rationale ... it does seem overly optimistic other tool chains and intrinsic implementations will be similarly defensive though so it isn't that much help to the developer practically). Now the fairly recent optimizer practice of treating undefined behaviors in code as __builtin_unreachable .. that's way way up there on my insanity list. /s Just love the exciting innovation of the optimizer pruning explicit tests to verify unsanitized inputs don't tickle undefined behaviors /s. In any case, just submitted fixes and generally more strict guarantees compiler memory fencing in fd_tpool. Thanks for the feedback! |
As per the discussion in issue #3710, gave fd_tpool API stronger compiler memory fencing guarantees and updated FD_SPIN_PAUSE documentation to be explicit about its current (lack of) fencing guarantees. In the process, this fixes a latent bug caused by missing compiler memory fence on targets without FD_HAS_X86. The missing fence could cause the the compiler to "optimize" the tpool worker task dispatch outer loop into an infinite loop for worker threads in tpools containing 2 or more tiles within the same process on targets without FD_HAS_X86. With this, in the vast majority of circumstances (e.g. not writing low level concurrency tests/benchmarks), normal code (i.e. single-threaded non-conflicting) should "just work" without use of volatile, FD_VOLATILE, FD_COMPILER_MFENCE, FD_ATOMIC, etc when thread parallelized via fd_tpool / FD_FOR_ALL / FD_MAP_REDUCE (and the unit test was updated accordingly).
Commented in the PR that I am happy with the changes made; thanks for taking the time to look at this and address it so promptly. P.S. It is rare to find people that have discipline around documentation and testing, have mechanical sympathy (basically knowing computer arch in sufficient detail + validating through measurement / benchmarks that code performs as expected), combined with strong communication skills for conveying these ideas to a broad audience. |
Summarizing a bunch of omitted philosophical musings, code is meant to be read. But it is seldom appreciated that authors (and not just for code) rarely know if anybody is meaningfully reading their work. I'm very happy to see that people like you are reading the code, understanding it, learning from it, thinking critically about it, constructively challenging it, and thus ultimately improving it (and, by extension, my coding, communications, etc). It's quite rare to get a concrete signal that thoughtful people are paying serious attention. Feedback like this (the whole thread) is more appreciated than you might realize. Least I could do was respond in kind. Thanks again! |
As per the discussion in issue #3710, gave fd_tpool API stronger compiler memory fencing guarantees and updated FD_SPIN_PAUSE documentation to be explicit about its current (lack of) fencing guarantees. In the process, this fixes a latent bug caused by missing compiler memory fence on targets without FD_HAS_X86. The missing fence could cause the the compiler to "optimize" the tpool worker task dispatch outer loop into an infinite loop for worker threads in tpools containing 2 or more tiles within the same process on targets without FD_HAS_X86. With this, in the vast majority of circumstances (e.g. not writing low level concurrency tests/benchmarks), normal code (i.e. single-threaded non-conflicting) should "just work" without use of volatile, FD_VOLATILE, FD_COMPILER_MFENCE, FD_ATOMIC, etc when thread parallelized via fd_tpool / FD_FOR_ALL / FD_MAP_REDUCE (and the unit test was updated accordingly).
fd_tpool_private_worker(..) has an FD_COMPILER_MFENCE between executing the task and setting state to IDLE.
https://github.com/firedancer-io/firedancer/blob/main/src/util/tpool/fd_tpool.cxx#L74
This prevents compiler reordering of writes done within the task with setting of worker state.
This should be matched with a FD_COMPILER_MFENCE after the read of worker state observing IDLE in fd_tpool_wait; that would ensure that reads done after observing the worker state are not re-ordered by the compiler with the read of the worker state. However no such compiler fence exists in the impl today which is likely just an oversight.
firedancer/src/util/tpool/fd_tpool.h
Line 813 in fcacb50
Although unnecessary (??) for many use patterns, it likely simplest (and also mirrors other uses of the compiler fence in many other places; ex fd_fseq_query) to also add one at the start of fd_tpool_wait.
An example of this bug potentially affecting correctness is the recent checkpoint restore_v2 impl which reads err and dirty after calling fd_tpool_wait.
https://github.com/firedancer-io/firedancer/blob/main/src/util/wksp/fd_wksp_restore_v2.c#L421
Also as hinted by kbowers; compiler fences like these are sufficient (for acquire-release patterns or seq-lock style patterns) only on TSO architectures (ex. x86); on arm you would need actual fences or use instructions that have embedded ordering (load acquire; store release etc). But fixing that can/should be part of a later holistic rework to support non-x86 deployments.
cc: @kbowers-jump
The text was updated successfully, but these errors were encountered: