fd_tpool_wait should have FD_COMPILER_MFENCE() at end of function #3710

GaneshRapolu · 2024-12-16T20:44:21Z

fd_tpool_private_worker(..) has an FD_COMPILER_MFENCE between executing the task and setting state to IDLE.
https://github.com/firedancer-io/firedancer/blob/main/src/util/tpool/fd_tpool.cxx#L74

This prevents compiler reordering of writes done within the task with setting of worker state.

This should be matched with a FD_COMPILER_MFENCE after the read of worker state observing IDLE in fd_tpool_wait; that would ensure that reads done after observing the worker state are not re-ordered by the compiler with the read of the worker state. However no such compiler fence exists in the impl today which is likely just an oversight.

firedancer/src/util/tpool/fd_tpool.h

Line 813 in fcacb50

fd_tpool_wait( fd_tpool_t const * tpool,

Although unnecessary (??) for many use patterns, it likely simplest (and also mirrors other uses of the compiler fence in many other places; ex fd_fseq_query) to also add one at the start of fd_tpool_wait.

An example of this bug potentially affecting correctness is the recent checkpoint restore_v2 impl which reads err and dirty after calling fd_tpool_wait.

https://github.com/firedancer-io/firedancer/blob/main/src/util/wksp/fd_wksp_restore_v2.c#L421

Also as hinted by kbowers; compiler fences like these are sufficient (for acquire-release patterns or seq-lock style patterns) only on TSO architectures (ex. x86); on arm you would need actual fences or use instructions that have embedded ordering (load acquire; store release etc). But fixing that can/should be part of a later holistic rework to support non-x86 deployments.

cc: @kbowers-jump

kbowers-jump · 2024-12-16T23:05:31Z

No objection here to having fd_tpool_wait using pre- and post-compiler mfences under the hood. Good catch.

GaneshRapolu · 2024-12-17T05:48:14Z

Now that I am looking at the code once more;

fd_tpool_exec
https://github.com/firedancer-io/firedancer/blob/main/src/util/tpool/fd_tpool.h#L801
is doing the right thing with pre & post compiler fences and also a compiler fence between setting the worker args and setting worker state to EXEC.

However, fd_tpool_private_worker https://github.com/firedancer-io/firedancer/blob/main/src/util/tpool/fd_tpool.cxx#L55 does not have a corresponding compiler fence between reading EXEC as worker state and reading the worker args.

Pretty sure that is not right theoretically either.

Simple enough to address this fence site as well as part of this issue.

Some other codebases (ex. some sites in the linux kernel) for acquire-release patterns explicitly link (through a comment reference) the acquire side to the release side (or matching load acquire/store release). An example is
https://github.com/torvalds/linux/blob/f44d154d6e3d633d4c49a5d6aed8a0e4684ae25e/kernel/sched/fair.c#L4641

I wonder if it is worth doing so in the firedancer codebase. That would help catch these kinds of unpaired barrier cases.

kbowers-jump · 2024-12-19T00:47:42Z

Did a quick linting pass through tpool compiler memory fencing shortly after your original note.

Re the worker state to EXEC:

FD_SPIN_PAUSE expands to __builtin_ia32_pause when FD_HAS_X86 is set. The compiler expands this under the hood into a compiler memory fence and a "pause" (a.k.a. "rep nop") instruction (e.g. https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/x86-Built-in-Functions.html).

So the code "works" ... but ...

FD_SPIN_PAUSE's documentation in fd_util_base currently doesn't make an explicit guarantee that it is a compiler memory fence. So this is less than obvious.

Worse, this could cause an unpleasant surprise when, say, porting to non-x86 if the target's FD_SPIN_PAUSE equivalent didn't act as a compiler fence (in addition to any other more explicit hardware fencing that might be needed for architectures without TSO-style memory ordering as you noted).

Currently all the routine CI testing / linting without FD_HAS_X86 uses single threaded processes. So doubt anybody would notice this until trying out tpool-style code on multithreaded processes on non-x86. Then the existing tpool unit test would fail when the compiler mis-optimizes the worker spin loop into an infinite loop. (And maybe this is how you found this ... regardless ... good catch!)

Without an explicitly documented compiler memory fence requirement for FD_SPIN_PAUSE, this is a very easy mistake in porting. In fact, it already happened ... when the code is compiled without FD_HAS_X86, FD_SPIN_PAUSE is currently a fence-free no-op (and running test_tpool on 64-cores under stock gcc for the noarch64 target yields the miscompiled worker spin loop).

The two choices here are:

Add an explicit compiler fence to the worker spin.
Document FD_SPIN_PAUSE as a compiler fence (in addition to any other hardware fencing required for the target for TSO-like behavior) and update the generic fallback accordingly.

These choices aren't exclusive. Strongly leaning toward doing the first on the grounds of clarity. Open to the second on the grounds of defensiveness.

Re comments:

As per the above, no objection comments like that anywhere it improves clarity.

In any case, in the aforementioned linting, added this fence, removed some other superfluous fences (e.g. the leading compiler fence in exec doesn't hurt but the actual requirement is that worker task writes become visible to the worker before the worker's state write), and gave a couple minor APIs stricter semantics as to when they happen from caller's point of view.
Expect to have a PR soonish.

GaneshRapolu · 2024-12-19T03:05:19Z

GCC treating & documenting __builtin_ia32_pause as also providing a compiler fence is insane; I was not aware of this and can’t really think of any reason why this would be a sensible choice. I also can’t find any info that clang guarantees something similar from a brief google search.

If firedancer used _mm_pause() instead from the intrinsics header directly as many codebases do, there is no documented guarantee of this embedded compiler fence for pause that I can find.
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_pause&ig_expand=4897

Instructions for spin loop delays (pause on x86 or isb on arm) really should not be coupled to compiler fences or memory fences for the purposes of preserving memory ordering (or maybe this is only my own bias). So I would not support adding the compiler fence semantics to FD_SPIN_PAUSE.

Nevertheless I think what we need from a fencing pov for the exec -> worker flow is (again only considering x86 for now)

int state = FD_VOLATILE_CONST( worker->state );
if( FD_UNLIKELY( state!=FD_TPOOL_WORKER_STATE_EXEC ) ) {
  if( FD_UNLIKELY( state!=FD_TPOOL_WORKER_STATE_IDLE ) ) break;
  FD_SPIN_PAUSE();
  continue;
}

FD_COMPILER_MFENCE(); // ensure reading task args are not reordered before observing EXEC

/* We are EXEC ... do the task and then transition to IDLE */

fd_tpool_task_t task = FD_VOLATILE_CONST( worker->task );

FD_SPIN_PAUSE behaving as a compiler fence results in the equivalent of

int state = FD_VOLATILE_CONST( worker->state );
if( FD_UNLIKELY( state!=FD_TPOOL_WORKER_STATE_EXEC ) ) {
  if( FD_UNLIKELY( state!=FD_TPOOL_WORKER_STATE_IDLE ) ) break;
  FD_COMPILER_MFENCE();
  FD_SPIN_PAUSE();
  FD_COMPILER_MFENCE();
  continue;
}

/* We are EXEC ... do the task and then transition to IDLE */

fd_tpool_task_t task = FD_VOLATILE_CONST( worker->task );

which doesn't do what I expect is needed. So option 1
"Add an explicit compiler fence to the worker spin." is my preference as it is only one that works (unless I am misunderstanding things).

As per the discussion in issue #3710, gave fd_tpool API stronger compiler memory fencing guarantees and updated FD_SPIN_PAUSE documentation to be explicit about its current (lack of) fencing guarantees. In the process, this fixes a latent bug caused by missing compiler memory fence on targets without FD_HAS_X86. The missing fence could cause the the compiler to "optimize" the tpool worker task dispatch outer loop into an infinite loop for worker threads in tpools containing 2 or more tiles within the same process on targets without FD_HAS_X86. With this, in the vast majority of circumstances (e.g. not writing low level concurrency tests/benchmarks), normal code (i.e. single-threaded non-conflicting) should "just work" without use of volatile, FD_VOLATILE, FD_COMPILER_MFENCE, FD_ATOMIC, etc when thread parallelized via fd_tpool / FD_FOR_ALL / FD_MAP_REDUCE (and the unit test was updated accordingly).

kbowers-jump · 2024-12-19T23:21:15Z

Though I've never been known to be particularly nice about programming language and compiler issues, as far as insanities are concerned, ia32_pause acting as a compiler memory fence isn't the worst thing I've seen (given the main use case for pause is waiting in a nice way for a new value in shared memory, I can see their rationale ... it does seem overly optimistic other tool chains and intrinsic implementations will be similarly defensive though so it isn't that much help to the developer practically). Now the fairly recent optimizer practice of treating undefined behaviors in code as __builtin_unreachable .. that's way way up there on my insanity list. /s Just love the exciting innovation of the optimizer pruning explicit tests to verify unsanitized inputs don't tickle undefined behaviors /s.

In any case, just submitted fixes and generally more strict guarantees compiler memory fencing in fd_tpool. Thanks for the feedback!

As per the discussion in issue #3710, gave fd_tpool API stronger compiler memory fencing guarantees and updated FD_SPIN_PAUSE documentation to be explicit about its current (lack of) fencing guarantees. In the process, this fixes a latent bug caused by missing compiler memory fence on targets without FD_HAS_X86. The missing fence could cause the the compiler to "optimize" the tpool worker task dispatch outer loop into an infinite loop for worker threads in tpools containing 2 or more tiles within the same process on targets without FD_HAS_X86. With this, in the vast majority of circumstances (e.g. not writing low level concurrency tests/benchmarks), normal code (i.e. single-threaded non-conflicting) should "just work" without use of volatile, FD_VOLATILE, FD_COMPILER_MFENCE, FD_ATOMIC, etc when thread parallelized via fd_tpool / FD_FOR_ALL / FD_MAP_REDUCE (and the unit test was updated accordingly).

GaneshRapolu · 2024-12-20T02:07:41Z

Commented in the PR that I am happy with the changes made; thanks for taking the time to look at this and address it so promptly.

P.S.
I am sure you have heard this many times, but I am personally learning a lot from the talks/interviews you gave and from reading various parts of the firedancer codebase (esp contributions that you have made).

It is rare to find people that have discipline around documentation and testing, have mechanical sympathy (basically knowing computer arch in sufficient detail + validating through measurement / benchmarks that code performs as expected), combined with strong communication skills for conveying these ideas to a broad audience.

kbowers-jump · 2024-12-20T21:24:43Z

Summarizing a bunch of omitted philosophical musings, code is meant to be read. But it is seldom appreciated that authors (and not just for code) rarely know if anybody is meaningfully reading their work. I'm very happy to see that people like you are reading the code, understanding it, learning from it, thinking critically about it, constructively challenging it, and thus ultimately improving it (and, by extension, my coding, communications, etc). It's quite rare to get a concrete signal that thoughtful people are paying serious attention. Feedback like this (the whole thread) is more appreciated than you might realize. Least I could do was respond in kind. Thanks again!

As per the discussion in issue #3710, gave fd_tpool API stronger compiler memory fencing guarantees and updated FD_SPIN_PAUSE documentation to be explicit about its current (lack of) fencing guarantees. In the process, this fixes a latent bug caused by missing compiler memory fence on targets without FD_HAS_X86. The missing fence could cause the the compiler to "optimize" the tpool worker task dispatch outer loop into an infinite loop for worker threads in tpools containing 2 or more tiles within the same process on targets without FD_HAS_X86. With this, in the vast majority of circumstances (e.g. not writing low level concurrency tests/benchmarks), normal code (i.e. single-threaded non-conflicting) should "just work" without use of volatile, FD_VOLATILE, FD_COMPILER_MFENCE, FD_ATOMIC, etc when thread parallelized via fd_tpool / FD_FOR_ALL / FD_MAP_REDUCE (and the unit test was updated accordingly).

kbowers-jump mentioned this issue Dec 19, 2024

Stronger fd_tpool compiler memory fencing guarantees #3760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fd_tpool_wait should have FD_COMPILER_MFENCE() at end of function #3710

fd_tpool_wait should have FD_COMPILER_MFENCE() at end of function #3710

GaneshRapolu commented Dec 16, 2024 •

edited

Loading

kbowers-jump commented Dec 16, 2024 •

edited

Loading

GaneshRapolu commented Dec 17, 2024 •

edited

Loading

kbowers-jump commented Dec 19, 2024

GaneshRapolu commented Dec 19, 2024 •

edited

Loading

kbowers-jump commented Dec 19, 2024

GaneshRapolu commented Dec 20, 2024

kbowers-jump commented Dec 20, 2024

fd_tpool_wait should have FD_COMPILER_MFENCE() at end of function #3710

fd_tpool_wait should have FD_COMPILER_MFENCE() at end of function #3710

Comments

GaneshRapolu commented Dec 16, 2024 • edited Loading

kbowers-jump commented Dec 16, 2024 • edited Loading

GaneshRapolu commented Dec 17, 2024 • edited Loading

kbowers-jump commented Dec 19, 2024

GaneshRapolu commented Dec 19, 2024 • edited Loading

kbowers-jump commented Dec 19, 2024

GaneshRapolu commented Dec 20, 2024

kbowers-jump commented Dec 20, 2024

GaneshRapolu commented Dec 16, 2024 •

edited

Loading

kbowers-jump commented Dec 16, 2024 •

edited

Loading

GaneshRapolu commented Dec 17, 2024 •

edited

Loading

GaneshRapolu commented Dec 19, 2024 •

edited

Loading