sim_asynch_check data race #329

bscottm · 2023-12-08T18:52:15Z

The Clang thread sanitizer on Linux/Ubuntu detects a data race (v.i., "Traceback:") on the sim_asynch_check variable, which Mark has previously said is a false positive. I'll explain why this isn't the case more carefully than the last time. N.B.: This isn't a discussion of AIO's design, it's a technical discussion with respect to variable synchronization across threads, i.e., implementation.

Here's what's going on:

The main thread reads and writes sim_asynch_check inside the AIO_CHECK_EVENT macro, which is generally expanded at the bottom of simulator instruction loops to process AIO events when --sim_asynch_check < 0.
The AIO thread also executes sim_asynch_check = 0 (in sim_aio_activate, scp.c:439) so that the next time the main thread executes the AIO_CHECK_EVENT macro's code, AIO events are (supposed to be) immediately processed.

Unlike other variables used across threads by AIO, sim_asynch_check is not protected: no mutex, no synchronized access primitives. Consequently, the thread sanitizer correctly identifies an inconsistent read-after-write hazard across threads. This can be a problem for several reasons, one of which is memory store ordering (Intel and ARM do it differently) and another of which is the intent of the AIO code:

sim_asynch_check = 0;                             /* try to force check */

The important thing to remember is that memory writes are not guaranteed to be sequential or consistent across cores without thread synchronization primitives. Just because the AIO thread writes 0 to sim_asynch_check does not mean that the main thread will read 0 when AIO_EVENT_CHECK decrements sim_async_check (On Intel, there's a high likelihood that the write and reads are more strongly ordered via total store ordering (TSO), whereas ARM has weaker guarantees because ARM does fairly aggressive load/store reordering.)

Looking at the AIO_CHECK_EVENT macro, with extra annotations:

#define AIO_CHECK_EVENT                                                \
   /* case 1 */                                                        \
    if (0 > --sim_asynch_check) {                                      \
       /* case 2 */                                                    \
      AIO_UPDATE_QUEUE;                                                \
      sim_asynch_check = sim_asynch_inst_latency;                      \
       /* case 3 */                                                    \
      } else (void)0
       /* case 4 */                                                    \

Case 1: If AIO's write is successfully read by the main thread, then the intent of scp.c:439 is satisfied. Otherwise, AIO_UPDATE_QUEUE is delayed for at least one simulator instruction cycle, possibly delayed by the remaining count in sim_asynch_check depending on the host processor's load/store ordering -- the AIO sim_asynch_check = 0 precedes AIO_CHECK_EVENT and can effectively be ignored or removed from the host processor's write buffers.
Case 2: Main thread wins, delaying AIO_UPDATE_QUEUE by sim_async_inst_latency simulated instruction cycles.
Cases 3, 4: Similar to case 1, where the intent of forcing the AIO check is likely satisfied because main thread's write to sim_asynch_check precedes AIO's write and the host processor can effectively ignore the main thread's write.

Previously, Mark has said (and I'm paraphrasing) that there's no real problem if AIO_UPDATE_QUEUE gets delayed. If that's the case, then scp.c:439 can be removed because there's never a reason to force AIO_EVENT_CHECK to execute immediately.

There is a performance impact if sim_asynch_check is synchronized across threads because synchronization will occur during each simulated instruction when AIO_EVENT_CHECK executes. This should only be a memory fence/barrier around sim_asynch_check, but one can't make guarantees with respect to how host processors or compilers choose to implement memory fences or barriers.

Inserting Clang-specific annotation or a preprocessor conditional forest to detect the Clang thread sanitizer is a non-starter. It makes the code harder to read and maintain. Fixing the race by coming to a consensus as to what the proper behavior is infinitely more preferable.

Traceback:

WARNING: ThreadSanitizer: data race (pid=8206)
  Write of size 4 at 0x55f42a1f26b8 by thread T4 (mutexes: write M0):
    #0 sim_aio_activate /home/runner/work/open-simh/open-simh/scp.c:439:18 (vax8600+0x1b66ac) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #1 _sim_activate /home/runner/work/open-simh/open-simh/scp.c:12013:1 (vax8600+0x1e978f) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #2 sim_activate /home/runner/work/open-simh/open-simh/scp.c:12005:8 (vax8600+0x1dd1ff) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #3 _disk_io /home/runner/work/open-simh/open-simh/sim_disk.c:266:5 (vax8600+0x206b49) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)

  Previous write of size 4 at 0x55f42a1f26b8 by main thread:
    #0 sim_instr /home/runner/work/open-simh/open-simh/VAX/vax_cpu.c:658:5 (vax8600+0x1033ee) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #1 run_cmd /home/runner/work/open-simh/open-simh/scp.c:9325:13 (vax8600+0x1e38dd) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #2 vax860_boot /home/runner/work/open-simh/open-simh/VAX/vax860_abus.c:700:8 (vax8600+0x144459) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #3 do_cmd_label /home/runner/work/open-simh/open-simh/scp.c (vax8600+0x1c61a5) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #4 call_cmd /home/runner/work/open-simh/open-simh/scp.c:5636:8 (vax8600+0x1d0ea0) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #5 do_cmd_label /home/runner/work/open-simh/open-simh/scp.c (vax8600+0x1c61a5) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #6 do_cmd /home/runner/work/open-simh/open-simh/scp.c:4040:8 (vax8600+0x1c53e0) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)
    #7 main /home/runner/work/open-simh/open-simh/scp.c:2944:16 (vax8600+0x1b7d9e) (BuildId: 219a3a64523a85072939433b7104158fba55d2ba)

  Location is global 'sim_asynch_check' of size 4 at 0x55f42a1f26b8 (vax8600+0x177f6b8)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sim_asynch_check data race #329

sim_asynch_check data race #329

bscottm commented Dec 8, 2023

sim_asynch_check data race #329

sim_asynch_check data race #329

Comments

bscottm commented Dec 8, 2023