-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SCP: Fix scp_asynch_check cross-thread interference #333
base: master
Are you sure you want to change the base?
Conversation
|
@markpizz Then please explain why when I run TOPS20 Panda under current SCP it appears to hang after a couple hours. If I disable async I/O it runs for hours. If I use Scott's patch it also runs for hours. |
I'll be glad to track down and then precisely explain what's happening if you point at a configuration and disk setup and hopefully some minimal activities' needed to be performed to reproduce the hang. |
Download the simulator from sims repository. `Start addr=040000 BOOT V11.0(315) BOOT>tops20: [BOOT: Loading] [OK] System structure not found. What is its name? TOPS20 [TOPS20 mounted] System res then if you leave it idle for a while it will eventually get hung waiting on a lock. |
Thanks for the config. Meanwhile, where can I find:
and:
|
Download Panda distribution. |
As I'm sure you know, I'm not a user of this simulator nor do I know anything about the panda distribution. A search for "panda distribution" shows various stuff about a species of bear from China. Specific URL pointers to the 2 relevant files will help me get to debugging and will help anyone else reviewing this issue to also reproduce the failure and be comfortable with the ultimate fix. |
Absence of evidence is not evidence of absence. TSan's warning is a good reason to re-evaluate the code, even if specific errant cases haven't been reported or just to verify that it isn't a false positive.
The concern you expressed in a previous thread was a performance hit due to synchronized Here's a rough estimate using vax730 running the diagnostics as the baseline, which regularly trips the TSan warning. For AArch64 (Raspberry Pi 4, Debian), there's a 2.6% performance penalty to check The Win32 rough estimate is probably pessimistic because MS doesn't make it easy to measure process execution statistics (vs. the
Feel free to propose better test cases (NetBSD on VAXen?) 3b2 and pdp10-kl both call |
Not that hard to find on panda.trailing-edge.com. |
I'd still characterize these results as rough (3x BIN/vax730 warmup executions, 50x BIN/vax730 time executions) for AArch64, x86_64 and Win32 x86. Summary:
timings.ods spreadsheet with collected timing (Libreoffice Calc or Excel.) |
Refined the x86 Win32 results by wrangling C# and Powershell to get user and system times vs. total elapsed as reported by
timings.ods spreadsheet with collected timing (Libreoffice Calc or Excel.) |
Thanks. I've got it. Testing now. So far:
On Windows I tried to attached the NI device to the local system LAN and to another interface without local traffic. No change in result. Again, same results whether SET ASYNC or SET NOASYNC. The above Windows case is with Scott's changes to scp.c and sim_defs.h applied to the code I tested with. I got the same behavior before applying his change and again with or without asynch enabled. I don't think the Windows hang has anything to do with the point raised in this PR. Building that simulator on Windows emits these warnings:
|
It is very hard to reproduce.... it gets stuck waiting for the AIO lock. |
What about the windows case? With or without Async enabled and with or without Scott's patch? |
I don't do a lot of testing on Windows. I tried it with async with Scott's patch and did not see the error. I was also experiencing similar issues with B5500 emulator. They went away with Scott's patch. |
but you also say:
Given that, as you say "It is very hard to reproduce", and my lack of being able to reproduce it, maybe your test didn't create the hang... The Ubuntu test has been running without issue for coming up on a day... The windows test can't even reproduce the working case with or without scott's patches and with or without asynch being enabled. So I'm still looking for a case where this PR fixes an observable problem.
Scott's patch had absolutely no chance of changing anything in the B5500. That simulator doesn't use sim_disk, sim_tape or sim_ether which are the only cases that could potentially use asynch I/O. The B5500 doesn't even have any asynch code (which his changes are part of) compiled in. What platform (OS and architecture) were you experiencing the problems on? What compiler was used to build the failing cases? @bscottm :
They do call AIO_CHECK_EVENT due to potential activities in their simulated ethernet devices. When tested with no ethernet traffic, the tests provide very little or no functional test of your changes.
Doing OS installs to test this doesn't involve any activity on the only devices in these simulators that can potentially do Asynch I/O (their network devices are the only cases in these simulators).
The setting of scp_asynch_check = 0 provides a signal that says "it would be nice if your asynch queue check happened on the next instruction execution". If it doesn't happen, no big deal since it will end up happening within a few instructions anyway. A several percent simulation slowdown for ALL workloads just to achieve this seems to me to be unjustified. That several percent performance gain was a deliberate design decision when this functionality was added. I had originally just believed there would be an advantage with this approach. Thanks for measuring the gain.
They do call AIO_CHECK_EVENT due to potential activities in their simulated ethernet devices. When tested with no ethernet traffic, the tests provide very little or no functional test of your changes. A rigorous test would involve activity that deals with much network AND much disk activity, where the disk activity uses the pdp11_rq device which is the only disk that does asynch disk I/O. |
Perhaps try telneting into TOPS20 or telneting out of it. Then waiting for it to get hung in bug. All I know is that I was seeing several of my simulators getting stuck in waits for AIO lock. I applied his patch and the waits disappeared. I also don't see any performance hit from his patches. So I am ok with them. |
@markpizz: Richard is not wrong and I think this PR fixes a long time issue with the IBM 1130 and SEL simulators. These two simulators have spurious failures on the CI/CD builds. IIRC, you implemented an instruction limit and timeout so that the GH CI/CD wouldn't hang indefinitely when these simulators' tests went astray. I experimented with Bottom line: I haven't been able to get the IBM 1130 or SEL32 simulators to hang indefinitely using this patch, whereas the hangs do happen from time to time with the existing AIO code. (The punchline for the Win32 code is: use compare-exchange, there is no atomic load.) |
117a996
to
c3a60fb
Compare
That you can't get these to fail is proof of exactly what? I never said that my inability to see failure of @rcornwell's pdp10-kl was proof of anything. If he sees failures, I want to see one so I can precisely know what is actually happening. Meanwhile, as to your "proof" ... If you look carefully the code you are changing in your most recent update can't possibly have any effect on the IBM 1130 since the change you're making (or the original code) is only visible to a simulator when the simulator is compiled with SIM_ASYNCH_IO defined. The makefile and the Visual Studio project builds DO NOT define this when building the IBM1130. Maybe you are defining it in your cmake builds... @rcornwell's sims repo only has makefile and visual studio project build support. As for the SEL32 build: That simulator is built with SIM_ASYNC_IO defined (as it should be since it, in theory, leverages sim_ether's asynch support). Meanwhile, the build, even when running the author's tests, does not exercise any network activities, and as such wouldn't be traversing the code you're messing with. Likewise for the vax running its instruction diagnostics. To quote someone earlier in this discussion: "Absence of evidence is not evidence of absence."
This may indeed be appropriate. The comments in your most recent change have language that says: "Do Not Ever use InterlockedExchangePointer() to read sim_asynch_queue. This results in failures across the IBM1130, SEL32 and VAXen simulators." It is truly amazing that I have had a VAX simulator running continuously (whenever my Windows Server is up) for at least the last 12 years and it has never had any sort of a failure like this. This system does tons of network and disk I/O and doesn't merely run instruction diagnostics. It would be best if you created a specific test demonstrating interactions with scp, sim_instr() and devices that perform asynch I/O that proves (or disproves) your assertion and that can readily demonstrate the problem you're fixing on one or more supported platforms. Without a hard proof of failure and some change that unambiguously fixes it, making changes like this are at best wishful thinking. Under real simulator conditions, devices insert events into the queue on the order of dozen's of times a second, while sim_instr() examines the queue likely tens to hundreds of thousands of times per second while executing 10's of millions of instructions per second. The usual case would involve at most 2 or 3 devices slightly busy at the same time with exceedingly rare concurrent queue insertions.
I did implement the optional CPU instruction/time limit, but the failures that were happening all related to errant test behaviors or less than perfect test completion checks. |
I can, however, cause SEL32 and IBM 1130 to fail repeatedly with a slight tweak, lending credence that the existing AIO code as an issue that needs further investigation, because the failure is in the AIO code. How the existing code ends up in a deadlock might prove fruitful; having a working patch that solves several simulators' issue is the proverbial "bird in hand."
CMake build defines
And yet it ends up waiting for some AIO request to complete because
To the contrary, I think I found evidence. :-)
Maybe I should push a commit that ensures that |
I'm digging into some of what you've explored. Meanwhile, these relevant facts should influence your thinking: 1 - You are always building simulators with SIM_ASYNC_IO defined. The makefile builds VERY SPECIFICALLY only compile with SIM_ASYNC_IO defined when code in the specific simulator has devices which leverage asynchronous I/O. The reason being the performance impact of the thread synchronization overhead when it absolutely isn't needed by these simulators. My measurements of the cost of this overhead is between 300% and 400% of the time spent in the various simh event management routines (sim_activate*, sim_cancel, etc.). You should either adjust the cmake builds for these simulators to avoid building with SIM_ASYNC_IO defined, or folks should be advised to build with the makefile to get the fastest simulator results. 2 - Your conclusion that whatever problem the IBM1130 simulator has in the CI/CD testing is related to the changes in this PR is wrong. This simulator does ABSOLUTELY no asynchronous I/O and is completely single threaded. When built with the makefile (without SIM_ASYNC_IO) in the earlier makefile based CI/CD, the test runs intermittently failed there AND NONE of the concerns you raise about InterlockedCompareExchangePointer are a factor since it is never used, If the PR code seems like it "happens" to have fixed the problem, you have just concealed it, or the intermittent activity merely hasn't occurred yet. @rcornwell |
Evidence?
Or it's a Heisenbug related to grabbing |
@markpizz I am not going to argue with you over this. I was having issues with Panda on pdp10-kl, I set noasync and the issue went away. @bscottm ask me (and others) to test his patch. I tried it and my issue with pdp10-kl went away. I do know then under gdb I was able to see that the simulator was stuck waiting for one of the AIO locks. PDP10-kl does use async traffic for networking. |
If you just use your eyes and look at the additional lines of code executed in the sim_activate*, sim_cancel, and sim_aio_check_event and compare them to the simple code paths without any asynch stuff, my seat of the pants would have been more than the 3-4X I measured. I've been busy but am still digging into this.
Well, "Heisenbug" isn't in the picture since there is no sim_asynch_lock when compiled without SIM_ASYNC_IO. Build without asynch support and the IBM1130 intermittently failed the original CI/CD activities. |
Again, this is specifically for the case where simulators DO NOT want or need to be build with SIM_ASYNC_IO defined since everything they do is done completely within one thread! |
I'm starting to wonder why SYM_ASYNC_IO should be around in the first place. |
@pkoning2 I agree here... the only places where I can see an I/O callback are for ethernet or tmxr data receiving. Generally for the level of simH simulators asynchronous I/O does not buy much. Also for most (other then perhaps VAX) they can't really take advantage of async disk I/O. |
Here're the abbreviated timings from While AIO is infrequently used in most simulators, what counts is the overhead from checking a synchronized variable in a simulator's With respect to defining
|
It used to be, and it certainly isn't necessary and is not the case in the simh/simh repo OR in the Visual Studio Projects whenever Asynch I/O isn't appropriate. USE_READER_THREAD is only meaningful for simulators that actively use sim_ether. SIM_ASYNCH_IO is appropriate for simulators that use sim_ether or use RQ and TQ devices. All other simulators are completely single threaded. As for your rigorous tests: Most simulator testing merely validates the register definitions and does essentially nothing related to simulator execution or event processing. Additionally, the cases of more extended testing (VAX and others) are designed to rigorously exercise the instruction execution activities of the particular simulator and deliberately do nothing special to touch devices that may actually perform asynchronous I/O. None of these aggressively exercise anything related to event activities. I'll provide more precise evidence of the performance difference later today. |
The ONLY things which can be cherry picked from simh/simh are changes made by others. ABSOLUTELY ANY CHANGE I have made IS OFF LIMITS. |
But changes made to open-simh are ok to be pulled into your repository? |
Just go ahead and make the changes to the open-simh makefile - as appropriate to the issues reported here. |
I occasionally (today included) submit changes directly to open-simh/simh. Some of my PRs are merged, some are not. The more that are not merged, the less likely I am to submit future ones, and the less likely I am to participate in discussions about code I wrote. Therefore the approximately 200 commits I've made since the open-simh fork are no longer part of open-simh. These include fixing bugs, adding functionality and generally extending some simulators and the platform in general.
Different licenses,, different rules. You guys motivated the different rules based on the idea that my solution to a user problem wasn't good enough and required a different solution. This varied from the historical approach always taken in the simh project for the prior 20 years where the author of possibly problematic code is primarily responsible for fixes. I submitted today's PR due to concern about the problems raised by this PR, which a later message will provide details about. |
Any simulated disk or tape device could be written to provide asynchronous operations. Writing with that in mind requires restructuring the traditional simulated device to take advantage of that approach. When Sergey Oboguev implemented his multi-processor (32 processor) VAX simulator back in 2012 he started from the then current simh code which at that time included asynch RQ, TQ and ethernet devices. Pausing for I/O operations "in between instructions" would have significantly impacted the behaviors of the simulated processors. Meanwhile, the only simulators which might need SIM_ASYNCH_IO are ones with Ethernet and then it only exists to avoid polling for input which then lets things be well behaved while idling. The cmake build infrastructure should only build with SIM_ASYNCH_IO for simulators that need it. Evidence:
Same host using simh/simh repo with SIM_ASYNCH_IO defined and DONT_USE_AIO_INTRINSICS defined, thus avoiding any theoretical issues with InterlockedCompareExchangePointer
Same host using simh/simh repo with SIM_ASYNCH_IO defined, not avoiding anything with InterlockedCompareExchangePointer.
All 3 cases had the same events scheduled and completed. This test represents a basic simulator that has no Ethernet devices. In these cases, it seems that a 5X advantage in the SCP event management APIs to compiling without SIM_ASYNC_IO defined. On Windows compiled with VS2008 (Debug build), we get:
This is a 4.3X advantage in the SCP event management APIs to compiling without SIM_ASYNC_IO defined. Apart from this basic processing advantage when there are no asynchronous threads in the picture, I've got another test which very aggressively exercises the asynchronous event system with a varying number of threads. Processing 100 Million asynchronous events. This exercise runs equivalently with BOTH the locked based asynchronous system as well as the Lock Free system. Performance is better with the lock free case. Equivalently, means that all thread activity completes as expected, but due to dynamic scheduling of threads by the OS, the order changes slightly - this is fully expected. The exact same results are observed on: Windows Intel, WSL Intel, Ubuntu Intel, macOS Intel, macOS M2, Raspberry Pi. BTW, the VAX diagnostic tests aggressively exercise the instruction execution, but actually only perform at most several dozen asynchronous events reading the diagnostics into memory. Not much of a test of the asynchronous event system. Given that the above mentioned aggressive testing has not uncovered any actual problems to be solved, I suggest that before this PR is considered further, a demonstrable failure should be provided that the changes actually fix. @pkoning2: |
Your fix for AIO seems to be to remove it. @bscottm fix appears to solve the problems with it. |
No; the rule is no bringing over stuff from simh/simh because of the mismatching license. What does work is for the author of some change (who can license code at will as the owner) to contribute stuff to open-simh whether it was also submitted to simh/simh or not. |
As a rule they get merged, but there may be delays to to busy work schedules. |
You opened this PR specifically with the assertion that there was a problem that needed fixing in simh asynchronous event queueing without any proof of that assertion. You should be able to write your own test case which demonstrates the problem you are actually fixing. The concept of how to write a demonstration which exercised the original code and produces a failure should be simple for you given the language you use in the PR code comments which claims various specific failures and deep knowledge of how things work. Using my code, which proves there are no problems, doesn't solve your requirement to demonstrate any actual failure case. The theoretical example cases that have been suggested aren't actually reproducible after many dozens of tries on various platforms. The assertion that, these problems don't happen when run with the PR code, doesn't really prove anything since they aren't producible beforehand and even if these rare problems actually occur the timing of the code changes in the different implementations could readily change the timing between threads and that changed timing could affect the true cause of the problem which I now am sure isn't related to asynchronous simh queueing. The cases that were "rumored" to be fixed by the PR: The only bug I found was the fact that SET NOASYNC didn't disable asynchronous event queueing for devices using sim_ether. I've provided a fix for that, This doesn't fix anything actually related to this code in this PR, but it was discovered while exploring the theoretical problem space (as was the missing SIM_ASYNC_IO in the open-simh/simh makefile). |
I once asked a relatively simple question, "Is If one is going to use threads, one has to take into consideration read/write consistency across cores. Using mutexes does not automatically ensure consistency; it only ensures sequential ordering between threads. Consistency is an issue on ARM, which underlies the Apple M1 processors. So, one has to pay attention to this problem if forcing the AIO check is necessary. Windows' Windows does not have an
This measures an amortized I could write a trivial test that exercises the |
c3a60fb
to
ff11516
Compare
Could you handle the merge conflicts? |
Please hold off on committing this PR. The author hasn't actually provided anything which demonstrates that this PR fixes any problem. He asks this question which has been repeatedly answered in the past as the motivation for the PR, due to some external threading analysis tool generating a false positive:
The answer, once again, is that the setting of that variable (and it's reference elsewhere) is deliberately not synchronized since it might change the delivery of an event into the simulator a little sooner that it might otherwise happen, but the cost of synchronizing that check on every potential reference would significantly slow down simulator execution by the overhead of the synchronization with no measurable gain in the behavior of the simulator. Further explanation and details provided later today. |
It is interesting that, while teaching your undergraduates and interacting with defense contractors, you never suggested that they should have unit tests to exercise and prove problems exist. Maybe you suggested it for them, but you're absolved of any such utilities and can just blow ahead and rewrite things which aren't actually broken.
That is very wise, and in fact precisely what was done.
When building with VS2022, the event APIs are actually faster than the VS2008 compiler, which is great AND it proves that the newer compiler improves things without breaking code that has always worked.
Your exercise of the changed code using the VAX730 simulator which reasonably exercises the range of simulator's ability to execute all of its instruction with some vigor, however, the whole simulator test exercises some 143k lines of code across some 148 files. During this vigorous exercise of the simulator, it runs for almost 9 minutes of simulator elapsed time and executes some 350 million instructions and interacts with the SCP event system only some 125K events with a grand total of 53 asynchronous events. This relatively small number of events and essentially no asynchronous events hardly would be considered by anyone as a test of the code you've changed in this PR. Repeating that test really doesn't add any value to exercise the changed code. The point of exercising the event system and more specifically the asynchronous event code should be to assure that the code you're changing is robust to cover all theoretical multi-threaded interactions. The total asynchronous event processing code is some 200ish lines of code. A reasonable unit test of these 200+ lines of code that 1) demonstrates the failure of the prior code, and 2) demonstrates the proper behavior of the changed code would be required by most folks in order to adopt this changed code.
The examples measurements I provided earlier proved that simulators which never leverage the SCP asynchronous APIs were burdened with significant extra cost during event processing, and therefore should be built without USE_ASYNCH_IO defined. The real test and point of your changes is the behavior of how multiple threads behave when the SCP asynchronous event APIs are rigorously engaged including all the potential ways and timing that separate threads can interact. Such a test should demonstrate failure when run with the unmodified asynchronous even APIs and success with modified code. These results should demonstrate success results on all platforms that SIMH supports. Getting back to the earlier question about whether "sim_asynch_check = 0" is needed , as I've said earlier, "it would be nice" if it was noticed every time it is done, but as demonstrated by the full vax730 test, asynchronous events happen very rarely relative to other event interactions and instruction execution, so adding overhead to assure that the asynchronous event is always delivered as soon as possible vs slipping a few instruction times is not worth any extra overhead to assure that. Some of your comments in the earlier commits suggested things about acquire and release semantics. These might be relevant on some platforms, but the total changes required to achieve this comprises 2 lines removed and 4 lines added to the existing code which results in code which reliably works on all current platforms AS WELL as IA64 VMS which is then a distinct improvement for that rare platform. At the beginning of this PR, you asserted in the commit comments and comments in the code is that the existing code causes synchronization problems on various simulators without anything that actually proved the code being changed here was actually due to the observed problems. Now, after my earlier comments, you remove those assertions, so now you've got changes that are merely "gee if I had to solve this problem, I'd do it this way". This is another of your "gee if" code changes that don't actually fix anything that is broken. The Open SIMH Steering Group is welcome to take any and all such changes rather than merely focusing on adopting changes which actually fix problems or enhance functionality. |
Fixing a bug is not a predicate to simplifying complex code. The LLVM thread sanitizer brought an issue to light that led to an opportunity to simplify the AIO thread code. LLVM isn't some random tool: it's the compiler toolchain for macOS and widely used. Were it the case that the sanitizers produced consistent false positives, both the Google and the LLVM teams would have addressed them. Modifying Asymmetry: There is one Lock-free: The original code surrounds the lock-free access to Ensuring original code intent: The intent behind setting Inlines vs. Mutex semantic overloading: Possible Heisenbug: @rcornwell observed that the the |
c28c074
to
9593560
Compare
Looking into why CMake 3.28.9 doesn't like "Visual Studio 17 2022". Works fine locally for me for CMake 3.28.8 but borks for 3.28.9. Ubuntu and macOS builds use an older version as well, so this appears to be an issue with the latest, 3.28.9. |
@bscottm said:
Your code is no less complex than the original code, in fact it merely moves things around mostly to create your needed inline setup. Of course, digging through things and moving things around got you familiar with it so now it doesn't seem complex.
You've removed the "other thread" modifying sim_asynch_check so with the code in this PR, the original role of setting sim_asynch_check to 0 isn't happening at all, nor does it matter since all references to sim_asynch_check happen in the main simulator thread. This is NOT what was intended. Meanwhile, your amazing thread sanitizer tool somehow doesn't identify at least one other variable which is referenced without explicit synchronization in multiple threads.
It might be harder to maintain if there was actually a need to maintain it since it had actual failure modes. You haven't demonstrated anything specific which actually fails. Certainly, if a problem actually existed it would need to be addressed by someone who correctly considered all of the original intended design goals. Your solution to your PR #327 shows that you actually can create a unit test to exercise some internal code specifically, but you silently ignore that any such unit testing would be useful and provide any number of insights. Your single test case which is absolutely not reproducible is no proof that there is a problem in the asynch event code. A rumor has been stated that a hang was sitting a mutex wait when examined with a debugger. You could readily take such a case and dig around the rest of the memory state to get a handle on the actual conditions which caused the hang. Knowing that you could then write a unit test which specifically created the conditions reliably so it wouldn't be hard to reproduce. Once you had such a test case, you could drop in different code for the asynch event queuing system and fix the problem. I believe that you won't actually find such a problem.
I can't find any AIO_I_LOCK or AIO_I_UNLOCK. I do use AIO_ILOCK and AIO_IUNLOCK, which do something when a lock free solution isn't available (the mutex case), and otherwise are defined as nothing. This strategy (of #defines sometimes being blank) is precisely how all of the SIM_ASYNCH_IO enabled macros are only defined when needed. You don't seem to want to change that approach.
That is certainly true and addresses the point I raised above. Extending that logic (testing only the sim_asynch_queue) one inch further removes the concept of having sim_asynch_check at all. MEANWHILE, as I've explained previously, referencing sim_asynch_queue (which sometimes requires a LOCK, or at least various memory flushes on some platforms) is more expensive than the unsynchronized value reference to sim_asynch_check. That needs to happen on every instruction execution in sim_instr(), while actual asynchronous events are very very rare in a running simulator and actually occur maybe once every 100's of thousands of instructions (or maybe millions of instructions). Avoiding the continuous overhead of mutex access or memory flushing makes sense to me, but not to you or your thread sanitizer.
#defines (if properly defined) also ensure precise type safety, but step through debugging is generally not available. The "#ifdef forest" was already in a single place. Now, with your "Inline code", you need to actually have a compiler situation which in fact actually inlines this. Otherwise, you are burdened with procedure call overhead for every reference, you get automatic inline for free in both debug and release builds with the "#ifdef forest".
We're back to "gee if I were to write this". Please demonstrate any functional problem with the existing code as well as the debug case.
It merely moves the bar to a different Heisenbug place which you haven't stumbled into yet. As I advised above, capture the hanging case and dig through the rest of the state to find the necessary conditions for the hang, then create a unit test which reliably recreates the hang conditions. FYI, several years back I dug through prior apparent hangs in at least one of @rcornwell's simulators and the problem was not in any asynchronous code, but an oversight in some aspect of the simulator/device code. The simulator code is large and complicated as is the rest of the SCP code which supports it. A unit test which specifically exercises the presumed failing routines (the asynch event queue management) will naturally remove all of this extra simulator and SCP code that somehow stumbled into the failing case. |
@mark: I don't have the time tonight to respond fully. Please refrain from referring to the Google-produced, LLVM compiler team-maintained thread sanitizer as "my thread sanitizer."
And always remember the Roman admonition, "Memento mori." |
9593560
to
314192d
Compare
@pkoning2: This patch set is a refactoring and cleanup of the AIO "lock-free" implementation that fixes a few issues that I've documented and for which I've provided rationale in earlier parts of the thread. It may have also unintentionally fixed a latent Heisenbug. The refactoring uses static inline functions instead of preprocessor macro code. In my experience, static inlines provide better argument hygiene and type safety, vice preprocessor macro code. Fundamentally, I've striven to provide correct cross-thread communication based on my experience of writing, teaching and reviewing multicore threaded code. It's a matter of doing it intentionally and not mixing paradigms. Lock-free code, if implemented correctly, should not have to lock a mutex. To @markpizz's point of waiting for synchronization primitives, his concern is overblown. On Intel,
What does the assembly look like? GCC (left) and Clang (right) produce:
See the For ARMv8 (AARCH64 and a proxy for Apple M1), GCC and Clang emit ARMv8 assembly output for the same C code as above:
The The claim that I've rearranged the complexity is true, to a limited extent. My aim was to make the code more readable and that should make it more maintainable in the long run. @markpizz obviously disagrees. It should be fairly easy to determine the preprocessor conditional flow for the lock-free, lock-based and no-AIO-at-all paths. Heisenbugs: Hoping that one can find a way to consistently reproduce a Heisenbug is a fool's errand. That's why they're called "Heisenbugs". Perturbing the code is one of the few ways to get a Heisenbug to identify itself. It might be true that I've pushed the Heisenbug elsewhere or deeper. It might also be true that I've quashed a long term latent bug. Only time will tell. The vigor with which @markpizz derides this code seems to be a perverse sort of admiration. |
@pkoning2: Latest push eliminates the |
Zlib is no longer packaged as part of macOS, now needs to be installed by HomeBrew.
- Periodic "python cmake/generate.py" to sync with makefile changes. - Ensure that AIO_CCDEFS (makefile) and USES_AIO (CMake generate.py) are present for PDP-10 simulators. Otherwise, user gets a warning message when the simulator starts about network support and asynchronous I/O.
Address a Clang/LLVM sanitizer warning that scp_asynch_check is written by both the AIO and main threads, one thread potentially clobbering the other's value. The "scp_asynch_check = 0" at scp.c:239 is where the thread sanitizer detects the issue. This assignment is supposed to cause the main thread to process I/O updates on the next AIO_CHECK_EVENT code's execution. To preserve that behavior, AIO_CHECK_EVENT now executes AIO_UPDATE_QUEUE when sim_asynch_queue != QUEUE_HEAD, i.e., there is work to be done in the queue. Code refactoring: - Convert preprocessor macro code to inline functions unless impractical. - Eliminate the asymmetry between the lock-based (mutex) and lock-free implementations. - Lock-free: AIO_ILOCK/AIO_IUNLOCK do not reacquire sim_asynch_lock when compiler intrinsics are detected (GCC, Clang, MS C and DEC C on Itanium.) - Lock-based: If DONT_USE_AIO_INTRINSICS is defined or intrinsics are not detected, the AIO implementation becomes lock-based via mutexes and AIO_ILOCK/AIO_IUNLOCK recursively acquire/release sim_asynch_lock. - AIO defaults to the lock-based implementation if compiler intrinsics are not detected. - GCC, Clang: Prefer the __atomic intrinsics over the deprecated __sync intrinsics. The __sync intrinics still exist for older GCC compilers. - sim_asynch_lock is a recursive mutex for both lock-based and lock-free implementations. Eliminates implementation asymmetry. - AIO_CHECK_EVENT invokes AIO_ILOCK and AIO_IUNLOCK so that the lock-based code cannot alter sim_asynch_queue when checking for pending I/O work. - sim_debug_io_lock: Debug output serialization lock. sim_asynch_lock was semantically overloaded to serialize output from _sim_debug_write_flush. This lock provides better semantic clarity. - Removed sim_asynch_check. It is no longer necessary. - New builder script flag to disable AIO lock-free, force AIO lock-based code: - cmake-builder.ps1 -noaiointrinsics ... - cmake-builder.sh -no-aio-intrinics ... "-DDONT_USE_AIO_INTRINSICS:Bool=On" is the command line CMake configuration option.
As I mentioned, I reacted to the message produced by your forced push to this PR which now does have 3 commits. I don't think it is appropriate to have the same commits present in multiple independent PRs. |
@markpizz: If/when the other PR gets merged, the extra commits go away as they will be part of master. Which will mean a re-push to this branch, which is in no hurry to get merged anyway, so it's not really as big a deal as it seems. (And if you're looking at any of my other branches, you'll see the same thing, until the other PR gets merged.) |
[Mark: I'm prepared for the tirade. There never was a "false positive" detected by the Clang/LLVM thread sanitizer.
AIO_ILOCK
andAIO_IUNLOCK
were set toAIO_LOCK
andAIO_UNLOCK
, respectively. They are now set to an empty preprocessor define. No net performance impact.AIO_LOCK
andAIO_UNLOCK
in_sim_debug_write_flush
to serialize debug output is a bit sketchy.#ifdef
/#endif
forest and static inlines, rather than#define
-ing compare-and-swap in terms of Microsoft'sInterlockedCompareExchangePointer
. This makes the code more maintainable and (arguably) more readable.AIO_INIT
is asymmetric between the lock-free and lock-based code paths, notably,sim_asynch_lock
is recursive for lock-based, but not for lock-free. I didn't change the asymmetry, althoughAIO_INIT
could be hoisted out of the lock-free/lock-based preprocessor conditionals and use a recursive mutex in both cases.][Note: I've tried this branch against the 3b2/400 SVR3 and PDP10-KL TOPS 20 v7 installs, both of which are I/O heavy and exercise multiple devices. Haven't experienced any lockups or issues.
Also: Does anyone actually compile SIMH for Intel Itanium on a DEC platform currently?]
Fix a Clang/LLVM sanitizer warning that
scp_asynch_check
is written by both the AIO and main threads, one thread potentially clobbering the other's value. Thescp_asynch_check = 0
at scp.c:239 is where the thread sanitizer detects the issue.scp_asynch_check = 0
is supposed to cause the main thread to process I/O updates on the nextAIO_CHECK_EVENT
code's execution. To preserve that behavior,AIO_CHECK_EVENT
now executesAIO_UPDATE_QUEUE
when either sim_asynch_check decrements below 0 or there is work to be done on the AIO queue (sim_asynch_queue != QUEUE_HEAD
.)Code refactoring:
AIO_ILOCK/AIO_IUNLOCK: As documented in sim_defs.h, these two macros are empty in the lock-free code case. Minor performance improvement.
AIO_QUEUE_VAL and AIO_QUEUE_SET are inlined static functions and directly use compiler intrinsics for the lock-free implementation, when available.
GCC/Clang: Use the __atomic intrinsics. __sync intrinsics have been deprecated since the C++11 standard's publication. The __sync intrinsics are still available and provide a fallback if GCC or Clang is really so old that the __atomic intrinsics are unavailable (insert Wallace Shawn's iconic "That's totally inconceivable!" here.)
AIO_QUEUE_VAL: Leverages the processor's read fence when available. Ensures that sim_asynch_queue's value is consistently read across cores and threads in the lock-free implementation for platforms without total store ordering or strong consistency guarantees (i.e., anything other than Intel and SPARC.)
AIO_QUEUE_SET: Compare-exchange to set sim_asynch_queue's head. The return value is a boolean (1/0) to reflect the success (1) or failure (0) of the compare-exchange.
AIO_CHECK_EVENT invokes AIO_ILOCK and AIO_IUNLOCK so that the lock-based code cannot alter sim_asynch_queue when checking to see if there is pending I/O work to be done. sim_asynch_lock is a recursive mutex in the lock-based case, so AIO_ILOCK/AIO_IUNLOCK will succeed for the lock-based implementation.
New builder script flag(s) to disable AIO lock-free, force AIO lock-based code: