runtime: distinguish two kind of mutexes #13716

gasche · 2025-01-08T09:59:32Z

Summary

This PR distinguishes two kind of mutexes:

'runtime' mutexes are for blocking critical sections which do not access the runtime.
'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system.

This refactoring comes from the discussions in #13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.

More details

The runtime has a caml_plat_lock_blocking function that takes a mutex in the obvious way. This function should be used very carefully, because it in blocks a domain without transferring control to its backup thread or otherwise listening to STW interruptions, and it can easily cause deadlocks if the critical section itself contains an STW poll point. In #13063, @gadmm introduced a different mutex-taking function, caml_plat_lock_non_blocking that releases the domain lock when it needs to block, and should be used in any critical section that could be long or needs to use the runtime system.

We are still learning about what's a correct usage discipline for these two functions (on Monday I temporarily introduced a bug in trunk, detected reported on Tuesday morning by @jmid in #13713 and fixed on Tuesday evening by @gadmm in #13714 ). In #13714 we realized that it is incorrect to mix uses of lock_blocking and lock_non_blocking on the same mutex -- except in very specific use-cases that are not currently used in the runtime. The current PR proposes to separate the two APIs so that there is no risk to make this mistake again in the future. The goal is to have a system that is simpler to reason about and to use correctly for non-experts such as myself.

gasche · 2025-01-08T10:18:34Z

(I intend to enable multicoretests CI on this PR -- it should only be a refactoring that does not change the runtime behavior, but I wrote new code for mutator mutexes instead of just moving code around, so there may be mistakes. But I will wait first for the usual CI to be green, I have only tested it lightly locally and there is no need to unleash automated testing yet.)

- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.

gasche · 2025-01-08T10:24:15Z

Note: I am afraid there is some overlap and potential conflicts between this PR and #13416, as the PR exploits the definition of sync_mutex in sync_posix.h as a pointer to a caml_plat_mutex. There is not much I can do in particular about this (I don't have the expertise to review #13416 to help it move forward), but I tried to not touch sync_posix.h to minimize conflicts.

MisterDA · 2025-01-10T13:21:21Z

use correctly for non-experts such as myself

Speaking as a non-expert myself, I often have to remind myself that blocking/non-blocking in this context refers to the domain lock and not the mutex itself (blocking on the mutex until it's released). What I mean is that caml_plat_lock_non_blocking will wait on the mutex and is not a wrapper of, e.g., pthread_mutex_trylock.
It's sort of similar to caml_enter_blocking_section and caml_leave_blocking_section. Maybe there's a case for renaming the lock functions to something like caml_lock_release_runtime_system and caml_lock_acquire_runtime_system?

gasche · 2025-01-10T14:02:05Z

I see your point, but acquire_runtime_system does not work as it suggests taking the domain lock, while we expect it to be held and keep it.

(I suggested "cooperative" (and we could use "non_cooperative") before but @gadmm was not convinced.)

Maybe caml_plat_lock_blocking and just caml_mutex_lock_polling could work? The problem with caml_plat_lock is that it really is blocking on the mutex, and therefore it blocks the domain; it's nice that the name calls attention to this problem and makes people think twice about using it. On the other hand caml_mutex_lock{,_polling} is a somewhat standard mutator function, which is unsafe in contexts that don't own the runtime system but this is a safety property common with other functions of the runtime, so it may be handled in more standard ways that don't require a specific naming convention.

(It may be that some lightweight static annotations could help detect issues where mutator functions are called unsafely; there is already a dynamic debug-time system with CAMLnoalloc and CAMLalloc_point_here that I'm not familiar with and therefore do not use.)

gadmm · 2025-01-14T18:43:20Z

Just a quick comment without reading the code in-depth, the "non blocking" variants of plat functions are also "non raising" variants of the caml_ml_mutex_* functions from sync.c, that can be used wherever you cannot reasonably handle an exception. So an option would be "caml_mutex_lock_noexc" and "caml_plat_lock_blocking".

I do not see what is the use for the current caml_mutex_lock. This one seems like it would easily be misused and cause deadlocks. Before removing it, though, are we sure no-one uses it?

gasche mentioned this pull request Jan 8, 2025

Audit and fix caml_plat_lock_blocking usage #13227

Merged

gasche added the runtime-system label Jan 8, 2025

gasche force-pushed the caml_plat_lock_non_blocking5 branch from 476e163 to b839f44 Compare January 8, 2025 10:22

dra27 assigned gasche Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: distinguish two kind of mutexes #13716

runtime: distinguish two kind of mutexes #13716

gasche commented Jan 8, 2025

gasche commented Jan 8, 2025

gasche commented Jan 8, 2025 •

edited

Loading

MisterDA commented Jan 10, 2025

gasche commented Jan 10, 2025

gadmm commented Jan 14, 2025

runtime: distinguish two kind of mutexes #13716

Are you sure you want to change the base?

runtime: distinguish two kind of mutexes #13716

Conversation

gasche commented Jan 8, 2025

Summary

More details

gasche commented Jan 8, 2025

gasche commented Jan 8, 2025 • edited Loading

MisterDA commented Jan 10, 2025

gasche commented Jan 10, 2025

gadmm commented Jan 14, 2025

gasche commented Jan 8, 2025 •

edited

Loading