[WIP RFC] OS-specific shared memory #89
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is an alternative to the changes to
File
in #80. You still need the changes toBuffer
from #80.In #80, you have a OS-agnostic memory mapping which sits on a OS-specific tmpfs, which is available in Linux only.
Lifecycle management of the memory is guaranteed by the Nanny, also in case of sudden death of the worker.
In this PR, you have OS-specific access to non-POSIX shared memory API, available on Windows and Linux but not on POSIX (crucially for dask, not on MacOSX).
Unlike
multiprocessing.shared_memory
, which is a thin wrapper around the POSIXshm_open
on all OSes except Windows, this API crucially performs reference counting, automatically releasing a shared memory buffer when all the processes holding a reference to it die (gracefully or not).Notes
1 You can straightforwardly calculate total shared memory, without duplication, if you know the PIDs of all the workers on the host. Which in turn is something you can straightforwardly figure out without info from the scheduler as long as all worker processes were forked/spawned from the same parent and didn't secede (e.g. like in
dask worker
CLI). This requires kernel calls costing O(n), where n is the total number of replicated shared memory buffers on the host, but from early benchmarking it looks fast enough not to be of concern. This is not implemented in this PR (yet?)2 You could implement OS-agnostic tracking of the total shared memory through a bespoke service (
distributed.core.Server
) that is informed by the various workers every time they acquire/release a buffer. This service would then communicate directly to the scheduler via a heartbeat. Since it's just a meter and not what actually holds the references to the memory, you need not worry about race conditions and leaks - workers would asynchronously inform the tracker of any new events, when time allows. This feels like a clean design although there's legwork involved around the deployment (dask worker
CLI,LocalCluster
, etc. would need to spawn a new Server and inform all workers of the server's address).