Stream-to-stream synchronization at the buffer level
Highlights from README:
- Support for CUDA events, and
Condition
-based events for CPU backends. - Overhaul of the backend interfaces, both user-facing but especially internal: full code sharing.
- Automatic stream-to-stream synchronization on a per-tensor-node basis.
Details from the changelog:
Added
- Interface files for
Backends
andLow_level
. - Fixed #245: tracking of used memory. But there's room for improvement.
- Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.
Changed
- Migrated to cudajit 0.6.1.
- Verifying that code is linked with the right contexts, by tracking
embedded_nodes
with assignments. - Renaming: (virtual)
device
->stream
,physical_device
->device
. - New files: split out
backend_intf.ml
,backend_impl.ml
,schedulers.ml
frombackends.ml
; movedTnode.task
totask.ml
; renamedbackend_utils.ml
toc_syntax.ml
. - Removed half-static verification of merge buffer nodes inside
device_to_device
. - Fixed #286: cross-stream-sharing incorporated into
Tnode.memory_mode
. - Moved the multicore backend from a
device = stream
model to a single device model. - Got rid of
unsafe_cleanup
. - Rename
subordinal
tostream_id
. - Removed dependency on
core
, broke up dependency onppx_jane
. - Huge refactoring of backend internal interfaces and API (not repeating same code).
- Built per-tensor-node stream-to-stream synchronization into copying functions.
- Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
- Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
- Fresh backends are now fresh modules to structurally prevent any potential cache leaking.
Fixed
- Validating merge nodes for the CUDA backend.
- Checking
is_released
on weak array retrieval.