Release Stream-to-stream synchronization at the buffer level · ahrefs/ocannl

Highlights from README:

Support for CUDA events, and Condition-based events for CPU backends.
Overhaul of the backend interfaces, both user-facing but especially internal: full code sharing.
Automatic stream-to-stream synchronization on a per-tensor-node basis.

Details from the changelog:

Interface files for Backends and Low_level.
Fixed #245: tracking of used memory. But there's room for improvement.
Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.

Migrated to cudajit 0.6.1.
Verifying that code is linked with the right contexts, by tracking embedded_nodes with assignments.
Renaming: (virtual) device -> stream, physical_device -> device.
New files: split out backend_intf.ml, backend_impl.ml, schedulers.ml from backends.ml; moved Tnode.task to task.ml; renamed backend_utils.ml to c_syntax.ml.
Removed half-static verification of merge buffer nodes inside device_to_device.
Fixed #286: cross-stream-sharing incorporated into Tnode.memory_mode.
Moved the multicore backend from a device = stream model to a single device model.
Got rid of unsafe_cleanup.
Rename subordinal to stream_id.
Removed dependency on core, broke up dependency on ppx_jane.
Huge refactoring of backend internal interfaces and API (not repeating same code).
Built per-tensor-node stream-to-stream synchronization into copying functions.
Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
Fresh backends are now fresh modules to structurally prevent any potential cache leaking.

Provide feedback