Skip to content

Stream-to-stream synchronization at the buffer level

Compare
Choose a tag to compare
@lukstafi lukstafi released this 20 Dec 20:30
· 10 commits to master since this release

Highlights from README:

  • Support for CUDA events, and Condition-based events for CPU backends.
  • Overhaul of the backend interfaces, both user-facing but especially internal: full code sharing.
  • Automatic stream-to-stream synchronization on a per-tensor-node basis.

Details from the changelog:

Added

  • Interface files for Backends and Low_level.
  • Fixed #245: tracking of used memory. But there's room for improvement.
  • Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.

Changed

  • Migrated to cudajit 0.6.1.
  • Verifying that code is linked with the right contexts, by tracking embedded_nodes with assignments.
  • Renaming: (virtual) device -> stream, physical_device -> device.
  • New files: split out backend_intf.ml, backend_impl.ml, schedulers.ml from backends.ml; moved Tnode.task to task.ml; renamed backend_utils.ml to c_syntax.ml.
  • Removed half-static verification of merge buffer nodes inside device_to_device.
  • Fixed #286: cross-stream-sharing incorporated into Tnode.memory_mode.
  • Moved the multicore backend from a device = stream model to a single device model.
  • Got rid of unsafe_cleanup.
  • Rename subordinal to stream_id.
  • Removed dependency on core, broke up dependency on ppx_jane.
  • Huge refactoring of backend internal interfaces and API (not repeating same code).
  • Built per-tensor-node stream-to-stream synchronization into copying functions.
  • Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
  • Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
  • Fresh backends are now fresh modules to structurally prevent any potential cache leaking.

Fixed

  • Validating merge nodes for the CUDA backend.
  • Checking is_released on weak array retrieval.