Skip to content

Releases: ahrefs/ocannl

Automatic synchronization and transfers between host and devices

01 Jan 21:39
Compare
Choose a tag to compare

From the changelog:

Added

  • Automatic transfers to host from the context that most recently updated a node.
  • Automatic transfers of routine's inputs from host to routine's context if the host array modification was not yet transfered.

Fixed

  • Added # as alternative to ~~ for comment lines in ocannl_config files, and fixed a bug in their parsing.

Stream-to-stream synchronization at the buffer level

20 Dec 20:30
Compare
Choose a tag to compare

Highlights from README:

  • Support for CUDA events, and Condition-based events for CPU backends.
  • Overhaul of the backend interfaces, both user-facing but especially internal: full code sharing.
  • Automatic stream-to-stream synchronization on a per-tensor-node basis.

Details from the changelog:

Added

  • Interface files for Backends and Low_level.
  • Fixed #245: tracking of used memory. But there's room for improvement.
  • Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.

Changed

  • Migrated to cudajit 0.6.1.
  • Verifying that code is linked with the right contexts, by tracking embedded_nodes with assignments.
  • Renaming: (virtual) device -> stream, physical_device -> device.
  • New files: split out backend_intf.ml, backend_impl.ml, schedulers.ml from backends.ml; moved Tnode.task to task.ml; renamed backend_utils.ml to c_syntax.ml.
  • Removed half-static verification of merge buffer nodes inside device_to_device.
  • Fixed #286: cross-stream-sharing incorporated into Tnode.memory_mode.
  • Moved the multicore backend from a device = stream model to a single device model.
  • Got rid of unsafe_cleanup.
  • Rename subordinal to stream_id.
  • Removed dependency on core, broke up dependency on ppx_jane.
  • Huge refactoring of backend internal interfaces and API (not repeating same code).
  • Built per-tensor-node stream-to-stream synchronization into copying functions.
  • Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
  • Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
  • Fresh backends are now fresh modules to structurally prevent any potential cache leaking.

Fixed

  • Validating merge nodes for the CUDA backend.
  • Checking is_released on weak array retrieval.

Half precision, mixed precision, CUDA virtual devices

17 Sep 13:07
Compare
Choose a tag to compare

The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.

From the CHANGELOG:

Added

  • Implemented the previously-mocked support for half precision (FP16).
    • We work around the missing Ctypes coverage by not using Ctypes.bigarray_start.
    • We check FP16 constants for overflow.
    • We output half precision specific code from the CUDA backend.
  • Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via Tnode.update_prec.
  • A placeholder nn_blocks.ml hinting at an intended design pattern for model components.
  • A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
  • Slides for the Fun OCaml meetup: docs/Fun OCaml.
  • New syntax: inline tensor declarations with a literal float as initial value.

Changed

  • Removed the pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.
  • Changed the %cd block comment syntax ~~ to allow detailed structuring. Rewrote Train.grad_update to use the %cd syntax.
  • Made Train.sgd_one slightly more thrifty: p =- learning_rate *. sgd_delta --> p =- learning_rate * sgd_delta ~logic:"." without the inline tensor expression.

Fixed

  • Log levels related de-confusion:
    • Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
    • Properly restore log_level and inform about its setting.
    • By default do not log from tests.
    • debug_log_from_routines should only happen when log_level > 1.
  • Bugs in Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.
  • Reduced busy-waiting inside c_compile_and_load, propagating compilation errors now instead of infinite loop on error.
  • Fixed loss of significant digits for small numbers when outputting files.
  • Added missing mixed-precision conversions in the C_syntax backend builder.
  • Restored the functionality of debug logging from the cuda backend.
  • Always reinitialize global state at the beginning of let%expect_test, to make them more deterministic.

Half precision, mixed precision, CUDA virtual devices

13 Sep 22:39
Compare
Choose a tag to compare

The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.

Non-beta release blocked by getting cudajit 0.4.1 in the opam-repository.

From the CHANGELOG:

Added

  • Implemented the previously-mocked support for half precision (FP16).
    • We work around the missing Ctypes coverage by not using Ctypes.bigarray_start.
    • We check FP16 constants for overflow.
    • We output half precision specific code from the CUDA backend.
  • Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via Tnode.update_prec.
  • A placeholder nn_blocks.ml hinting at an intended design pattern for model components.
  • A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend.
    • It fixes the CUDA backend behavior in the data parallelism benchmark.

Changed

  • Removed the pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.

Fixed

  • Log levels related de-confusion:
    • Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
    • Properly restore log_level and inform about its setting.
    • By default do not log from tests.
    • debug_log_from_routines should only happen when log_level > 1.
  • Bugs in Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.
  • Reduced busy-waiting inside c_compile_and_load, propagating compilation errors now instead of infinite loop on error.
  • Fixed loss of significant digits for small numbers when outputting files.
  • Added missing mixed-precision conversions in the C_syntax backend builder.
  • Restored the functionality of debug logging from the cuda backend.

Merge buffers, C-syntax backend builder, improved syntax extensions

05 Sep 15:12
Compare
Choose a tag to compare

From the CHANGELOG:

Added

  • A new backend "cc": C based on a configurable C compiler command, defaulting to cc.
  • Merge buffers representational abstraction (one per virtual device):
    • backends just need to support device-to-device transfers,
    • merging gets implemented in "user space".
  • CUDA streaming multiprocessor parallelism via streams <-> virtual devices.
  • Support for cuda-gdb and compute-sanitizer (pass the right arguments to cudajit).
  • Inline declarations for (non-differentiable) tensors in the %cd syntax.
  • A minimal wrapper Sync_backend creating CPU backends with a single device only, where all calls are synchronous. (It's a baseline and helps debugging.)
  • In progress: proper (condition variables based) scheduler. The legacy scheduler (pipes based) kept for now as baseline and to help debugging.
  • Documentation for the syntax extensions.
  • %op syntax: when under a ~config parameter, refine the inline declared params' labels with config.label.
  • %op syntax: incorporate the input tensor's (if any) label in the resulting tensor's label.
  • Comments in config files using the line prefix ~~.

Changed

  • Terminology in the API: Renamed almost all uses of "jit" into uses of "compile" and / or "link".
  • Split the compile-to-ptx phase from the build-module and build-kernel-launcher phase.
  • Migrated the CUDA backend to ppx_minidebug-based execution tracing.
  • Fixes for mixed precision computations.
  • Further terminology refactoring: Renamed Low_level.compile to Low_level.lower;
    • and Low_level.compiled to Low_level.optimized, making it a record.
  • Further refactoring of the Backends API:
    • split the device type into virtual device and physical_device,
    • removed the direct support for merge, instead relying on merge buffers.
  • Updated to cudajit 0.4.
  • A template for C-syntax backends, refactoring CC and CUDA backends.
  • Improvements to handling of tensor node labels, and to the Tnode.debug_name function.
  • Output files generated by backends, and files generated by logging, in separate subdirectories.
  • C-syntax logging: also output the pre-assignment value when logging an assignment.
  • Migrated to ppx_minidebug 2.0 with the benefits it brings: no runtime passing, Utils.settings.log_level unified with ppx_minidebug's log levels.

Fixed

  • Allow verifying that non-embedded tensor nodes of the tensor(s) associated with a linked code are already in the context passed to link (resp. link_batch), since they won't get introduced into the context. It is the responsibility of helper functions (such as those in Train) to ensure the check.
  • Fixed both known and newly discovered shortcomings of the syntax extensions.
  • In particular, %op syntax: lift ~config applications out of (tensor) functions.
  • Multiple other tiny fixes.

Continuous integration

24 Apr 11:10
Compare
Choose a tag to compare

From the changelog:

Added

  • GitHub workflow for continuous integration and API docs.
  • Randomness plug-ins via global config randomness_lib: currently only stdlib and for_tests.

Fixed

  • A bit of code rot in the Cuda backend mock cuda_backend.missing.ml.
  • NPY: Compatibility with OCaml 5.2.0.
  • Renamed the main package name from ocannl to neural_nets_lib, to prevent the opam linter from complaining about a confusing name.

Complete shape inference for splicing, einsum with ellipsis-in-the-middle

22 Apr 22:00
Compare
Choose a tag to compare

From the changelog:

Added

  • let%cd _ = (and let%op _ =?) do not affect root tracking (intended for adding shape constraints).
  • More expressive shape constraints: allowing row variables to be sandwiched between leftmost axes beg_dims and rightmost axes dims.
  • Einsum notation support for leftmost axes.

Changed

  • Cleaned up "user-facing" API by moving IDX and CDSL to Train, and Tensor.O to more precise Operation.At.
  • Added interface Tensor.mli to reduce "the user learning surface".
  • Improved documentation and layout of Shape.mli.
  • A more reasonable syntax for labels specifications and einsum notation. In particular, whitespace insensitive (except whitespace not allowed inside identifiers).
  • Vendored the npy package while we wait for a PR.

Fixed

  • Moved cudajit to depopts.
  • Slice shape inference is now complete, by using leftmost axes beg_dims in constraints.

Package visibility, sanitizing code inclusion (rootness checks)

15 Apr 10:01
Compare
Choose a tag to compare

This is a small incremental release:

  • making the project API visible from outside the package,
  • providing saving and restoring tensors,
  • preventing some user bugs by "rootness checks" (regarding when code pieces are included with tensor references).

From the changelog:

Added

  • Tensor parameters saving and restoring, Ndarray saving and restoring.
  • An operation outer_sum: like einsum but simpler, addition everywhere.

Changed

  • Tweaks to make the project usable as a package (external library).
  • Sanitizing code inclusion via code roots management: Tensor.consume_forward_code and consume_backprop_code, (optionally but by default) used from Train.

Fixed

  • Shape inference in presence of non-0 fixed indexing inside einsums was broken (because actually not implemented).
  • Incompleteness of shape inference for slicing was leading to inferring shapes with no axes: constraint generation was intended to raise a shape error instead. Proper fix coming in 0.3.2 will make slice shape inference complete.

Shape inference, jitted routines, gccjit backend

01 Apr 16:49
Compare
Choose a tag to compare

Major rewrite. Abandoning the design choices of 0.1 and 0.2.

Added:

  • Optionally, inferring or checking tensor (batch) sizes from data (e.g. file) sizes.
  • Static indexing. A "slice" operator to select individual batches.
  • Established the backends API with first-class modules.
  • The Train module as an optimization "frontend".
  • Parallel optimization across devices.
  • Global settings configurable via config files, environment variables, and commandline flags.
  • Integration of backend logging with ppx_minidebug (the debug_log_from_routines setting).

Changed:

  • The Cuda backend is not supported for now. It is (optionally) buildable to reduce code rot.
  • Dynamic indexing is not supported anymore (to reduce complexity). It might be reintroduced if needed.
  • Factored out the arrayjit library / package containing compilation (former Ndarray, Node, Code).
  • Renamed Formula -> Tensor
  • No more "form vs. non-form" formulas / tensors.
  • Formula/tensor roots are split into forward roots and backprop roots.
  • No more %nn_rs, %nn_dt syntaxes and Synthetic fetch primitive.
  • Renamed %nn_op to %op and %nn_cd to %cd.
  • Migrated gccjit into a separate repository.
  • Migrated cudajit into a separate repository.
  • Massive rewrite of shape inference in a declarative style.
  • Generalize zero_out to initialize_neutral to prepare arbitrary accumulation operation.
  • Renamed Node -> Lazy_array -> Tnode (tensor node).

And more.

Naive Cuda (tagged for archival purposes)

21 Jul 08:56
Compare
Choose a tag to compare

Cuda FFI, naive, not particularly functional Cuda backend where a "parallel" axis is mapped across blocks and a "minibatch" axis is mapped across threads in a block.

This does not really work because it lacks synchronization across blocks. Also the "parallel axis", "minibatch axis" approach is not really usable (neither for Cuda nor the Gccjit backend).

When using too many total threads, Cuda hangs / takes too long on compilation to PTX. Where the Cuda backend works, the Gccjit backend is way faster.

Other meaningful improvements include: low-level code optimization / simplification; refactorings.