Skip to content

Half precision, mixed precision, CUDA virtual devices

Compare
Choose a tag to compare
@lukstafi lukstafi released this 17 Sep 13:07
· 147 commits to master since this release

The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.

From the CHANGELOG:

Added

  • Implemented the previously-mocked support for half precision (FP16).
    • We work around the missing Ctypes coverage by not using Ctypes.bigarray_start.
    • We check FP16 constants for overflow.
    • We output half precision specific code from the CUDA backend.
  • Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via Tnode.update_prec.
  • A placeholder nn_blocks.ml hinting at an intended design pattern for model components.
  • A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
  • Slides for the Fun OCaml meetup: docs/Fun OCaml.
  • New syntax: inline tensor declarations with a literal float as initial value.

Changed

  • Removed the pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.
  • Changed the %cd block comment syntax ~~ to allow detailed structuring. Rewrote Train.grad_update to use the %cd syntax.
  • Made Train.sgd_one slightly more thrifty: p =- learning_rate *. sgd_delta --> p =- learning_rate * sgd_delta ~logic:"." without the inline tensor expression.

Fixed

  • Log levels related de-confusion:
    • Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
    • Properly restore log_level and inform about its setting.
    • By default do not log from tests.
    • debug_log_from_routines should only happen when log_level > 1.
  • Bugs in Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.
  • Reduced busy-waiting inside c_compile_and_load, propagating compilation errors now instead of infinite loop on error.
  • Fixed loss of significant digits for small numbers when outputting files.
  • Added missing mixed-precision conversions in the C_syntax backend builder.
  • Restored the functionality of debug logging from the cuda backend.
  • Always reinitialize global state at the beginning of let%expect_test, to make them more deterministic.