v0.18.0
Highlights
- Speed improvements:
- Up to 2x faster I/O: benchmarks.
- Faster transposed copies, unary, and binary ops
- Transposed convolutions
- Improvements to
mx.distributed
(send
/recv
/average_gradients
)
Core
-
New features:
mx.conv_transpose{1,2,3}d
- Allow
mx.take
to work with integer index - Add
std
as method onmx.array
mx.put_along_axis
mx.cross_product
int()
andfloat()
work on scalarmx.array
- Add optional headers to
mx.fast.metal_kernel
mx.distributed.send
andmx.distributed.recv
mx.linalg.pinv
-
Performance
- Up to 2x faster I/O
- Much faster CPU convolutions
- Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
- Put reduction ops in default stream with async for faster comms
- Overhead reductions in
mx.fast.metal_kernel
- Improve donation heuristics to reduce memory use
-
Misc
- Support Xcode 160
NN
- Faster RNN layers
nn.ConvTranspose{1,2,3}d
mlx.nn.average_gradients
data parallel helper for distributed training
Bug Fixes
- Fix boolean all reduce bug
- Fix extension metal library finding
- Fix ternary for large arrays
- Make eval just wait if all arrays are scheduled
- Fix CPU softmax by removing redundant coefficient in neon_fast_exp
- Fix JIT reductions
- Fix overflow in quantize/dequantize
- Fix compile with byte sized constants
- Fix copy in the sort primitive
- Fix reduce edge case
- Fix slice data size
- Throw for certain cases of non captured inputs in compile
- Fix copying scalars by adding fill_gpu
- Fix bug in module attribute set, reset, set
- Ensure io/comm streams are active before eval
- Fix
mx.clip
- Override class function in Repr so
mx.array
is not confused witharray.array
- Avoid using find_library to make install truly portable
- Remove fmt dependencies from MLX install
- Fix for partition VJP
- Avoid command buffer timeout for IO on large arrays