Releases: ml-explore/mlx
Releases Β· ml-explore/mlx
v0.17.0
Highlights
mx.einsum
: PR- Big speedups in reductions: benchmarks
- 2x faster model loading: PR
mx.fast.metal_kernel
for custom GPU kernels: docs
Core
- Faster program exits
- Laplace sampling
mx.nan_to_num
nn.tanh
gelu approximation- Fused GPU quantization ops
- Faster group norm
- bf16 winograd conv
- vmap support for
mx.scatter
mx.pad
"edge" padding- More numerically stable
mx.var
mx.linalg.cholesky_inv
/mx.linalg.tri_inv
mx.isfinite
- Complex
mx.sign
now mirrors NumPy 2.0 behaviour - More flexible
mx.fast.rope
- Update to
nanobind
2.1
Bug Fixes
- gguf zero initialization
- expm1f overflow handling
- bfloat16 hadamard
- large arrays for various ops
- rope fix
- bf16 array creation
- preserve dtype in
nn.Dropout
nn.TransformerEncoder
withnorm_first=False
- excess copies from contiguity bug
v0.16.3
v0.16.2
ππ
0.16.1
v0.16.0
Highlights
@mx.custom_function
for customvjp
/jvp
/vmap
transforms- Up to 2x faster Metal GEMV and fast masked GEMV
- Fast
hadamard_transform
Core
- Metal 3.2 support
- Reduced CPU binary size
- Added quantized GPU ops to JIT
- Faster GPU compilation
- Added grads for bitwise ops + indexing
Bug Fixes
- 1D scatter bug
- Strided sort bug
- Reshape copy bug
- Seg fault in
mx.compile
- Donation condition in compilation
- Compilation of accelerate on iOS
v0.15.2
v0.15.1
v0.15.0
Highlights
- Fast Metal GPU FFTs
- On average ~30x faster than CPU
- More benchmarks
mx.distributed
withall_sum
andall_gather
Core
- Added dlpack device
__dlpack_device__
- Fast GPU FFTs benchmarks
- Add docs for the
mx.distributed
- Add
mx.view
op
NN
softmin
,hardshrink
, andhardtanh
activations
Bugfixes
- Fix broadcast bug in bitwise ops
- Allow more buffers for JIT compilation
- Fix matvec vector stride bug
- Fix multi-block sort stride management
- Stable cumprod grad at 0
- Buf fix with race condition in scan
v0.14.1
v0.14.0
Highlights
- Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
mx.gather_qmm
quantized equivalent formx.gather_mm
which speeds up MoE inference by ~2x- Grouped 2D convolutions
Core
mx.conjugate
mx.conv3d
andnn.Conv3d
- List based indexing
- Started
mx.distributed
which uses MPI (if installed) for communication across machinesmx.distributed.init
mx.distributed.all_gather
mx.distributed.all_reduce_sum
- Support conversion to and from dlpack
mx.linalg.cholesky
on CPUmx.quantized_matmul
sped up for vector-matrix productsmx.trace
mx.block_masked_mm
now supports floating point masks!
Fixes
- Error messaging in eval
- Add some missing docs
- Scatter index bug
- The extensions example now compiles and runs
- CPU copy bug with many dimensions