Skip to content

Releases: oneapi-src/oneDNN

v3.6-rc

20 Sep 23:29
Compare
Choose a tag to compare
v3.6-rc Pre-release
Pre-release

Performance Optimizations

Intel Architecture Processors

  • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
  • Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
  • Improved performance of group normalization primitive.
  • Improved bf16 matmul performance with int4 compressed weights on processors with Intel AMX instruction set support.
  • Improved performance of fp8 matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support.
  • Improved fp32 RNN primitive performance on processors with Intel AVX2 instruction set support.
  • Improved performance of the following subgraphs with Graph API:
    • convolution and binary operation fusions with better layout selection in Graph API.
    • fp8 convolution and unary or binary on processors with Intel AMX instruction set.
    • Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
    • LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

Intel Graphics Products

  • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra Processors (Series 2) (formerly Lunar Lake).
  • Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
  • Introduced support for Intel Arc Graphics for future Intel Core Ultra Processor (code name Arrow Lake-H).
  • Improved performance of fp8_e5m2 primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
  • Improved int8 convolution performance with weight zero points.
  • Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
  • Improved performance of the following subgraphs with Graph API:
    • SDPA without scale, MQA, and GQA patterns. f16 variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
    • fp8 convolution and unary or binary on Intel Data Center GPU Max Series.
    • LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

AArch64-based Processors

  • Improved fp32 convolution backpropagation performance on processors with SVE support.
  • Improved reorder performance for blocked format on processors with SVE support.
  • Improved bf16 softmax performance on processors with SVE support.
  • Improved batch normalization performance on processors with SVE support.
  • Improved matmul performance on processors with SVE support.
  • Improved fp16 convolution with Arm Compute Library (ACL).
  • Improved matmul performance with ACL.
  • Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.

Functionality

  • Introduced generic GPU support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
  • Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL based implementations.
  • Enabled support for int8 activations with grouped scales and int8 or int4 compressed weights in matmul primitive. This functionality is implemented on Intel GPUs.
  • Introduces support for stochastic rounding for fp8 data type functionality.
  • [experimental] Extended microkernel API:
    • Introduced int8 quantization support.
    • Extended transform microkernel with transposition support and support for arbitrary strides.
    • Introduced verbose diagnostics support.
  • [experimental] Extended sparse API:
    • Introduced support for sparse memory with coordinate (COO) storage format.
    • Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
  • Introduced int8 support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs.
  • Graph API:
    • Introduced GroupNorm operation and fusions in Graph API.
    • Introduced support for standalone StaticReshape and StaticTranspose operations.

Usability

  • Added examples for SDPA, MQA, and GQA patterns implementation with Graph API.
  • Added an example for deconvolution primitive.
  • Added examples for Vanilla RNN and LBR GRU RNN cells.
  • Introduced support for Intel DPC++/C++ Compiler 2025.0.
  • Introduced interoperability with SYCL Graph record/replay mode.
  • Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
  • [experimental] Introduced logging mechanism based on spdlog library.
  • Introduced support for ONEDNN_ENABLE_WORKLOAD build knob for Graph API.
  • Improved performance of get_partitions() function in Graph API.

Validation

  • Introduced protection from out of memory scenarios in benchdnn Graph API driver.

Breaking Changes

Thanks to these Contributors

This release contains contributions from the project core team as well as Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts @apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, @matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich, Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov @vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone who asked questions and reported issues.

v3.5.3

02 Aug 22:26
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.5.2:

  • Fixed correctness issue in convolution weight gradient for small shapes on Intel GPUs (49eee6a, 281dd3b)
  • Extended MLP patterns supported by experimental Graph Compiler to cover cases relevant to ChatGLM model (ff680fc)
  • Fixed performance regression in bf16 depthwise convolution on Intel CPUs (d6c216a)

v3.5.2

26 Jul 23:34
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.5.1:

  • Fixed performance regression for some Graph API subgraphs with LayerNorm operation (82f629c)
  • Fixed runtime error for Graph API subgraphs including 6D LayerNorm operation (f704f09)
  • Fixed an issue with host compiler version detection in SYCL configurations (730b976)
  • Fixed an issue with missing DNNL_TARGET_ARCH define for builds not relying on CMake (87848b9)
  • Fixed a test issue for matmul with low-precision scales and/or zero-points (91c35d8)
  • Fixed segfault issue in bfloat16 shuffle on AArch64 processors (9116681)
  • Fixed runtime issue in quantized layer normalization pattern with Graph API (0013e8c)

v3.4.4

22 Jul 16:03
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.4.3:

  • Fixed an issue with host compiler version detection in SYCL configurations (fcaa1b4)

v3.5.1

16 Jul 15:39
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.5:

  • Fixed potential page fault in matmul on Intel Datacenter Max Series GPUs (a9c525d)
  • Fixed potential stack overflow issue in convolution implementation for Intel GPUs (0fb7e6e)
  • Added test cases for matmul with compressed weights (015ccb1)
  • Extended Graph API LayerNorm operation with zero points support (dc2701a)
  • Fixed primitive creation error for depthwise convolution backpropagation on Intel GPUs (4a045e4, b529d22)

v3.5

11 Jun 23:26
Compare
Choose a tag to compare

Performance Optimizations

Intel Architecture Processors

  • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
  • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
  • Improved performance of group normalization primitive.
  • Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
  • Improved performance of the following subgraphs with Graph API:
    • Multi-Query Attention (MQA).
    • Scaled Dot Product Attention (SDPA), including the variant with select operation.
    • LayerNorm + Multiply + Quantize produced by SmoothQuant algorithm.
    • Convolution + Sigmoid + Multiply with mixed precisions.

Intel Graphics Products

  • Improved performance for Processor Graphics based on Xe2 architecture.
  • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
  • Improved RNN primitive performance for LSTM cell case.
  • Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).

AArch64-based Processors

  • Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
  • Improved bf16 matmul, convolution, and reorder primitives performance with Arm Compute Library (ACL).
  • Improved eltwise primitive performance with gelu_erf algorithm with ACL.

Functionality

  • Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
  • Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
  • Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only.
  • Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
    • bfloat16 matmul with int8 weights on Intel CPUs.
    • float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
  • [experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.

Usability

  • Extended error messages for engine and memory objects creation errors.
  • Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
  • Introduced support for clang++ host compiler in SYCL builds.
  • Introduced API for tensor serialization and deserialization.
  • Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
  • Introduced OpenCL runtime support for Graph API.
  • Added support for building oneDNN with installed Arm Compute Library (ACL).

Validation

  • Extended benchdnn with support for tensor tags in RNN primitive validation.

Breaking Changes

  • Updated minimal supported ACL version to 24.04 (was 23.11).

Thanks to these Contributors

This release contains contributions from the project core team as well as Abdel @quickwritereader, @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, David Svantesson @davsva01, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Fadi Arafeh @fadara01, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej Ciglarič @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.

v3.4.3

28 May 23:16
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.4.2:

  • Fixed GPU detection issues on systems with several different Intel GPUs (0fb7e6e)

v3.5-rc

28 May 19:36
Compare
Choose a tag to compare
v3.5-rc Pre-release
Pre-release

This is a release candidate for oneDNN v3.5. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
    • Improved performance of group normalization primitive.
    • Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
    • Improved performance of the following subgraphs with Graph API:
      • Multi-Query Attention (MQA).
      • Scaled Dot Product Attention (SDPA), including the variant with select operation.
      • LayerNorm + Multiply + Quantize produced by SmoothQuant algorithm.
      • Convolution + Sigmoid + Multiply with mixed precisions.
  • Intel Graphics Products:

    • Improved performance for Processor Graphics based on Xe2 architecture.
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
    • Improved RNN primitive performance for LSTM cell case.
    • Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • AArch64-based Processors:

    • Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
    • Improved bf16 matmul performance with Arm Compute Library (ACL).
    • Improved eltwise primitive performance with gelu_erf algorithm with ACL.

Functionality

  • Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
  • Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
  • Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs only.
  • Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
    • bfloat16 matmul with int8 weights on Intel CPUs.
    • float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
  • [experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.

Usability

  • Extended error messages for engine and memory objects creation errors.
  • Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
  • Introduced support for clang++ host compiler in SYCL builds.
  • Introduced API for tensor serialization and deserialization.
  • Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
  • Introduced OpenCL runtime support for Graph API.
  • Added support for building oneDNN with installed Arm Compute Library (ACL).

Validation

  • Extended benchdnn with support for tensor tags in RNN primitive validation.

Thanks to these Contributors

This release contains contributions from the project core team as well as @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Shreyas-fuj @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej Ciglarič @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.

v3.4.2

10 May 22:02
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.4.1:

  • Fixed performance regression in deconvolution on processors with Intel AVX-512 instruction set (307b35b, f46fffb)
  • Improved performance of batched matmul with binary post-op on processors with Intel AVX-512 instruction set (d39e1b7)
  • Fixed performance regression in softmax with destination memory format set to any on processors with Intel AVX-512 instruction set (756d3cf)
  • Fixed incorrect results in int8 deconvolution with source zero points on processors with Intel AMX instruction set (d5ddbc8)
  • Fixed performance regression in convolution on processors with Intel AVX2 instruction set (2968c89)
  • Improved f8_e4m3 matmul performance on Intel Data Center GPU Max Series (068f850, 668abae, c3972ef, ad94382)
  • Fixed sporadic accuracy issues in bf16 depthwise convolution backpropagation on processors with Intel AVX-512 instruction set (0184044)
  • Fixed primitive creation issue for fp16 pooling backpropagation on Intel GPUs (e4737d9)
  • Fixed failure for subgraphs with int8 matmul operation with experimental Graph Compiler on processors with Intel AMX instruction set (5ebde2e)
  • Fixed assert in experimental Graph Compiler on Windows (f53fbd1, fd903ae)
  • Fixed incorrect results for subgraphs with shuffle operation with experimental Graph Compiler (aef5023)
  • Improved performance of subgraphs involving int8 matmul with experimental Graph Compiler on processors with Intel AMX support (0ca5bc5)
  • Fixed page fault in fp16 matmul primitive on Intel Data Center GPU Max Series (5587f08)
  • Fixed incorrect results in dp32 deconvolution with Arm Compute Library on AArch64 processors (b7694a0)
  • Fixed performance regression in deconvolution on processors with Intel AVX2 instruction set (6f452e2)

v3.4.1

29 Mar 22:27
Compare
Choose a tag to compare

This is a patch release containing the following changes to v3.4:

  • Fixed an issue with caching and serialization of primitives in deterministic mode (7ed604a)
  • Introduced memory descriptor serialization API (4cad420, 929a27a, 9b848c8)
  • Fixed incorrect results in fp64 convolution and deconvolution on Intel GPUs based on Xe-LPG architecture (ebe77b5, 0b399ac, d748d64, 9f4f3d5, 21a8cae)
  • Fixed incorrect results in reorder with large sizes on Intel CPUs and GPUs (69a111e, 4b72361, 74a343b)
  • Reduced creation time for deconvolution primitive on Intel CPUs (bec487e, 1eab005)
  • Fixed performance regression in deconvolution on Intel CPUs (fbe5b97, 1dd3c6a)
  • Removed dangling symblols from static builds (e92c404, 6f5621a)
  • Fixed crash during platform detection on some AArch64-based systems (406a079)
  • Fixed performance regression in int8 deconvolution on Intel CPUs (7e50e15)
  • Fixed handling of zero points for matmul in verbose logs converter (15c7916)