Releases: oneapi-src/oneDNN
v2.4-rc
This is a release candidate for oneDNN v2.4. Please provide feedback and submit defect reports via Github issues.
Performance Optimizations
- Improved primitive cache performance for Intel Graphics products.
- Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved binary primitive performance for cases when one of the tensors is broadcasted.
- Improved reorder primitive performance for memory formats with padding and/or zero points.
- Intel Graphics Products
- Introduced initial optimizations for future Intel Arc graphics (code name Alchemist and DG2).
- AArch64-based Processors
- Improved inner product and eltwise primitives performance with ACL.
- Introduced support for sum and for indirect and Winograd convolution implementations with ACL.
- NVIDIA Graphics
- Improved convolution performance with eltwise post-op.
Functionality
- Introduced PReLU post-op support in convolution and matmul.
- Extended maximum allowed post-ops chain for compute primitives (convolution, deconvolution, inner product, and matmul) to 32.
- Introduced support for zero points in sum post-op for convolution and matmul. The functionality is implemented only for CPUs.
- Extended binary primitive with support for mixed data types for input tensors. The functionality is implemented only for CPUs.
- Extended sum post-op for convolution and matmul primitives with support for mixed data types. The functionality is implemented only for CPUs.
- Added USM support for OpenCL GPU runtime.
Usability
- Added compile time options to manage the set of supported primitives and workload types. See DNNL_ENABLE_WORKLOAD and DNNL_ENABLE_PRIMITIVE in build options for more details. This feature allows to reduce binary footprint of the library for specialized applications.
- Reduced overall library size by trimming down use of templates, OpenCL headers, and TBB headers. The configurations that benefitted the most are CPU only configuration with TBB threading and GPU only configuration. Note, that binary footprint depends on the compiler used to build the library and build options.
- Introduced floating point math mode API. The API allows the library to use bfloat16 or float16 hardware acceleration in fp32 operations. Currently this mode is not supported in the implementation.
- Added a build option DNNL_LIBRARY_NAME to change the library name and CMake target. This feature helps projects that use multiple oneDNN configurations.
Breaking Changes
- Updated minimal supported ACL version from 21.08 (was 21.05).
Deprecated functionality
- Intel MKL-DNN compatibility API is deprecated and will be removed in the next update. See Transition from Intel MKL-DNN to oneDNN page for instructions on moving to new API.
Thanks to the Contributors
This release contains contributions from the project core team as well as
Aleksandr Nikolaev @alenik01, Arthur Mitrano @aaraujom, Diana Bite @diaena, Jing Xu @jingxu10, Kentaro Kawakami @kawakami-k, Kevin Putnam @intelkevinputnam, MITSUNARI Shigeo @herumi, Nathan John Sircombe @nSircombe, Nicolas Chauvet (kwizart) @kwizart, Peter Caday @petercad. We would also like to thank everyone who asked questions and reported issues.
graph-v0.2
This is a technical preview for oneDNN Graph API based on oneDNN v2.3.2.
oneDNN Graph API extends oneDNN with a unified, high-level graph API for multiple AI hardware classes (CPU, GPU, accelerators). The graph interface integrates with the deep learning frameworks and inference engines to maximize opportunities for performance optimizations across a variety of hardware targets. This preview has full support for the oneAPI Graph programming model and partial support of the operations in oneDNN Graph API specification v0.7.
Learn more about oneDNN Graph API:
Supported Functionality
- C++ and DPC++ API.
- Graph partition and compilation API.
- Operations and fusions targeting fp32 inference for CNNs, MLPs, and transformer neural networks.
Performance Optimizations
Backend implementation relies on oneDNN and includes performance optimizations for Intel Architecture processors with Intel SSE4.1, Intel AVX, Intel AVX2, or Intel AVX512 instruction set.
Validation
- Gtest suite is available for basic functional testing.
- Comprehensive functional and performance validation is covered by the extended version of benchdnn.
Known Issues and Limitations
- Some subgraphs might not be recognized as a partition even if it matches the general pattern description due to internal implementation.
- The weight’s opaque layout can be queried only from a compiled partition, which requires that tensor shapes must be known at compilation time.
- Binary operation with scalar and tensor inputs is not optimized.
Thanks to the Contributors
This release contains contributions from the project core teams as well as Jiong Gong, Pinzhen Xu, Chunyuan Wu, Jianping Chen, Scott Cyphers, Nishant Patel, Yiqiang Li, Yang Sheng, Kiefer Kuah, Adam Straw, Tim Zerrell, Namrata Choudhury and others.
v2.3.2
v2.3.1
This is a patch release containing the following changes to v2.3:
- Improved int8 GEMM performance for processors with Intel AVX2 and Intel DL Boost support (f5c071b)
- Fixed integer overflow for inner product implementation on CPUs (66971b5)
- Fixed out of bounds access in GEMM implementation for Intel SSE 4.1 (4e81df0)
- Fixed correctness issue for depthwise convolution post-op with non-default scales on CPUs (783e1d6, 066c832)
- Fixed crash for s8 binary primitive on Windows (d9fd397)
- Fixed performance regression in fp32 to u8 reorder for Intel AMX specific memory formats (97f40cf, 532648a)
- Fixed correctness issue for bfloat16 convolution weight gradient on processors with Intel AMX support (053406d, 6649b75)
- Fixed correctness issue for bfloat16 inner product backpropagation on processors with Intel AMX support (a2e6c55)
- Fixed correctness issue for bfloat16 convolution with padded memory formats on GEN9 GPUs (c0aea07)
- Fixed correctness issue for int8 matmul primitive with zero points on processors with Intel AMX support (55cb716)
- Fixed segfault in depthwise convolution post-op on CPUs (ad46635)
v2.3
Performance Optimizations
- Extended primitive cache to improve primitive descriptor creation performance.
- Improved primitive cache performance in multithreaded configurations.
- Intel Architecture Processors
- Introduced initial optimizations for bfloat16 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved performance of binary primitive and binary post-op for cases with broadcast and mixed source and destination formats.
- Improved performance of reduction primitive.
- Improved performance of depthwise convolution primitive with NHWC activations for training cases
- Intel Graphics Products
- Improved fp32 and fp16 Winograd convolution performance.
- Introduced support for automatic selection between direct and Winograd convolution algorithms.
- Improved int8 depthwise convolution performance.
- Improved performance of reorder, shuffle, concat, binary, and batch normalization primitives
- Improved layer normalization performance for blocked formats.
- AArch64-based Processors
- Improved reorder primitive performance for systems with SVE 128 and SVE 256 support.
- Improved eltwise primitive performance for systems with SVE 512 support.
Functionality
- Extended batch normalization and layer normalization primitives API to take separate scale and shift arguments.
- Extended resampling primitive with post-ops support and mixed source and destination data types.
Usability
- Introduced binary distribution in conda-forge. Supported configurations cover Linux, Windows, and macOS operating systems and Intel64/AMD64, Aarch64, and PPC64 architectures.
- Introduced support for GPU-only build. This configuration helps to reduce binary footprint for applications targeting GPU.
- Introduced an option to use GNU OpenMP as CPU runtime for DPC++ configuration.
- Introduced verbose log converter. This tool processes oneDNN verbose logs and generates test cases for benchdnn.
Breaking Changes
- Updated minimal supported CMake version from to 2.8.12 (was 2.8.11).
- Updated minimal supported ACL version from 21.05 (was 21.02).
Thanks to the Contributors
This release contains contributions from the project core team as well as Alexandre Truong @aletru01, Arthur Mitrano @aaraujom, fitchbe @fitchbe, Isuru Fernando @isuruf, Joe Ramsay @joeramsay, Kentaro Kawakami @kawakami-k, leizheng1 @leizheng1, Nomoto Kazuhiro @NomotoKazuhiro, Peter Caday @petercad, Pablo Romero @pablocum, Takumi-H @Takumi-Honda, Uwe L. Korn @xhochy, Vasily Rubtsov @vasilyru. We would also like to thank everyone who asked questions and reported issues.
v2.3-rc2
This is a release candidate for oneDNN v2.3. Please provide feedback and submit defect reports via Github issues.
v2.2.4
v2.3-rc
This is a release candidate for oneDNN v2.3. Please provide feedback and submit defect reports via Github issues.
v2.2.3
This is a patch release containing the following changes to v2.2.2:
- Fixed a bug in int8 depthwise convolution ptimitive with groups and 1d spatial size for processors with Intel AVX-512 and Intel AVX2 support (8a784c6, f0e4af9)
- Fixed correctness issue for PReLU primitive on Intel Processor Graphics (f3c3daf)
- Fixed corretness issue in reorder for blocked layouts with zero padding (68f05d0, d51616b, fd2c642)
- Improved performance of weights reorders used by BRGEMM-based convolution primitive for processors with Intel AVX-512 support (23b2ec0, 10f8187, 4c0819c)
- Added
-fp-model=precise
build flag for DPC++ code (3e40e5e) - Fixed potential memory leak in matmul primitive (36dba73)
- Fixed performance of matmul primitive when fused with bias update and sum (f993b25)
- Fixed a bug in matmul primitive when writing to non-contiguous destination buffer (36d25d4)
v2.2.2
This is a patch release containing the following changes to v2.2.1:
- Fixed performance regression in fp32 forward inner product for shapes with number of output channels equal to 1 for processors with Intel AVX-512 support (714b1fd)
- Fixed performance regression in forward convolutions with groups for processors with Intel AVX-512 support(3555d4a)
- Removed
-std=c++11
build flag for DPC++ headers (1fcb867) - Fixed buffer access in initializing workspace in RNN implementation on GPU (9b03091)
- Fixed fix a bug in convolution with 1x1 kernel and mixed strides on processors with Intel AVX-512 support (d0b3e3f)
- Used getauxval for Linux to get CPU features on for AArch64 systems (25c4cea)
- Added
-fp-model=precise
build flag for DPC++ code (3e40e5e) - Fixed out-of-bounds writes in elementwise primitive on Intel Processor Graphics (bcf823c)