-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for E5M2 and E4M3 floating point types #2312
Comments
Thanks for the heads-up! FYI we are using a hybrid of e5 and e4 called SFP, which gives m3 for larger numbers and m2 for smaller, while retaining ~24 bit dynamic range. This is also fast to convert (code) to bf16 via two permutex2var, enabling fast FMA into f32 via _mm512_dpbf16_ps. It also avoids having to choose between the two formats. By contrast, conversions to f16 seem less useful, given the lack of precision and (last I checked) low-throughput f16 <-> f32 conversions on Intel. Do you have a use case for these specific conversions to f16? |
There are some open-source libraries that can work with E5M2 and E4M3 floating point types, including CUTLASS (https://github.com/NVIDIA/cutlass), JAX (https://github.com/google/jax), ONNX Runtime (https://github.com/microsoft/onnxruntime), OpenVINO (https://github.com/openvinotoolkit/openvino), PyTorch (https://github.com/pytorch/pytorch), and TensorFlow (https://github.com/tensorflow/tensorflow). One of the most common use cases for E5M2/E4M3 to F16 conversions is tensor arithmetic. |
I agree E5/E4 are currently used in ML frameworks. |
The upcoming Intel AVX10.2 instruction set extension is going to be adding support for conversions to E5M2 (BF8) and E4M3 (HF8) 8-bit floating point types from F16, along with conversions to F16 from E4M3 (HF8).
The E5M2 and E4M3 floating point formats are described in the Open Compute Project 8-bit Floating Point Specification, which can be found at https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1.
The E5M2 (BF8) floating-point format has 1 sign bit, 5 exponent bits, and 2 mantissa bits, and the bit representation of the E5M2 format is equivalent to the upper 8 bits of a hwy::float16_t (16-bit IEEE 754 half-precision floating-point) value (similar to the bit representation of the hwy::bfloat16_t being equivalent to the upper 16 bits of a 32-bit IEEE 754 single-precision floating-point value).
The E4M3 (HF8) floating-point format has 1 sign bit, 4 exponent bits, and 3 mantissa bits. The E4M3 floating-point format has no infinities, and the E4M3 format has only 2 NaN bit representations (0x7F and 0xFF). The E4M3 floating-point format considers non-NaN values that have the largest exponent to be normal floating-point values whose absolute value is between 256 and 448, unlike most of the floating-point formats which consider values having the largest exponent to be infinities or NaN values.
The AVX10.2 VCVTNE2PH2BF8 instruction converts a F16 vector to a E5M2 (BF8) vector, and the AVX10.2 VCVTNEPH2HF8 instruction converts a F16 vector to a E4M3 (HF8) vector.
The AVX10.2 VCVTHF82PH instruction converts a E4M3 (HF8) vector to a F16 vector.
Arm has already added the FP8 AArch64 extension that adds support for conversions to E5M2/E4M3 floating-point types from F16/F32 floating-types along with conversions from E5M2/E4M3 to F16/BF16.
The text was updated successfully, but these errors were encountered: