An implementation of generic floating point encode/decode logic, handling various current and proposed floating point types:
- IEEE 754: Binary16, Binary32
- OCP Float8: E5M2, E4M3
- IEEE WG P3109: P3109_{K}p{P} for K > 2, and 1 <= P < K.
- OCP MX Formats: E2M1, M2M3, E3M2, E8M0, INT8, and the MX block formats.
The library favours readability and extensibility over speed (although the *_ndarray functions are reasonably fast for large arrays, see the benchmarking notebook). For other implementations of these datatypes more focused on speed see, for example, ml_dtypes, bitstring, MX PyTorch Emulation Library.
See https://gfloat.readthedocs.io for documentation, or dive into the notebooks to explore the formats.
For example, here's a table from the 02-value-stats notebook:
name | B: Bits in the format | P: Precision in bits | E: Exponent field width in bits | 0<x<1 | 1<x<Inf | Exact in float16? | maxFinite | minFinite | maxNormal | minNormal | minSubnormal | maxSubnormal |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ocp_e2m1 | 4 | 2 | 2 | 1 | 5 | True | 6 | -6 | 6 | 1 | 0.5 | 0.5 |
ocp_e2m3 | 6 | 4 | 2 | 7 | 23 | True | 7.5 | -7.5 | 7.5 | 1 | 0.125 | 0.875 |
ocp_e3m2 | 6 | 3 | 3 | 11 | 19 | True | 28 | -28 | 28 | 0.25 | 0.0625 | 0.1875 |
ocp_e4m3 | 8 | 4 | 4 | 55 | 70 | True | 448 | -448 | 448 | 0.015625 | 1*2^-9 | 7/4*2^-7 |
ocp_e5m2 | 8 | 3 | 5 | 59 | 63 | True | 57344 | -57344 | 57344 | 1*2^-14 | 1*2^-16 | 3/2*2^-15 |
p3109_8p1 | 8 | 1 | 7 | 62 | 63 | False | 1*2^63 | -1*2^63 | 1*2^63 | 1*2^-62 | nan | nan |
p3109_8p2 | 8 | 2 | 6 | 63 | 62 | False | 1*2^31 | -1*2^31 | 1*2^31 | 1*2^-31 | 1*2^-32 | 1*2^-32 |
p3109_8p3 | 8 | 3 | 5 | 63 | 62 | True | 49152 | -49152 | 49152 | 1*2^-15 | 1*2^-17 | 3/2*2^-16 |
p3109_8p4 | 8 | 4 | 4 | 63 | 62 | True | 224 | -224 | 224 | 0.0078125 | 1*2^-10 | 7/4*2^-8 |
p3109_8p5 | 8 | 5 | 3 | 63 | 62 | True | 15 | -15 | 15 | 0.125 | 0.0078125 | 15/8*2^-4 |
p3109_8p6 | 8 | 6 | 2 | 63 | 62 | True | 3.875 | -3.875 | 3.875 | 0.5 | 0.015625 | 31/16*2^-2 |
bfloat16 | 16 | 8 | 8 | 16255 | 16383 | False | 255/128*2^127 | -255/128*2^127 | 255/128*2^127 | 1*2^-126 | 1*2^-133 | 127/64*2^-127 |
ocp_int8 | 8 | 8 | 0 | 63 | 63 | True | 127/64*2^0 | -2 | nan | nan | 0.015625 | 127/64*2^0 |
ocp_e8m0 | 8 | 1 | 8 | 127 | 127 | False | 1*2^127 | 1*2^-127 | 1*2^127 | 1*2^-127 | nan | nan |
All NaNs are the same, with no distinction between signalling or quiet, or between differently encoded NaNs.