-
Notifications
You must be signed in to change notification settings - Fork 88
IEEE precisions
This is a document describing useful information about IEEE 754 floating point standard.
The error introduced depends on the number of bits used for the significand s, and the rounding mode used:
- with round-to-nearest mode, an additional bit is added and the error is 2-(s+1)
- with round-to-zero mode, the error is 2-s
The range depends on the number of exponent bits e. The bits encode an unsigned integer u, and the fixed-point base 2 number 1.s...s (where s...s is the significand) is multiplied by 2u-bias, where bias = 2e-1 - 1. The largest and the smallest representable values of u are reserved for infinity and denormalized numbers, so the largest possible exponent is reached for u = 2e - 2, thus u - bias = 2e - 2e-1 - 1 = 2e-1 - 1. Which means that the largest representable value is:
max(e, s) = 22e-1 - 1 * 1.1...1 = 22e-1 - 1 * (2 - 2-s)
The smallest exponent is reached for u = 1, so u - bias = 1 - 2e-1 + 1 = -(2e-1 - 2). Thus, the smallest (normalized) representable value is reached for:
min(e, s) = 2-(2e-1 - 2) * 1.0...0 = 2-(2e-1 - 2)
name | #bits | e | s | R2N error | R2n digits | R2Z error | R2Z digits | min | max |
---|---|---|---|---|---|---|---|---|---|
double | 64 | 11 | 52 | 1.11e-16 | 15.95 | 2.22e-16 | 15.65 | 2.23e-308 | 1.80e+308 |
32 | 11 | 20 | 4.77e-7 | 6.32 | 9.54e-7 | 6.02 | 2.23e-308 | 1.80e+308 | |
16 | 11 | 4 | 0.03125 | 1.51 | 0.0625 | 1.20 | 2.23e-308 | 1.74e+308 | |
single | 32 | 8 | 23 | 5.96e-8 | 7.22 | 1.19e-7 | 6.92 | 1.18e-38 | 3.40e+38 |
16 | 8 | 7 | 0.00391 | 2.41 | 0.0078125 | 2.11 | 1.18e-38 | 3.39e+38 | |
half | 16 | 5 | 10 | 0.00048828125 | 3.31 | 0.0009765625 | 3.01 | 6.10e-5 | 6.55e+4 |
Tutorial: Building a Poisson Solver
- Getting Started
- Implement: Matrices
- Implement: Solvers
- Optimize: Measuring Performance
- Optimize: Monitoring Progress
- Optimize: More Suitable Matrix Formats
- Optimize: Using a Preconditioner
- Optimize: Using GPUs
- Customize: Loggers
- Customize: Stopping Criterions
- Customize: Matrix Formats
- Customize: Solvers
- Customize: Preconditioners