Skip to content

IEEE precisions

Goran Flegar edited this page Dec 5, 2018 · 7 revisions

This is a document describing useful information about IEEE 754 floating point standard.

Round-off error

The error introduced depends on the number of bits used for the significand s, and the rounding mode used:

  • with round-to-nearest mode, an additional bit is added and the error is 2-(s+1)
  • with round-to-zero mode, the error is 2-s

Range

The range depends on the number of exponent bits e. The bits encode an unsigned integer u, and the fixed-point base 2 number 1.s...s (where s...s is the significand) is multiplied by 2u-bias, where bias = 2e-1 - 1. The largest and the smallest representable values of u are reserved for infinity and denormalized numbers, so the largest possible exponent is reached for u = 2e - 2, thus u - bias = 2e - 2e-1 - 1 = 2e-1 - 1. Which means that the largest representable value is:

max(e, s) = 22e-1 - 1 * 1.1...1 = 22e-1 - 1 * (2 - 2-s)

The smallest exponent is reached for u = 1, so u - bias = 1 - 2e-1 + 1 = -(2e-1 - 2). Thus, the smallest (normalized) representable value is reached for:

min(e, s) = 2-(2e-1 - 2) * 1.0...0 = 2-(2e-1 - 2)

Table of useful properties

name #bits e s R2N error R2n digits R2Z error R2Z digits min max
double 64 11 52 1.11e-16 15.95 2.22e-16 15.65 2.23e-308 1.80e+308
32 11 20 4.77e-7 6.32 9.54e-7 6.02 2.23e-308 1.80e+308
16 11 4 0.03125 1.51 0.0625 1.20 2.23e-308 1.74e+308
single 32 8 23 5.96e-8 7.22 1.19e-7 6.92 1.18e-38 3.40e+38
16 8 7 0.00391 2.41 0.0078125 2.11 1.18e-38 3.39e+38
half 16 5 10 0.00048828125 3.31 0.0009765625 3.01 6.10e-5 6.55e+4
Clone this wiki locally