Raffaello Giulietti, v2023-10-30-02
In terms of running time, division is the most expensive of the integer arithmetical/logical operations on contemporary CPUs.
Given an integer compile-time constant d
, it turns out that it is possible to replace later computations of x / d
with faster code producing the same outcome, where integer x
is not known at compile-time.
A preparatory work, done at compile-time and that depends on d
alone, leads to faster code for the division later on at run-time.
Detailed proofs of the core algorithm presented here can be found elsewhere, in §10.1.
- We write
$-2^k$ for$-(2^k)$ , although the former really means$(-2)^k$ (negation binds tighter than binary operations like exponentiation). -
$W$ denotes the word size (e.g.,$32$ or$64$ ). -
$/$ denotes usual division over the real numbers (when not appearing in code as/
). -
$\div$ denotes truncating division over the real numbers:$x \div d = \lfloor x / d\rfloor$ if$x / d \ge 0$ ,$x \div d = \lceil x / d\rceil$ otherwise, where$d \ne 0$ . In Java code, withint x, d
, and absent overflows,x / y
${}= x \div d$ .
Let integer
Compute
Here,
so
Then, for integer
and for integer
All computations can be done in signed
For integer
The remainder
and can be computed using
To check whether an unsigned int d
(an int
interpreted as an unsigned value) is a power of
(d & (d - 1)) == 0; // holds iff d is a power of 2, or if d = 0
int m = (2 * Integer.SIZE - 1) - Integer.numberOfLeadingZeros(d);
long c = (1L << m) / d + 1; // the division only happens at compile-time
No overflows occur.
Here, c
could fit in an unsigned int
.
If so, below it must be masked with 0xFFFF_FFFFL
before multiplication.
Since about half of the admitted divisors is even, one can apply a reduction step as follows
int k = Integer.numberOfTrailingZeros((int) c);
c >>>= k;
m -= k;
As alluded, in about half of the cases this leads to a int
.
For int
divisors, the exponent
It turns out that, in the int
case,
long p = x * c; // to reduce latency, schedule the product before computing s
int s = x >>> (Integer.SIZE - 1); // 0 if x >= 0; 1 if x < 0
int q = (int) (p >> m) + s;
again without overflows.
If accessing the high int
half of a long
, and shifting it, is faster than just shifting a long
, the line for q
can be replaced by
int q = (high_half(p) >> (m - Integer.SIZE)) + s;
where the shift distance m - Integer.SIZE
is a compile-time constant.
Of course, in compile-time contexts where x
is known to be non-negative, the division can be simplified to the one-liner
int q = (int) (x * c >> m);
respectively
int q = high_half(x * c) >> (m - Integer.SIZE);
Similarly when x
is known to be negative.
Computing -d
overflows to itself when d = Integer.MIN_VALUE
, but this is not an issue: indeed, when seen as an unsigned value, this is a power of
int m = (2 * Integer.SIZE - 1) - Integer.numberOfLeadingZeros(-d);
long c = (-1L << m) / d + 1; // the division only happens at compile-time
long p = x * c; // to reduce latency, schedule the product before computing s
int s = x >>> (Integer.SIZE - 1); // 0 if x >= 0; 1 if x < 0
int q = -((int) (p >> m) + s);
with similar variations as above.