This document discusses various methods of reducing PSGD's memory and compute overhead, as well as the trade-offs involved.
triu_as_line
is an argument that reduces the preconditioniner (Q
) storage overhead by storing only the upper
triangle of the triangular Q
as a 1D array, halving memory usage.
This comes at the cost of having to remap the 1D array to a 2D array every time the preconditioner is used, which needs
significant memory bandwidth.
triu_as_line
is enabled by default, and can be disabled by setting it to False
.\
A high-overhead test-case (python3 xor_digit.py --batch 16 --size 1024 --length 4 --depth 1
) showed that the total
step time may be increased by up ~58% when training with triu_as_line=True
.
Larger batch sizes help ammortize this issue.
For PSGDKron
, there's an alternative variant, CachedPSGDKron
.
PSGDKron computes the preconditioning matrix on the fly based on the triangular Q
. However, this preconditioner can be
precomputed and reused across steps. This reduces the per-step overhead of computing the preconditioner, but doubles
the memory overhead.
If the doubled memory cost of CachedPSGDKron
is too high, it's possible to use CachedPSGDKron
with
triu_as_line=True
, which reduces the total memory cost from 2x Q
to 1.5x Q
.