Make unary ops more efficient with non-contiguous inputs #192

robertknight · 2024-05-20T06:18:18Z

Unary operators (eg. sigmoid, tanh) are much less efficient with non-contiguous inputs. The problem is two-fold:

For SIMD-vectorized operators (eg. tanh), the fast path calls a SIMD function that applies the operator to the entire contiguous buffer at once. For non-contiguous inputs, it falls back to iterating over the input and applying the operator on one element at a time. Even worse, the fast path is parallel whereas the fallback is not
In the slow path for TensorBase::apply, it uses an iterator which is much less efficient than iterating over contiguous inputs. See also Replace all usage of TensorBase::broadcast_iter #189.

A better implementation would be something like:

Sort dimensions into maximally-contiguous order
If the size of the longest contiguous chunks is above a threshold, iterate over them and apply the operator
If the size is below the threshold, use nested loops instead of an iterator, ala. Replace all usage of TensorBase::broadcast_iter #189

Once this is done, copying activations in RNN operators (eg. GRU, LSTM) can be replaced with their in-place versions to reduce copying.

Change TensorBase::apply to avoid using an iterator (Improve slow-path performance for unary ops #223)
Add better slow path for SIMD-vectorized unary ops

The text was updated successfully, but these errors were encountered:

This is a workaround needed because `tanh_in_place` is very slow with non-contigous inputs. See #192.

This is a workaround until #192 is solved more generally.

Replace iterators with a pattern that uses a fixed number of nested loops. The same approach was previously applied to binary and ternary ops. Part of #192.

robertknight added the performance Issues that affect model inference or loading performance label May 20, 2024

robertknight added a commit that referenced this issue May 20, 2024

Copy hidden gate in GRU op before applying activation

f0d7b9c

This is a workaround needed because `tanh_in_place` is very slow with non-contigous inputs. See #192.

robertknight mentioned this issue May 20, 2024

Combine matmuls for GRU gates #188

Merged

2 tasks

robertknight added a commit that referenced this issue May 20, 2024

Copy hidden gate in GRU op before applying activation

53d9e41

This is a workaround needed because `tanh_in_place` is very slow with non-contigous inputs. See #192.

robertknight added a commit that referenced this issue May 20, 2024

Copy update and reset gates in GRU op before applying sigmoid activation

cf790fa

This is a workaround until #192 is solved more generally.

robertknight mentioned this issue May 20, 2024

Copy gates in LSTM, GRU ops before applying activations #193

Merged

robertknight added a commit that referenced this issue May 20, 2024

Copy update and reset gates in GRU op before applying sigmoid activation

e5c6a7f

This is a workaround until #192 is solved more generally.

robertknight added a commit that referenced this issue May 31, 2024

Improve slow-path performance for unary ops

146ef2a

Replace iterators with a pattern that uses a fixed number of nested loops. The same approach was previously applied to binary and ternary ops. Part of #192.

robertknight mentioned this issue May 31, 2024

Improve slow-path performance for unary ops #223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make unary ops more efficient with non-contiguous inputs #192

Make unary ops more efficient with non-contiguous inputs #192

robertknight commented May 20, 2024 •

edited

Loading

Make unary ops more efficient with non-contiguous inputs #192

Make unary ops more efficient with non-contiguous inputs #192

Comments

robertknight commented May 20, 2024 • edited Loading

robertknight commented May 20, 2024 •

edited

Loading