XLA loops without a kernel launch on each iteration #16186

carlosgmartin · 2024-08-17T18:10:23Z

carlosgmartin
Aug 17, 2024

Copied from here, since the XLA repo might be a better place to discuss this.

My understanding is that, currently, each iteration of jax.lax.scan requires a kernel launch on GPU backends. This causes an appreciable performance penalty.

For context, consider the following comments:

July 2020:

Perhaps the best thing XLA:GPU could do would be to lower it into a single kernel, since that would minimize overheads and maximize optimization opportunities. But XLA:GPU can't (yet) generate a single kernel for whole loops.
[...]
The upshot of all this is that XLA:GPU doesn't (yet) do the best with some loops. There could be some fundamental limits based on the GPU programming model or the tools NVIDIA provides for generating GPU programs, but I suspect we're not at those limits yet and more can be done with more investment in XLA:GPU. So the best policy is to send love and support towards XLA:GPU developers (both on Google compiler teams and in open source, including at NVIDIA) so we can make this thing we love even better!

March 2021:

Based on some comments from @hawknsp, I think the issue is that on GPU each iteration of the loop gets turned into a separate kernel launch, which is a current limitation of XLA:GPU. Our options for improving this include:

help XLA:GPU folks make loops faster (by gaining the ability to generate a single kernel launch for some loops);

write a custom GPU kernel (like we do for PRNG sampling, because there is a bad compile time / execution time tradeoff on GPU);

partially unroll this loop at the JAX Python level.

The first fix seems like the right one for the long term, but I don't know when it'll happen. The last one is more of a mitigation, but it seems like quite a good one, especially since we can just write the fori_loop as a scan, and use scan's built-in unroll value.

July 2021:

You could try using lax.scan along with its unroll parameter. That should save on kernel launches, though on GPU the kernel launch overheads might still be high. (On TPU the whole jitted computation is walsy compiled into one 'kernel', so there are no analogous overheads there.)

May 2023:

On GPU scans can significantly degrade execution performance relative to expressing the same thing with a Python for loop. The reason is that the Python loop gets unrolled into the staged-out and compiled computation, effectively inlining its operations and allowing all XLA optimizations. In contrast, the scan computation gets staged out to a (rolled) loop operation, and loops have high overhead on GPU because each iteration corresponds to a kernel launch. (On CPU there are no kernels being launched, and on TPU the entire program is compiled into one device program, so neither of those backends have the same issue; it's specific to GPU.)

September 2023:

On GPU, XLA's 'scan' (fori_loop) implementation launches multiple calls to the body_fun GPU kernel, whereas a fully unrolled scan can be fused into a single kernel launch.

February 2024:

I think the issue you're running into is related to the execution model of loops (fori_loop, while_loop, and scan) on GPU. For GPU backends, each iteration effectively requires a kernel launch, so if you have very cheap iterations it can lead to a lot of overhead.

My question is this: Is this a fundamental limitation of XLA and/or GPU hardware? Can it be resolved? The first two comments above suggest it's possible. If so, is this currently being discussed or worked on somewhere?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLA loops without a kernel launch on each iteration #16186

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

XLA loops without a kernel launch on each iteration #16186

carlosgmartin Aug 17, 2024

Replies: 0 comments

carlosgmartin
Aug 17, 2024