No real speedup from any-precision-llm kernels #7

pgimenes · 2024-09-20T16:59:02Z

Hello,

Similarly to #3, I've tried reproducing the demo.py benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions.

It was mentioned this is due to the software stack bottlenecking the performance for the low precision kernels and real speedup would only be seen through integration with a serving engine.

Could you explain further what are the CPU processes bottlenecking the performance, and how TensorRT-LLM overcomes the issue? @ilil96

The text was updated successfully, but these errors were encountered:

ilil96 · 2024-09-21T12:24:41Z

I’m not entirely sure where the overhead occurs, but it’s clear that single-core CPU performance is important for our demo. It seems that the CPU software stack wrapping around the GPU kernels may slow down on less performant CPUs, causing delays in launching the GPU kernels as needed. If you profile the execution of our demo code using nsys on a platform with low single-core performance, which is a common case for server-grade CPUs, you’ll likely see large gaps between kernel executions, which nullify the benefits of optimized kernels. LLM inference engines like TensorRT-LLM address this issue with techniques like CUDA Graph.

pgimenes · 2024-09-27T20:00:13Z

I’m not entirely sure where the overhead occurs, but it’s clear that single-core CPU performance is important for our demo. It seems that the CPU software stack wrapping around the GPU kernels may slow down on less performant CPUs, causing delays in launching the GPU kernels as needed. If you profile the execution of our demo code using nsys on a platform with low single-core performance, which is a common case for server-grade CPUs, you’ll likely see large gaps between kernel executions, which nullify the benefits of optimized kernels. LLM inference engines like TensorRT-LLM address this issue with techniques like CUDA Graph.

I've tried to use the Pytorch CUDA graph API to capture the AnyPrecisionForCausalLM module, but haven't had any success. Would it be possible to release a version of the code used to obtain the benchmarking results in the paper, even if incomplete?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No real speedup from any-precision-llm kernels #7

No real speedup from any-precision-llm kernels #7

pgimenes commented Sep 20, 2024

ilil96 commented Sep 21, 2024

pgimenes commented Sep 27, 2024

No real speedup from any-precision-llm kernels #7

No real speedup from any-precision-llm kernels #7

Comments

pgimenes commented Sep 20, 2024

ilil96 commented Sep 21, 2024

pgimenes commented Sep 27, 2024