You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Similarly to #3, I've tried reproducing the demo.py benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions.
It was mentioned this is due to the software stack bottlenecking the performance for the low precision kernels and real speedup would only be seen through integration with a serving engine.
Could you explain further what are the CPU processes bottlenecking the performance, and how TensorRT-LLM overcomes the issue? @ilil96
The text was updated successfully, but these errors were encountered:
I’m not entirely sure where the overhead occurs, but it’s clear that single-core CPU performance is important for our demo. It seems that the CPU software stack wrapping around the GPU kernels may slow down on less performant CPUs, causing delays in launching the GPU kernels as needed. If you profile the execution of our demo code using nsys on a platform with low single-core performance, which is a common case for server-grade CPUs, you’ll likely see large gaps between kernel executions, which nullify the benefits of optimized kernels. LLM inference engines like TensorRT-LLM address this issue with techniques like CUDA Graph.
I’m not entirely sure where the overhead occurs, but it’s clear that single-core CPU performance is important for our demo. It seems that the CPU software stack wrapping around the GPU kernels may slow down on less performant CPUs, causing delays in launching the GPU kernels as needed. If you profile the execution of our demo code using nsys on a platform with low single-core performance, which is a common case for server-grade CPUs, you’ll likely see large gaps between kernel executions, which nullify the benefits of optimized kernels. LLM inference engines like TensorRT-LLM address this issue with techniques like CUDA Graph.
I've tried to use the Pytorch CUDA graph API to capture the AnyPrecisionForCausalLM module, but haven't had any success. Would it be possible to release a version of the code used to obtain the benchmarking results in the paper, even if incomplete?
Hello,
Similarly to #3, I've tried reproducing the
demo.py
benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions.It was mentioned this is due to the software stack bottlenecking the performance for the low precision kernels and real speedup would only be seen through integration with a serving engine.
Could you explain further what are the CPU processes bottlenecking the performance, and how TensorRT-LLM overcomes the issue? @ilil96
The text was updated successfully, but these errors were encountered: