Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No real speedup from any-precision-llm kernels #7

Open
pgimenes opened this issue Sep 20, 2024 · 2 comments
Open

No real speedup from any-precision-llm kernels #7

pgimenes opened this issue Sep 20, 2024 · 2 comments

Comments

@pgimenes
Copy link

Hello,

Similarly to #3, I've tried reproducing the demo.py benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions.

It was mentioned this is due to the software stack bottlenecking the performance for the low precision kernels and real speedup would only be seen through integration with a serving engine.

Could you explain further what are the CPU processes bottlenecking the performance, and how TensorRT-LLM overcomes the issue? @ilil96

@ilil96
Copy link
Collaborator

ilil96 commented Sep 21, 2024

I’m not entirely sure where the overhead occurs, but it’s clear that single-core CPU performance is important for our demo. It seems that the CPU software stack wrapping around the GPU kernels may slow down on less performant CPUs, causing delays in launching the GPU kernels as needed. If you profile the execution of our demo code using nsys on a platform with low single-core performance, which is a common case for server-grade CPUs, you’ll likely see large gaps between kernel executions, which nullify the benefits of optimized kernels. LLM inference engines like TensorRT-LLM address this issue with techniques like CUDA Graph.

@pgimenes
Copy link
Author

I’m not entirely sure where the overhead occurs, but it’s clear that single-core CPU performance is important for our demo. It seems that the CPU software stack wrapping around the GPU kernels may slow down on less performant CPUs, causing delays in launching the GPU kernels as needed. If you profile the execution of our demo code using nsys on a platform with low single-core performance, which is a common case for server-grade CPUs, you’ll likely see large gaps between kernel executions, which nullify the benefits of optimized kernels. LLM inference engines like TensorRT-LLM address this issue with techniques like CUDA Graph.

I've tried to use the Pytorch CUDA graph API to capture the AnyPrecisionForCausalLM module, but haven't had any success. Would it be possible to release a version of the code used to obtain the benchmarking results in the paper, even if incomplete?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants