Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Would there have possibilty that kernel's perf differ between unittest and real model? #1888

Open
foreverlms opened this issue Oct 21, 2024 · 4 comments

Comments

@foreverlms
Copy link

foreverlms commented Oct 21, 2024

I encountered a problem when using int8 gemm cutlass kernel: NVIDIA/TensorRT-LLM#2351

For shape [16,6144,4096], I got perf of 14us in my unittest benchmark, but in real models, I got 25us.
I compared the template parameters from two nsys perf results (unittest and real models), They are identical.
Could you please give me some suggestions?
All the computation are in the same stream, sequentially.

@foreverlms
Copy link
Author

@thakkarV Hi, could you please give a help?

@foreverlms
Copy link
Author

foreverlms commented Oct 22, 2024

Ok I got a clue: if I run the kernel once in the model, the perf will degraded to half. But if I run the kernel 10 times in model, the perf after the first time kernel launch are all normal now. That's weird.

Image

@thakkarV
Copy link
Collaborator

Hi, this probably has to do with your benchmarking methodology. Make sure you are benchmarking in a tight loop 100-10000s of times, discarding the first 10 or so runs, and averaging the time of the rest. Warm L2 cache can also lead to L2 camping which may or may not be realistic for your workload. Hard to give exact suggestions here since this totally depends on what your expect deployment scenario for this kernel is

@foreverlms
Copy link
Author

Hi, this probably has to do with your benchmarking methodology. Make sure you are benchmarking in a tight loop 100-10000s of times, discarding the first 10 or so runs, and averaging the time of the rest. Warm L2 cache can also lead to L2 camping which may or may not be realistic for your workload. Hard to give exact suggestions here since this totally depends on what your expect deployment scenario for this kernel is

Yes,i have loop the benchmarks exactly like what you said: warm-up 20 times then run 300 times to get the baseline perf. What i thought weird is that I found It's perf degraded to half in models. But if i launch the kernel 10times each calling,the perf is back.
Obviously,i can't launch 10times…… it's an llm model,this kernel is executed 32x2 times for each token generation. And there is indeed a model warm-up prompt at the very beginning. So that's what i said weird,this seemed not a warm-up or l2 cache eviction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants