[QST] Would there have possibilty that kernel's perf differ between unittest and real model? #1888

foreverlms · 2024-10-21T10:59:59Z

I encountered a problem when using int8 gemm cutlass kernel: NVIDIA/TensorRT-LLM#2351

For shape [16,6144,4096], I got perf of 14us in my unittest benchmark, but in real models, I got 25us.
I compared the template parameters from two nsys perf results (unittest and real models), They are identical.
Could you please give me some suggestions?
All the computation are in the same stream, sequentially.

The text was updated successfully, but these errors were encountered:

foreverlms · 2024-10-22T01:09:58Z

@thakkarV Hi, could you please give a help?

foreverlms · 2024-10-22T13:00:03Z

Ok I got a clue: if I run the kernel once in the model, the perf will degraded to half. But if I run the kernel 10 times in model, the perf after the first time kernel launch are all normal now. That's weird.

thakkarV · 2024-10-22T13:30:10Z

Hi, this probably has to do with your benchmarking methodology. Make sure you are benchmarking in a tight loop 100-10000s of times, discarding the first 10 or so runs, and averaging the time of the rest. Warm L2 cache can also lead to L2 camping which may or may not be realistic for your workload. Hard to give exact suggestions here since this totally depends on what your expect deployment scenario for this kernel is

foreverlms · 2024-10-22T19:34:45Z

Hi, this probably has to do with your benchmarking methodology. Make sure you are benchmarking in a tight loop 100-10000s of times, discarding the first 10 or so runs, and averaging the time of the rest. Warm L2 cache can also lead to L2 camping which may or may not be realistic for your workload. Hard to give exact suggestions here since this totally depends on what your expect deployment scenario for this kernel is

Yes，i have loop the benchmarks exactly like what you said: warm-up 20 times then run 300 times to get the baseline perf. What i thought weird is that I found It's perf degraded to half in models. But if i launch the kernel 10times each calling，the perf is back.
Obviously，i can't launch 10times…… it's an llm model，this kernel is executed 32x2 times for each token generation. And there is indeed a model warm-up prompt at the very beginning. So that's what i said weird，this seemed not a warm-up or l2 cache eviction?

foreverlms added ? - Needs Triage question Question labels Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Would there have possibilty that kernel's perf differ between unittest and real model? #1888

[QST] Would there have possibilty that kernel's perf differ between unittest and real model? #1888

foreverlms commented Oct 21, 2024 •

edited

Loading

foreverlms commented Oct 22, 2024

foreverlms commented Oct 22, 2024 •

edited

Loading

thakkarV commented Oct 22, 2024

foreverlms commented Oct 22, 2024

[QST] Would there have possibilty that kernel's perf differ between unittest and real model? #1888

[QST] Would there have possibilty that kernel's perf differ between unittest and real model? #1888

Comments

foreverlms commented Oct 21, 2024 • edited Loading

foreverlms commented Oct 22, 2024

foreverlms commented Oct 22, 2024 • edited Loading

thakkarV commented Oct 22, 2024

foreverlms commented Oct 22, 2024

foreverlms commented Oct 21, 2024 •

edited

Loading

foreverlms commented Oct 22, 2024 •

edited

Loading