You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For shape [16,6144,4096], I got perf of 14us in my unittest benchmark, but in real models, I got 25us.
I compared the template parameters from two nsys perf results (unittest and real models), They are identical.
Could you please give me some suggestions?
All the computation are in the same stream, sequentially.
The text was updated successfully, but these errors were encountered:
Ok I got a clue: if I run the kernel once in the model, the perf will degraded to half. But if I run the kernel 10 times in model, the perf after the first time kernel launch are all normal now. That's weird.
Hi, this probably has to do with your benchmarking methodology. Make sure you are benchmarking in a tight loop 100-10000s of times, discarding the first 10 or so runs, and averaging the time of the rest. Warm L2 cache can also lead to L2 camping which may or may not be realistic for your workload. Hard to give exact suggestions here since this totally depends on what your expect deployment scenario for this kernel is
Hi, this probably has to do with your benchmarking methodology. Make sure you are benchmarking in a tight loop 100-10000s of times, discarding the first 10 or so runs, and averaging the time of the rest. Warm L2 cache can also lead to L2 camping which may or may not be realistic for your workload. Hard to give exact suggestions here since this totally depends on what your expect deployment scenario for this kernel is
Yes,i have loop the benchmarks exactly like what you said: warm-up 20 times then run 300 times to get the baseline perf. What i thought weird is that I found It's perf degraded to half in models. But if i launch the kernel 10times each calling,the perf is back.
Obviously,i can't launch 10times…… it's an llm model,this kernel is executed 32x2 times for each token generation. And there is indeed a model warm-up prompt at the very beginning. So that's what i said weird,this seemed not a warm-up or l2 cache eviction?
I encountered a problem when using int8 gemm cutlass kernel: NVIDIA/TensorRT-LLM#2351
For shape [16,6144,4096], I got perf of
14us
in my unittest benchmark, but in real models, I got25us
.I compared the template parameters from two nsys perf results (unittest and real models), They are identical.
Could you please give me some suggestions?
All the computation are in the same stream, sequentially.
The text was updated successfully, but these errors were encountered: