Extremely low running time when profiling transpiled muGraphs #97

wmdi · 2024-10-04T20:26:18Z

When profiling transpiled muGrpahs, some results are extremely low and are close to kernel launch time. For example, in the gated_mlp example, some muGraphs only consume ~0.004ms in the catalyst cluster. This is an indication that we may have kernel launch error in the generated cuda programs.

jiazhihao · 2024-10-15T14:31:43Z

I suspect this is because some of the generated CUDA kernels cannot be successfully executed on A5000 GPUs. One possibility is that the required smem size exceeds the hardware limit. @xinhaoc Can you take a look at the gated_mlp issue on A5000?

jiazhihao assigned xinhaoc Oct 15, 2024

jiazhihao added the bug Something isn't working label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely low running time when profiling transpiled muGraphs #97

Extremely low running time when profiling transpiled muGraphs #97

wmdi commented Oct 4, 2024

jiazhihao commented Oct 15, 2024

Extremely low running time when profiling transpiled muGraphs #97

Extremely low running time when profiling transpiled muGraphs #97

Comments

wmdi commented Oct 4, 2024

jiazhihao commented Oct 15, 2024