Add HIP Performance Guidelines #3455

matyas-streamhpc · 2024-04-24T10:23:10Z

No description provided.

docs/reference/performance.rst

Melirius · 2024-04-24T15:03:21Z

docs/reference/performance.rst

+:ref:`synchronization functions`) within the same kernel invocation. If they 
+belong to different blocks, they must use global memory with two separate 
+kernel invocations. The latter should be minimized as it adds overhead.


Hmm, e.g. SYCL has a guaranty that lowest global index block that is still executing is progressing. It allows some limited use of synchronization for later blocks in the same invocation. However, I don't remember, if it is the case for HIP, and cannot google it fast. Should we ask Young about?

As I know, no. But I do not find it either. It would be a good idea.

docs/reference/performance.rst

Melirius · 2024-04-25T11:45:03Z

docs/reference/performance.rst

+and is generally reduced when addresses are more scattered, especially in
+global memory.
+
+Device memory is accessed via 32-, 64-, or 128-byte transactions that must be
+naturally aligned. Maximizing memory throughput involves coalescing memory


I think a short glossary in the beginning can be very valuable. For example, I'm not sure here, is "device memory" and "global memory" the same thing in these sentences, or it is different concepts. And again "coalescing" without explanation.

Reference added to coalescing.

docs/reference/performance.rst

Melirius · 2024-04-25T12:14:59Z

docs/reference/performance.rst

+ As for alternative ways to synchronize is using streams. Different streams 
+ can execute commands out of order with respect to one another or concurrently.
+ This allows for more fine-grained control over the execution order of 
+ commands, which can be beneficial in certain scenarios.


Hmm, using streams for intra-block synchronization is a definite overkill. I suggest to extend this paragraph to explain, what level of synchronization is provided by streams, and make a link to their description.

docs/reference/performance.rst

MKKnorr

Most of these sections are very close to the performance guidelines of the cuda programming guide, sometimes almost quoting it directly. I don't think that's a good practice, especially as some parts don't apply to AMDs GPUs at all, and on top of that the cuda programmign guide does not have a permissive license from what I can tell

A better place for inspiration might be gpuopen, that already has some performance guides for e.g. rdna https://gpuopen.com/learn/rdna-performance-guide/

MKKnorr · 2024-04-26T10:56:23Z

docs/index.md

@@ -17,6 +17,7 @@ portable applications for AMD and NVIDIA GPUs from single source code.

 :::{grid-item-card} Reference

+* {doc}`/reference/performance_guidelines`


I would argue this document would rather fit in the "understand" section, than "reference"

It is moved to How-to

docs/reference/performance_guidelines.rst

MKKnorr · 2024-04-26T14:54:58Z

docs/reference/performance_guidelines.rst

+Optimizing memory access: The efficiency of memory access can impact the speed
+of arithmetic operations. Coalesced memory access, where threads in a warp
+access consecutive memory locations, can improve memory throughput and thus
+the speed of arithmetic operations.


Pedantic: arithmetic operations can't be "sped up", the time they can't be scheduled for execution however can depend on memory accesses. I would add a reference to the memory optimizations here

docs/reference/performance_guidelines.rst

Melirius · 2024-04-30T11:51:44Z

BTW, it should be mentioned somewhere that to fully utilize all SIMD lines/possible threads in the block x-block size should be a multiple of warp size.

MKKnorr · 2024-04-30T12:20:49Z

BTW, it should be mentioned somewhere that to fully utilize all SIMD lines/possible threads in the block x-block size should be a multiple of warp size.

When being pedantic: the block size (x*y*z) has to be a multiple of the warp size for full utilization

matyas-streamhpc · 2024-05-02T15:02:43Z

Most of these sections are very close to the performance guidelines of the cuda programming guide, sometimes almost quoting it directly. I don't think that's a good practice, especially as some parts don't apply to AMDs GPUs at all, and on top of that the cuda programmign guide does not have a permissive license from what I can tell

A better place for inspiration might be gpuopen, that already has some performance guides for e.g. rdna https://gpuopen.com/learn/rdna-performance-guide/

I am not sure that is best strategy either, but the concept was accepted as a first version. It is not quoting directly the mentioned document, but there is overlap in the content. Personally, I would have appreciated every pieces of recommendation, both in format and in content.

Nonetheless, we always have the opportunity to improve it and make the documentation better for the satisfaction of the developers.

neon60 · 2024-05-15T11:21:24Z

Most of this PR changes has been merged in, while the leftover is here: #3483

matyas-streamhpc requested a review from parbenc April 24, 2024 10:23

matyas-streamhpc self-assigned this Apr 24, 2024

Melirius reviewed Apr 24, 2024

View reviewed changes