-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HIP Performance Guidelines #3455
Add HIP Performance Guidelines #3455
Conversation
docs/reference/performance.rst
Outdated
:ref:`synchronization functions`) within the same kernel invocation. If they | ||
belong to different blocks, they must use global memory with two separate | ||
kernel invocations. The latter should be minimized as it adds overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, e.g. SYCL has a guaranty that lowest global index block that is still executing is progressing. It allows some limited use of synchronization for later blocks in the same invocation. However, I don't remember, if it is the case for HIP, and cannot google it fast. Should we ask Young about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I know, no. But I do not find it either. It would be a good idea.
docs/reference/performance.rst
Outdated
and is generally reduced when addresses are more scattered, especially in | ||
global memory. | ||
|
||
Device memory is accessed via 32-, 64-, or 128-byte transactions that must be | ||
naturally aligned. Maximizing memory throughput involves coalescing memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a short glossary in the beginning can be very valuable. For example, I'm not sure here, is "device memory" and "global memory" the same thing in these sentences, or it is different concepts. And again "coalescing" without explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference added to coalescing.
docs/reference/performance.rst
Outdated
As for alternative ways to synchronize is using streams. Different streams | ||
can execute commands out of order with respect to one another or concurrently. | ||
This allows for more fine-grained control over the execution order of | ||
commands, which can be beneficial in certain scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, using streams for intra-block synchronization is a definite overkill. I suggest to extend this paragraph to explain, what level of synchronization is provided by streams, and make a link to their description.
c4390c2
to
29a7ac3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these sections are very close to the performance guidelines of the cuda programming guide, sometimes almost quoting it directly. I don't think that's a good practice, especially as some parts don't apply to AMDs GPUs at all, and on top of that the cuda programmign guide does not have a permissive license from what I can tell
A better place for inspiration might be gpuopen, that already has some performance guides for e.g. rdna https://gpuopen.com/learn/rdna-performance-guide/
docs/index.md
Outdated
@@ -17,6 +17,7 @@ portable applications for AMD and NVIDIA GPUs from single source code. | |||
|
|||
:::{grid-item-card} Reference | |||
|
|||
* {doc}`/reference/performance_guidelines` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue this document would rather fit in the "understand" section, than "reference"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is moved to How-to
Optimizing memory access: The efficiency of memory access can impact the speed | ||
of arithmetic operations. Coalesced memory access, where threads in a warp | ||
access consecutive memory locations, can improve memory throughput and thus | ||
the speed of arithmetic operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic: arithmetic operations can't be "sped up", the time they can't be scheduled for execution however can depend on memory accesses. I would add a reference to the memory optimizations here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
29a7ac3
to
147036e
Compare
BTW, it should be mentioned somewhere that to fully utilize all SIMD lines/possible threads in the block x-block size should be a multiple of warp size. |
When being pedantic: the block size ( |
I am not sure that is best strategy either, but the concept was accepted as a first version. It is not quoting directly the mentioned document, but there is overlap in the content. Personally, I would have appreciated every pieces of recommendation, both in format and in content. Nonetheless, we always have the opportunity to improve it and make the documentation better for the satisfaction of the developers. |
Most of this PR changes has been merged in, while the leftover is here: #3483 |
No description provided.