Skip to content

Commit

Permalink
SWDEV-354898 - HIP documents patch for 5.4 release.
Browse files Browse the repository at this point in the history
Change-Id: I52a20f69775ad06672321fdaa114dbee815a9838
  • Loading branch information
jujiang-del committed Nov 4, 2022
1 parent b6ec0c8 commit 35d0c73
Show file tree
Hide file tree
Showing 4 changed files with 130 additions and 128 deletions.
2 changes: 1 addition & 1 deletion docs/markdown/hip_debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ The following is the summary of the most useful environment variables in HIP.
| AMD_SERIALIZE_COPY <br><sub> Serialize copies. </sub> | 0 | 1: Wait for completion before enqueue. <br> 2: Wait for completion after enqueue. <br> 3: Both. |
| HIP_HOST_COHERENT <br><sub> Coherent memory in hipHostMalloc. </sub> | 0 | 0: memory is not coherent between host and GPU. <br> 1: memory is coherent with host. |
| AMD_DIRECT_DISPATCH <br><sub> Enable direct kernel dispatch. </sub> | 1 | 0: Disable. <br> 1: Enable. |

| GPU_MAX_HW_QUEUES <br><sub> The maximum number of hardware queues allocated per device. </sub> | 4 | The variable controls how many independent hardware queues HIP runtime can create per process, per device. If application allocates more HIP streams than this number, then HIP runtime will reuse the same hardware queues for the new streams in round robin manner. Please note, this maximum number does not apply to either hardware queues that are created for CU masked HIP streams, or cooperative queue for HIP Cooperative Groups (there is only one single queue per device). |

## General Debugging Tips
- 'gdb --args' can be used to conveniently pass the executable and arguments to gdb.
Expand Down
16 changes: 14 additions & 2 deletions docs/markdown/hip_kernel_language.md
Original file line number Diff line number Diff line change
Expand Up @@ -455,9 +455,9 @@ Following is the list of supported integer intrinsics. Note that intrinsics are
| unsigned int __popcll ( unsigned long long int x )<br><sub>Count the number of bits that are set to 1 in a 64 bit integer.</sub> |
| int __mul24 ( int x, int y )<br><sub>Multiply two 24bit integers.</sub> |
| unsigned int __umul24 ( unsigned int x, unsigned int y )<br><sub>Multiply two 24bit unsigned integers.</sub> |
<sub><b id="f3"><sup>[1]</sup></b>
<sub><b id="f3"><sup>[1]</sup></b>
The HIP-Clang implementation of __ffs() and __ffsll() contains code to add a constant +1 to produce the ffs result format.
For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform,
For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform,
HIP-Clang provides __lastbit_u32_u32(unsigned int input) and __lastbit_u32_u64(unsigned long long int input).
The index returned by __lastbit_ instructions starts at -1, while for ffs the index starts at 0.

Expand Down Expand Up @@ -496,6 +496,18 @@ long long int clock64()
```
Returns the value of counter that is incremented every clock cycle on device. Difference in values returned provides the cycles used.

```
long long int wall_clock64()
```
Returns wall clock count at a constant frequency on the device, which can be queried via HIP API with hipDeviceAttributeWallClockRate attribute of the device in HIP application code, for example,
```
int wallClkRate = 0; //in kilohertz
HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
```
Where hipDeviceAttributeWallClockRate is a device attribute.
Note that, wall clock frequency is a per-device attribute.


## Atomic Functions

Atomic functions execute as read-modify-write operations residing in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.
Expand Down
3 changes: 0 additions & 3 deletions docs/markdown/hip_programming_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,9 +102,6 @@ A stronger system-level fence can be specified when the event is created with hi
- hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded.  This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing.  Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem.
- hipEventDisableTiming: Events created with this flag would not record profiling data and provide best performance if used for synchronization.

Note, for HIP Events used in kernel dispatch using hipExtLaunchKernelGGL/hipExtLaunchKernel, events passed in the API are not explicitly recorded and should only be used to get elapsed time for that specific launch.
In case events are used across multiple dispatches, for example, start and stop events from different hipExtLaunchKernelGGL/hipExtLaunchKernel calls, they will be treated as invalid unrecorded events, HIP will throw error "hipErrorInvalidHandle" from hipEventElapsedTime.

### Summary and Recommendations:

- Coherent host memory is the default and is the easiest to use since the memory is visible to the CPU at typical synchronization points. This memory allows in-kernel synchronization commands such as threadfence_system to work transparently.
Expand Down
Loading

0 comments on commit 35d0c73

Please sign in to comment.