Skip to content

Commit

Permalink
Review fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
MKKnorr committed Nov 28, 2024
1 parent 0cf8414 commit ef46f97
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 21 deletions.
37 changes: 20 additions & 17 deletions docs/how-to/hip_cpp_language_extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,9 @@ There are some restrictions on the parameters of kernels. Kernels can't:
Kernels can have variadic template parameters, but only one parameter pack,
which must be the last item in the template parameter list.

Unlike CUDA, HIP does not support dynamic parallelism, meaning that kernels can
not be called from the device.
.. note::
Unlike CUDA, HIP does not support dynamic parallelism, meaning that kernels
can not be called from the device.

Calling __global__ functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -79,8 +80,8 @@ need an additional configuration, that specifies the grid and block dimensions
(i.e. the amount of threads to be launched), as well as specifying the amount of
shared memory per block and which stream to execute the kernel on.

Kernels are called using the ``<<<>>>`` syntax known from CUDA, but HIP also
supports the ``hipLaunchKernelGGL`` macro.
Kernels are called using the triple chevron ``<<<>>>`` syntax known from CUDA,
but HIP also supports the ``hipLaunchKernelGGL`` macro.

When using ``hipLaunchKernelGGL``, the first five configuration parameters must
be:
Expand Down Expand Up @@ -167,8 +168,8 @@ launched with more threads than ``MAX_THREADS_PER_BLOCK``.

If no ``__launch_bounds__`` are specified, ``MAX_THREADS_PER_BLOCK`` is
the maximum block size supported by the device (see
:doc:`reference/hardware_features`). Reducing ``MAX_THREADS_PER_BLOCK`` allows
the compiler to use more resources per thread than an unconstrained
:doc:`../reference/hardware_features`). Reducing ``MAX_THREADS_PER_BLOCK``
allows the compiler to use more resources per thread than an unconstrained
compilation. This might however reduce the amount of blocks that can run
concurrently on a CU, thereby reducing occupancy and trading thread-level
parallelism for instruction-level parallelism.
Expand All @@ -187,9 +188,9 @@ When launching kernels HIP will validate the launch configuration to make sure
the requested block size is not larger than ``MAX_THREADS_PER_BLOCK`` and
return an error if it is exceeded.

If :doc:`AMD_LOG_LEVEL <how-to/logging>` is set, detailed information will be
shown in the error log message, including the launch configuration of the
kernel and the specified ``__launch_bounds__``.
If :doc:`AMD_LOG_LEVEL <./logging>` is set, detailed information will be shown
in the error log message, including the launch configuration of the kernel and
the specified ``__launch_bounds__``.

MIN_WARPS_PER_EXECUTION_UNIT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -210,11 +211,11 @@ The available registers per Compute Unit are listed in
:doc:`rocm:reference/gpu-arch-specs`. Beware that these values are per Compute
Unit, not per Execution Unit. On AMD GPUs a Compute Unit consists of 4 Execution
Units, also known as SIMDs, each with their own register file. For more
information see :doc:`understand/hardware_implementation`.
information see :doc:`../understand/hardware_implementation`.
:cpp:struct:`hipDeviceProp_t` also has a field ``executionUnitsPerMultiprocessor``.

Porting from CUDA __launch_bounds
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

CUDA defines the ``__launch_bounds`` qualifier which works similar to
``__launch_bounds__``:
Expand All @@ -223,10 +224,12 @@ CUDA defines the ``__launch_bounds`` qualifier which works similar to
__launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
The first parameter is the same as HIP's implementation. The second parameter of
``__launch_bounds`` must be converted to the format used ``__launch_bounds__``,
which uses warps and execution units rather than blocks and multiprocessors.
This conversion is performed automatically by :doc:`HIPIFY <hipify:index>`.
The first parameter is the same as HIP's implementation, but
``MIN_BLOCKS_PER_MULTIPROCESSOR`` must be converted to
``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than
blocks and multiprocessors. This conversion is performed automatically by
:doc:`HIPIFY <hipify:index>`, or can be done manually with the following
equation.

.. code-block:: cpp
Expand Down Expand Up @@ -520,8 +523,8 @@ the following workaround:
#. Build HIP with the ``HIP_COHERENT_HOST_ALLOC`` environment variable enabled.
#. Modify kernels that use ``__threadfence_system()`` as follows:

* Ensure the kernel operates only on fine-grained system memory, which should be allocated with
``hipHostMalloc()``.
* Ensure the kernel operates only on fine-grained system memory, which should be
allocated with ``hipHostMalloc()``.
* Remove ``memcpy`` for all allocated fine-grained system memory regions.

.. _synchronization_functions:
Expand Down
7 changes: 3 additions & 4 deletions docs/how-to/kernel_language_cpp_support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,9 @@ Assertions
--------------------------------------------------------------------------------

The ``assert`` function is supported in device code. Assertions are used for
debugging purposes. When the input expression equals to zero, the execution will
be stopped.
HIP provides its own implementation for ``assert`` for usage in device code in
``hip/hip_runtime.h``
debugging purposes. When the input expression equals zero, the execution will be
stopped. HIP provides its own implementation for ``assert`` for usage in device
code in ``hip/hip_runtime.h``.

.. code-block:: cpp
Expand Down

0 comments on commit ef46f97

Please sign in to comment.