Review fixes

ROCm · Nov 28, 2024 · 00a1e34 · 00a1e34
1 parent 0cf8414
commit 00a1e34
Show file tree

Hide file tree

Showing 2 changed files with 23 additions and 22 deletions.
diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst
@@ -68,8 +68,9 @@ There are some restrictions on the parameters of kernels. Kernels can't:
 Kernels can have variadic template parameters, but only one parameter pack,
 which must be the last item in the template parameter list.
 
-Unlike CUDA, HIP does not support dynamic parallelism, meaning that kernels can
-not be called from the device.
+.. note::
+    Unlike CUDA, HIP does not support dynamic parallelism, meaning that kernels
+    can not be called from the device.
 
 Calling __global__ functions
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -79,8 +80,8 @@ need an additional configuration, that specifies the grid and block dimensions
 (i.e. the amount of threads to be launched), as well as specifying the amount of
 shared memory per block and which stream to execute the kernel on.
 
-Kernels are called using the ``<<<>>>`` syntax known from CUDA, but HIP also
-supports the ``hipLaunchKernelGGL`` macro.
+Kernels are called using the triple chevron ``<<<>>>`` syntax known from CUDA,
+but HIP also supports the ``hipLaunchKernelGGL`` macro.
 
 When using ``hipLaunchKernelGGL``, the first five configuration parameters must
 be:
@@ -167,8 +168,8 @@ launched with more threads than ``MAX_THREADS_PER_BLOCK``.
 
 If no ``__launch_bounds__`` are specified, ``MAX_THREADS_PER_BLOCK`` is
 the maximum block size supported by the device (see
-:doc:`reference/hardware_features`). Reducing ``MAX_THREADS_PER_BLOCK`` allows
-the compiler to use more resources per thread than an unconstrained
+:doc:`../reference/hardware_features`). Reducing ``MAX_THREADS_PER_BLOCK``
+allows the compiler to use more resources per thread than an unconstrained
 compilation. This might however reduce the amount of blocks that can run
 concurrently on a CU, thereby reducing occupancy and trading thread-level
 parallelism for instruction-level parallelism.
@@ -187,9 +188,9 @@ When launching kernels HIP will validate the launch configuration to make sure
 the requested block size is not larger than ``MAX_THREADS_PER_BLOCK`` and
 return an error if it is exceeded.
 
-If :doc:`AMD_LOG_LEVEL <how-to/logging>` is set, detailed information will be
-shown in the error log message, including the launch configuration of the
-kernel and the specified ``__launch_bounds__``.
+If :doc:`AMD_LOG_LEVEL <./logging>` is set, detailed information will be shown
+in the error log message, including the launch configuration of the kernel and
+the specified ``__launch_bounds__``.
 
 MIN_WARPS_PER_EXECUTION_UNIT
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -210,11 +211,11 @@ The available registers per Compute Unit are listed in
 :doc:`rocm:reference/gpu-arch-specs`. Beware that these values are per Compute
 Unit, not per Execution Unit. On AMD GPUs a Compute Unit consists of 4 Execution
 Units, also known as SIMDs, each with their own register file. For more
-information see :doc:`understand/hardware_implementation`.
+information see :doc:`../understand/hardware_implementation`.
 :cpp:struct:`hipDeviceProp_t` also has a field ``executionUnitsPerMultiprocessor``.
 
 Porting from CUDA __launch_bounds
-""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 CUDA defines the ``__launch_bounds`` qualifier which works similar to
 ``__launch_bounds__``:
@@ -223,10 +224,12 @@ CUDA defines the ``__launch_bounds`` qualifier which works similar to
 
   __launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
 
-The first parameter is the same as HIP's implementation. The second parameter of
-``__launch_bounds`` must  be converted to the format used ``__launch_bounds__``,
-which uses warps and execution units rather than blocks and multiprocessors.
-This conversion is performed automatically by :doc:`HIPIFY <hipify:index>`.
+The first parameter is the same as HIP's implementation, but
+``MIN_BLOCKS_PER_MULTIPROCESSOR`` must  be converted to
+``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than
+blocks and multiprocessors. This conversion is performed automatically by
+:doc:`HIPIFY <hipify:index>`, or can be done manually with the following
+equation.
 
 .. code-block:: cpp
 
@@ -247,7 +250,6 @@ Unlike ``nvcc``, ``amdclang++`` does not support the ``--maxregcount`` option.
 Instead, users are encouraged to use the ``__launch_bounds__`` directive since
 the parameters are more intuitive and portable than micro-architecture details
 like registers. The directive allows per-kernel control.
-``__launch_bounds__`` works on both AMD and NVIDIA platforms.
 
 Memory space qualifiers
 ================================================================================
@@ -520,8 +522,8 @@ the following workaround:
 #. Build HIP with the ``HIP_COHERENT_HOST_ALLOC`` environment variable enabled.
 #. Modify kernels that use ``__threadfence_system()`` as follows:
 
-* Ensure the kernel operates only on fine-grained system memory, which should be allocated with
-  ``hipHostMalloc()``.
+* Ensure the kernel operates only on fine-grained system memory, which should be
+  allocated with ``hipHostMalloc()``.
 * Remove ``memcpy`` for all allocated fine-grained system memory regions.
 
 .. _synchronization_functions:

diff --git a/docs/how-to/kernel_language_cpp_support.rst b/docs/how-to/kernel_language_cpp_support.rst
@@ -44,10 +44,9 @@ Assertions
 --------------------------------------------------------------------------------
 
 The ``assert`` function is supported in device code. Assertions are used for
-debugging purposes. When the input expression equals to zero, the execution will
-be stopped.
-HIP provides its own implementation for ``assert`` for usage in device code in
-``hip/hip_runtime.h``
+debugging purposes. When the input expression equals zero, the execution will be
+stopped. HIP provides its own implementation for ``assert`` for usage in device
+code in ``hip/hip_runtime.h``.
 
 .. code-block:: cpp